[ https://issues.apache.org/jira/browse/IMPALA-9744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong updated IMPALA-9744: ---------------------------------- Summary: Treat corrupt table stats as missing to avoid bad plans (was: Treat corrupt table stats as ) > Treat corrupt table stats as missing to avoid bad plans > ------------------------------------------------------- > > Key: IMPALA-9744 > URL: https://issues.apache.org/jira/browse/IMPALA-9744 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Reporter: Tim Armstrong > Priority: Major > Labels: ramp-up > > We currently detect corrupt stats (0 rows but data in partition) but only > flag it. The 0 row count is used for planning. I ran into a scenario where > this lead to an extremely pathological plan - the 0 row count lead to > flipping a nested loop join to put the big table on the build side and > running out of memory. > I propose doing something very conservative to avoid this scenario: if we see > corrupt stats in any partition, and the row count is computed to be zero, > ignore the row count and treat it the same as missing stats in the planner. > Here's an example where we end up with corrupt stats. Warning: this can > remove the data file from your alltypes type, I recommend copying the file to > a different location before running this. > {noformat} > # In beeline against HS2 > !connect jdbc:hive2://localhost:11050 hive org.apache.hive.jdbc.HiveDrive > set hive.stats.autogather=true; > CREATE TABLE `alltypes_insert_only`( > `id` int COMMENT 'Add a comment', > `bool_col` boolean, > `tinyint_col` tinyint, > `smallint_col` smallint, > `int_col` int, > `bigint_col` bigint, > `float_col` float, > `double_col` double, > `date_string_col` string, > `string_col` string, > `timestamp_col` timestamp) > PARTITIONED BY ( > `year` int, > `month` int) > STORED AS PARQUET > TBLPROPERTIES ("transactional"="true", > "transactional_properties"="insert_only"); > load data inpath > 'hdfs://172.19.0.1:20500/test-warehouse/alltypes_parquet/year=2009/month=1/154473eafa08ea0e-f9d70e7100000004_1040780996_data.0.parq' > into table alltypes_insert_only partition (year=2009,month=9); > # In Impala > show table stats alltypes_insert_only; > +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+ > | year | month | #Rows | #Files | Size | Bytes Cached | Cache Replication > | Format | Incremental stats | Location > | > +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+ > | 2009 | 10 | 0 | 1 | 7.75KB | NOT CACHED | NOT CACHED > | PARQUET | false | > hdfs://172.19.0.1:20500/test-warehouse/managed/alltypes_insert_only/year=2009/month=10 > | > | Total | | -1 | 1 | 7.75KB | 0B | > | | | > | > +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org