Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans ...................................................................... Patch Set 10: (4 comments) http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164 PS10, Line 1164: > Without removing it, we can not use the containing logic to detect corrupte I thought that getNumClusteringCols.size() == number of partition columns for a table? do you mean non-partitioned tables? http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179 PS10, Line 1179: // If all partitions have good stats, return the total row count, contributed : // by all of them, as the row count for the table. Responding to Tim's comments: > My original suggestion was to only fall back when the total row count was 0. So to summarize, the goal here (or at least original intention of this JIRA) is that when you have a partitioned table, and the total row count of the table is calculated to be 0, but some of the partitions are "corrupt", then treat the stats as "missing", in which case we fall back to the estimation based on the total table size + average row size? Where "corrupt" stats are defined as stats with a value less than -1 or where the numRows == 0 but the partition size != 0. Where "missing" stats are defined as stats with the value = -1. > I think we could also just treat corrupt partition stats the same as missing > partition stats - i.e. ignore them as long as we have some valid stats. Yeah that seems reasonable. For missing stats, I think that is the current behavior. The code says "Ignore partitions with missing stats in the hope they don't matter enough to change the planning outcome." My vote is to follow the "treat corrupt partition stats the same as missing partition stats" as that is more generic and should simpler and safer to implement. http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py File tests/metadata/test_compute_stats.py: http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199 PS10, Line 199: CREATE TABLE {0}.{1} ( > Yes, I would agree. The disadvantage though is that we may miss the code pa Isn't it the load data local inpath query that is setting the stats to a corrupt value? not the create table statement itself? http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215 PS10, Line 215: set hive.stats.autogather=true; > Yes. This is pre-condition for Hive to create corrupted stats. It's set to true by default in Hive already: https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2536 so I don't think you need to set it to true again here. -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 10 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Tue, 30 Jun 2020 00:33:28 +0000 Gerrit-HasComments: Yes