Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans ...................................................................... Patch Set 10: (9 comments) Thanks a lot for the comments! http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164 PS10, Line 1164: > why remove this? Without removing it, we can not use the containing logic to detect corrupted status for single partitioned tables. http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179 PS10, Line 1179: // If all partitions have good stats, return the total row count, contributed : // by all of them, as the row count for the table. > When I filed the JIRA the ask was just to avoid catastrophically bad decisi Yes, the new logic achieves the objective: all partitions are of bad stats. The change of the original logic also prevents a potential under-estimate problem: one or several partitions have good stats, and many others have bad stats. http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179 PS10, Line 1179: // If all partitions have good stats, return the total row count, contributed : // by all of them, as the row count for the table. > is this the intended behavior? so if a user has a table with hundreds of pa Based on the fact that hive.stats.autogather=TRUE is the default setting, I would think it is more often that more number of partitions will have corrupted stats in a table. The case of almost all partitions having good stats and one or two having bad status will be rare. Thus the code as written is optimized for the common use cases. This part of the code fulfills the following for the scan. 1. Provide accurate row count when the stats is good, and do not set the corruption flag; 2. Provide estimate when the stats is corrupted and set the corruption flag. Note that the partitions list does not include eliminated partitions, and therefore the estimate will be relevant to those involved in the scan only. http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1220 PS10, Line 1220: if ( numRows > 0 ) : numRows = Math.min(numRows, estNumRows); : else : numRows = estNumRows; > for any if statement thats is longer than a single line, you should add cur Done http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py File tests/metadata/test_compute_stats.py: http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@180 PS10, Line 180: impala_9744 > I know other methods do this, but I'm generally not a fan of including the Done http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@194 PS10, Line 194: # > nit remove Done http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199 PS10, Line 199: CREATE TABLE {0}.{1} ( > I think you could just use 'create table like parquet' rather than listing Yes, I would agree. The disadvantage though is that we may miss the code path for Hive to create a table with bad stats. Once this JIRA (https://issues.apache.org/jira/browse/HIVE-10593) is fixed, we could do as you suggested from Hive side. http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215 PS10, Line 215: set hive.stats.autogather=true; > do you need this? Yes. This is pre-condition for Hive to create corrupted stats. http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@236 PS10, Line 236: # > I think its probably best to create separate test methods for the partition Done. I like the idea. -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 10 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Mon, 29 Jun 2020 13:49:14 +0000 Gerrit-HasComments: Yes
