Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans ...................................................................... Patch Set 10: (9 comments) http://gerrit.cloudera.org:8080/#/c/16098/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16098/10//COMMIT_MSG@27 PS10, Line 27: configrations nit: typo http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164 PS10, Line 1164: why remove this? http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179 PS10, Line 1179: // If all partitions have good stats, return the total row count, contributed : // by all of them, as the row count for the table. is this the intended behavior? so if a user has a table with hundreds of partitions, adding a single partition with corrupt stats basically causes the stats estimation to fall back to the table-size based estimation? wouldn't a more reasonable behavior be if you see a partition with corrupt stats you fallback to the file-size based estimation just for that partition? http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1220 PS10, Line 1220: if ( numRows > 0 ) : numRows = Math.min(numRows, estNumRows); : else : numRows = estNumRows; for any if statement thats is longer than a single line, you should add curly braces. so it should be: if (numRows > 0) { numRows = Math.min(numRows, estNumRows); } else { numRows = estNumRows; } you don't need the extra space in the if condition either http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py File tests/metadata/test_compute_stats.py: http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@180 PS10, Line 180: impala_9744 I know other methods do this, but I'm generally not a fan of including the JIRA number in a test method name. I think we can come up with a more descriptive name for this method. http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@194 PS10, Line 194: # nit remove http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199 PS10, Line 199: CREATE TABLE {0}.{1} ( I think you could just use 'create table like parquet' rather than listing out all the columns. You can create the table in Impala and still modify it using the load data local inpath statement in Hive. http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215 PS10, Line 215: set hive.stats.autogather=true; do you need this? http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@236 PS10, Line 236: # I think its probably best to create separate test methods for the partitioned and unpartitioned cases -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 10 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Sat, 27 Jun 2020 03:43:27 +0000 Gerrit-HasComments: Yes