Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans ...................................................................... Patch Set 14: (10 comments) http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179 PS10, Line 1179: // Consider partition with good stats. : if (partNumRows > -1) { > Yes, I agree that treating the missing and corrupt stats the same is a good right, as Tim said though: http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1155 PS14, Line 1155: estiamte typo http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1167 PS14, Line 1167: partitionsWithBadRowcount I think its clearer to rename this to `partitionsWithCorruptOrMissingStats`. Its a bit more verbose, but I think the terminology should be to only categories bad stats as "corrupt" or "missing" http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1167 PS14, Line 1167: FeFsPartition not needed http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1178 PS14, Line 1178: else { : // Consider partition with good stats. : if should be combined into a single line http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1189 PS14, Line 1189: && numPartitionsWithNumRows_ > 0 is this condition still necessary? http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1200 PS14, Line 1200: numRows == 0 && tbl_.getTotalHdfsBytes() > 0 is this condition ever false if hasCorruptTableStats_ is true? http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1201 PS14, Line 1201: ) shouldn't there be a check so see if there are any partitions with bad row counts? http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1230 PS14, Line 1230: partitionNumRows_ if the table is partitioned, partitionNumRows_ needs to be updated with the stats from the partitions with good stats. partitionNumRows_ is an instance variable and is used in getTableStatsExplainString to print out the stats in the explain plan. http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1233 PS14, Line 1233: // Cap the estimated number of rows by numRows if it is > 0. : if (numRows > 0) { : numRows = Math.min(numRows, estNumRows); : } else { : numRows = estNumRows; : } why is this needed? doesn't this defeat the purpose of estimating the number of rows from the file size? e.g. if you have a partitioned table and only have row stats have 1/2 of the partitions, that is now the upper bound for the total rows stats regardless of how many additional partitions are added. -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 14 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Fri, 10 Jul 2020 17:53:13 +0000 Gerrit-HasComments: Yes
