Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16098 )

Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad 
plans
......................................................................


Patch Set 14:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179:     // Consider partition with good stats.
               :         if (partNumRows > -1) {
> Yes, I agree that treating the missing and corrupt stats the same is a good
right, as Tim said though:


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1155
PS14, Line 1155: estiamte
typo


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1167
PS14, Line 1167: partitionsWithBadRowcount
I think its clearer to rename this to `partitionsWithCorruptOrMissingStats`. 
Its a bit more verbose, but I think the terminology should be to only 
categories bad stats as "corrupt" or "missing"


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1167
PS14, Line 1167: FeFsPartition
not needed


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1178
PS14, Line 1178: else {
               :         // Consider partition with good stats.
               :         if
should be combined into a single line


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1189
PS14, Line 1189: && numPartitionsWithNumRows_ > 0
is this condition still necessary?


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1200
PS14, Line 1200: numRows == 0 && tbl_.getTotalHdfsBytes() > 0
is this condition ever false if hasCorruptTableStats_ is true?


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1201
PS14, Line 1201: )
shouldn't there be a check so see if there are any partitions with bad row 
counts?


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1230
PS14, Line 1230: partitionNumRows_
if the table is partitioned, partitionNumRows_ needs to be updated with the 
stats from the partitions with good stats. partitionNumRows_ is an instance 
variable and is used in getTableStatsExplainString to print out the stats in 
the explain plan.


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1233
PS14, Line 1233:       // Cap the estimated number of rows by numRows if it is 
> 0.
               :       if (numRows > 0) {
               :         numRows = Math.min(numRows, estNumRows);
               :       } else {
               :         numRows = estNumRows;
               :       }
why is this needed? doesn't this defeat the purpose of estimating the number of 
rows from the file size? e.g. if you have a partitioned table and only have row 
stats have 1/2 of the partitions, that is now the upper bound for the total 
rows stats regardless of how many additional partitions are added.



--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Fri, 10 Jul 2020 17:53:13 +0000
Gerrit-HasComments: Yes

Reply via email to