Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16098 )

Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad 
plans
......................................................................


Patch Set 10:

(9 comments)

Thanks a lot for the comments!

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164
PS10, Line 1164:
> why remove this?
Without removing it, we can not use the containing logic to detect corrupted 
status for single partitioned tables.


http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179: // If all partitions have good stats, return the total row 
count, contributed
               :     // by all of them, as the row count for the table.
> When I filed the JIRA the ask was just to avoid catastrophically bad decisi
Yes, the new logic achieves the objective: all partitions are of bad stats.

The change of the original logic also prevents a potential under-estimate 
problem: one or several partitions have good stats, and many others have bad 
stats.


http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179: // If all partitions have good stats, return the total row 
count, contributed
               :     // by all of them, as the row count for the table.
> is this the intended behavior? so if a user has a table with hundreds of pa
Based on the fact that hive.stats.autogather=TRUE is the default setting, I 
would think it is more often that more number of partitions will have corrupted 
stats in a table. The case of almost all partitions having good stats and one 
or two having bad status will be rare. Thus the code as written is optimized 
for the common use cases.

This part of the code fulfills the following for the scan.

1. Provide accurate row count when the stats is good, and do not set the 
corruption flag;
2. Provide estimate when the stats is corrupted and set the corruption flag.

Note that the partitions list does not include eliminated partitions, and 
therefore the estimate will be relevant to those involved in the scan only.


http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1220
PS10, Line 1220:       if ( numRows > 0 )
               :         numRows = Math.min(numRows, estNumRows);
               :       else
               :         numRows = estNumRows;
> for any if statement thats is longer than a single line, you should add cur
Done


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@180
PS10, Line 180: impala_9744
> I know other methods do this, but I'm generally not a fan of including the
Done


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@194
PS10, Line 194: #
> nit remove
Done


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199
PS10, Line 199:       CREATE TABLE {0}.{1} (
> I think you could just use 'create table like parquet' rather than listing
Yes, I would agree. The disadvantage though is that we may miss the code path 
for Hive to create a table with bad stats.

Once this JIRA (https://issues.apache.org/jira/browse/HIVE-10593) is fixed, we 
could do as you suggested from Hive side.


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215
PS10, Line 215:       set hive.stats.autogather=true;
> do you need this?
Yes. This is pre-condition for Hive to create corrupted stats.


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@236
PS10, Line 236:     #
> I think its probably best to create separate test methods for the partition
Done.  I like the idea.



--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 10
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Jun 2020 13:49:14 +0000
Gerrit-HasComments: Yes

Reply via email to