Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16098 )

Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad 
plans
......................................................................


Patch Set 10:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164
PS10, Line 1164:
> Without removing it, we can not use the containing logic to detect corrupte
I thought that getNumClusteringCols.size() == number of partition columns for a 
table? do you mean non-partitioned tables?


http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179: // If all partitions have good stats, return the total row 
count, contributed
               :     // by all of them, as the row count for the table.
Responding to Tim's comments:

> My original suggestion was to only fall back when the total row count was 0.

So to summarize, the goal here (or at least original intention of this JIRA) is 
that when you have a partitioned table, and the total row count of the table is 
calculated to be 0, but some of the partitions are "corrupt", then treat the 
stats as "missing", in which case we fall back to the estimation based on the 
total table size + average row size?

Where "corrupt" stats are defined as stats with a value less than -1 or where 
the numRows == 0 but the partition size != 0.

Where "missing" stats are defined as stats with the value = -1.

> I think we could also just treat corrupt partition stats the same as missing 
> partition stats - i.e. ignore them as long as we have some valid stats.

Yeah that seems reasonable. For missing stats, I think that is the current 
behavior. The code says "Ignore partitions with missing stats in the hope they 
don't matter enough to change the planning outcome."

My vote is to follow the "treat corrupt partition stats the same as missing 
partition stats" as that is more generic and should simpler and safer to 
implement.


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199
PS10, Line 199:       CREATE TABLE {0}.{1} (
> Yes, I would agree. The disadvantage though is that we may miss the code pa
Isn't it the load data local inpath query that is setting the stats to a 
corrupt value? not the create table statement itself?


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215
PS10, Line 215:       set hive.stats.autogather=true;
> Yes. This is pre-condition for Hive to create corrupted stats.
It's set to true by default in Hive already: 
https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2536
 so I don't think you need to set it to true again here.



--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 10
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jun 2020 00:33:28 +0000
Gerrit-HasComments: Yes

Reply via email to