[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans

Sahil Takiar (Code Review) Wed, 01 Jul 2020 12:00:08 -0700

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16098 )

Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad
plans
......................................................................

Patch Set 10:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164
PS10, Line 1164:
> Yes by a single partitioned table I mean a non-partitioned table. We use th
I need to spend some time understanding this code some more, I'll get back to
you on this.

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179: // If all partitions have good stats, return the total row
count, contributed
: // by all of them, as the row count for the table.
> Yes, I agree that treating the missing and corrupt stats the same is a good
I agree that could be an issue, but I think that is currently by design. As per
the code comments,

// Ignore partitions with missing stats in the hope they don't matter
// enough to change the planning outcome.

I think it might make sense to change that, but as Tim said I think we should
keep the current patch simple and focused on the problem described in the JIRA.
We can make this change in a follow up patch.

I also think that if we were to make this change, we should do something
slightly different. The approach you have currently implemented will fallback
to the data-size based estimate whenever a single partition has corrupt stats.
The problem is if you have a table with 100 partitions. 99 of which of correct
stats, and 1 of which has corrupt stats. the patch will then fallback to the
data-size based estimate and ignore all stats from the 99 partitions with
correct stats. a better approach would be to still continue to use the stats
from the 99 partitions, and just do the data-size based estimate for the single
partition with corrupt stats. I think this would be considered a follow up to
IMPALA-7608, which adds the logic for handling unpartitioned tables or
partitioned tables where all partitions have bad stats, but does not handle the
case where a partitioned tables has some good stats, and some bad stats.
however, again lets tackle this is a different patch. feel free to file a
follow up JIRA.

http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199
PS10, Line 199: CREATE TABLE {0}.{1} (
> Thus, if the objective is to test against a Hive table, the current version
> does the job.

Is that the objective? I thought the objective is to test against corrupt stats
created by Hive's load data local inpath query? I guess I don't understand what
is different whether you use a table created via Hive vs. Impala.

http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215
PS10, Line 215: set hive.stats.autogather=true;
> Maybe we add a comment here? The rational is to call out the condition to r
so if hive.stats.autogather is false, does the behavior here change?

--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 10
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Wed, 01 Jul 2020 18:59:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans

Reply via email to