Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16098 )

Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad 
plans
......................................................................


Patch Set 11:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164
PS10, Line 1164:
> I thought that getNumClusteringCols.size() == number of partition columns f
Yes by a single partitioned table I mean a non-partitioned table. We use the 
term a lot in my previous job.


http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179
PS10, Line 1179: // If all partitions have good stats, return the total row 
count, contributed
               :     // by all of them, as the row count for the table.
> > So to summarize, the goal here (or at least original intention of this JI
Yes, I agree that treating the missing and corrupt stats the same is a good 
idea, in the context of providing a good/useful row count (RC).

My first impression of the original logic, at line 1179,

if (numPartitionsWithNumRows_ > 0) return partitionNumRows_;

is that it can seriously under-estimate the RC when only one partition has the 
good stats. This is addressed by the fix.


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199
PS10, Line 199:         int_col int,
> Isn't it the load data local inpath query that is setting the stats to a co
Yes, the bad stats was created by the loading part. Sorry I did not state it 
clearly in my comment above and missed a point as follows.

If we create the table with 'create table like' in Impala, the testing table 
itself becomes an Impala one, regardless of being internal or external. See 
https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#tables.

Normally, one needs to create a table "natively" in an engine to create a 
"native" table. Thus, if the objective is to test against a Hive table, the 
current version does the job. 

We could add a test case as you suggested to test the impala table.


http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215
PS10, Line 215:     # Make the table visible in Impala.
> It's set to true by default in Hive already: https://github.com/apache/hive
Maybe we add a comment here? The rational is to call out the condition to 
reproduce the bad stats in the test, regardless of the default setting in Hive.



--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Jun 2020 15:04:42 +0000
Gerrit-HasComments: Yes

Reply via email to