Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16098 )

Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad 
plans
......................................................................


Patch Set 17:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1178
PS14, Line 1178: if (partNumRows > -1) {
               :           if (partitionNumRows_ == -1) partitionNumRows_ = 0;
               :
> Done
ping - this is still an issue. it should be

 } else if (partNumRows > -1) {


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1189
PS14, Line 1189:
> This is an extra protection for the #partitions in the loop above ever bein
Done


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1200
PS14, Line 1200:  table size from those partitions with bad r
> Yes, it is needed for single-partition tables.
but non-partitioned tables (e.g. getNumClusteringCols == 0), have a single 
FeFsPartition that basically represents the entire table. the for loop about 
over the partitions_ table should still set hasCorruptTableStats_ even for 
non-partitioned tables.


http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1201
PS14, Line 1201: n
> hasCorruptTableStats_ here actually means that some partition is of bad sta
sorry, I meant shouldn't it check of any partitions have either corrupt or 
missing stats? right now it only checks for corrupt stats, but shouldn't it 
check if there are missing stats for any partitions as well?


http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1187
PS17, Line 1187:     if (partitionsWithCorruptOrMissingStats.size() == 0 && 
numPartitionsWithNumRows_ > 0)
               :       return partitionNumRows_;
needs curly braces around the body of the if statement


http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1244
PS17, Line 1244: p:
nit should be "p :"


http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@174
PS17, Line 174:
we need to add a test for the mixed case for partitioned tables - where some 
partitions have good stats and other have missing / corrupt stats


http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@180
PS17, Line 180: Hive
entire test name should be all lower case


http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@197
PS17, Line 197:       CREATE TABLE {0}.{1} (
              :         id int COMMENT 'Add a comment',
              :         bool_col boolean,
              :         tinyint_col tinyint,
              :         smallint_col smallint,
              :         int_col int,
              :         bigint_col bigint,
              :         float_col float,
              :         double_col double,
              :         date_string_col string,
              :         string_col string,
              :         timestamp_col timestamp,
              :         year int,
              :         month int )
              :         PARTITIONED BY (decade string)
              :         STORED AS PARQUET;
I know we discussed this before, but I still don't understand why you can't 
create the table in Impala. it doesn't really matter, but I made the following 
change locally and the test still passes:

 diff --git a/tests/metadata/test_compute_stats.py 
b/tests/metadata/test_compute_stats.py
index c477773..87ecc37 100644
--- a/tests/metadata/test_compute_stats.py
+++ b/tests/metadata/test_compute_stats.py
@@ -193,23 +193,11 @@ class TestComputeStats(ImpalaTestSuite):

     # Setting hive.stats.autogather=true after CRTB DDL but before LOAD DML
     # minimally reproduces the corrupt stats issue.
+    impala_create = "CREATE TABLE {0}.{1} like parquet '{2}' PARTITIONED BY 
(decade string)".format(
+        unique_database, table_name, "file://" + local_file)
+    self.execute_query(impala_create)
+
     create_load_data_stmts = """
-      CREATE TABLE {0}.{1} (
-        id int COMMENT 'Add a comment',
-        bool_col boolean,
-        tinyint_col tinyint,
-        smallint_col smallint,
-        int_col int,
-        bigint_col bigint,
-        float_col float,
-        double_col double,
-        date_string_col string,
-        string_col string,
-        timestamp_col timestamp,
-        year int,
-        month int )
-        PARTITIONED BY (decade string)
-        STORED AS PARQUET;
       set hive.stats.autogather=true;
       load data local inpath '{2}' into table {0}.{1} partition 
(decade="2020s");
     """.format(unique_database, table_name, local_file)


http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@231
PS17, Line 231:          assert("cardinality=[1-9][0-9]*" in 
explain_result.data[i + 2])
there should be some validation of the output for "partitions: %s/%s rows=%s" 
as well.


http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@247
PS17, Line 247:     # Load from a local data file
most of the code in this test method and the above is the same, it should be 
re-factored into a private helper method to avoid code duplication between the 
two methods



--
To view, visit http://gerrit.cloudera.org:8080/16098
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576
Gerrit-Change-Number: 16098
Gerrit-PatchSet: 17
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Fri, 17 Jul 2020 18:50:43 +0000
Gerrit-HasComments: Yes

Reply via email to