Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 )
Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans ...................................................................... Patch Set 17: (11 comments) http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1178 PS14, Line 1178: if (partNumRows > -1) { : if (partitionNumRows_ == -1) partitionNumRows_ = 0; : > Done ping - this is still an issue. it should be } else if (partNumRows > -1) { http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1189 PS14, Line 1189: > This is an extra protection for the #partitions in the loop above ever bein Done http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1200 PS14, Line 1200: table size from those partitions with bad r > Yes, it is needed for single-partition tables. but non-partitioned tables (e.g. getNumClusteringCols == 0), have a single FeFsPartition that basically represents the entire table. the for loop about over the partitions_ table should still set hasCorruptTableStats_ even for non-partitioned tables. http://gerrit.cloudera.org:8080/#/c/16098/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1201 PS14, Line 1201: n > hasCorruptTableStats_ here actually means that some partition is of bad sta sorry, I meant shouldn't it check of any partitions have either corrupt or missing stats? right now it only checks for corrupt stats, but shouldn't it check if there are missing stats for any partitions as well? http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1187 PS17, Line 1187: if (partitionsWithCorruptOrMissingStats.size() == 0 && numPartitionsWithNumRows_ > 0) : return partitionNumRows_; needs curly braces around the body of the if statement http://gerrit.cloudera.org:8080/#/c/16098/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1244 PS17, Line 1244: p: nit should be "p :" http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py File tests/metadata/test_compute_stats.py: http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@174 PS17, Line 174: we need to add a test for the mixed case for partitioned tables - where some partitions have good stats and other have missing / corrupt stats http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@180 PS17, Line 180: Hive entire test name should be all lower case http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@197 PS17, Line 197: CREATE TABLE {0}.{1} ( : id int COMMENT 'Add a comment', : bool_col boolean, : tinyint_col tinyint, : smallint_col smallint, : int_col int, : bigint_col bigint, : float_col float, : double_col double, : date_string_col string, : string_col string, : timestamp_col timestamp, : year int, : month int ) : PARTITIONED BY (decade string) : STORED AS PARQUET; I know we discussed this before, but I still don't understand why you can't create the table in Impala. it doesn't really matter, but I made the following change locally and the test still passes: diff --git a/tests/metadata/test_compute_stats.py b/tests/metadata/test_compute_stats.py index c477773..87ecc37 100644 --- a/tests/metadata/test_compute_stats.py +++ b/tests/metadata/test_compute_stats.py @@ -193,23 +193,11 @@ class TestComputeStats(ImpalaTestSuite): # Setting hive.stats.autogather=true after CRTB DDL but before LOAD DML # minimally reproduces the corrupt stats issue. + impala_create = "CREATE TABLE {0}.{1} like parquet '{2}' PARTITIONED BY (decade string)".format( + unique_database, table_name, "file://" + local_file) + self.execute_query(impala_create) + create_load_data_stmts = """ - CREATE TABLE {0}.{1} ( - id int COMMENT 'Add a comment', - bool_col boolean, - tinyint_col tinyint, - smallint_col smallint, - int_col int, - bigint_col bigint, - float_col float, - double_col double, - date_string_col string, - string_col string, - timestamp_col timestamp, - year int, - month int ) - PARTITIONED BY (decade string) - STORED AS PARQUET; set hive.stats.autogather=true; load data local inpath '{2}' into table {0}.{1} partition (decade="2020s"); """.format(unique_database, table_name, local_file) http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@231 PS17, Line 231: assert("cardinality=[1-9][0-9]*" in explain_result.data[i + 2]) there should be some validation of the output for "partitions: %s/%s rows=%s" as well. http://gerrit.cloudera.org:8080/#/c/16098/17/tests/metadata/test_compute_stats.py@247 PS17, Line 247: # Load from a local data file most of the code in this test method and the above is the same, it should be re-factored into a private helper method to avoid code duplication between the two methods -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 17 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Fri, 17 Jul 2020 18:50:43 +0000 Gerrit-HasComments: Yes