Fang-Yu Rao has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 )
Change subject: IMPALA-7608: Estimate row count from file size when no stats available ...................................................................... Patch Set 9: (21 comments) > Patch Set 8: > > (1 comment) Hi all, I have tried to addressed your previous comments. Please review the updated patch set. Thank you very much! http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h File be/src/service/query-options.h: http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h@48 PS8, Line 48: #define QUERY_OPTS_TABLE\ > Remove the last line, no need to describe symptoms of a DCHECK here. Done http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift@391 PS8, Line 391: // scanning. > EST->ESTIMATE. No real need to abbreviate in a safety valve argument that w Done http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@285 PS8, Line 285: private int numFilesNoDiskIds_ = 0; > Can you rewrite the comment to just describe what the value is. No need to Done http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@288 PS8, Line 288: // List of conjuncts for min/max values of parquet::Statistics, that are used to skip > Should be a constant, not a variable, i.e. private static double DEFAULT_RO Done http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1148 PS8, Line 1148: " sel=" + Double.toString(computeSelectivity())); > I would prefer if you just passed in the query options to this methods. Oth Done http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1190 PS8, Line 1190: } > Do we have a test case for this? It seems currently we do not have a test case for this. After some investigation, I found the following table could exercise this code path. CREATE TABLE array_demo ( pets ARRAY <STRING> ) STORED AS PARQUET; Note that sumAvgRowSizes is equal to 0 not because there is no Column defined for this table. It is because the type of the current Column is not of ScalarType. In this specific case, the current Column is of ArrayType. Similarly, if the column is of MapType, sumAvgRowSizes would be equal to 0 as well. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1198 PS8, Line 1198: // the hdfs table. > Should we estimate the compression factor differently depending on the file Thanks. There are 8 supported file format defined in https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java. After some discussions with Tim, I divided the files into 3 categories - uncompressed, legacy compressed (e.g., text, avro, rc, seq), and columnar (e.g., parquet and orc). Depending on the category of a file, we multiply the size of the file by its corresponding compression factor to derive an estimated original size of the file before compression, based on which we could compute an estimate of the number of rows in the file according to the estimate of the row width. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: PS8: Thanks! I will do as suggested. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@234 PS8, Line 234: > Not true if there's a group by - we should test that too. Thanks! Will add another test case for group by. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@262 PS8, Line 262: verifyCardinality("SELECT COUNT(a) FROM functional.tinytable", 1, true, > Can we use something cleaner for constant lists like: Arrays.asList(0, 1, 0 Thanks! Will do as suggested. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@454 PS8, Line 454: functional.tinytable. > I think this new planner test option is awkward since the planner tests alr Thanks for the suggestion! I will revise verifyCardinality as suggested. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@482 PS8, Line 482: List<Integer> path = Arrays.asList(0); Thanks for the suggestion! I have added a static variable called "tolerance" in this class that denotes the margin of error. The default value of tolerance is 0.05, meaning that (expected value - actual value) <= expected value * 0.05. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@253 PS8, Line 253: public void testFkPkJoinDetection() { > Let's call the ones with the default options testFkPkJoinDetection() - no n Thanks! I will modify the names of other tests with the default options as well. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test: PS8: > I think you can remove test cases that have all known cardinality - duplica Done http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test: PS8: > I think you can remove test cases that have all known cardinality - duplica Done http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test: PS8: > I think you can remove test cases that have all known cardinality - duplica Done http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test: PS8: > I think you can remove test cases that have all known cardinality - duplica Done http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test: PS8: > I think you can remove test cases that have all known cardinality - duplica Done http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test: PS8: > I think you can remove test cases that have all known cardinality - duplica Done http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test: PS8: > I think you can remove test cases that have all known cardinality - duplica Done http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-query/queries/QueryTest/set.test File testdata/workloads/functional-query/queries/QueryTest/set.test: http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-query/queries/QueryTest/set.test@5 PS8, Line 5: set buffer_pool_limit=7; > I'd suggest removing this comment. I generally like the idea of documenting Thanks! I will remove this comment. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 9 Gerrit-Owner: Fang-Yu Rao <fangyu....@cloudera.com> Gerrit-Reviewer: Fang-Yu Rao <fangyu....@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Mon, 03 Jun 2019 01:31:57 +0000 Gerrit-HasComments: Yes