Fang-Yu Rao has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12974 )

Change subject: IMPALA-7608: Estimate row count from file size when no stats 
available
......................................................................


Patch Set 9:

(21 comments)

> Patch Set 8:
>
> (1 comment)

Hi all, I have tried to addressed your previous comments. Please review the 
updated patch set. Thank you very much!

http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h@48
PS8, Line 48: #define QUERY_OPTS_TABLE\
> Remove the last line, no need to describe symptoms of a DCHECK here.
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift@391
PS8, Line 391: // scanning.
> EST->ESTIMATE. No real need to abbreviate in a safety valve argument that w
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@285
PS8, Line 285:   private int numFilesNoDiskIds_ = 0;
> Can you rewrite the comment to just describe what the value is. No need to
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@288
PS8, Line 288:   // List of conjuncts for min/max values of 
parquet::Statistics, that are used to skip
> Should be a constant, not a variable, i.e. private static double DEFAULT_RO
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1148
PS8, Line 1148:                   " sel=" + 
Double.toString(computeSelectivity()));
> I would prefer if you just passed in the query options to this methods. Oth
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1190
PS8, Line 1190:     }
> Do we have a test case for this?
It seems currently we do not have a test case for this. After some 
investigation, I found the following table could exercise this code path.

CREATE TABLE array_demo (
  pets ARRAY <STRING>
) STORED AS PARQUET;

Note that sumAvgRowSizes is equal to 0 not because there is no Column defined 
for this table. It is because the type of the current Column is not of 
ScalarType. In this specific case, the current Column is of ArrayType. 
Similarly, if the column is of MapType, sumAvgRowSizes would be equal to 0 as 
well.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1198
PS8, Line 1198:     // the hdfs table.
> Should we estimate the compression factor differently depending on the file
Thanks. There are 8 supported file format defined in 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java.

After some discussions with Tim, I divided the files into 3 categories - 
uncompressed, legacy compressed (e.g., text, avro, rc, seq), and  columnar 
(e.g., parquet and orc). Depending on the category of a file, we multiply the 
size of the file by its corresponding compression factor to derive an estimated 
original size of the file before compression, based on which we could compute 
an estimate of the number of rows in the file according to the estimate of the 
row width.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java:

PS8:
Thanks! I will do as suggested.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@234
PS8, Line 234:
> Not true if there's a group by - we should test that too.
Thanks! Will add another test case for group by.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@262
PS8, Line 262:     verifyCardinality("SELECT COUNT(a) FROM 
functional.tinytable", 1, true,
> Can we use something cleaner for constant lists like: Arrays.asList(0, 1, 0
Thanks! Will do as suggested.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@454
PS8, Line 454:  functional.tinytable.
> I think this new planner test option is awkward since the planner tests alr
Thanks for the suggestion! I will revise verifyCardinality as suggested.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@482
PS8, Line 482:     List<Integer> path = Arrays.asList(0);
Thanks for the suggestion! I have added a static variable called "tolerance" in 
this class that denotes the margin of error. The default value of tolerance is 
0.05, meaning that (expected value - actual value) <= expected value * 0.05.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@253
PS8, Line 253:   public void testFkPkJoinDetection() {
> Let's call the ones with the default options testFkPkJoinDetection() - no n
Thanks! I will modify the names of other tests with the default options as well.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test:

PS8:
> I think you can remove test cases that have all known cardinality - duplica
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test:

PS8:
> I think you can remove test cases that have all known cardinality - duplica
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test:

PS8:
> I think you can remove test cases that have all known cardinality - duplica
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test:

PS8:
> I think you can remove test cases that have all known cardinality - duplica
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test:

PS8:
> I think you can remove test cases that have all known cardinality - duplica
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test:

PS8:
> I think you can remove test cases that have all known cardinality - duplica
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test:

PS8:
> I think you can remove test cases that have all known cardinality - duplica
Done


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-query/queries/QueryTest/set.test
File testdata/workloads/functional-query/queries/QueryTest/set.test:

http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-query/queries/QueryTest/set.test@5
PS8, Line 5: set buffer_pool_limit=7;
> I'd suggest removing this comment. I generally like the idea of documenting
Thanks! I will remove this comment.



--
To view, visit http://gerrit.cloudera.org:8080/12974
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a
Gerrit-Change-Number: 12974
Gerrit-PatchSet: 9
Gerrit-Owner: Fang-Yu Rao <fangyu....@cloudera.com>
Gerrit-Reviewer: Fang-Yu Rao <fangyu....@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Mon, 03 Jun 2019 01:31:57 +0000
Gerrit-HasComments: Yes

Reply via email to