[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available

Tim Armstrong (Code Review) Thu, 16 May 2019 11:48:57 -0700

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12974 )


Change subject: IMPALA-7608: Estimate row count from file size when no stats 
available
......................................................................


Patch Set 8:

(20 comments)

Another round of thoughts on the test infrastructure. I think the product code 
looks fine, just want to make sure the tests are more maintainable and easier 
to work on in the future.

http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h@48
PS8, Line 48: // The Debug webpage won't be able to be connected once the 
DCHECK fails.
Remove the last line, no need to describe symptoms of a DCHECK here.


http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift@391
PS8, Line 391: DISABLE_HDFS_NUM_ROWS_EST
EST->ESTIMATE. No real need to abbreviate in a safety valve argument that will 
be rarely used.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@285
PS8, Line 285:   // Added this field to prevent the case in the method 
getStatsNumRows when
Can you rewrite the comment to just describe what the value is. No need to 
include the story about why you added it here.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@288
PS8, Line 288:   private double defaultRowWidth_ = 1.0;
Should be a constant, not a variable, i.e. private static double 
DEFAULT_ROW_WIDTH; Maybe also rename to indicate that it's an estimate, not the 
real row width, i.e. DEFAULT_ROW_WIDTH_ESTIMATE.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1148
PS8, Line 1148:   private long getStatsNumRows(Analyzer analyzer) {
I would prefer if you just passed in the query options to this methods. 
Otherwise when reading code one might think that it's using the analyzer in a 
more complex way.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1190
PS8, Line 1190:         // In the case when there is no Column defined, we use 
an ultimate
Do we have a test case for this?


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java:

PS8:
We should also test the nodes with the new estimate functionality disabled in 
these unit tests. Just to make sure that all branches in the code that handle 
missing stats are covered.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@234
PS8, Line 234:     // The cardinality of an AggregateNode is always 1 no matter
Not true if there's a group by - we should test that too.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@262
PS8, Line 262:     path.add(0);
Can we use something cleaner for constant lists like: Arrays.asList(0, 1, 0);


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@454
PS8, Line 454: GENERATE_DISTRIBUTED_PLAN
I think this new planner test option is awkward since the planner tests already 
have a different mechanism to test single-node plans. Also the default for 
planner tests is to generate a distributed plan, which makes it more confusing.

I'd propose that we instead implement it purely within this file instead - i.e. 
just add an argument to verifyCardinality.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@482
PS8, Line 482:   protected void verifyCardinality(String query, long expected) {
For the tests that rely on file-size-based estimates, I think we want some 
margin of error for the cardinality estimates.I think Paul added something for 
the planner tests.

The problem is that file sizes may vary a little bit, e.g. because of changes 
to the file writers or random variation, so cardinality estimates based on file 
size can vary too.


http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@253
PS8, Line 253:   public void testFkPkJoinDetectionWithHDFSNumRowsEstEnabled() {
Let's call the ones with the default options testFkPkJoinDetection() - no need 
to embed the fact that it's using the default options in the test name.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test:

PS8:
I think you can remove test cases that have all known cardinality - duplicating 
the tests isn't providing additional coverage for the fk-pk join detection in 
those cases.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test:

PS8:
I think you can remove test cases that have all known cardinality - duplicating 
the tests isn't providing additional coverage for the join planning in those 
cases.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test:

PS8:
I think you can remove test cases that have all known cardinality - duplicating 
the tests isn't providing additional coverage for the filters in those cases.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test:

PS8:
I think you can remove test cases that have all known cardinality - duplicating 
the tests isn't providing additional coverage for mt-dop in those cases.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test:

PS8:
I think you can remove test cases that have all known cardinality - duplicating 
the tests isn't providing additional coverage for resource reuqirements in 
those cases.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test:

PS8:
I think you can remove test cases that have all known cardinality - duplicating 
the tests isn't providing additional coverage for buffer sizing in those cases.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test:

PS8:
I think you can remove test cases that have all known cardinality - duplicating 
the tests isn't providing additional coverage for subqueries in those cases.


http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-query/queries/QueryTest/set.test
File testdata/workloads/functional-query/queries/QueryTest/set.test:

http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-query/queries/QueryTest/set.test@5
PS8, Line 5: # Note that there should not be any space around the comma used to 
separate different
I'd suggest removing this comment. I generally like the idea of documenting 
pitfalls you ran into, but I think the chances of someone else making this 
mistake in this particular file are minimal.



--
To view, visit http://gerrit.cloudera.org:8080/12974
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a
Gerrit-Change-Number: 12974
Gerrit-PatchSet: 8
Gerrit-Owner: Fang-Yu Rao <fangyu....@cloudera.com>
Gerrit-Reviewer: Fang-Yu Rao <fangyu....@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Thu, 16 May 2019 18:38:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7608: Estimate row count from file size when no stats available

Reply via email to