Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 )
Change subject: IMPALA-7608: Estimate row count from file size when no stats available ...................................................................... Patch Set 8: (20 comments) Another round of thoughts on the test infrastructure. I think the product code looks fine, just want to make sure the tests are more maintainable and easier to work on in the future. http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h File be/src/service/query-options.h: http://gerrit.cloudera.org:8080/#/c/12974/8/be/src/service/query-options.h@48 PS8, Line 48: // The Debug webpage won't be able to be connected once the DCHECK fails. Remove the last line, no need to describe symptoms of a DCHECK here. http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/12974/8/common/thrift/ImpalaService.thrift@391 PS8, Line 391: DISABLE_HDFS_NUM_ROWS_EST EST->ESTIMATE. No real need to abbreviate in a safety valve argument that will be rarely used. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@285 PS8, Line 285: // Added this field to prevent the case in the method getStatsNumRows when Can you rewrite the comment to just describe what the value is. No need to include the story about why you added it here. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@288 PS8, Line 288: private double defaultRowWidth_ = 1.0; Should be a constant, not a variable, i.e. private static double DEFAULT_ROW_WIDTH; Maybe also rename to indicate that it's an estimate, not the real row width, i.e. DEFAULT_ROW_WIDTH_ESTIMATE. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1148 PS8, Line 1148: private long getStatsNumRows(Analyzer analyzer) { I would prefer if you just passed in the query options to this methods. Otherwise when reading code one might think that it's using the analyzer in a more complex way. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1190 PS8, Line 1190: // In the case when there is no Column defined, we use an ultimate Do we have a test case for this? http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: PS8: We should also test the nodes with the new estimate functionality disabled in these unit tests. Just to make sure that all branches in the code that handle missing stats are covered. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@234 PS8, Line 234: // The cardinality of an AggregateNode is always 1 no matter Not true if there's a group by - we should test that too. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@262 PS8, Line 262: path.add(0); Can we use something cleaner for constant lists like: Arrays.asList(0, 1, 0); http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@454 PS8, Line 454: GENERATE_DISTRIBUTED_PLAN I think this new planner test option is awkward since the planner tests already have a different mechanism to test single-node plans. Also the default for planner tests is to generate a distributed plan, which makes it more confusing. I'd propose that we instead implement it purely within this file instead - i.e. just add an argument to verifyCardinality. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@482 PS8, Line 482: protected void verifyCardinality(String query, long expected) { For the tests that rely on file-size-based estimates, I think we want some margin of error for the cardinality estimates.I think Paul added something for the planner tests. The problem is that file sizes may vary a little bit, e.g. because of changes to the file writers or random variation, so cardinality estimates based on file size can vary too. http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: http://gerrit.cloudera.org:8080/#/c/12974/8/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@253 PS8, Line 253: public void testFkPkJoinDetectionWithHDFSNumRowsEstEnabled() { Let's call the ones with the default options testFkPkJoinDetection() - no need to embed the fact that it's using the default options in the test name. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection-hdfs-num-rows-est-disabled.test: PS8: I think you can remove test cases that have all known cardinality - duplicating the tests isn't providing additional coverage for the fk-pk join detection in those cases. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/joins-hdfs-num-rows-est-disabled.test: PS8: I think you can remove test cases that have all known cardinality - duplicating the tests isn't providing additional coverage for the join planning in those cases. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-disabled.test: PS8: I think you can remove test cases that have all known cardinality - duplicating the tests isn't providing additional coverage for the filters in those cases. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation-hdfs-num-rows-est-disabled.test: PS8: I think you can remove test cases that have all known cardinality - duplicating the tests isn't providing additional coverage for mt-dop in those cases. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements-hdfs-num-rows-est-disabled.test: PS8: I think you can remove test cases that have all known cardinality - duplicating the tests isn't providing additional coverage for resource reuqirements in those cases. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing-hdfs-num-rows-est-disabled.test: PS8: I think you can remove test cases that have all known cardinality - duplicating the tests isn't providing additional coverage for buffer sizing in those cases. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test File testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite-hdfs-num-rows-est-disabled.test: PS8: I think you can remove test cases that have all known cardinality - duplicating the tests isn't providing additional coverage for subqueries in those cases. http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-query/queries/QueryTest/set.test File testdata/workloads/functional-query/queries/QueryTest/set.test: http://gerrit.cloudera.org:8080/#/c/12974/8/testdata/workloads/functional-query/queries/QueryTest/set.test@5 PS8, Line 5: # Note that there should not be any space around the comma used to separate different I'd suggest removing this comment. I generally like the idea of documenting pitfalls you ran into, but I think the chances of someone else making this mistake in this particular file are minimal. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 8 Gerrit-Owner: Fang-Yu Rao <fangyu....@cloudera.com> Gerrit-Reviewer: Fang-Yu Rao <fangyu....@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Thu, 16 May 2019 18:38:03 +0000 Gerrit-HasComments: Yes