Alex Behm has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input ......................................................................
Patch Set 7: (12 comments) Be changes look pretty good to me. http://gerrit.cloudera.org:8080/#/c/4863/7/common/thrift/DataSinks.thrift File common/thrift/DataSinks.thrift: Line 70: /// partition keys, meaning partitions can be opened, written, and closed one by one. I think we just use the double // here (and above) Line 71: 4: required bool input_is_clustered; remove trailing semicolon http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java: Line 51: protected final boolean input_is_clustered_; inputIsClustered_ http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/TableSink.java File fe/src/main/java/org/apache/impala/planner/TableSink.java: Line 85: boolean overwrite, boolean ignoreDuplicates, boolean input_is_clustered) { use Java style: inputIsClustered http://gerrit.cloudera.org:8080/#/c/4863/7/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 863: insert into table alltypesinsert do we have coverage of the clustered with and without shuffling? http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py File tests/query_test/test_insert.py: Line 112: def test_insert_test(self, vector): ? http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 481: table_path = "test-warehouse/{0}.db/insert_clustered".format(unique_database) don't we need to add the filesystem prefix for S3 and local FS? get_fs_path() in filesystem_utils.py Alternatively, disable this test for those cases. Are we sure only a single file will be created on S3? Might be good to give this patch a private try on S3 Line 489: insert_stmt = """insert into {0} partition(year, month) /*+ clustered */ add shuffle hint to make sure we shuffle Line 507: pytest.skip("only runs in exhaustive") give reason. takes too long? Line 509: table_path = "test-warehouse/{0}.db/insert_clustered".format(unique_database) same comment about the fs prefix Line 534: insert_stmt = """insert into {0} partition(l_returnflag) /*+ clustered */ select does this test make any assumptions about whether the shuffling behavior? better to enforce with a hint Line 554: expected_partition_files = [("l_returnflag=A", 7), it feels like this could become flaky relatively easily (changing compression or parquet format details), maybe relax the condition to check that the number of files are within some bound? -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes