Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input ......................................................................
Patch Set 8: (14 comments) Thanks for the reviews, please see PS8. I will update again once the private builds for S3 and local FS have finished. http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 530: DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY); > Doesn't the above check need to be the same as here? My assumption was that InitOutputPartition wouldn't be called if key==ROOT_PARTITION_KEY, since that partition would exist already. However that seems to be wrong. The key isn't passed into InitOutputPartition so we cannot repeat the check there. http://gerrit.cloudera.org:8080/#/c/4863/7/common/thrift/DataSinks.thrift File common/thrift/DataSinks.thrift: Line 70: // partition keys, meaning partitions can be opened, written, and closed one by one. > I think we just use the double // here (and above) Done. The one above was me some time ago, too :) Line 71: 4: required bool input_is_clustered > remove trailing semicolon Done, and elsewhere in the file. http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java: Line 51: protected final boolean inputIsClustered_; > inputIsClustered_ Done http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/TableSink.java File fe/src/main/java/org/apache/impala/planner/TableSink.java: Line 85: boolean overwrite, boolean ignoreDuplicates, boolean inputIsClustered) { > use Java style: inputIsClustered Done http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 912: partition (year, month) /*+ clustered,noshuffle */ > I missed it earlier that we don't have tests for unpartitioned inserts with Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch() to make sure we're performing a partitioned insert. However I'm not sure now whether we actually should allow specifying /*+ clustered */ for non-partitioned tables at all, since it won't have any impact on the plan and will just be silently ignored. http://gerrit.cloudera.org:8080/#/c/4863/7/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 863: insert into table alltypesinsert > do we have coverage of the clustered with and without shuffling? So far we only had coverage in the FE. Added a test. Should we run all tests with shuffle and noshuffle? I assume shuffle makes more sense since we mostly will want to use clustered with large inserts. I added it to the other tests. http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py File tests/query_test/test_insert.py: Line 112: def test_insert_test(self, vector): > ? This makes it possible to test only "insert.test" by specifying "-k test_insert_test" to impala-py.test. A lot of our BE tests have this issue of not being able to just select the main test by -k. Should I revert this, or add a comment? It's explained in the commit message. http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 481: table_path = get_fs_path( > don't we need to add the filesystem prefix for S3 and local FS? I wasn't aware that the S3 and local FS cases were handled by the Hdfs code. I added get_fs_path() calls and started private runs for both S3 and local FS and will update this once they're done. Line 489: > add shuffle hint to make sure we shuffle Done Line 507: # This test takes about 30 seconds and we are unlikely to break it, so only run it in > give reason. takes too long? Added a comment. Line 509: if self.exploration_strategy() != 'exhaustive': > same comment about the fs prefix See comment above. Line 534: > does this test make any assumptions about whether the shuffling behavior? b No, it doesn't make any assumptions. I added the hint. Line 554: l_returnflag > it feels like this could become flaky relatively easily (changing compressi What would be good values here? Should we leave it and see if things break and then gradually adapt? -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes