[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

Lars Volker (Code Review) Fri, 04 Nov 2016 04:31:56 -0700

Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
......................................................................



Patch Set 8:

(14 comments)

Thanks for the reviews, please see PS8. I will update again once the private 
builds for S3 and local FS have finished.

http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 530:   DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY);
> Doesn't the above check need to be the same as here?
My assumption was that InitOutputPartition wouldn't be called if 
key==ROOT_PARTITION_KEY, since that partition would exist already. However that 
seems to be wrong. The key isn't passed into InitOutputPartition so we cannot 
repeat the check there.


http://gerrit.cloudera.org:8080/#/c/4863/7/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

Line 70:   // partition keys, meaning partitions can be opened, written, and 
closed one by one.
> I think we just use the double // here (and above)
Done. The one above was me some time ago, too :)


Line 71:   4: required bool input_is_clustered
> remove trailing semicolon
Done, and elsewhere in the file.


http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

Line 51:   protected final boolean inputIsClustered_;
> inputIsClustered_
Done


http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/TableSink.java
File fe/src/main/java/org/apache/impala/planner/TableSink.java:

Line 85:       boolean overwrite, boolean ignoreDuplicates, boolean 
inputIsClustered) {
> use Java style: inputIsClustered
Done


http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 912: partition (year, month) /*+ clustered,noshuffle */
> I missed it earlier that we don't have tests for unpartitioned inserts with
Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch() to 
make sure we're performing a partitioned insert.

However I'm not sure now whether we actually should allow specifying /*+ 
clustered */ for non-partitioned tables at all, since it won't have any impact 
on the plan and will just be silently ignored.


http://gerrit.cloudera.org:8080/#/c/4863/7/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 863: insert into table alltypesinsert
> do we have coverage of the clustered with and without shuffling?
So far we only had coverage in the FE. Added a test. Should we run all tests 
with shuffle and noshuffle? I assume shuffle makes more sense since we mostly 
will want to use clustered with large inserts. I added it to the other tests.


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py
File tests/query_test/test_insert.py:

Line 112:   def test_insert_test(self, vector):
> ?
This makes it possible to test only "insert.test" by specifying "-k 
test_insert_test" to impala-py.test. A lot of our BE tests have this issue of 
not being able to just select the main test by -k. Should I revert this, or add 
a comment? It's explained in the commit message.


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 481:     table_path = get_fs_path(
> don't we need to add the filesystem prefix for S3 and local FS?
I wasn't aware that the S3 and local FS cases were handled by the Hdfs code. I 
added get_fs_path() calls and started private runs for both S3 and local FS and 
will update this once they're done.


Line 489: 
> add shuffle hint to make sure we shuffle
Done


Line 507:     # This test takes about 30 seconds and we are unlikely to break 
it, so only run it in
> give reason. takes too long?
Added a comment.


Line 509:     if self.exploration_strategy() != 'exhaustive':
> same comment about the fs prefix
See comment above.


Line 534: 
> does this test make any assumptions about whether the shuffling behavior? b
No, it doesn't make any assumptions. I added the hint.


Line 554:                l_returnflag
> it feels like this could become flaky relatively easily (changing compressi
What would be good values here? Should we leave it and see if things break and 
then gradually adapt?


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

Reply via email to