[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

Alex Behm (Code Review) Thu, 03 Nov 2016 15:58:02 -0700

Alex Behm has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
......................................................................



Patch Set 7:

(12 comments)

Be changes look pretty good to me.

http://gerrit.cloudera.org:8080/#/c/4863/7/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

Line 70:   /// partition keys, meaning partitions can be opened, written, and 
closed one by one.
I think we just use the double // here (and above)


Line 71:   4: required bool input_is_clustered;
remove trailing semicolon


http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

Line 51:   protected final boolean input_is_clustered_;
inputIsClustered_


http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/TableSink.java
File fe/src/main/java/org/apache/impala/planner/TableSink.java:

Line 85:       boolean overwrite, boolean ignoreDuplicates, boolean 
input_is_clustered) {
use Java style: inputIsClustered


http://gerrit.cloudera.org:8080/#/c/4863/7/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 863: insert into table alltypesinsert
do we have coverage of the clustered with and without shuffling?


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py
File tests/query_test/test_insert.py:

Line 112:   def test_insert_test(self, vector):
?


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 481:     table_path = 
"test-warehouse/{0}.db/insert_clustered".format(unique_database)
don't we need to add the filesystem prefix for S3 and local FS?

get_fs_path() in filesystem_utils.py

Alternatively, disable this test for those cases. Are we sure only a single 
file will be created on S3?

Might be good to give this patch a private try on S3


Line 489:     insert_stmt = """insert into {0} partition(year, month) /*+ 
clustered */
add shuffle hint to make sure we shuffle


Line 507:       pytest.skip("only runs in exhaustive")
give reason. takes too long?


Line 509:     table_path = 
"test-warehouse/{0}.db/insert_clustered".format(unique_database)
same comment about the fs prefix


Line 534:     insert_stmt = """insert into {0} partition(l_returnflag) /*+ 
clustered */ select
does this test make any assumptions about whether the shuffling behavior? 
better to enforce with a hint


Line 554:     expected_partition_files = [("l_returnflag=A", 7),
it feels like this could become flaky relatively easily (changing compression 
or parquet format details), maybe relax the condition to check that the number 
of files are within some bound?


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

Reply via email to