[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. - It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Reviewed-on: http://gerrit.cloudera.org:8080/4863 Reviewed-by: Lars Volker Reviewed-by: Tim Armstrong Tested-by: Internal Jenkins --- M be/src/exec/hbase-table-sink.cc M be/src/exec/hbase-table-writer.cc M be/src/exec/hbase-table-writer.h M be/src/exec/hdfs-avro-table-writer.cc M be/src/exec/hdfs-avro-table-writer.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M be/src/exec/hdfs-table-writer.h M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 26 files changed, 457 insertions(+), 161 deletions(-) Approvals: Lars Volker: Looks good to me, but someone else must approve Internal Jenkins: Verified Tim Armstrong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 20 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Internal Jenkins has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 19: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Internal Jenkins has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 19: Build failed: http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/518/ -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Internal Jenkins has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 19: Build failed: http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/515/ -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Internal Jenkins has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 19: Build failed: http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/510/ -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Internal Jenkins has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 19: Verified-1 Build failed: http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/509/ -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 19: Code-Review+2 Carry the +2 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 19: Code-Review+1 Replaced NULL with nullptr as discussed in previous comments. Carry +1. Now a committer required to carry the +2. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Marcel Kornacker, Alex Behm, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#19). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. - It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hbase-table-sink.cc M be/src/exec/hbase-table-writer.cc M be/src/exec/hbase-table-writer.h M be/src/exec/hdfs-avro-table-writer.cc M be/src/exec/hdfs-avro-table-writer.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M be/src/exec/hdfs-table-writer.h M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 26 files changed, 457 insertions(+), 161 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/19 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 18: Code-Review+1 Rebased and ran the relevant tests locally, carry +1. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 18 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Marcel Kornacker, Alex Behm, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#18). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. - It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hbase-table-sink.cc M be/src/exec/hbase-table-writer.cc M be/src/exec/hbase-table-writer.h M be/src/exec/hdfs-avro-table-writer.cc M be/src/exec/hdfs-avro-table-writer.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M be/src/exec/hdfs-table-writer.h M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 26 files changed, 428 insertions(+), 138 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/18 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 18 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 16: (3 comments) Thanks for the review. I addressed the comments in PS17 and will rebase the change next. http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 299: // then the batch contains rows with the same key and can be written as a whole. > then all rows in the batch have the same key Done http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 214: /// Maps all rows in 'batch' to partitions and appends them to their temporary Hdfs > good point, better revert then. Done http://gerrit.cloudera.org:8080/#/c/4863/16/be/src/exec/hdfs-table-writer.h File be/src/exec/hdfs-table-writer.h: Line 70: /// rows in the batch are appended. > i find this confusing, because it comments on the intention of the caller ( Done -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Marcel Kornacker, Alex Behm, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#17). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. - It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hbase-table-sink.cc M be/src/exec/hbase-table-writer.cc M be/src/exec/hbase-table-writer.h M be/src/exec/hdfs-avro-table-writer.cc M be/src/exec/hdfs-avro-table-writer.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M be/src/exec/hdfs-table-writer.h M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 26 files changed, 428 insertions(+), 138 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/17 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 17 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 16: Code-Review+2 (3 comments) http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 299: // then the batch contains rows with the same key and can be written as a whole. then all rows in the batch have the same key http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 214: /// Maps all rows in 'batch' to partitions and appends them to their temporary Hdfs > Done. Though I think "must be" makes it more clear that it's a prerequisit good point, better revert then. http://gerrit.cloudera.org:8080/#/c/4863/16/be/src/exec/hdfs-table-writer.h File be/src/exec/hdfs-table-writer.h: Line 70: /// rows in the batch are appended. i find this confusing, because it comments on the intention of the caller (if multiple partitions then...), and it doesn't even make that clear. this function doesn't check for any partitions. it appends the rows in 'batch' that were selected via row_group_indices, and if the latter is empty, appends every row. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 16: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 15: (5 comments) Thanks for the comments, please see PS16. http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: PS15, Line 329: key > Could do std:move() to avoid copy of string Done http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: PS15, Line 168: form > from Done Line 214: /// Appends all rows in batch to the temporary Hdfs files their respective partitions. > You're missing a verb in here I think Rephrased. Line 269: PartitionPair* current_clustered_partition_; > "Only set if 'input_is_clustered_' is true" Done Line 272: /// batches. > "Only set if 'input_is_clustered_' is true" Done -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 15 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Alex Behm, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#16). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hbase-table-sink.cc M be/src/exec/hbase-table-writer.cc M be/src/exec/hbase-table-writer.h M be/src/exec/hdfs-avro-table-writer.cc M be/src/exec/hdfs-avro-table-writer.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M be/src/exec/hdfs-table-writer.h M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 26 files changed, 429 insertions(+), 138 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/16 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 15: (5 comments) http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: PS15, Line 329: key Could do std:move() to avoid copy of string http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: PS15, Line 168: form from Line 214: /// Appends all rows in batch to the temporary Hdfs files their respective partitions. You're missing a verb in here I think Line 269: PartitionPair* current_clustered_partition_; "Only set if 'input_is_clustered_' is true" Line 272: /// batches. "Only set if 'input_is_clustered_' is true" -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 15 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch( > yes. Done -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Alex Behm, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#15). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hbase-table-sink.cc M be/src/exec/hbase-table-writer.cc M be/src/exec/hbase-table-writer.h M be/src/exec/hdfs-avro-table-writer.cc M be/src/exec/hdfs-avro-table-writer.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M be/src/exec/hdfs-table-writer.h M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 26 files changed, 424 insertions(+), 138 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/15 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 15 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch( > This is part of the HdfsTableWriter virtual interface. Should I change it h yes. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 13: (10 comments) Thanks for the review. Please see PS14. http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 66: tsink.table_sink.hdfs_table_sink.skip_header_line_count : > if you break this up, then This has been formatted by clang-format since I touched these lines. While I agree that the formatting could be nicer, I'd like to keep the automatic format here for simplicity. Line 212: GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, &key); > we're standardizing on nullptr now I changed all occurrences in lines I touched. Once the change has a +2 I will rebase the change and then replace all other occurrences in the file to keep it consistent (and keep the risk for merge conflicts low until then). Line 266: // The rows of this batch may span across multiple files. We repeatedly pass the row > remove "across" Done Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch( > rename this function, it's not appending the entire row batch (AppendRows?) This is part of the HdfsTableWriter virtual interface. Should I change it here and in all table writer classes? Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, RowBatch* batch) { > to speed this up, you should: Done. AppendRowBatch() already appends the whole batch if the row index vector is empty, so there was no change necessary. http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 170: const HdfsPartitionDescriptor& partition_descriptor, const TupleRow* current_row, > mention that current_row provides the partition key values. Done Line 181: /// GetHashTblKey(). Maps to an OutputPartition, which are owned by the object pool and > "pool," Done Line 189: void GetHashTblKey(const TupleRow* current_row, const std::vector& ctxs, > current_row -> row (it doesn't matter to this function that the row is 'cur Done Line 213: /// Appends all rows in batch to the temporary Hdfs files corresponding to partitions. > what "corresponding to partitions" refers to is unclear. Rephrased it. Line 214: /// The input must be ordered by the partition key expressions. > the rows are expected to be ordered...? Done. Though I think "must be" makes it more clear that it's a prerequisite to be guaranteed by the caller. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Alex Behm, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#14). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 14 files changed, 391 insertions(+), 98 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/14 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 13: (10 comments) http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 66: tsink.table_sink.hdfs_table_sink.skip_header_line_count : if you break this up, then ...\n ? ...\n : 0 is better Line 212: GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, &key); we're standardizing on nullptr now Line 266: // The rows of this batch may span across multiple files. We repeatedly pass the row remove "across" Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch( rename this function, it's not appending the entire row batch (AppendRows?) Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, RowBatch* batch) { to speed this up, you should: - keep a current_clustered_partition_ - also current_clustered_partition_key_ - check the partition key of the last row in batch whether it matches current_partition_key_ - if so, skip the for loop and write the entire batch out also: change WriteRowsToPartition to take the index vector separately; pass a nullptr in the case where the entire batch gets written to the same partition apply that same optimization to writer::AppendRowBatch (or however you rename that). no need to spend time checking some vector in that case. http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 170: const HdfsPartitionDescriptor& partition_descriptor, const TupleRow* current_row, mention that current_row provides the partition key values. Line 181: /// GetHashTblKey(). Maps to an OutputPartition, which are owned by the object pool and "pool," Line 189: void GetHashTblKey(const TupleRow* current_row, const std::vector& ctxs, current_row -> row (it doesn't matter to this function that the row is 'current') same for InitOutputPartition Line 213: /// Appends all rows in batch to the temporary Hdfs files corresponding to partitions. what "corresponding to partitions" refers to is unclear. Line 214: /// The input must be ordered by the partition key expressions. the rows are expected to be ordered...? -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 12: Rebased and added tests in AnalyzeStmtsTest as suggested by Alex in https://gerrit.cloudera.org/#/c/5051/1/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Alex Behm, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#13). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 14 files changed, 357 insertions(+), 95 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/13 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 12: Code-Review+1 Rebased, carry +1 from Alex. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Alex Behm, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#12). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 13 files changed, 341 insertions(+), 95 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/12 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 11: Marcel, can you have a look for a +2? Thanks. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 11: Passed private S3 run with only query_tests enabled here. Relevant lines for S3: 02:22:40.508 [gw2] PASSED query_test/test_insert_behaviour.py::TestInsertBehaviour::test_clustered_partition_single_file 02:25:23.115 [gw2] PASSED query_test/test_insert_behaviour.py::TestInsertBehaviour::test_clustered_partition_multiple_files -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Alex Behm has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 11: Code-Review+1 Assuming the S3 run passes. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 9: (3 comments) Thanks for the reviews. I'm running another private job to test the change on S3 and will update the Jira once it's done. http://gerrit.cloudera.org:8080/#/c/4863/9/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 488: import os > why? Sry, leftover from trying to get the tests to work. Line 491: table_location = DEFAULT_FS + "/" + table_path > I'm confused. Doesn't get_fs_path() do the right thing? Why do we have to a It does. I couldn't figure out why FILESYSTEM_PREFIX works the way it does on the Jenkins host I ran this on, but not on my local dev machine. Sailesh helped me, now I'm convinced it should work. I'm running another private build and will update this change again once I have results. Line 550: table, table_location) > indent 4 Done -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#11). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/util/filesystem_utils.py 14 files changed, 344 insertions(+), 99 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/11 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#10). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/util/filesystem_utils.py 14 files changed, 344 insertions(+), 99 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/10 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Alex Behm has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 9: (3 comments) http://gerrit.cloudera.org:8080/#/c/4863/9/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 488: import os why? Line 491: table_location = DEFAULT_FS + "/" + table_path I'm confused. Doesn't get_fs_path() do the right thing? Why do we have to add this DEFAULT_FS prefix? Line 550: table, table_location) indent 4 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Alex Behm has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 7: (3 comments) http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 912: partition (year, month) /*+ clustered */ > Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch( We could add warning when inserting into an unpartitioned table with a clustered hint. Not in this patch, please. http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py File tests/query_test/test_insert.py: Line 112: def test_insert_test(self, vector): > This makes it possible to test only "insert.test" by specifying "-k test_in Sorry missed that in the commit msg. Works for me. http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 534: insert_stmt = """insert into {0} partition(l_returnflag) /*+ clustered */ select > No, it doesn't make any assumptions. I added the hint. I think it does make an assumption. If for whatever reason this plan switched to "noshuffle", we'd get more files and fail this test. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 7: S3 private build failed b/c of stale testdata snapshot. I will rebase next and try again. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 481: table_path = "test-warehouse/{0}.db/insert_clustered".format(unique_database) > I wasn't aware that the S3 and local FS cases were handled by the Hdfs code get_fs_path() didn't help, since it only adds a common prefix to the path. I fixed the S3 tests and pushed a new PS. test_insert_behavior is disabled as a whole on local FS. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#9). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M bin/impala-config.sh M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py M tests/util/filesystem_utils.py 14 files changed, 345 insertions(+), 99 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/9 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 8: i'll pick this up once it has an overall +1 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 8: Code-Review+1 (3 comments) +1 on the backend part. http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 530: DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY); > My assumption was that InitOutputPartition wouldn't be called if key==ROOT_ Makes sense, thanks. http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 912: partition (year, month) /*+ clustered,noshuffle */ > Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch( That's a good point, I misunderstood that aspect of the feature - I was thinking of the "SORT BY" clause that isn't implemented yet. http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 554:l_returnflag > What would be good values here? Should we leave it and see if things break I agree with Alex. I think we mainly want to check that we're not losing test coverage. So I'd be ok with something really loose like >= 3 and <= 30, since that would only fail if something drastically changed. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 8: (14 comments) Thanks for the reviews, please see PS8. I will update again once the private builds for S3 and local FS have finished. http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 530: DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY); > Doesn't the above check need to be the same as here? My assumption was that InitOutputPartition wouldn't be called if key==ROOT_PARTITION_KEY, since that partition would exist already. However that seems to be wrong. The key isn't passed into InitOutputPartition so we cannot repeat the check there. http://gerrit.cloudera.org:8080/#/c/4863/7/common/thrift/DataSinks.thrift File common/thrift/DataSinks.thrift: Line 70: // partition keys, meaning partitions can be opened, written, and closed one by one. > I think we just use the double // here (and above) Done. The one above was me some time ago, too :) Line 71: 4: required bool input_is_clustered > remove trailing semicolon Done, and elsewhere in the file. http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java: Line 51: protected final boolean inputIsClustered_; > inputIsClustered_ Done http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/TableSink.java File fe/src/main/java/org/apache/impala/planner/TableSink.java: Line 85: boolean overwrite, boolean ignoreDuplicates, boolean inputIsClustered) { > use Java style: inputIsClustered Done http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 912: partition (year, month) /*+ clustered,noshuffle */ > I missed it earlier that we don't have tests for unpartitioned inserts with Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch() to make sure we're performing a partitioned insert. However I'm not sure now whether we actually should allow specifying /*+ clustered */ for non-partitioned tables at all, since it won't have any impact on the plan and will just be silently ignored. http://gerrit.cloudera.org:8080/#/c/4863/7/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 863: insert into table alltypesinsert > do we have coverage of the clustered with and without shuffling? So far we only had coverage in the FE. Added a test. Should we run all tests with shuffle and noshuffle? I assume shuffle makes more sense since we mostly will want to use clustered with large inserts. I added it to the other tests. http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py File tests/query_test/test_insert.py: Line 112: def test_insert_test(self, vector): > ? This makes it possible to test only "insert.test" by specifying "-k test_insert_test" to impala-py.test. A lot of our BE tests have this issue of not being able to just select the main test by -k. Should I revert this, or add a comment? It's explained in the commit message. http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 481: table_path = get_fs_path( > don't we need to add the filesystem prefix for S3 and local FS? I wasn't aware that the S3 and local FS cases were handled by the Hdfs code. I added get_fs_path() calls and started private runs for both S3 and local FS and will update this once they're done. Line 489: > add shuffle hint to make sure we shuffle Done Line 507: # This test takes about 30 seconds and we are unlikely to break it, so only run it in > give reason. takes too long? Added a comment. Line 509: if self.exploration_strategy() != 'exhaustive': > same comment about the fs prefix See comment above. Line 534: > does this test make any assumptions about whether the shuffling behavior? b No, it doesn't make any assumptions. I added the hint. Line 554:l_returnflag > it feels like this could become flaky relatively easily (changing compressi What would be good values here? Should we leave it and see if things break and then gradually adapt? -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#8). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 12 files changed, 318 insertions(+), 95 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/8 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Alex Behm has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 7: (12 comments) Be changes look pretty good to me. http://gerrit.cloudera.org:8080/#/c/4863/7/common/thrift/DataSinks.thrift File common/thrift/DataSinks.thrift: Line 70: /// partition keys, meaning partitions can be opened, written, and closed one by one. I think we just use the double // here (and above) Line 71: 4: required bool input_is_clustered; remove trailing semicolon http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java: Line 51: protected final boolean input_is_clustered_; inputIsClustered_ http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/TableSink.java File fe/src/main/java/org/apache/impala/planner/TableSink.java: Line 85: boolean overwrite, boolean ignoreDuplicates, boolean input_is_clustered) { use Java style: inputIsClustered http://gerrit.cloudera.org:8080/#/c/4863/7/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 863: insert into table alltypesinsert do we have coverage of the clustered with and without shuffling? http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py File tests/query_test/test_insert.py: Line 112: def test_insert_test(self, vector): ? http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 481: table_path = "test-warehouse/{0}.db/insert_clustered".format(unique_database) don't we need to add the filesystem prefix for S3 and local FS? get_fs_path() in filesystem_utils.py Alternatively, disable this test for those cases. Are we sure only a single file will be created on S3? Might be good to give this patch a private try on S3 Line 489: insert_stmt = """insert into {0} partition(year, month) /*+ clustered */ add shuffle hint to make sure we shuffle Line 507: pytest.skip("only runs in exhaustive") give reason. takes too long? Line 509: table_path = "test-warehouse/{0}.db/insert_clustered".format(unique_database) same comment about the fs prefix Line 534: insert_stmt = """insert into {0} partition(l_returnflag) /*+ clustered */ select does this test make any assumptions about whether the shuffling behavior? better to enforce with a hint Line 554: expected_partition_files = [("l_returnflag=A", 7), it feels like this could become flaky relatively easily (changing compression or parquet format details), maybe relax the condition to check that the number of files are within some bound? -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 530: DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY); Doesn't the above check need to be the same as here? http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: Line 912: partition (year, month) /*+ clustered */ I missed it earlier that we don't have tests for unpartitioned inserts with the clustered hint - we should make sure that it goes down the right code path (I think it does). -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 7: (1 comment) Thanks Tim, for the review. I pushed PS7, removing a DCHECK I had added - though I'm not quite sure why it gets hit. Marcel, can you have a look at PS7 for a +2? http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 395: // Build the unique name for this partition from the partition keys, e.g. "j=1/f=foo/" The insert tests are hitting this DCHECK. Without it they pass, though it is not entirely clear to me why, yet. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4863 to look at the new patch set (#7). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 12 files changed, 282 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/7 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 6: LGTM but I think it's a big enough change that we should have another pair of eyes on it -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 6: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 5: (4 comments) Thanks for the review, please see PS6. http://gerrit.cloudera.org:8080/#/c/4863/5/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 277: if (new_file) { > Can avoid a level of nesting with if (!new_file) break; Done Line 294: current_row_ = batch->GetRow(i); > Oh man, I didn't notice this 'current_row_' thing earlier. It looks like fo Done Line 302: RETURN_IF_ERROR(WriteRowsToPartition(state, batch, current_partition_pair)); > I think we should also remove the entry in partition_keys_to_output_partiti Done http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 476: > Maybe a little much given core already takes too long - I think it's useful Done -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has uploaded a new patch set (#6). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 12 files changed, 283 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/6 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 4: (4 comments) http://gerrit.cloudera.org:8080/#/c/4863/5/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 277: RETURN_IF_ERROR(FinalizePartitionFile(state, output_partition)); Can avoid a level of nesting with if (!new_file) break; Line 294: GetHashTblKey(dynamic_partition_key_expr_ctxs_, &key); Oh man, I didn't notice this 'current_row_' thing earlier. It looks like for some reason the original author of the code didn't want to pass an extra argument into functions so instead did it implicitly via 'current_row_' Can you refactor it so that it's passed as an actual argument? Line 302: RETURN_IF_ERROR(WriteRowsOfPartition(state, batch, current_partition_pair)); I think we should also remove the entry in partition_keys_to_output_partitions_ here so that we use O(1) memory. http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 476: > Done. It takes 30 seconds on my machine. Too much for core? Maybe a little much given core already takes too long - I think it's useful coverage but we're also probably pretty unlikely to break it. Seems fine for exhaustive. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has uploaded a new patch set (#5). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 12 files changed, 253 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/5 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 4: (17 comments) http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 268: // Pass the row batch to the writer. If new_file is returned true then the current > I find this comment a little confusing. Maybe explain more the "why", i.e. Done. I had moved this from elsewhere, but tried to make it more clear. Line 272: do { > I feel like this might be clearer if you write it as while(true) and then b Done Line 288: // TODO: Should we adapt hdfs writer to accept start,num pair instead of a vector? > My guess is that it would only help a little bit. It's nice to use the same Ok Line 296: PartitionPair* next_partition_pair = NULL; > It would be more efficient to keep track of the previous key and comparing Done PS4, Line 301: Flush > I think it might be clearer to just say "Write rows" instead of "flush". Fl Done Line 301: // Flush previous partition > I think the comments in here are a little too low level. Here maybe a singl Done Line 310: // Collect row index > This comment isn't too helpful. Done PS4, Line 313: Flush > "Write rows to" is probably clearer than "flush". Done PS4, Line 607: WriteRowsOfPartition > WriteRowsToPartition? Done http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: PS4, Line 183: rows > Maybe reword to be clearer, took me a while to figure out: Done PS4, Line 209: partitions > partition's (missing apostrophe). Done Line 209: /// Writes all rows of a partition to the partitions writer and clears the row index > This could be clearer that it's only writing the rows with indices in 'part Done PS4, Line 215: is expected to > This could be more declarative - i.e. "must" instead of "is expected to" Done http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert.py File tests/query_test/test_insert.py: PS4, Line 71: text > parquet, right? Done http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 476: > Hmm, I think we could also do with a test that writes some large partitions Done. It takes 30 seconds on my machine. Too much for core? Line 477: > Extra blank line Done Line 501: assert len(files) == 1 > Comment why we only expect one file - it might not be that obvious to peopl Done -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 4: (17 comments) Thanks for adding the tests, this is looking pretty good - I think it could be refined and polished a bit more but no major concerns here. http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 268: // Pass the row batch to the writer. If new_file is returned true then the current I find this comment a little confusing. Maybe explain more the "why", i.e. that the rows may be written to multiple files. Line 272: do { I feel like this might be clearer if you write it as while(true) and then break out if 'new_file' is false. Line 288: // TODO: Should we adapt hdfs writer to accept start,num pair instead of a vector? My guess is that it would only help a little bit. It's nice to use the same code path for this and dynamic partitioning at least for now. Line 296: PartitionPair* next_partition_pair = NULL; It would be more efficient to keep track of the previous key and comparing directly instead of looking up the partition hash table for every row. It's a little more complicated but I think would help perf. PS4, Line 301: Flush I think it might be clearer to just say "Write rows" instead of "flush". Flush makes me think it's writing to the filesystem, but it's really just pushing the rows into the HdfsTableWriter, which does its own buffering. Line 301: // Flush previous partition I think the comments in here are a little too low level. Here maybe a single comment along the lines of "Done with previous partition - write rows and close." might be more helpful. Line 310: // Collect row index This comment isn't too helpful. PS4, Line 313: Flush "Write rows to" is probably clearer than "flush". PS4, Line 607: WriteRowsOfPartition WriteRowsToPartition? http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: PS4, Line 183: rows Maybe reword to be clearer, took me a while to figure out: "a vector of indices of the rows in the current batch to insert into the partition" Line 209: /// Writes all rows of a partition to the partitions writer and clears the row index This could be clearer that it's only writing the rows with indices in 'partition_pair', it kind-of sounds like it's writing all the rows in 'batch' into the partition. PS4, Line 209: partitions partition's (missing apostrophe). PS4, Line 215: is expected to This could be more declarative - i.e. "must" instead of "is expected to" http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert.py File tests/query_test/test_insert.py: PS4, Line 71: text parquet, right? http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py File tests/query_test/test_insert_behaviour.py: Line 476: Hmm, I think we could also do with a test that writes some large partitions split across files. E.g. do a CTAS of lineitem partitioned by some low-NDV column. We could also maybe adjust PARQUET_FILE_SIZE to force it to split it into more files. Depending on how long it takes, might make sense to only run it in exhaustive. Line 477: Extra blank line Line 501: assert len(files) == 1 Comment why we only expect one file - it might not be that obvious to people looking at this code later. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. Patch Set 4: (2 comments) Thanks Tim for having a look. Please see PS4. http://gerrit.cloudera.org:8080/#/c/4863/1//COMMIT_MSG Commit Message: PS1, Line 7: A- > - instead of _ Done Line 12: > RE: testing - off the top of my head, I think we need: Thanks for the feedback. I added tests addressing points 1 to 3 and an explanation in the commit message. I also added different row batch sizes to the exhaustive run of insert.test. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has uploaded a new patch set (#4). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine it took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 12 files changed, 192 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/4 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has uploaded a new patch set (#2). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 12 files changed, 163 insertions(+), 54 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/2 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Lars Volker has uploaded a new patch set (#3). Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input .. IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java M testdata/workloads/functional-query/queries/QueryTest/insert.test M tests/query_test/test_insert.py M tests/query_test/test_insert_behaviour.py 12 files changed, 163 insertions(+), 54 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/3 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA 2523: Make HdfsTableSink aware of clustered input
Tim Armstrong has posted comments on this change. Change subject: IMPALA_2523: Make HdfsTableSink aware of clustered input .. Patch Set 1: (2 comments) I didn't do a full pass but had thoughts on testing. http://gerrit.cloudera.org:8080/#/c/4863/1//COMMIT_MSG Commit Message: PS1, Line 7: A_ - instead of _ Line 12: RE: testing - off the top of my head, I think we need: * Very large inserts with partitions spanning many row batches * Inserts with a small number of rows per partition (e.g. 1, 2, 3). * Tests that check that the expected # of files are created It would also be good to add batch_size as a test dimension (if it isn't already for the insert tests) and test with some different batch sizes to hit more edge cases, e.g. 1, 16, default (1024). -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA 2523: Make HdfsTableSink aware of clustered input
Lars Volker has posted comments on this change. Change subject: IMPALA_2523: Make HdfsTableSink aware of clustered input .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/4863/1/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 594: WriteClusteredRowBatch(state, batch); this case is exercised by the tests for IMPALA-2521, however we don't have tests (yet) that make sure we actually ever have one partition output file open. Any ideas how I could test this easily with our current infrastructure? -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA 2523: Make HdfsTableSink aware of clustered input
Lars Volker has uploaded a new change for review. http://gerrit.cloudera.org:8080/4863 Change subject: IMPALA_2523: Make HdfsTableSink aware of clustered input .. IMPALA_2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/hdfs-table-writer.cc M common/thrift/DataSinks.thrift M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M fe/src/main/java/org/apache/impala/planner/TableSink.java 9 files changed, 105 insertions(+), 48 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/1 -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker