[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Reviewed-on: http://gerrit.cloudera.org:8080/9693 Reviewed-by: Zoltan Borok-Nagy Tested-by: Impala Public Jenkins --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 14 files changed, 1,002 insertions(+), 113 deletions(-) Approvals: Zoltan Borok-Nagy: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 21 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 20: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 20 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 17 May 2018 20:22:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 20: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2493/ -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 20 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 17 May 2018 16:30:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 20: Code-Review+2 Fixed clang-tidy issue, adjusted stats-extrapolation test Carrying +2 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 20 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 17 May 2018 16:30:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#20). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 14 files changed, 1,002 insertions(+), 113 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/20 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 20 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 19: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2492/ -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 19 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 17 May 2018 11:37:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 19: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2492/ -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 19 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 17 May 2018 07:46:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 19: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 19 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 16 May 2018 17:01:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 19: (1 comment) http://gerrit.cloudera.org:8080/#/c/9693/18/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/18/tests/util/get_parquet_metadata.py@137 PS18, Line 137: alr > update comment to include that it needs to be an open file handle. Updated the comment. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 19 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 16 May 2018 16:54:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#19). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 13 files changed, 995 insertions(+), 108 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/19 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 19 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 18: Code-Review+2 (2 comments) LGTM, had one remaining nit. http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@179 PS17, Line 179: > The Python sorted() function also has a reversed flag with the same meaning Thanks for the explanation. Lets keep it as is then. http://gerrit.cloudera.org:8080/#/c/9693/18/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/18/tests/util/get_parquet_metadata.py@137 PS18, Line 137: file update comment to include that it needs to be an open file handle. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 18 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 16 May 2018 16:39:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 18: (4 comments) http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@71 PS17, Line 71: if column_index_offset and column_index_length: > It might make things faster / more elegant to pass an open file here, other Done http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@126 PS17, Line 126: es = page_header.data > Why not "not page_is_null"? Guess I wrote too many "is None"... Done. http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@179 PS17, Line 179: > "reverse" seems to rely on everyone thinking of ascending as the default. M The Python sorted() function also has a reversed flag with the same meaning, so maybe it's the Python convention. If you still think I should be more explicit, I'll change the name of it. We could use BoundaryOrder, and define it to UNORDERED, but we couldn't really use it because BoundaryOrder says something about two lists, not just one. I.e., testing the following in case of UNORDERED wouldn't be enough: is_sorted(min_values, UNORDERED) or is_sorted(max_values, UNORDERED) E.g. if min_values is ascending and max_values is descending the above expression would evaluate to false, but the boundary order should be UNORDERED. http://gerrit.cloudera.org:8080/#/c/9693/17/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/17/tests/util/get_parquet_metadata.py@163 PS17, Line 163: > If you follow the suggestion in the other file to pass an open file to read Done -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 18 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 15 May 2018 16:53:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#18). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 13 files changed, 993 insertions(+), 108 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/18 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 18 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 17: (5 comments) Only a few minor comments, otherwise looks good to me. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68 PS15, Line 68: EvalTruncation(normal_plus_max, "abcdeg", 10, UP); > I had a similar test, when I used the '127' literal. Thx. I don't feel strongly about using decimal vs hex literals. What I was aiming for was the last byte being uchar_max, and the one before being schar_max. LGTM now. http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@71 PS17, Line 71: column_index = read_serialized_object(ColumnIndex, parquet_file, It might make things faster / more elegant to pass an open file here, otherwise you're opening the same file 3 times for each column. Feel free to ignore though as this is not performance critical code. http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@126 PS17, Line 126: page_is_null is False Why not "not page_is_null"? http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@179 PS17, Line 179: reverse "reverse" seems to rely on everyone thinking of ascending as the default. Maybe rename parameter to "ascending" or "descending"? Or even pass BoundaryOrder, and implement UNORDERED as return not is_sorted(, ASCENDING) and not is_sorted(l, DESCENDING) ? http://gerrit.cloudera.org:8080/#/c/9693/17/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/17/tests/util/get_parquet_metadata.py@163 PS17, Line 163: metadata = FileMetaData() If you follow the suggestion in the other file to pass an open file to read_serialized_object, you could use it here, too. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 17 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 14 May 2018 19:44:37 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 17: (19 comments) http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc@769 PS16, Line 769: if (UNLIKELY(!table_sink_mem_tracker_->TryConsume(new_memory_allocation))) { > Need to increment this after the TryConsume() succeeds, otherwise we'll rel Ouch, right. Done. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68 PS15, Line 68: EvalTruncation(normal_plus_max, "abcdeg", 10, UP); > Thanks for adding more tests. I meant the exact 3-character string consisti I had a similar test, when I used the '127' literal. I updated it to use your 3-character string. http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc@46 PS16, Line 46: // We convert it to unsigned because signed overflow results in undefined behavior. > I think this could use a comment and/or an explicit cast to make the intent Added comment and cast. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@36 PS16, Line 36: > update Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@55 PS16, Line 55: > nit: PEP8 says the closing """ should be on a new line. You can use flake8 Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@90 PS16, Line 90: column_info = ColumnInfo(schema, stats, offset_index, column_index, page_headers) > I think you should be able to just pass all the parameters to the ctor "Col Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@96 PS16, Line 96: e_headers) > "ColumnInfo objects" or "column infos" "column infos", done. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@97 PS16, Line 97: > the first row group? Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@110 PS16, Line 110: sert previous_loc.offset < current > You can zip a list with itself to create (prev, cur) pairs: Nice! Thanks, done. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@124 PS16, Line 124: sert page > nit: page_is_null might be more clear Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@182 PS16, Line 182: else: > Would it make sense to validate the ordering instead of re-computing it? I. Yeah, I think it's better to use a different algorithm with the same semantics. Rewrote this function. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@214 PS16, Line 214: > unused? Removed it. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@216 PS16, Line 216: and that th > Stale comment? Yes, removed it. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@249 PS16, Line 249: else: > move closer to the comment explaining the intent Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@321 PS16, Line 321: tmp > nit: end Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@348 PS16, Line 348: : column = > I had missed this looking at the truncation code. Why do we keep the zeroes Yeah, I don't think we need those. Changed the logic. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@31 PS16, Line 31: "" : with open(file > That does not actually seem to be a requirement of the function. Should we Removed the last two sentences from the comment and renamed the method to 'read_range'. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@39 PS16, Line 39: contain a serialized thrift object.""" > Mention "serialized_object" in the comment, or rename the parameter to buff Renamed the parameter to 'serialized_object_buffer'. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@171 PS16, Line 171: (serialized_thrift_obj > Can read_serialized_object also create the protocol and return i
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#17). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 13 files changed, 988 insertions(+), 98 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/17 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 17 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 16: Code-Review+1 (1 comment) LGTM aside from one bug http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc@769 PS16, Line 769: page_index_memory_consumption_ += new_memory_allocation; Need to increment this after the TryConsume() succeeds, otherwise we'll release too much. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 16 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 11 May 2018 00:17:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 16: (20 comments) http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225 PS15, Line 1225: > Switched to ++i. Ah, I had overlooked row_group->columns[i] http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@33 PS15, Line 33: void EvalTruncation(const string& original, const string& expected_result, > In the test it was easier to organize the code like this. Let's leave it as it is, I also don't feel strongly about it. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68 PS15, Line 68: EvalTruncation(normal_plus_max, "abcdeg" + string(4, 0), 10, UP); > Added new tess. Already had tests with 0xFF at L57-L66 Thanks for adding more tests. I meant the exact 3-character string consisting of {0, (char)0x7F, (char)0xFF} to make sure that truncating up would hit the signed char overflow. Apologies for not being more clear. http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc@46 PS16, Line 46: unsigned char uch = (*result)[i]; I think this could use a comment and/or an explicit cast to make the intent clear. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@36 PS16, Line 36: PAGE_INDEX_TRUNCATE_LENGTH update http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@55 PS16, Line 55: """ nit: PEP8 says the closing """ should be on a new line. You can use flake8 to check the whole file. It was recently added to our codebase: https://github.com/apache/impala/commit/2fb73f94b425fde488166a19f78050ddbc3d7b50 http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@90 PS16, Line 90: column_info = ColumnInfo._make((schema, stats, offset_index, column_index, I think you should be able to just pass all the parameters to the ctor "ColumnInfo(schema, stats,...). Currently you wrap it in a tuple to get an iterable to pass to _make(), but that should not be necessary. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@96 PS16, Line 96: nametuples "ColumnInfo objects" or "column infos" http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@97 PS16, Line 97: a row group the first row group? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@110 PS16, Line 110: current_loc in page_locations[1:]: You can zip a list with itself to create (prev, cur) pairs: In [6]: l = range(5) In [7]: zip(l[:-1], l[1:]) Out[7]: [(0, 1), (1, 2), (2, 3), (3, 4)] http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@124 PS16, Line 124: null_page nit: page_is_null might be more clear http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@182 PS16, Line 182: previous_max = None Would it make sense to validate the ordering instead of re-computing it? I.e., extract all the min_values and max_values into lists and then just all assert(sorted(min_values))? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@214 PS16, Line 214: sorting_column unused? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@216 PS16, Line 216: skip_col_idx Stale comment? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@249 PS16, Line 249: vector.get_value('exec_option')['num_nodes'] = 1 move closer to the comment explaining the intent http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@321 PS16, Line 321: ends nit: end http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@348 PS16, Line 348: followed by : # zeroes. I had missed this looking at the truncation code. Why do we keep the zeroes at the end? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@31 PS16, Line 31: The range must start with a serialized : thrift object. That does not actually seem to be a requirement of th
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 15: (29 comments) http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@108 PS15, Line 108: PAGE_INDEX_TRUNCATE_LENGTH > PAGE_INDEX_MAX_STRING_LENGTH? Changed the name http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@130 PS15, Line 130: Write > nit: "Writes" Done, also fixed the previous and the next comment as well. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@48 PS15, Line 48: using namespace parquet; > Can you remove this? I find omitting the parquet:: prefix confusing, and we I agree, done. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@109 PS15, Line 109: HdfsSinkTable > HdfsTableSink? Done http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@118 PS15, Line 118: table_sink_mem_tracker_->Release(page_index_memory_consumption_); > Can we do this in Close() instead? We have generally been trying to do reso Done http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@312 PS15, Line 312: OffsetIndex offset_index_; > I find it useful to prefix Parquet types with parquet:: and we do that in p I'm also in favor of prefixing. Done. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@318 PS15, Line 318: MemTracker *table_sink_mem_tracker_; > nit: MemTracker* Done http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@651 PS15, Line 651: location.compressed_page_size = -1; > Probably doesn't have much of an effect, but should we set the size to 0 fo Yeah, it makes more sense. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@668 PS15, Line 668: location.compressed_page_size = page.header.compressed_page_size + len; > Should we add a comment for this line to explain that one includes the leng Added a comment. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@752 PS15, Line 752: string min_val, max_val; > I think we usually do one declaration per line, but I don't feel strongly a Done http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@754 PS15, Line 754: Status s_min = TruncateMinValue(page_stats.min_value, PAGE_INDEX_TRUNCATE_LENGTH, > Could this truncate non-string values if PAGE_INDEX_TRUNCATE_LENGTH was sma Added a comment. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@761 PS15, Line 761: column_index_.null_pages.push_back(true); > Maybe DCHECK that both are false, to catch errors that result in only one b Added DCHECK. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@766 PS15, Line 766: Consume > We prefer TryConsume() and returning a MEM_LIMIT_EXCEEDED error synchronous Switched to TryConsume() and check its result. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1220 PS15, Line 1220: // Currently we only support Parquet files with one row group. > Can you add a sentence why this particular code relies on this limited supp Updated the comment. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225 PS15, Line 1225: ++) > nit: ++i Switched to ++i. I need 'i' to index 'columns_' and 'row_group->columns'. The same stands for the next for-loop. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@88 PS15, Line 88: if (a != a && b != b) return 0; > Can we use std::isnan()? Yes, done. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@113 PS15, Line 113: Used for merging page statistics into column chunk statistics. > I think it's better to omit usage from the comment here. Instead say that i Changed the added comment to your sentence. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@164 PS15, Line 164: Let's a > nit: "We assume..." so it reads less like a suggestions. Done http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@165 PS15, Line 165: If not all values are equal > What happens for NaN, i.e. when all values are NaN? In ColumnStats::Merge(), MinMaxTrait::Compare() will compare the NaNs, and the cu
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#16). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 13 files changed, 999 insertions(+), 98 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/16 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 16 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 15: (25 comments) Still need to do a pass over the tests. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@108 PS15, Line 108: PAGE_INDEX_TRUNCATE_LENGTH PAGE_INDEX_MAX_STRING_LENGTH? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@130 PS15, Line 130: Write nit: "Writes" http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@48 PS15, Line 48: using namespace parquet; Can you remove this? I find omitting the parquet:: prefix confusing, and we often explicitly specify it below. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@312 PS15, Line 312: OffsetIndex offset_index_; I find it useful to prefix Parquet types with parquet:: and we do that in parts of this file already. Can you add it in places where it's missing and remove the "using" above? Let me know if you feel we shouldn't prefix them. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@651 PS15, Line 651: location.compressed_page_size = -1; Probably doesn't have much of an effect, but should we set the size to 0 for an empty page? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@668 PS15, Line 668: location.compressed_page_size = page.header.compressed_page_size + len; Should we add a comment for this line to explain that one includes the length and the other doesn't, to make the intent explicit? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@752 PS15, Line 752: string min_val, max_val; I think we usually do one declaration per line, but I don't feel strongly about it. It also might be safer to enclose this block in a nested scope to prevent future editors from re-using min_val and max_val after they have been moved below. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@754 PS15, Line 754: Status s_min = TruncateMinValue(page_stats.min_value, PAGE_INDEX_TRUNCATE_LENGTH, Could this truncate non-string values if PAGE_INDEX_TRUNCATE_LENGTH was small enough? If so, we should add a check that it isn't, or at least add a comment to the definition of PAGE_INDEX_TRUNCATE_LENGTH. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@761 PS15, Line 761: column_index_.null_pages.push_back(true); Maybe DCHECK that both are false, to catch errors that result in only one being set? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1220 PS15, Line 1220: // Currently we only support Parquet files with one row group. Can you add a sentence why this particular code relies on this limited support? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225 PS15, Line 1225: ++) nit: ++i Or use a range based loop like you do in L1254? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@88 PS15, Line 88: if (a != a && b != b) return 0; Can we use std::isnan()? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@113 PS15, Line 113: Used for merging page statistics into column chunk statistics. I think it's better to omit usage from the comment here. Instead say that it maintains internal state that tracks whether the values are ordered. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@164 PS15, Line 164: Let's a nit: "We assume..." so it reads less like a suggestions. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@165 PS15, Line 165: If not all values are equal What happens for NaN, i.e. when all values are NaN? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@54 PS15, Line 54: if (MinMaxTrait::Compare(prev_page_max_value_, cs->max_value_) > 0 || I cannot convince myself that prev_page_max_value_ has to be initialized here. Is it? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@66 PS15, Line 66: Update(cs->min_value_, cs->max_value_); If you do the order checks after this line,
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 15: (5 comments) Thanks for your patience. I'm read to +2 after a couple of tweaks to the MemTracker stuff. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@109 PS15, Line 109: HdfsSinkTable HdfsTableSink? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@118 PS15, Line 118: table_sink_mem_tracker_->Release(page_index_memory_consumption_); Can we do this in Close() instead? We have generally been trying to do resource releasing in Close() so that the sequencing of resource release is more explicit (goes against RAII but there are pros and cons of that). http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@318 PS15, Line 318: MemTracker *table_sink_mem_tracker_; nit: MemTracker* http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@766 PS15, Line 766: Consume We prefer TryConsume() and returning a MEM_LIMIT_EXCEEDED error synchronously so that we don't depend on the memory limit being enforced asynchronously. HdfsTableSink is a bad offender in terms of using Consume(), which we need to fix at some point. http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql File testdata/bin/load-dependent-tables.sql: http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@77 PS13, Line 77: (cs CHAR(5), cl CHAR(140), vc VARCHAR(32)) > My bad, I thought if I run a complete data load after formatting, it will e I ended up running these statements manually via copy-and-paste into beeline. I think this should run during pre-commit if we do a full data load. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 15 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 08 May 2018 22:14:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 13: (15 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195 PS2, Line 1195: // Populate RowGroup::sorting_columns with all columns specified by the Frontend. > I didn't follow up on this. It looks like the code only ever writes one row Unfortunately it wouldn't work, because based on the description of the Parquet page indexes (https://github.com/apache/parquet-format/blob/master/PageIndex.md) we should write all the page indexes of all the row groups together, just before the footer. In other words, we have to flush all the row groups, then we can write all the page indexes. Currently Impala writes only one row group per Parquet file, so I just add a DCHECK here. If Impala's behavior changes in the future in this regard, we'll need to update the write path of the page indexes. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@107 PS13, Line 107: page_index_mem_tracker_(-1, "Parquet Page Index", > We can have many HdfsTableWriters per HdfsTableSink if we're doing non-clus I track directly against HdfsTableSink. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1212 PS13, Line 1212: if (file_metadata_.row_groups.size() != 1) return Status::OK(); > This Answered it at PS2. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1214 PS13, Line 1214: parquet::RowGroup* row_group = &(file_metadata_.row_groups[0]); > Can't we use current_row_group_ here? We can't, detailed answer is at PS2. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1216 PS13, Line 1216: for (int i = 0; i < columns_.size(); i++) { > I injected an error here by forgetting to write out one of the pages and th Thanks for checking. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1241 PS13, Line 1241: row_group->columns[i].__set_offset_index_offset(file_pos_); > I added some errors to the offsets and lengths - looks like the tests caugh Thanks http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@31 PS13, Line 31: ascending_boundary_order_ = true; > I tried injecting a bug by initializing these to "false". The tests didn't Currently we don't invoke Reset() on the ColumnStatsBase object where it would matter. In HdfsParquetTableWriter::BaseColumnWriter there are two ColumnStatsBase members: page_stats_base_ and row_group_stats_base_. For each new page, we invoke Reset() on page_stats_base_, but that doesn't corrupt the ordering, because we only consider the state of row_group_stats_base_ for that, and we never call Reset() on it. Even in BaseColumnWriter::Reset(), we throw away the old row_group_stats_base_ and create a brand new one. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@56 PS13, Line 56: false > I tried injecting a bug by flipping the value of this and the tests caught Thanks! http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@60 PS13, Line 60: if (MinMaxTrait::Compare(prev_page_max_value_, cs->max_value_) < 0 || > I tried injecting a bug by removing the first condition and the tests caugh Thanks! http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@68 PS13, Line 68: max_value_ > I tried switching this to cs->min_value. The tests caught it. Thanks! http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@69 PS13, Line 69: prev_page_min_buffer_.Clear(); > I tried removing this line. The tests didn't catch it, even under ASAN. Turned out that the _validate_ordering function was too loose in test_parquet_page_index.py. We could remove both prev_page_min_buffer_ and prev_page_max_buffer_ without noticing it. Now _validate_ordering implements the same logic from parquet-column-stats to determine the expected ordering. Interestingly the tests only catch the case when we remove prev_page_max_buffer_.Clear(). The lack of prev_page_min_buffer.Clear() remained unnoticed. I tried to run the tests on a couple of existing tables, but without any luck. The reason is a bit subtle, basically prev_page_min_value_.ptr is "lucky" enough to point at memory addr
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#15). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 13 files changed, 923 insertions(+), 48 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/15 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 15 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#14). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 13 files changed, 923 insertions(+), 48 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/14 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 14 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 13: (15 comments) I spent a bit of time trying to break the code by injecting bugs and it looks like the test coverage is pretty good - there were a couple of things it didn't catch though that might be worth investigating a little. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195 PS2, Line 1195: // Populate RowGroup::sorting_columns with all columns specified by the Frontend. > AFAIK we always write only one row group. I just copy-pasted this part from I didn't follow up on this. It looks like the code only ever writes one row group but it's somewhat generalized to handle multiple row groups. I think people might have used this in the past to generate custom files. If we ever wrote out multiple row groups, we'd have to test it properly. In the meantime I think we could remove this if and just use current_row_group_ below instead of hardcoding [0] and it would all work ok. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@107 PS13, Line 107: page_index_mem_tracker_(-1, "Parquet Page Index", We can have many HdfsTableWriters per HdfsTableSink if we're doing non-cluster inserts. As a rule of thumb, I think we should have O(1) MemTrackers per ExecNode or DataSink - O(n) or greater could cause problems. Otherwise the MemTracker dumps could get huge and cause problems if we want to serialize them into a string. We could have one "Parquet Page Indices" MemTracker per HdfsTableSink or we could just track directly against the HdfsTableSink MemTracker. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1212 PS13, Line 1212: if (file_metadata_.row_groups.size() != 1) return Status::OK(); This http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1214 PS13, Line 1214: parquet::RowGroup* row_group = &(file_metadata_.row_groups[0]); Can't we use current_row_group_ here? http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1216 PS13, Line 1216: for (int i = 0; i < columns_.size(); i++) { I injected an error here by forgetting to write out one of the pages and the tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1241 PS13, Line 1241: row_group->columns[i].__set_offset_index_offset(file_pos_); I added some errors to the offsets and lengths - looks like the tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@31 PS13, Line 31: ascending_boundary_order_ = true; I tried injecting a bug by initializing these to "false". The tests didn't catch it. Would be good to add the coverage but this seems less bad of a coverage gap than the opposite (claiming an order that is not valid). http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@56 PS13, Line 56: false I tried injecting a bug by flipping the value of this and the tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@60 PS13, Line 60: if (MinMaxTrait::Compare(prev_page_max_value_, cs->max_value_) < 0 || I tried injecting a bug by removing the first condition and the tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@68 PS13, Line 68: max_value_ I tried switching this to cs->min_value. The tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@69 PS13, Line 69: prev_page_min_buffer_.Clear(); I tried removing this line. The tests didn't catch it, even under ASAN. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc@29 PS13, Line 29: int nit: use static_cast<> http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql File testdata/bin/load-dependent-tables.sql: http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@77 PS13, Line 77: WITH DEFERRED REBUILD IN TABLE hive_index_tbl missing trailing semicolon here and on some of the "DROP TABLE" statements below -I tried to run this file locally and it failed. http://gerrit.cloudera.org:8080/#/c/9693/13/te
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 13: (5 comments) Thanks, Csaba! Yes, we have tests for the edge cases in string-util-test.cc. And also added a new python test function 'test_max_string_values' http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@87 PS12, Line 87: // Base class for column writers. This contains most of the logic except for > If different value is used for page and column stats, then the constants sh Moved it to HdfsParquetTableWriter and renamed it to PAGE_INDEX_TRUNCATE_LENGTH. http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@749 PS12, Line 749: Status s_max = TruncateMaxValue(page_stats.max_value, PAGE_INDEX_TRUNCATE_LENGTH, : &max_val); : if (!s_min.ok() || !s_max.ok()) valid_page_index_ = false; : column_index_. > I can't find TruncateMinValue and TruncateMaxValue in code. Sorry, I've uploaded it now. http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@1221 PS12, Line 1221: // We always set null_counts. > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py@171 PS12, Line 171: if (isinstance(page_max_value, basestring) and > The != should be < - what happens if column_max_value (e.g. "bb") is shorte Oops, you are absolutely right. I use the truncate length now. http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py@172 PS12, Line 172: len(page_max_value) == PAGE_INDEX_TRUNCATE_LENGTH): > This may not be true if the last two characters contain the max char value. Fixed that as well. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 25 Apr 2018 10:06:14 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#13). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt A be/src/util/string-util-test.cc A be/src/util/string-util.cc A be/src/util/string-util.h M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 13 files changed, 864 insertions(+), 47 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/13 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 12: (5 comments) Do we have tests for the edge cases? For example a table with columns of 63/64/65 length strings. http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@87 PS12, Line 87: const int MAX_STAT_VALUE_LENGTH = 64; If different value is used for page and column stats, then the constants should be placed close to each other and their name should reflect their role. http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@749 PS12, Line 749: Status s_min = TruncateMinValue(page_stats.min_value, MAX_STAT_VALUE_LENGTH, : &min_val); : Status s_max = TruncateMaxValue(page_stats.max_value, MAX_STAT_VALUE_LENGTH, : &max_val); I can't find TruncateMinValue and TruncateMaxValue in code. http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@1221 PS12, Line 1221: column.column_index_.__set_boundary_order(column.row_group_stats_base_->GetBoundaryOrder()); nit: long line http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py@171 PS12, Line 171: len(page_max_value) != len(column_max_value)): The != should be < - what happens if column_max_value (e.g. "bb") is shorter then page_max_value (e.g "")? I would also consider using the exact max length used for page stats, to make the tests as strict as possible. http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py@172 PS12, Line 172: assert page_max_value[:-1] <= column_max_value[:len(page_max_value) - 1] This may not be true if the last two characters contain the max char value. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 24 Apr 2018 10:42:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 10: (9 comments) http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301 PS10, Line 301: std::vector min_values_; > Options 1 or 2 seem fine to me. Chose option 1, but also added a memtracker. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@735 PS10, Line 735: min_values_.push_back(std::string("")); > I don't know if we need the call to std::string() here, I think it should w Right. Now the code looks a bit different, and I call push_back after the 'if'. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@1227 PS10, Line 1227: for (auto& column : columns_) { > nit: can fit loop on one line. Done http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h@159 PS10, Line 159: // If true, min/max values are ascending. > Maybe briefly mention why they both start off true? And both can be true at Added some explanation here and to GetBoundaryOrder(). http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@37 PS10, Line 37: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite): > We've got a lot of good coverage in this test. I agree. It would be best to have a fully-fledged parquet reader implementation in Python that we could easily use. I only found https://pypi.org/project/parquet/ but it seems uncomplete. Or, maybe we can hard-code the expected values for a concrete parquet file. http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@177 PS10, Line 177: previouse_value > typo in variable name Done http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205 PS10, Line 205: falied > nit: failed Done http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205 PS10, Line 205: column_info_schema > this variable isn't defined - did you mean column_info? Done http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@244 PS10, Line 244: chars_formats > chars_formats is weird in that it's created by a different test - TestChars I moved table creation to 'testdata/bin/load-dependent-tables.sql'. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 23 Apr 2018 15:50:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#12). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/util/CMakeLists.txt M common/thrift/parquet.thrift M testdata/bin/load-dependent-tables.sql M tests/query_test/test_chars.py A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 10 files changed, 641 insertions(+), 47 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/12 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301 PS10, Line 301: std::vector min_values_; > Could we not just track the size of page_stats.min_value in a memtracker in Options 1 or 2 seem fine to me. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 19 Apr 2018 16:13:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301 PS10, Line 301: std::vector min_values_; > I had a discussion with the Parquet guys who implement the page indexes for Could we not just track the size of page_stats.min_value in a memtracker in FinalizeCurrentPage when adding elements? Then we can free it after flushing out the row group. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 16 Apr 2018 16:34:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301 PS10, Line 301: std::vector min_values_; > I'm still concerned about the amount of untracked memory from min_values_ a I had a discussion with the Parquet guys who implement the page indexes for parquet-mr and they plan to truncate the min/max values at 64 bytes. It seems reasonable because it is unlikely that users want to filter against very long string data. Maybe we could also use 64 bytes by default and let the users override it with an advanced query option. (specifying different sizes for different columns would make more sense, but I have no idea how to do that) If we go with this default, do you still think it will end up using too much untracked memory? As you wrote, if we start using StringValue/StringBuffer, we will need to convert min/max values to vectors of strings when we write out the column index, which is a lot of unnecessary copying and a short temporary peak of untracked memory. Or, to avoid the untracked memory peak, we can implement our own ColumnIndex class that uses StringBuffers and has a 'write(TProtocol *prot)' method. I think we can't avoid copying everything, because the TProtocol class only accepts std::strings for writing binary data. To summarize, I can only see 3 options, all with some drawbacks: * use std::strings pro: code is straight-forward pro: no copying con: lot of untracked memory * use StringBuffers pro: avoids long-living untracked memory con: lots of copying con: short-living untracked memory * implement our own ColumnIndex class pro: avoids untracked memory con: lots of copying con: need to maintain that class Am I missing something? Which option do you prefer? -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 16 Apr 2018 13:48:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 10: (9 comments) Overall this is looking good. I had some specific concerns about some of the nitty-gritty details. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301 PS10, Line 301: std::vector min_values_; I'm still concerned about the amount of untracked memory from min_values_ and max_values_, even if we truncate the string values to 1KB or similar - it seems like could end up with multiple MB of untracked memory. We could probably live with it since it's smaller than the actual data, but it's a step in the wrong direction. Maybe we could store min_values_ and max_values_ as StringValues backed by memory per_file_mem_pool_ and then only convert to strings when writing out each column to the page index? http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@735 PS10, Line 735: min_values_.push_back(std::string("")); I don't know if we need the call to std::string() here, I think it should work if we just emplace_back() to instantiate an empty string. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@1227 PS10, Line 1227: for (auto& column : columns_) { nit: can fit loop on one line. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h@159 PS10, Line 159: // If true, min/max values are ascending. Maybe briefly mention why they both start off true? And both can be true at the same time? It's slightly subtle. http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@37 PS10, Line 37: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite): We've got a lot of good coverage in this test. I'm wondering if we're missing some basic tests that confirm that the values in the page match the min/max values in the page index. It seems like these validations might not catch some kinds of bugs. E.g. min/max values in the index are somehow out-of-sync with the pages. Most bugs that I can imagine would get caught by one validation or another but it would be nice to have a sanity test where we confirm that the values in each page match the values in the page index. http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@177 PS10, Line 177: previouse_value typo in variable name http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205 PS10, Line 205: falied nit: failed http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205 PS10, Line 205: column_info_schema this variable isn't defined - did you mean column_info? http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@244 PS10, Line 244: chars_formats chars_formats is weird in that it's created by a different test - TestCharsFormats. I.e. it's not present unless that test ran before this one. Maybe we should change it so that the table is loaded during normal data loading? -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 12 Apr 2018 00:23:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > About the formula for prefix size: what would happen, if there is no common The formula was only a conversation-starter, but the situation is actually worse. On yesterday's Parquet sync meeting it turned out that it is a bad idea to have different limits for each page, here's why: Let's assume we have the following pages: Page 0: 12345612 12345613 12345614 min_value: 12345612 max_value: 12345614 Page 1: 12345615 23456789 min_value: 123 max_value: 235 Note that the values are in ascending order, but Page 0's min value is greater than Page 1's min_value. I'll investigate the second part of your comment. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: : > I was thinking about catching the exception in _validate_parquet_page_index I added the column name. I might add the filename as well in a later commit. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 30 Mar 2018 00:05:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#9). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M common/thrift/parquet.thrift A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 7 files changed, 570 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/9 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 8: (2 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > About leaving max_value unset: we cannot really do that, because 'max_value About the formula for prefix size: what would happen, if there is no common prefix? If num_values is small (like 64 in the plain encoded 1k length case), the result would be 0. About the untracked memory usage: the the min/max values could be collected by page_stats_base_, which knows the type of the column, so it wouldn't need to store fixed length data as string. When a data page is over, it could write its min/max value to a buffer. The min/max values would still had to be converted to string for thrift serialization, but columns could do this one-by-one, so their peak heap usage wouldn't be added together. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: if not null_page: : page_max_value = decod > I added the column name to this specific error message. I was thinking about catching the exception in _validate_parquet_page_index, adding the column name to its message or printing it, and then re-throwing the AssertionError. The same could be done for page index by using an iterator that is held in a member / argument, so its state is still intact in the caller function when the exception is caught, and using it in every loop to iterate over pages. The iterator stuff does probably not worth the effort, but knowing the name of the offending column could be really helpful if the test catches an error. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 29 Mar 2018 19:44:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: for null_page in null_pages: : assert null_page > It would be nice to always add the column name and the page number (if appl I added the column name to this specific error message. Adding that information to all the asserts would need a lot of code modifications and I'm not sure that it would worth the effort. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 28 Mar 2018 18:06:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#8). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M common/thrift/parquet.thrift A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 7 files changed, 566 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/8 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131 PS6, Line 131: > You meant validate_page_locations(), right? No, I meant the whole body of the for loop, but I am satisfied with the new structure, the code looks easier to read to me. :) http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: if not null_page: : page_max_value = decod > OK, used enumerate, and added a message to the assert. It would be nice to always add the column name and the page number (if applicable) to the error message, but I do not know how do to do that without a lot of extra code. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 28 Mar 2018 15:53:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 6: (10 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > If all strings in a page have the same 1KB prefix, then performance and spa About leaving max_value unset: we cannot really do that, because 'max_values' is a required list of binary data. We can only set it to empty string, but an empty string can also be a valid value. And an empty string is less than all strings, so not really a good choice. About determining the prefix size: why don't we calculate it dynamically based on the actual data and with some heuristic? E.g for max_value.: min_element( length(max_value), length(common_prefix(min_value, max_value)) * 1.5 + num_values / 100, 4 kB ) http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc@727 PS6, Line 727: // max_values_ and mark the values as valid. In case min and max values are not > It is a bit misleading to by talking about marking the min/max values valid Yeah, I agree that the comment is misleading, so I updated it. Currently the write path logic is reversed however. We mark the page as null if it doesn't have min and max values, and we mark it as not-null if it has min and max values as well. http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h@54 PS6, Line 54: if (MinMaxTrait::Compare(prev_page_max_value_, cs->max_value_) == 1 || > I know it is not Java but it might be readable to use a java-like contract I changed the ifs to use < and > against 0 to allow some flexibility for the Compare() implementations. However, for now I leave the definition of Compare as it is, because I can't use a - b for string values, therefore I'd need to write one more template specialization for strings. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131 PS6, Line 131: for column_info in columns: > The next block is a good candidate to move to a separate function. You meant validate_page_locations(), right? http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@134 PS6, Line 134: previous_loc = column_info.offset_index.page_locations[0] : for current_loc in column_info.offset_index.page_locations[1:]: : assert previous_loc.offset < current_loc.offset : assert previous_loc.first_row_index < current_loc.first_row_index : previous_loc = current_loc > I would move this to a separate function like "validate_page_locations". Done http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@139 PS6, Line 139: null_pages = column_info.column_index.null_pages > The next empty line could be moved before this line to group it together wi I refactored a lot, this comment doesn't really apply anymore. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@157 PS6, Line 157: if column_info.stats: > This is more subjective than the comment in line 131, but I would move the I created validate functions for the members of ColumnIndex and OffsetIndex. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: for null_page in null_pages: : assert null_page > Python has any() and all(). However, it might actually be better to use enu OK, used enumerate, and added a message to the assert. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@166 PS6, Line 166: return > This "return" should be "continue", but it would be even better to move to Right, now it's in a different function, called _validate_min_max_values http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@169 PS6, Line 169: for null_page, value in zip(null_pages, min_values): : if not null_page: : current_value = decode_stats_value(column_info.schema, value) : assert current_value >= min_value
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#7). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M common/thrift/parquet.thrift A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 7 files changed, 566 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/7 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > About long strings: I think that it would be safer to start with a smaller If all strings in a page have the same 1KB prefix, then performance and space consumption is still likely to be dominated by those. If we want to cap at a small value, we will face this for much less extreme workloads and not writing column stats becomes more of an issue. If you want to dive into this, please do comprehensive experiments to determine a good size. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: for null_page in null_pages: : assert null_page > This could be "assert False not in null_pages". Creating a variable to stor Python has any() and all(). However, it might actually be better to use enumerate() and print the index if a page is not null to make debugging easier. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 27 Mar 2018 17:49:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 6: (7 comments) http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131 PS6, Line 131: for column_info in columns: The next block is a good candidate to move to a separate function. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@134 PS6, Line 134: previous_loc = column_info.offset_index.page_locations[0] : for current_loc in column_info.offset_index.page_locations[1:]: : assert previous_loc.offset < current_loc.offset : assert previous_loc.first_row_index < current_loc.first_row_index : previous_loc = current_loc I would move this to a separate function like "validate_page_locations". http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@139 PS6, Line 139: null_pages = column_info.column_index.null_pages The next empty line could be moved before this line to group it together with the related asserts. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@157 PS6, Line 157: if column_info.stats: This is more subjective than the comment in line 131, but I would move the next block to a separate function like "validate_index_stats_consistency". http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: for null_page in null_pages: : assert null_page This could be "assert False not in null_pages". Creating a variable to store whether all pages are null could be useful to check that if all pages are null, the min and max stats should not be set. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@166 PS6, Line 166: return This "return" should be "continue", but it would be even better to move to implement the suggestion in line 131 and keep this as a "return". http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@169 PS6, Line 169: for null_page, value in zip(null_pages, min_values): : if not null_page: : current_value = decode_stats_value(column_info.schema, value) : assert current_value >= min_value : : max_value = decode_stats_value(column_info.schema, max_value_str) : for null_page, value in zip(null_pages, max_values): : if not null_page: : current_value = decode_stats_value(column_info.schema, value) : assert current_value <= max_value These two for loops could be merged. I would also prefer more exact variable names, like min/max_value_in_page instead of current_value, min/max_value in column_chunk instead of min/max_value. Another test ideas: - the sum of page null counts should be equal to the total null count - in most cases the minimum/maximum among the page min/max values should be equal to the column chunk's min/max value. As all the tests are checking tables created by Impala, we can be stricter/more specific than the Praquet spec. Special columns where this is not true could be listed, or their expected values could be set in a map to check the logic behind these values. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 27 Mar 2018 16:15:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Anonymous Coward #248 has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 6: (2 comments) I have 2 minor findings only. Otherwise, LGTM. (I'm not an Impala nor a C/C++ expert, though.) http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc@727 PS6, Line 727: // max_values_ and mark the values as valid. In case min and max values are not It is a bit misleading to by talking about marking the min/max values valid/invalid. It is about the null pages which case the min/max values do not make sense so we store empty strings instead. http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h@54 PS6, Line 54: if (MinMaxTrait::Compare(prev_page_max_value_, cs->max_value_) == 1 || I know it is not Java but it might be readable to use a java-like contract for compare. It says compare(a, b) < 0 if a < b etc. It also might have performance benefit in some cases if you simply return a - b instead of if-else. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 27 Mar 2018 12:32:29 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 5: (1 comment) Thanks, I fixed a few other comments as well. http://gerrit.cloudera.org:8080/#/c/9693/5/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/5/be/src/exec/parquet-column-stats.h@162 PS5, Line 162: i > nit: Capital I Done -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 27 Mar 2018 00:41:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#6). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M common/thrift/parquet.thrift A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 7 files changed, 547 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/6 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/9693/5/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/5/be/src/exec/parquet-column-stats.h@162 PS5, Line 162: nit: Capital I -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 26 Mar 2018 19:17:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 5: (16 comments) Thanks Csaba! (and I also updated my vimrc... :) ) http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@159 PS4, Line 159: // If true, min/max values are ascending. > nit: missing . Done http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@162 PS4, Line 162: // if true min/max values are descending. > nit: missing . Done http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h@207 PS4, Line 207: if (prev_page_min_buffer_.IsEmpty()) { : RETURN_IF_ERROR(CopyToBuffer(&prev_page_min_buffer_, &prev_page_min_value_)); > nit: long lines Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@71 PS4, Line 71: column_index_length) > nit: not enough indentation Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@82 PS4, Line 82: offset_index_length) > nit: not enough indentation Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@85 PS4, Line 85: page_header = get_page_header(parquet_file, page_loc.offset, : page_loc.compressed_page_size) : page_headers.append(page_header) > nit: too much indentation Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@113 PS4, Line 113: for null_page, value in zip(null_pages, values): > Are these prints intentional? No, jus left them here. http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@118 PS4, Line 118: elif ordering == BoundaryOrder.DESCENDING and previouse_value is not None: > Is this print intentional? Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@120 PS4, Line 120: previous_value = current_value > nit: too much indentation Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@122 PS4, Line 122: > nit: too much indentation Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@164 PS4, Line 164: assert null_page : # Everything is None, no further checks needed. : return : : min_value = decode_stats_value(column_info.schema, min_value_str) : for null_p > nit: too much indentation Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@200 PS4, Line 200: ).format(qualified_table_name, sorting_column, source_table) > nit: not enough indentation Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@203 PS4, Line 203: self._validate_parquet_index(hdfs_path, sorting_column, tmpdir) > nit: not enough indentation Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@244 PS4, Line 244: > nit: missing . Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@249 PS4, Line 249: self._ctas_table_and_verify_index(vector, unique_database, > nit: missing . Done http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py@169 PS4, Line 169: def get_column_index(filename, file_pos, length): : """Reads a ColumnIndex object from a file at the given position.""" : serialized_column_index = read_serialized_object(filename, file_pos, length) : protocol = create_protocol(serialized_column_index) : column_index = ColumnIndex() : column_index.read(protocol) : return column_index : : : def get_page_header(filename, file_pos, length): : """Reads a PageHeader object from a file at the given position.""" : serialized_page_header = read_serialized_object(filename, file_pos, length) : protocol = create_protocol(serialized_page_header) : page_header = PageHeader() : page_header.read(protocol)
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Tim Armstrong, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#5). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M common/thrift/parquet.thrift A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 7 files changed, 547 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/5 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 4: (17 comments) I have added a lot of nits, feel free to skip them for now if you think that it is too early to deal with such things at the moment. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > OK, I'll implement this part after that. About long strings: I think that it would be safer to start with a smaller max size than 1KB, for example 64, to avoid making inserts slower / the written files larger. Truncating long string can only make performance worse if a query contains =/ comparison with a string longer than the truncated length, which seems a non-typical use case to me. http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@159 PS4, Line 159: // If true, min/max values are ascending nit: missing . http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@162 PS4, Line 162: // if true min/max values are descending nit: missing . http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h@207 PS4, Line 207: if (prev_page_min_buffer_.IsEmpty()) RETURN_IF_ERROR(CopyToBuffer(&prev_page_min_buffer_, &prev_page_min_value_)); : if (prev_page_max_buffer_.IsEmpty()) RETURN_IF_ERROR(CopyToBuffer(&prev_page_max_buffer_, &prev_page_max_value_)); nit: long lines http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@71 PS4, Line 71: column_index_length) nit: not enough indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@82 PS4, Line 82: offset_index_length) nit: not enough indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@85 PS4, Line 85: page_header = get_page_header(parquet_file, page_loc.offset, : page_loc.compressed_page_size) : page_headers.append(page_header) nit: too much indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@113 PS4, Line 113: print schema Are these prints intentional? http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@118 PS4, Line 118: print current_value Is this print intentional? http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@120 PS4, Line 120: assert current_value >= previous_value nit: too much indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@122 PS4, Line 122: assert current_value <= previous_value nit: too much indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@164 PS4, Line 164: # If either is None, then both need to be None. : assert min_value_str is None and max_value_str is None : for null_page in null_pages: : assert null_page : # Everything is None, no further checks needed. : return nit: too much indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@200 PS4, Line 200: qualified_table_name, source_table) nit: not enough indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@203 PS4, Line 203: ).format(qualified_table_name, sorting_column, source_table) nit: not enough indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@244 PS4, Line 244: """Test table with wide row""" nit: missing . http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@249 PS4, Line 249: """Test tables with wide rows and many columns""" nit: missing . http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py@169 PS4, Line 169: def get_column_index(filename, file_pos, length): : """Reads a
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#4). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M common/thrift/parquet.thrift A tests/query_test/test_parquet_page_index.py M tests/util/get_parquet_metadata.py 7 files changed, 546 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/4 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 2: (20 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > Just saw Tim's comment. I think we discussed this a while ago in the Parque OK, I'll implement this part after that. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739 PS2, Line 739: null_pages_ > Sry for the confusion. The spec says that this must be set if and only if a Oh, sure. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195 PS2, Line 1195: // If the file contains more than one row_group, don't write the index. > I missed this earlier - this seems weird. Do we ever write more than one ro AFAIK we always write only one row group. I just copy-pasted this part from Pooja's work. Should I substitute it to a DCHECK maybe? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: < > That sounds good to me. Great, I implemented it that way. http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc@1211 PS3, Line 1211: // Let's see if there's an ordering on min/max values. > After moving column.min_values_ above it is invalid here. It also leaves th We are using columns_ in the next section to write the offset index, so can't reset them here, but at the end of the function. I only call BaseColumnWriter::Reset() on them, because unique_ptr::reset generates segfaults later. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@55 PS3, Line 55: > We should assert that the schema is flat, e.g. by comparing lengths of row_ Added an assert. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@63 PS3, Line 63: > You should not need "is not None" here and elsewhere. The parentheses also I removed it from couple of places. Kept them in asserts, because I think it is more readable that way. Also kept for "if max_value is not None" and such, because max_value can be set to 0. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@84 PS3, Line 84: > This looks complicated enough to warrant a namedtuple, maybe even a class. I chose namedtuple http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@87 PS3, Line 87: > ...row_groups... (plural) Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@90 PS3, Line 90: > this comment seems wrong Removed the last sentence. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@103 PS3, Line 103: > nit: All comments should follow PEP8 class comment format. Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@118 PS3, Line 118: > long line Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@122 PS3, Line 122: > It seems that this method is only called once, and without skip_col_idxs. R Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@147 PS3, Line 147: > Impala currently does not write these, add an assertion Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@164 PS3, Line 164: > What happens if it is None? Added some extra asserts about it. If either stats.min_value or stats.max_value is None, the other must be None as well. And all pages need to be null. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@165 PS3, Line 165: > Don't overwrite the variable Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@205 PS3, Line 205: > copy-paste error? yup, done. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@230 PS3, Line 230: > We also should have a test with a file that has 4K of 0xFF in it, just to m Do you know a quick way of generate a parquet file like that? I added the other tests. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py File tests/util/get_parquet_index.py: PS3: > Maybe we should merge this file with get_parquet_metadata.py (and clean it Moved the contents to get_parquet_metadata. http://gerrit.cl
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > I think that null_pages should also have covered non-null pages with invali Just saw Tim's comment. I think we discussed this a while ago in the Parquet sync and people were opposed to amending the standard for this case. I might be wrong though, let's check in the next meeting. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 20 Mar 2018 23:03:17 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 3: (19 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > Tim: I think that null_pages should also have covered non-null pages with invalid stats, but the current spec doesn't seem to allow that. In light of that I see two options: 1) Store more than 4KB of the pathological value until a non-0xFF character. I think we don't allow unlimited Parquet footer sizes, but I cannot find the code that enforces this right now. 2) Not store a ColumnIndex for that column at all. I am in favor of 2). This case seems highly hypothetical to me and we really just need to deal with it in a standard-conforming way. I don't think we need to care about the actual performance in this particular scenario. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739 PS2, Line 739: null_pages_ > I'm not sure if I've found the relevant comment. Sry for the confusion. The spec says that this must be set if and only if all values are null. Can you add a DCHECK to make sure that the null_count is equal to the number of values in the page? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: n > I just realized that the min/max values are only strings at this point, at That sounds good to me. http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc@1211 PS3, Line 1211: for (int j = 0; j < column.min_values_.size() - 1; ++j) { After moving column.min_values_ above it is invalid here. It also leaves the whole BaseColumnWriter in an inconsistent state and we risk to introduce errors in the future. To improve this we could wrap the moves in a sub-scope that starts with dereferencing columns_[i] and ends with the move. We should then reset the column in columns_ and leave the array with NULL elements. That will require to make it very obvious that this happens through the name and comment of this function. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@55 PS3, Line 55: # We only support flat schemas, the additional element is the root element. We should assert that the schema is flat, e.g. by comparing lengths of row_group.columns and schemas. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@63 PS3, Line 63: is not None You should not need "is not None" here and elsewhere. The parentheses also shouldn't be necessary. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@84 PS3, Line 84: row_group_index.append((schema, stats, offset_index, column_index, page_headers)) This looks complicated enough to warrant a namedtuple, maybe even a class. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@87 PS3, Line 87: get_row_group ...row_groups... (plural) http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@90 PS3, Line 90: containing row groups and filename this comment seems wrong http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@103 PS3, Line 103: nit: All comments should follow PEP8 class comment format. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@118 PS3, Line 118: def _validate_parquet_index(self, hdfs_path, sorting_column, tmpdir, skip_col_idxs = None): long line http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@122 PS3, Line 122: skip_col_idxs = skip_col_idxs or [] It seems that this method is only called once, and without skip_col_idxs. Remove the parameter? http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@147 PS3, Line 147: num_values = page_header.data_page_header_v2.num_values Impala currently does not write these, add an assertion http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@164 PS3, Line 164: if max_value is not None: What happens if it is None? http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@165 PS3, Line 165: max_value = decode_stats_value(schema, max_value) Don't overwrite the variable http://gerr
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > Tim: Yeah, maybe we should combine the code into a shared utility function, since I think the core logic is the same. IMO the sensible thing is to leave the max unset in that case. Hopefully we can amend the spec to allow that. I think technically now our choices are to drop the entire column index or write out a huge value. Dropping the index wouldn't be a total disaster, since this should be a rare edge case, but it is kind of crappy. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195 PS2, Line 1195: // If the file contains more than one row_group, don't write the index. I missed this earlier - this seems weird. Do we ever write more than one row group? Do we test it? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: n > I just realized that the min/max values are only strings at this point, at I think I like the idea of determining asc/desc incrementally instead of doing an extra pass. Doing it in Merge() could make sense, but I think we would have to clearly document what that means, since Merge() would no longer be a commutative operation. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 20 Mar 2018 22:21:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 2: (12 comments) Thanks for the comments! I fixed what I could, but also have some further questions. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > The spec for page indexes actually allows to store min/max values that do n Tim: StringMinMaxFilter::MaterializeValues(): thanks for the tip, I'll use something similar. Lars/Tim: So, the cap should be 4kb? About the case when the string starts with 4kb of 0xFF: It seems to me that the behavior in this case should be decided by the Parquet community. Anyway, what if we don't set the max_value in this case? Looking at HdfsParquetScanner::EvaluateStatsConjuncts() that should do the job. However, now the ordering of pages gets complicated. We need "no max_value and not null pages" at the end for ascending order. Null pages also don't have max_value, but they can be anywhere in the list of pages. However, based on the comment in https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L797 , we cannot do that with the current specification. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739 PS2, Line 739: null_pages_ > See my comment above, this should only be true for null pages. I'm not sure if I've found the relevant comment. There is a comment that says the spec allows to set min and max values that are not present in the column. But it is only for compaction, right? If there are no values in the column, does it make sense to set min and max values? Yeah, the spec seems to allow that, but if Impala's implementation doesn't populate min and max values only for null pages, maybe we can use this information, can't we? Or, I can of course smarten up the parquet ColumnStats class. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1199 PS2, Line 1199: uint8_t* buffer; > declare in the most nested scope possible. Done http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1207 PS2, Line 1207: column_index.__set_null_pages(column.null_pages_); > Can we move() these to avoid a copy? Yes, done. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: < > Ideally the comparators would be implemented only once and we would call th I just realized that the min/max values are only strings at this point, at it is kind of cumbersome to retrieve the type information here. At L746 we call Merge() on row_group_stats_base_ to merge the page statistics into the column chunk statistics. Maybe Merge() could keep track of the ordering of the merged pages and then we could just call "GetBoundaryOrder()" on row_group_stats_base_. What do you think? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1254 PS2, Line 1254: > nit: extra blank line Done http://gerrit.cloudera.org:8080/#/c/9693/2/common/thrift/parquet.thrift File common/thrift/parquet.thrift: PS2: > Is this file now identical with the upstream parquet.thrift? No, I only added the changes of PARQUET-922. diff says there's quite a lot differences between the two files. http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@683 PS2, Line 683: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite): > I think this should live in its own file. It also needs a class comment. Done http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@734 PS2, Line 734: tmp_dir = make_tmp_dir() > This should use the tmpdir fixture from pytest (see other tests in this fil Done http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@746 PS2, Line 746: def _validate_ordering(ordering, schema, null_pages, values): > This should have a function comment. Added a comment. http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@785 PS2, Line 785: # Not accurate. Does not make sure null_count is equal to number of rows. > I think this should eventually be accurate (but I could be missing somethin Now I'm using num_values from PageHeader. http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@835 PS2, Line 835: def test_write_index_alltypes(self, vector, unique_database): > Let's make sure that we cover all data types (the name of alltypes is unfor Added tests for
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#3). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds heavily on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M common/thrift/parquet.thrift A tests/query_test/test_parquet_page_index.py A tests/util/get_parquet_index.py 5 files changed, 488 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/3 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 2: (12 comments) I, too, left some high level comments. http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@126 PS1, Line 126: /// Write the column index and offset index of each page in the file. Do we otherwise support writing more than one rowgroup? http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@127 PS1, Line 127: ... would only be written... ? http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@127 PS1, Line 127: typo http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; > I'm concerned that this may introduce a non-trivial amount of untracked mem The spec for page indexes actually allows to store min/max values that do not occur in the page: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L802 We should compute shorter prefixes when possible and cut them off at 4kb. We also need to decide what to do if a string starts with 4kb of 0xFF. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739 PS2, Line 739: null_pages_ See my comment above, this should only be true for null pages. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1199 PS2, Line 1199: uint8_t* buffer; declare in the most nested scope possible. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: < > I think we should avoid using builtin operators to compare values since the Ideally the comparators would be implemented only once and we would call the same methods here and in the parquet stats. http://gerrit.cloudera.org:8080/#/c/9693/2/common/thrift/parquet.thrift File common/thrift/parquet.thrift: PS2: Is this file now identical with the upstream parquet.thrift? http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@683 PS2, Line 683: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite): I think this should live in its own file. It also needs a class comment. http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@734 PS2, Line 734: tmp_dir = make_tmp_dir() This should use the tmpdir fixture from pytest (see other tests in this file) http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@746 PS2, Line 746: def _validate_ordering(ordering, schema, null_pages, values): This should have a function comment. http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@785 PS2, Line 785: # Not accurate. Does not make sure null_count is equal to number of rows. I think this should eventually be accurate (but I could be missing something). -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 16 Mar 2018 22:31:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 ) Change subject: IMPALA-5842: Write page index in Parquet files .. Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector null_pages_; I'm concerned that this may introduce a non-trivial amount of untracked memory, particularly for non-clustered dynamically-partitioned inserts where we have multiple table writers going at the same time. I don't think we cap the size of min/max values, so this is an additional concern. StringMinMaxFilter::MaterializeValues() has some relevant logic to cap the size of strings to 1KB, which we should probably use. I did a rough calculation for a 512MB parquet file. That could have 8192 64KB pages. Even with BIGINT values, that would be 128kb of stats, which is an amount we should be tracking. With 1kb strings that would be 8MB of untracked min/max values. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1207 PS2, Line 1207: column_index.__set_null_pages(column.null_pages_); Can we move() these to avoid a copy? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: < I think we should avoid using builtin operators to compare values since the Parquet spec's order won't necessarily always match the C++ types order. Having a helper function would make it clear that the code means the Parquet spec's order. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1254 PS2, Line 1254: nit: extra blank line http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@835 PS2, Line 835: def test_write_index_alltypes(self, vector, unique_database): Let's make sure that we cover all data types (the name of alltypes is unfortunate, since it's missing DECIMAL, CHAR and VARCHAR). -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 16 Mar 2018 22:03:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Hello Lars Volker, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9693 to look at the new patch set (#2). Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds heavily on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M common/thrift/parquet.thrift M tests/query_test/test_insert_parquet.py A tests/util/get_parquet_index.py 5 files changed, 433 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/2 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files
Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9693 Change subject: IMPALA-5842: Write page index in Parquet files .. IMPALA-5842: Write page index in Parquet files This commit builds heavily on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 --- M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-parquet-table-writer.h M common/thrift/parquet.thrift M tests/query_test/test_insert_parquet.py A tests/util/get_parquet_index.py 5 files changed, 433 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/1 -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy