[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 11: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 11 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 07 Nov 2017 04:34:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Reviewed-on: http://gerrit.cloudera.org:8080/7822 Reviewed-by: Bikramjeet VigTested-by: Impala Public Jenkins --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc M testdata/data/README A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 19 files changed, 634 insertions(+), 283 deletions(-) Approvals: Bikramjeet Vig: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 12 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 11: Code-Review+2 Carrying over +2, rebased, fixed clang errors -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 11 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 07 Nov 2017 01:11:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 11: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1437/ -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 11 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 07 Nov 2017 01:11:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7822 to look at the new patch set (#11). Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc M testdata/data/README A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 19 files changed, 634 insertions(+), 283 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/11 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 11 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 9: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1435/ -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 9 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 06 Nov 2017 23:02:09 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 9: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1435/ -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 9 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 06 Nov 2017 19:03:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 9: Code-Review+2 LGTM. Probably best to wait until after the weekend to merge it though :). -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 9 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 04 Nov 2017 00:29:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 8: (2 comments) @Tim, can you take a last look at the changes and carry over the +2 if they are ok? http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33 PS7, Line 33: int Encode(const InternalType& t, int encoded_byte_size, uint8_t* buffer, > Oh ok, thanks for the explanation. I think we should also add a comment men Done http://gerrit.cloudera.org:8080/#/c/7822/8/testdata/data/binary_decimal_dictionary.parquet File testdata/data/binary_decimal_dictionary.parquet: PS8: > We should also mention how these files were generated in the README Done -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 8 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 04 Nov 2017 00:23:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7822 to look at the new patch set (#9). Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc M testdata/data/README A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 19 files changed, 639 insertions(+), 283 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/9 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 9 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 8: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33 PS7, Line 33: int Encode(const InternalType& t, int encoded_byte_size, uint8_t* buffer, > This test suite basically takes a test value and its expected size and firs Oh ok, thanks for the explanation. I think we should also add a comment mentioning that this doesn't exactly implement the Parquet spec, since the parquet spec would require that it remove leading zero bytes. -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 8 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 23:29:04 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7822 to look at the new patch set (#8). Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 18 files changed, 629 insertions(+), 283 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/8 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 8 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h@391 PS7, Line 391: fixed_len_size > Looked again. The variable name (and recycling the argument storage) is con Done http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33 PS7, Line 33: int EncodeVarLenDecimal(const DECIMAL_TYPE& t, int fixed_len_size, uint8_t* buffer){ > I took another look at the standard and it says that the minimum number of This test suite basically takes a test value and its expected size and first encodes it then decodes it, in order to make minimal changes to the test suite I reused the function signature and passed the "minimum number of bytes required to store the unscaled value" as the expected size which is passed to the encode methods names as "fixed_len_size". As per you suggestion in a previous comment, I believe this will be more clear if i change the name to encoded_byte_size. Also, I gave an explanation of what "fixed_len_size" for decimals stored as BYTE_ARRAY as a comment at line 46. I believe it'll be better to move that comment right above this method instead. -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 7 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 23:21:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 7: (2 comments) I think we'll need to do some more work on testing for the int32/64 patches - we don't have pre-existing test files from parquet-mr. I think we'll have to generate some more test files with parquet-mr for the other cases, and we could consider turning that code into a data generator to generate more test files. From what I could tell Hive doesn't have a knob to generate some of these alternative output encodings. I feel ok with the coverage since we have end-to-end tests then more exhaustive unit tests for the various ways of encoding it. http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h@391 PS7, Line 391: fixed_len_size Looked again. The variable name (and recycling the argument storage) is confusing. Maybe 'encoded_byte_size'? http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33 PS7, Line 33: int EncodeVarLenDecimal(const DECIMAL_TYPE& t, int fixed_len_size, uint8_t* buffer){ I took another look at the standard and it says that the minimum number of bytes required to store the unscaled value should be used: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal I think this means that we should not be including any preceding "0" bytes. I.e. we should not have a fixed_len_size argument and instead determine the size based on the number of leading zero bytes in the value. -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 7 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 20:53:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 7: > Looks good to me and it looks like you addressed Dan's comments. > Dan, did you want to take another look? No, I don't plan to look through the details. How confident are we that we have sufficient test coverage with the added tests? If we're confident then, Tim, please do the +2 (or ask for a second review from another reviewer if you think it's warranted). -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 7 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 20:40:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 7: Code-Review+1 Looks good to me and it looks like you addressed Dan's comments. Dan, did you want to take another look? -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 7 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 00:12:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7822 to look at the new patch set (#7). Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 18 files changed, 633 insertions(+), 283 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/7 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 7 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 6: (7 comments) http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-column-readers.cc@207 PS6, Line 207: InternalType, parquet::Type::type PARQUET_TYPE > you should explain these template parameters Done http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@40 PS6, Line 40: IMPALA_TO_PARQUET_TYPES I think it makes sense to rename this to INTERNAL_TO_PARQUET_TYPES to be consistent with our naming of template parameters. http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@88 PS6, Line 88: InternalType > let's explain what that is Done http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@177 PS6, Line 177: static int DecodeByParquetType(const uint8_t* buffer, const uint8_t* buffer_end, int fixed_len_size, > nit: long line Done http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@333 PS6, Line 333: inline int EncodeDecimal(const T& v, int fixed_len_size, uint8_t* buffer){ > nit: missing space Done http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@34 PS6, Line 34: int > int32_t for consistency? Done http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@79 PS6, Line 79: LOGICAL_TYPE > LogicalType here and below Done -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 6 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Oct 2017 00:32:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 6: Code-Review+1 (3 comments) LGTM modulo some minor things. Will let Lars take another look. http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@333 PS6, Line 333: inline int EncodeDecimal(const T& v, int fixed_len_size, uint8_t* buffer){ nit: missing space http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@34 PS6, Line 34: int int32_t for consistency? http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@79 PS6, Line 79: LOGICAL_TYPE LogicalType here and below -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 6 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Oct 2017 22:53:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7822 to look at the new patch set (#6). Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 17 files changed, 619 insertions(+), 273 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/6 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 6 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 5: (7 comments) http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h@45 PS5, Line 45: INTERNAL_TYPE > I should have mentioned this earlier, but I think we mostly use CamelCase f Done http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@235 PS5, Line 235: template <> : inline int ParquetPlainEncoder::ByteSize(const Decimal4Value&) { : DCHECK(false); : return -1; : } : template <> : inline int ParquetPlainEncoder::ByteSize(const Decimal8Value&) { : DCHECK(false); : return -1; : } : template <> : inline int ParquetPlainEncoder::ByteSize(const Decimal16Value&) { : DCHECK(false); : return -1; : } fixed code duplication here too. http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@284 PS5, Line 284: template <> : inline int ParquetPlainEncoder::Encode( : const int8_t& v, int fixed_len_size, uint8_t* buffer) { : int32_t val = v; : memcpy(buffer, , sizeof(int32_t)); : return ByteSize(v); : } : : template <> : inline int ParquetPlainEncoder::Encode( : const int16_t& v, int fixed_len_size, uint8_t* buffer) { : int32_t val = v; : memcpy(buffer, , sizeof(int32_t)); : return ByteSize(v); : } Also, duplicated methods for both int8 and int16, since both are written out as int32 http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@328 PS5, Line 328: inline int ParquetPlainEncoder::Encode( > It looks like we could also deduplicate the Encode/Decode methods for FIXED Done http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@350 PS5, Line 350: Decode(const uint8_t* buffer, > nit: missing space Done http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@112 PS5, Line 112: //Decimal4Value: General test case, > I find this a bit confusing still because the values being encoded aren't i As discussed, moving these test cases back to its original form http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@200 PS5, Line 200: sizeof(int32_t) > Is there any reason to keep the encoding overhead separate from the encoded see comment above. -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 5 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Oct 2017 00:29:53 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 5: (6 comments) Mostly just cleanup and simplification comments now. http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h@45 PS5, Line 45: INTERNAL_TYPE I should have mentioned this earlier, but I think we mostly use CamelCase for type names inside templates. We're not totally consistent throughout the codebase but that's the general pattern. http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@328 PS5, Line 328: inline int ParquetPlainEncoder::Encode( It looks like we could also deduplicate the Encode/Decode methods for FIXED_LEN_BYTE_ARRAY with the same technique. http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@350 PS5, Line 350: Decode(const uint8_t* buffer, nit: missing space http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193 PS4, Line 193: izeof(D > for var len decimals, its encoded size is bytes required to store size + le Thanks for the explanation. http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@112 PS5, Line 112: //Decimal4Value: General test case, I find this a bit confusing still because the values being encoded aren't in the table. Could we take it even further and have something like: struct DecimalTestCase { int decimal_byte_size; int128_t decimal_val; int expected_size; } That way all of the inputs to each test would be in the same place. http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@200 PS5, Line 200: sizeof(int32_t) Is there any reason to keep the encoding overhead separate from the encoded size? I would have thought we would just check the total size. -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 5 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Oct 2017 00:36:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 4: (16 comments) http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277 PS4, Line 1277: switch (node.element->type) { > This case is only going to get bigger with the follow on patches - it would Done http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376 PS4, Line 376: inline int ParquetPlainEncoder::Decode( > I think you can avoid stamping this out if you leave it still parameterized Unfortunately, the call to Decode is explicit in our code, eg Decode therefore it will always look for a specialization of those template parameters. Similarly, in the dummy program you wrote, if I explicitly call foo (, ), this will look for a specialization of the double templated method and return 0; Since partial specialization is not possible, the only way to reduce redundant code will be to do what Lars suggested earlier, i.e., create a helper method that is only templatized on and call that for each full specialization. I will make changes accordingly in the next iteration of this patch. http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@237 PS2, Line 237: > I think this particular call here will always return sizeof(int32_t) (line should we do that everywhere else in this file wherever we specialize a method(both Decode and encode)? http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@338 PS2, Line 338: return fixed_len_size; > I'm not sure I'm following: it looks like the next three methods are exactl Oh I see what you mean now. Done! http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66 PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type, > let's make it static - we don't want to export the symbol to be linked with Done. enclosed in an anonymous namespace. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70 PS4, Line 70: ( > unnecessary parens Done http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173 PS4, Line 173: // TODO: Is this check too strict? > I don't see why this shouldn't match the file metadata, this seems valid to Done http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182 PS4, Line 182: parquet::SchemaElement* > Why not pass by const ref like above? Done http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209 PS4, Line 209: stringstream ss; > Can you convert to using Substitute() while we're here? We've been very gra Done http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176 PS4, Line 176: TestType (Decimal4Value(test_val), > It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests. Done http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193 PS4, Line 193: int32_t > Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above for var len decimals, its encoded size is bytes required to store size + least num of bytes required to store num. in this case bytes required to store numeric_limits::max() is sizeof(int32_t) which is less than the sizeof(Decimal8Value) since Decimal8Value is stored using int64 I will add a comment at the top of these byte_array test to clarify what expected size is. would you recommend I add more comments before limit tests? http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212 PS4, Line 212: for (int i = 1; i <=16; ++i) { > nit: missing space Done http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@104 PS4, Line 104: PHYSICAL_TYPE > Isn't this PHYSICAL_TYPE the internal type, which is called LOGICAL_TYPE in reverted to 'T' to minimize change from previous code http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@199 PS4, Line 199: PHYSICAL_TYPE > It doesn't seem
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7822 to look at the new patch set (#5). Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 17 files changed, 568 insertions(+), 256 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/5 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 5 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 4: (3 comments) Few more comments related to previous ones. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@104 PS4, Line 104: PHYSICAL_TYPE Isn't this PHYSICAL_TYPE the internal type, which is called LOGICAL_TYPE in the decoder below? http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@199 PS4, Line 199: PHYSICAL_TYPE It doesn't seem like the whole decoding class needs to be templated by this parameter, since it only affects the behaviour of the Reset() function that actually reads the dictionary. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@323 PS4, Line 323: DictDecoder::GetNextValue( The logic here is also the same as above - we should find a way to avoid the duplication. -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 4 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 16 Oct 2017 22:13:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 4: (12 comments) Did a first pass over it - thought I should flush out comments since you've been waiting quite a while for feedback here. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277 PS4, Line 1277: switch (node.element->type) { This case is only going to get bigger with the follow on patches - it would be best to make it a separate function. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@213 PS4, Line 213: template Logical type and physical type I think aren't quite right, since e.g. "StringValue" isn't a logical type. It's really decoding/encoding between Parquet physical types and Impala's internal representation. Maybe INTERNAL_TYPE and PARQUET_PHYSICAL_TYPE? Or PARQUET_TYPE seems ok since that's what parquet calls it. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376 PS4, Line 376: inline int ParquetPlainEncoder::Decode( I think you can avoid stamping this out if you leave it still parameterized by the internal Impala type, since the logic is the same in all cases. There may be an opportunity to reduce the redundancy above too, since there are functions above with identical bodies. I tried this out on a dummy program and it looks like it's possible: #include template int foo(T* a, S* b) { return 0; } template <> int foo(int* a, int* b) { return 1; } template int foo(T* a, double* b) { return 2; } int main(int argc, char**argv) { int a, b, c; double x, y, z; printf("%d\n", foo(, )); printf("%d\n", foo(, )); } http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66 PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type, let's make it static - we don't want to export the symbol to be linked with code outside this file. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70 PS4, Line 70: ( unnecessary parens http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173 PS4, Line 173: // TODO: Is this check too strict? I don't see why this shouldn't match the file metadata, this seems valid to me. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182 PS4, Line 182: parquet::SchemaElement* Why not pass by const ref like above? http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209 PS4, Line 209: stringstream ss; Can you convert to using Substitute() while we're here? We've been very gradually converting to using that for cases like this. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176 PS4, Line 176: TestType (Decimal4Value(test_val), It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests. Mostly the tests are the same except for the expected size, right? Could we make them table-driven so that we test all the same cases but just have different expected output? http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193 PS4, Line 193: int32_t Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above? Or if this is the better way to express it, should we change it above? http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212 PS4, Line 212: for (int i = 1; i <=16; ++i) { nit: missing space http://gerrit.cloudera.org:8080/#/c/7822/4/testdata/data/binary_decimal_dictionary.parquet File testdata/data/binary_decimal_dictionary.parquet: PS4: Can you update the README to describe how these files were generated. -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 4 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 ) Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-column-stats.inline.h@89 PS2, Line 89: switch(parquet_type){ > Do you mean to have a Decode wrapper around the templatized Decode methods? The former, so that the interface of Decode() is simpler. How this is implemented seems more a concern of the decoder than the column stats. http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@237 PS2, Line 237: ByteSize > Do you mean to have something like I think this particular call here will always return sizeof(int32_t) (line 220). I'd just put that here, since your change explicitly documents that as a template parameter. http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@338 PS2, Line 338: template <> > thats kinda difficult because DecimalUtil::DecodeFromFixedLenByteArray is a I'm not sure I'm following: it looks like the next three methods are exactly the same. Couldn't you move them into a new method DecodeDecimalValue(const uint8_t* buffer, const uint8_t* buffer_end, int fixed_len_size, T* v) and then call it and return in one line here? I may be missing something though :) -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 2 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 12 Oct 2017 00:33:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7822 to look at the new patch set (#4). Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e --- M be/src/exec/data-source-scan-node.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test M tests/query_test/test_scanners.py 17 files changed, 622 insertions(+), 259 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/4 -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 4 Gerrit-Owner: Bikramjeet VigGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has abandoned this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Abandoned -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: abandon Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 3: Agree with Tim about the template code changes, and there's less urgency to get this in so I'll abandon this approach for now. -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 3: s/binary/dictionary in last comment. -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has uploaded a new patch set (#3). Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner * Extend metadata checks to allow more than one possible physical type for a given logical type. * Change decimal decoding to handle non-fixed-length format in same path as fixed-length encoding. Testing: * Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Perf: * Tested computing SUM(col) for 1 billion distinct dictionary-encoded decimal(12,2) values using FIXED_BYTE_ARRAY physical type encoding. * The overhead of decoding with the extra branch was measured at 1s; i.e. the per-decode overhead is 1ns. Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc A testdata/data/binary_decimal_dictionary.parquet A testdata/data/binary_decimal_no_dictionary.parquet A testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test M tests/query_test/test_scanners.py 7 files changed, 135 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/5115/3 -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG Commit Message: Line 21: * Tested computing SUM(col) for 1 billion distinct dictionary-encoded > I'm assuming you are measuring response time. Since there is overall more w Right, but I'm measuring the absolute difference in response time, not the relative difference. So it doesn't totally matter what the rest of the query is doing, as long as it's stable (I agree that variance can drown out the signal). Thankfully all the runtimes I measured were remarkably stable. However, your next point is a very good one. Line 23: * No performance difference measured by introduction of extra > I'm assuming you compared response times. With multi-threaded scans the los Doh, of course you're right, and I wasn't properly measuring the overhead, because of parallelism. With a single thread, decoding 1B decimals is about 1 second slower, which is about 1ns per decode. I've updated the commit message (still think that's not a perf difference worth chasing). http://gerrit.cloudera.org:8080/#/c/5115/1/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: Line 203: /// assumed to be BYTE_ARRAY, otherwise it is assumed to be FIXED_LEN_BYTE_ARRAY. > Document return value? Done PS1, Line 328: fixed_len_size > This name seems inaccurate since it can be variable. Maybe 'encoded_size'? Hm, I think it's accurate - either it's set (> 0) and fixed, or it's not set (-1). 'encoded_size' wouldn't work well later in the method (because it's set to the size of the buffer, not the total encoded size). I could of course have another variable to manage that, but this seems ok to me. PS1, Line 338: fixed_len_size > What if fixed_len_size is negative? Done. http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: Line 150: # Decimal Parquet encodings written by Java Parquet library. > If you're going to create the table as part of the test, might as well copy Done. -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has uploaded a new patch set (#2). Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner * Extend metadata checks to allow more than one possible physical type for a given logical type. * Change decimal decoding to handle non-fixed-length format in same path as fixed-length encoding. Testing: * Query test that decodes dictionary-encoded decimals using binary encoding. Perf: * Tested computing SUM(col) for 1 billion distinct dictionary-encoded decimal(12,2) values using FIXED_BYTE_ARRAY physical type encoding. * The overhead of decoding with the extra branch was measured at 1s; i.e. the per-decode overhead is 1ns. Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc A testdata/data/byte_array_decimal_dict_encoded.parquet A testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test M tests/query_test/test_scanners.py 6 files changed, 119 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/5115/2 -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Alex Behm has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG Commit Message: Line 21: * Tested computing SUM(col) for 1 billion distinct dictionary-encoded > Can do - but where would that extra slowness really come from? I would have I'm assuming you are measuring response time. Since there is overall more work for the scanner to do in your dict-encoded experiment, any difference in perf will be less pronounced because it affects a relatively smaller portion of the work. With plain encoded there is no "overhead" of decoding the dictionary indexes and fetching the values from the dictionary. For a single decimal column, the work of decoding the dict indexes and fetching their values should be in the same ball park as just populating the slot directly with plain encoding, so there is roughly 50% "noise" it seems. Line 23: * No performance difference measured by introduction of extra > No, but I can do. What do you expect to change? I'm assuming you compared response times. With multi-threaded scans the loss in perf might not be apparent. With mt_dop=1 we're running the whole query in a single thread, so any slowdown along that critical path should prominently affect response time. -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Tim Armstrong has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/5115/1/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: Line 203: /// assumed to be BYTE_ARRAY, otherwise it is assumed to be FIXED_LEN_BYTE_ARRAY. Document return value? PS1, Line 328: fixed_len_size This name seems inaccurate since it can be variable. Maybe 'encoded_size'? PS1, Line 338: fixed_len_size What if fixed_len_size is negative? http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: Line 150: # Decimal Parquet encodings written by Java Parquet library. If you're going to create the table as part of the test, might as well copy the data as part of the test (and avoid the need to reload test data). E.g. like https://github.com/apache/incubator-impala/blob/master/tests/query_test/test_scanners.py#L248 http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test File testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test: PS1, Line 1: : QUERY : select col2 from decimal_encodings; : RESULTS : 123.00 : 123.00 : 123.00 : 123.00 : 123.00 : TYPES : DECIMAL > We're working on getting some richer test data - it's just a bit of a pain Sounds good, I think we need a bit more, especially boundary cases for different decimal sizes. -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG Commit Message: Line 21: * Tested computing SUM(col) for 1 billion distinct dictionary-encoded > Can also verify on plain encoding? Seems like that would be the extreme cas Can do - but where would that extra slowness really come from? I would have thought decoding 1 billion values would represent a large portion of the execution time of such a simple query, and if the proportion is really that small I'm inclined to worry even less about the perf differential. Line 23: * No performance difference measured by introduction of extra > Did you run this with num_scanner_threads=1 or even better mt_dop=1? No, but I can do. What do you expect to change? -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Alex Behm has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG Commit Message: Line 23: * No performance difference measured by introduction of extra Did you run this with num_scanner_threads=1 or even better mt_dop=1? -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Alex Behm has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG Commit Message: Line 21: * Tested computing SUM(col) for 1 billion distinct dictionary-encoded Can also verify on plain encoding? Seems like that would be the extreme case. The perf difference may not be pronounced enough with dict-encoding due to the general slowness of a dictionary-encoded column with only distinct values. -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has posted comments on this change. Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test File testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test: PS1, Line 1: : QUERY : select col2 from decimal_encodings; : RESULTS : 123.00 : 123.00 : 123.00 : 123.00 : 123.00 : TYPES : DECIMAL We're working on getting some richer test data - it's just a bit of a pain to write it out. This will include > 1 dict-encoded value, and some values that are plain encoded. -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry RobinsonGerrit-Reviewer: Henry Robinson Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Henry Robinson has uploaded a new change for review. http://gerrit.cloudera.org:8080/5115 Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner .. IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner * Extend metadata checks to allow more than one possible physical type for a given logical type. * Change decimal decoding to handle non-fixed-length format in same path as fixed-length encoding. Testing: * Query test that decodes dictionary-encoded decimals using binary encoding. Perf: * Tested computing SUM(col) for 1 billion distinct dictionary-encoded decimal(12,2) values using FIXED_BYTE_ARRAY physical type encoding. * No performance difference measured by introduction of extra predictable branch to Decode() path. Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M testdata/bin/create-load-data.sh A testdata/data/byte_array_decimal_dict_encoded.parquet A testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test M tests/query_test/test_scanners.py 7 files changed, 118 insertions(+), 48 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/5115/1 -- To view, visit http://gerrit.cloudera.org:8080/5115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry Robinson