[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 11: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 11
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 07 Nov 2017 04:34:25 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Reviewed-on: http://gerrit.cloudera.org:8080/7822
Reviewed-by: Bikramjeet Vig 
Tested-by: Impala Public Jenkins
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M testdata/data/README
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
19 files changed, 634 insertions(+), 283 deletions(-)

Approvals:
  Bikramjeet Vig: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 12
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-06 Thread Bikramjeet Vig (Code Review)
Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 11: Code-Review+2

Carrying over +2, rebased, fixed clang errors


--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 11
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 07 Nov 2017 01:11:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1437/


--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 11
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 07 Nov 2017 01:11:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-06 Thread Bikramjeet Vig (Code Review)
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#11).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M testdata/data/README
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
19 files changed, 634 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/11
--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 11
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 9: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1435/


--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 06 Nov 2017 23:02:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1435/


--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 06 Nov 2017 19:03:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 9: Code-Review+2

LGTM. Probably best to wait until after the weekend to merge it though :).


--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 04 Nov 2017 00:29:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-03 Thread Bikramjeet Vig (Code Review)
Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 8:

(2 comments)

@Tim, can you take a last look at the changes and carry over the +2 if they are 
ok?

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int Encode(const InternalType& t, int encoded_byte_size, uint8_t* 
buffer,
> Oh ok, thanks for the explanation. I think we should also add a comment men
Done


http://gerrit.cloudera.org:8080/#/c/7822/8/testdata/data/binary_decimal_dictionary.parquet
File testdata/data/binary_decimal_dictionary.parquet:

PS8:
> We should also mention how these files were generated in the README
Done



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 8
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 04 Nov 2017 00:23:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-03 Thread Bikramjeet Vig (Code Review)
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#9).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M testdata/data/README
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
19 files changed, 639 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/9
--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 8: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int Encode(const InternalType& t, int encoded_byte_size, uint8_t* 
buffer,
> This test suite basically takes a test value and its expected size and firs
Oh ok, thanks for the explanation. I think we should also add a comment 
mentioning that this doesn't exactly implement the Parquet spec, since the 
parquet spec would require that it remove leading zero bytes.



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 8
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 23:29:04 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-01 Thread Bikramjeet Vig (Code Review)
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#8).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
18 files changed, 629 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/8
--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 8
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-01 Thread Bikramjeet Vig (Code Review)
Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h@391
PS7, Line 391: fixed_len_size
> Looked again. The variable name (and recycling the argument storage) is con
Done


http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int EncodeVarLenDecimal(const DECIMAL_TYPE& t, int 
fixed_len_size, uint8_t* buffer){
> I took another look at the standard and it says that the minimum number of
This test suite basically takes a test value and its expected size and first 
encodes it then decodes it, in order to make minimal changes to the test suite 
I reused the  function signature and passed the "minimum number of bytes 
required to store the unscaled value" as the expected size which is passed to 
the encode methods names as  "fixed_len_size". As per you suggestion in a 
previous comment, I believe this will be more clear if i change the name to 
encoded_byte_size.

Also, I gave an explanation of  what "fixed_len_size" for decimals stored as 
BYTE_ARRAY as a comment at line 46. I believe it'll be better to move that 
comment right above this method instead.



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 23:21:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 7:

(2 comments)

I think we'll need to do some more work on testing for the int32/64 patches - 
we don't have pre-existing test files from parquet-mr. I think we'll have to 
generate some more test files with parquet-mr for the other cases, and we could 
consider turning that code into a data generator to generate more test files. 
From what I could tell Hive doesn't have a knob to generate some of these 
alternative output encodings.

 I feel ok with the coverage since we have end-to-end tests then more 
exhaustive unit tests for the various ways of encoding it.

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h@391
PS7, Line 391: fixed_len_size
Looked again. The variable name (and recycling the argument storage) is 
confusing. Maybe 'encoded_byte_size'?


http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int EncodeVarLenDecimal(const DECIMAL_TYPE& t, int 
fixed_len_size, uint8_t* buffer){
I took another look at the standard and it says that the minimum number of 
bytes required to store the unscaled value should be used: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

I think this means that we should not be including any preceding "0" bytes. 
I.e. we should not have a fixed_len_size argument and instead determine the 
size based on the number of leading zero bytes in the value.



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:53:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-01 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 7:

> Looks good to me and it looks like you addressed Dan's comments.
 > Dan, did you want to take another look?

No, I don't plan to look through the details.

How confident are we that we have sufficient test coverage with the added 
tests? If we're confident then, Tim, please do the +2 (or ask for a second 
review from another reviewer if you think it's warranted).


--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:40:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-31 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 7: Code-Review+1

Looks good to me and it looks like you addressed Dan's comments. Dan, did you 
want to take another look?


--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 01 Nov 2017 00:12:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-26 Thread Bikramjeet Vig (Code Review)
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#7).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
18 files changed, 633 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/7
--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-26 Thread Bikramjeet Vig (Code Review)
Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 6:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-column-readers.cc@207
PS6, Line 207: InternalType, parquet::Type::type PARQUET_TYPE
> you should explain these template parameters
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@40
PS6, Line 40: IMPALA_TO_PARQUET_TYPES
I think it makes sense to rename this to INTERNAL_TO_PARQUET_TYPES to be 
consistent with our naming of template parameters.


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@88
PS6, Line 88: InternalType
> let's explain what that is
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@177
PS6, Line 177:   static int DecodeByParquetType(const uint8_t* buffer, const 
uint8_t* buffer_end, int fixed_len_size,
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@333
PS6, Line 333: inline int EncodeDecimal(const T& v, int fixed_len_size, 
uint8_t* buffer){
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@34
PS6, Line 34: int
> int32_t for consistency?
Done


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@79
PS6, Line 79: LOGICAL_TYPE
> LogicalType here and below
Done



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 6
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 27 Oct 2017 00:32:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-25 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 6: Code-Review+1

(3 comments)

LGTM modulo some minor things. Will let Lars take another look.

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-common.h@333
PS6, Line 333: inline int EncodeDecimal(const T& v, int fixed_len_size, 
uint8_t* buffer){
nit: missing space


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@34
PS6, Line 34: int
int32_t for consistency?


http://gerrit.cloudera.org:8080/#/c/7822/6/be/src/exec/parquet-plain-test.cc@79
PS6, Line 79: LOGICAL_TYPE
LogicalType here and below



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 6
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Oct 2017 22:53:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-23 Thread Bikramjeet Vig (Code Review)
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#6).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
17 files changed, 619 insertions(+), 273 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/6
--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 6
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-23 Thread Bikramjeet Vig (Code Review)
Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h@45
PS5, Line 45: INTERNAL_TYPE
> I should have mentioned this earlier, but I think we mostly use CamelCase f
Done


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@235
PS5, Line 235: template <>
 : inline int ParquetPlainEncoder::ByteSize(const Decimal4Value&) {
 :   DCHECK(false);
 :   return -1;
 : }
 : template <>
 : inline int ParquetPlainEncoder::ByteSize(const Decimal8Value&) {
 :   DCHECK(false);
 :   return -1;
 : }
 : template <>
 : inline int ParquetPlainEncoder::ByteSize(const Decimal16Value&) {
 :   DCHECK(false);
 :   return -1;
 : }
fixed code duplication here too.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@284
PS5, Line 284: template <>
 : inline int ParquetPlainEncoder::Encode(
 : const int8_t& v, int fixed_len_size, uint8_t* buffer) {
 :   int32_t val = v;
 :   memcpy(buffer, , sizeof(int32_t));
 :   return ByteSize(v);
 : }
 :
 : template <>
 : inline int ParquetPlainEncoder::Encode(
 : const int16_t& v, int fixed_len_size, uint8_t* buffer) {
 :   int32_t val = v;
 :   memcpy(buffer, , sizeof(int32_t));
 :   return ByteSize(v);
 : }
Also, duplicated methods for both int8 and int16, since both are written out as 
int32


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@328
PS5, Line 328: inline int ParquetPlainEncoder::Encode(
> It looks like we could also deduplicate the Encode/Decode methods for FIXED
Done


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@350
PS5, Line 350: Decode(const 
uint8_t* buffer,
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@112
PS5, Line 112: //Decimal4Value: General test case,
> I find this a bit confusing still because the values being encoded aren't i
As discussed, moving these test cases back to its original form


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@200
PS5, Line 200: sizeof(int32_t)
> Is there any reason to keep the encoding overhead separate from the encoded
see comment above.



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 5
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Oct 2017 00:29:53 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 5:

(6 comments)

Mostly just cleanup and simplification comments now.

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/hdfs-parquet-scanner.h@45
PS5, Line 45: INTERNAL_TYPE
I should have mentioned this earlier, but I think we mostly use CamelCase for 
type names inside templates. We're not totally consistent throughout the 
codebase but that's the general pattern.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@328
PS5, Line 328: inline int ParquetPlainEncoder::Encode(
It looks like we could also deduplicate the Encode/Decode methods for 
FIXED_LEN_BYTE_ARRAY with the same technique.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-common.h@350
PS5, Line 350: Decode(const 
uint8_t* buffer,
nit: missing space


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193
PS4, Line 193: izeof(D
> for var len decimals, its encoded size is bytes required to store size + le
Thanks for the explanation.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@112
PS5, Line 112: //Decimal4Value: General test case,
I find this a bit confusing still because the values being encoded aren't in 
the table. Could we take it even further and have something like:

  struct DecimalTestCase {
int decimal_byte_size;
int128_t decimal_val;
int expected_size;
  }

That way all of the inputs to each test would be in the same place.


http://gerrit.cloudera.org:8080/#/c/7822/5/be/src/exec/parquet-plain-test.cc@200
PS5, Line 200: sizeof(int32_t)
Is there any reason to keep the encoding overhead separate from the encoded 
size? I would have thought we would just check the total size.



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 5
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Oct 2017 00:36:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-19 Thread Bikramjeet Vig (Code Review)
Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 4:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277
PS4, Line 1277: switch (node.element->type) {
> This case is only going to get bigger with the follow on patches - it would
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376
PS4, Line 376: inline int ParquetPlainEncoder::Decode(
> I think you can avoid stamping this out if you leave it still parameterized
Unfortunately, the call to Decode is explicit in our code, eg Decode 
therefore it will always look for a specialization of those template parameters.
Similarly, in the dummy program you wrote, if I explicitly call foo(, ), this will look for a specialization of the double templated  
method and return 0;

Since partial specialization is not possible, the only way to reduce redundant 
code will be to do what Lars suggested earlier, i.e., create a helper method 
that is only templatized on  and call that for each full 
specialization.
I will make changes accordingly in the next iteration of this patch.


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@237
PS2, Line 237:
> I think this particular call here will always return sizeof(int32_t) (line
should we do that everywhere else in this file wherever we specialize a 
method(both Decode and encode)?


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@338
PS2, Line 338:   return fixed_len_size;
> I'm not sure I'm following: it looks like the next three methods are exactl
Oh I see what you mean now. Done!


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66
PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type,
> let's make it static - we don't want to export the symbol to be linked with
Done. enclosed in an anonymous namespace.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70
PS4, Line 70: (
> unnecessary parens
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173
PS4, Line 173: // TODO: Is this check too strict?
> I don't see why this shouldn't match the file metadata, this seems valid to
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182
PS4, Line 182:  parquet::SchemaElement*
> Why not pass by const ref like above?
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209
PS4, Line 209: stringstream ss;
> Can you convert to using Substitute() while we're here? We've been very gra
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176
PS4, Line 176:   TestType(Decimal4Value(test_val),
> It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests.
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193
PS4, Line 193: int32_t
> Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above
for var len decimals, its encoded size is bytes required to store size + least 
num of bytes required to store num.

in this case bytes required to store numeric_limits::max() is 
sizeof(int32_t) which is less than the sizeof(Decimal8Value) since 
Decimal8Value is stored using int64

I will add a comment at the top of these byte_array test to clarify what 
expected size is. would you recommend I add more comments before limit tests?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212
PS4, Line 212:   for (int i = 1; i <=16; ++i) {
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@104
PS4, Line 104: PHYSICAL_TYPE
> Isn't this PHYSICAL_TYPE the internal type, which is called LOGICAL_TYPE in
reverted to 'T' to minimize change from previous code


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@199
PS4, Line 199: PHYSICAL_TYPE
> It doesn't seem 

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-19 Thread Bikramjeet Vig (Code Review)
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#5).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
17 files changed, 568 insertions(+), 256 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/5
--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 5
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-16 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 4:

(3 comments)

Few more comments related to previous ones.

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@104
PS4, Line 104: PHYSICAL_TYPE
Isn't this PHYSICAL_TYPE the internal type, which is called LOGICAL_TYPE in the 
decoder below?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@199
PS4, Line 199: PHYSICAL_TYPE
It doesn't seem like the whole decoding class needs to be templated by this 
parameter, since it only affects the behaviour of the Reset() function that 
actually reads the dictionary.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@323
PS4, Line 323: DictDecoder::GetNextValue(
The logic here is also the same as above - we should find a way to avoid the 
duplication.



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 16 Oct 2017 22:13:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-11 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 4:

(12 comments)

Did a first pass over it - thought I should flush out comments since you've 
been waiting quite a while for feedback here.

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277
PS4, Line 1277: switch (node.element->type) {
This case is only going to get bigger with the follow on patches - it would be 
best to make it a separate function.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@213
PS4, Line 213:   template 
Logical type and physical type I think aren't quite right, since e.g. 
"StringValue" isn't a logical type. It's really decoding/encoding between 
Parquet physical types and Impala's internal representation.

Maybe INTERNAL_TYPE and PARQUET_PHYSICAL_TYPE? Or PARQUET_TYPE seems ok since 
that's what parquet calls it.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376
PS4, Line 376: inline int ParquetPlainEncoder::Decode(
I think you can avoid stamping this out if you leave it still parameterized by 
the internal Impala type, since the logic is the same in all cases. There may 
be an opportunity to reduce the redundancy above too, since there are functions 
above with identical bodies.

I tried this out on a dummy program and it looks like it's possible:

  #include 

  template 
  int foo(T* a, S* b) {
return 0;
  }

  template <>
  int foo(int* a, int* b) { return 1; }

  template 
  int foo(T* a, double* b) { return 2; }

  int main(int argc, char**argv) {
int a, b, c;
double x, y, z;
printf("%d\n", foo(, ));
printf("%d\n", foo(, ));
  }


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66
PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type,
let's make it static - we don't want to export the symbol to be linked with 
code outside this file.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70
PS4, Line 70: (
unnecessary parens


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173
PS4, Line 173: // TODO: Is this check too strict?
I don't see why this shouldn't match the file metadata, this seems valid to me.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182
PS4, Line 182:  parquet::SchemaElement*
Why not pass by const ref like above?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209
PS4, Line 209: stringstream ss;
Can you convert to using Substitute() while we're here? We've been very 
gradually converting to using that for cases like this.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176
PS4, Line 176:   TestType(Decimal4Value(test_val),
It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests. 
Mostly the tests are the same except for the expected size, right? Could we 
make them table-driven so that we test all the same cases but just have 
different expected output?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193
PS4, Line 193: int32_t
Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above? Or 
if this is the better way to express it, should we change it above?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212
PS4, Line 212:   for (int i = 1; i <=16; ++i) {
nit: missing space


http://gerrit.cloudera.org:8080/#/c/7822/4/testdata/data/binary_decimal_dictionary.parquet
File testdata/data/binary_decimal_dictionary.parquet:

PS4:
Can you update the README to describe how these files were generated.



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-11 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-column-stats.inline.h@89
PS2, Line 89:   switch(parquet_type){
> Do you mean to have a Decode wrapper around the templatized Decode methods?
The former, so that the interface of Decode() is simpler. How this is 
implemented seems more a concern of the decoder than the column stats.


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@237
PS2, Line 237: ByteSize
> Do you mean to have something like
I think this particular call here will always return sizeof(int32_t) (line 
220). I'd just put that here, since your change explicitly documents that as a 
template parameter.


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@338
PS2, Line 338: template <>
> thats kinda difficult because DecimalUtil::DecodeFromFixedLenByteArray is a
I'm not sure I'm following: it looks like the next three methods are exactly 
the same. Couldn't you move them into a new method DecodeDecimalValue(const 
uint8_t* buffer, const uint8_t* buffer_end, int fixed_len_size, T* v) and then 
call it and return in one line here? I may be missing something though :)



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 2
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 12 Oct 2017 00:33:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-10-04 Thread Bikramjeet Vig (Code Review)
Hello Lars Volker, Matthew Jacobs, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7822

to look at the new patch set (#4).

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.

Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.

Performance:
Initial perf testing using tpcds_1000 shows no regression.

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
17 files changed, 622 insertions(+), 259 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/4
--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2017-03-07 Thread Henry Robinson (Code Review)
Henry Robinson has abandoned this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Abandoned

-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2017-01-06 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 3:

Agree with Tim about the template code changes, and there's less urgency to get 
this in so I'll abandon this approach for now.

-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-17 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 3:

s/binary/dictionary in last comment.

-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-17 Thread Henry Robinson (Code Review)
Henry Robinson has uploaded a new patch set (#3).

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

 * Extend metadata checks to allow more than one possible physical type
   for a given logical type.
 * Change decimal decoding to handle non-fixed-length format in same path
   as fixed-length encoding.

Testing:

 * Query test that decodes both plain and dictionary-encoded decimals
   using binary encoding.

Perf:

 * Tested computing SUM(col) for 1 billion distinct dictionary-encoded
   decimal(12,2) values using FIXED_BYTE_ARRAY physical type encoding.
 * The overhead of decoding with the extra branch was measured at 1s;
   i.e. the per-decode overhead is 1ns.

Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test
M tests/query_test/test_scanners.py
7 files changed, 135 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/5115/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG
Commit Message:

Line 21:  * Tested computing SUM(col) for 1 billion distinct dictionary-encoded
> I'm assuming you are measuring response time. Since there is overall more w
Right, but I'm measuring the absolute difference in response time, not  the 
relative difference. So it doesn't totally matter what the rest of the query is 
doing, as long as it's stable (I agree that variance can drown out the signal). 
Thankfully all the runtimes I measured were remarkably stable. However, your 
next point is a very good one.


Line 23:  * No performance difference measured by introduction of extra
> I'm assuming you compared response times. With multi-threaded scans the los
Doh, of course you're right, and I wasn't properly measuring the overhead, 
because of parallelism.

With a single thread, decoding 1B decimals is about 1 second slower, which is 
about 1ns per decode. I've updated the commit message (still think that's not a 
perf difference worth chasing).


http://gerrit.cloudera.org:8080/#/c/5115/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

Line 203:   /// assumed to be BYTE_ARRAY, otherwise it is assumed to be 
FIXED_LEN_BYTE_ARRAY.
> Document return value?
Done


PS1, Line 328: fixed_len_size
> This name seems inaccurate since it can be variable. Maybe 'encoded_size'?
Hm, I think it's accurate - either it's set (> 0) and fixed, or it's not set 
(-1). 'encoded_size' wouldn't work well later in the method (because it's set 
to the size of the buffer, not the total encoded size). I could of course have 
another variable to manage that, but this seems ok to me.


PS1, Line 338: fixed_len_size
> What if fixed_len_size is negative?
Done.


http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 150:   # Decimal Parquet encodings written by Java Parquet library.
> If you're going to create the table as part of the test, might as well copy
Done.


-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Henry Robinson (Code Review)
Henry Robinson has uploaded a new patch set (#2).

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

 * Extend metadata checks to allow more than one possible physical type
   for a given logical type.
 * Change decimal decoding to handle non-fixed-length format in same path
   as fixed-length encoding.

Testing:

 * Query test that decodes dictionary-encoded decimals using binary
   encoding.

Perf:

 * Tested computing SUM(col) for 1 billion distinct dictionary-encoded
   decimal(12,2) values using FIXED_BYTE_ARRAY physical type encoding.
 * The overhead of decoding with the extra branch was measured at 1s;
   i.e. the per-decode overhead is 1ns.

Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
A testdata/data/byte_array_decimal_dict_encoded.parquet
A testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test
M tests/query_test/test_scanners.py
6 files changed, 119 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/5115/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG
Commit Message:

Line 21:  * Tested computing SUM(col) for 1 billion distinct dictionary-encoded
> Can do - but where would that extra slowness really come from? I would have
I'm assuming you are measuring response time. Since there is overall more work 
for the scanner to do in your dict-encoded   experiment, any difference in perf 
will be less pronounced because it affects a relatively smaller portion of the 
work. With plain encoded there is no "overhead" of decoding the dictionary 
indexes and fetching the values from the dictionary. For a single decimal 
column, the work of decoding the dict indexes and fetching their values should 
be in the same ball park as just populating the slot directly with plain 
encoding, so there is roughly 50% "noise" it seems.


Line 23:  * No performance difference measured by introduction of extra
> No, but I can do. What do you expect to change?
I'm assuming you compared response times. With multi-threaded scans the loss in 
perf might not be apparent.

With mt_dop=1 we're running the whole query in a single thread, so any slowdown 
along that critical path should prominently affect response time.


-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5115/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

Line 203:   /// assumed to be BYTE_ARRAY, otherwise it is assumed to be 
FIXED_LEN_BYTE_ARRAY.
Document return value?


PS1, Line 328: fixed_len_size
This name seems inaccurate since it can be variable. Maybe 'encoded_size'?


PS1, Line 338: fixed_len_size
What if fixed_len_size is negative?


http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 150:   # Decimal Parquet encodings written by Java Parquet library.
If you're going to create the table as part of the test, might as well copy the 
data as part of the test (and avoid the need to reload test data). E.g. like 
https://github.com/apache/incubator-impala/blob/master/tests/query_test/test_scanners.py#L248


http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test
File 
testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test:

PS1, Line 1: 
   :  QUERY
   : select col2 from decimal_encodings;
   :  RESULTS
   : 123.00
   : 123.00
   : 123.00
   : 123.00
   : 123.00
   :  TYPES
   : DECIMAL
> We're working on getting some richer test data - it's just a bit of a pain 
Sounds good, I think we need a bit more, especially boundary cases for 
different decimal sizes.


-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG
Commit Message:

Line 21:  * Tested computing SUM(col) for 1 billion distinct dictionary-encoded
> Can also verify on plain encoding? Seems like that would be the extreme cas
Can do - but where would that extra slowness really come from? I would have 
thought decoding 1 billion values would represent a large portion of the 
execution time of such a simple query, and if the proportion is really that 
small I'm inclined to worry even less about the perf differential.


Line 23:  * No performance difference measured by introduction of extra
> Did you run this with num_scanner_threads=1 or even better mt_dop=1?
No, but I can do. What do you expect to change?


-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG
Commit Message:

Line 23:  * No performance difference measured by introduction of extra
Did you run this with num_scanner_threads=1 or even better mt_dop=1?


-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG
Commit Message:

Line 21:  * Tested computing SUM(col) for 1 billion distinct dictionary-encoded
Can also verify on plain encoding? Seems like that would be the extreme case. 
The perf difference may not be pronounced enough with dict-encoding due to the 
general slowness of a dictionary-encoded column with only distinct values.


-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Henry Robinson (Code Review)
Henry Robinson has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test
File 
testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test:

PS1, Line 1: 
   :  QUERY
   : select col2 from decimal_encodings;
   :  RESULTS
   : 123.00
   : 123.00
   : 123.00
   : 123.00
   : 123.00
   :  TYPES
   : DECIMAL
We're working on getting some richer test data - it's just a bit of a pain to 
write it out. This will include > 1 dict-encoded value, and some values that 
are plain encoded.


-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson 
Gerrit-Reviewer: Henry Robinson 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

2016-11-16 Thread Henry Robinson (Code Review)
Henry Robinson has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5115

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet 
scanner
..

IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner

* Extend metadata checks to allow more than one possible physical type
  for a given logical type.
* Change decimal decoding to handle non-fixed-length format in same path
  as fixed-length encoding.

Testing:

 * Query test that decodes dictionary-encoded decimals using binary
   encoding.

Perf:

 * Tested computing SUM(col) for 1 billion distinct dictionary-encoded
   decimal(12,2) values using FIXED_BYTE_ARRAY physical type encoding.
 * No performance difference measured by introduction of extra
   predictable branch to Decode() path.

Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M testdata/bin/create-load-data.sh
A testdata/data/byte_array_decimal_dict_encoded.parquet
A testdata/workloads/functional-query/queries/QueryTest/decimal-encodings.test
M tests/query_test/test_scanners.py
7 files changed, 118 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/5115/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson