Re: Fallback Encoding for Very Sparse or Sorted Datasets
Hi Gang, thanks for your reply. On 01.03.23 03:09, Gang Wu wrote: If at least one record in the beginning 2 rows is not null, then the encoded size will be much better. That is the workaround I have been using for the past weeks, although my tests show that at least two values are required. 3. If dictionary encoding is in effect, the first page must be a dictionary page followed by a set of data pages that are only indices of the dictionary. [...] 5. By default, the parquet-mr implementation has to decide the encoding of a page when it reaches 2 records. I agree that this is at the core of the problem; the question is, can this be changed to allow for better encoding decisions in the scenario I described? An all-null page contains just definition and (possibly) repetition levels, no value entries, so there is no need to choose their encoding yet. What are the reasons for forcing the dictionary to be the first page? Kind Regards Patrick
[jira] [Created] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
Yujiang Zhong created PARQUET-2252: -- Summary: Make some methods public to allow external projects to implement page skipping Key: PARQUET-2252 URL: https://issues.apache.org/jira/browse/PARQUET-2252 Project: Parquet Issue Type: New Feature Reporter: Yujiang Zhong Iceberg hopes to implement the column index filter based on Iceberg's own expressions, we would like to be able to use some of the methods in Parquet repo, for example: methods in `RowRanges` and `IndexIterator`, however these are currently not public. Currently we can only rely on reflection to use them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] zhongyujiang opened a new pull request, #1038: PARQUET-2252: Make some methods public to allow external projects to …
zhongyujiang opened a new pull request, #1038: URL: https://github.com/apache/parquet-mr/pull/1038 …implement page skipping. Issue: [PARQUET-2252](https://issues.apache.org/jira/browse/PARQUET-2252) This PR makes some methods required to implement column index filter public to allow Iceberg build its own column index filtering. Since Iceberg is going to calculate `RowRanges` itself, this also adds a public method in `ParquetFileReader` that allows users to pass in `RowRanges` to read filtered row group. Use of these changes can refer to this [PR](https://github.com/apache/iceberg/pull/6967), currently it uses reflection as a workaround. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695022#comment-17695022 ] ASF GitHub Bot commented on PARQUET-2252: - zhongyujiang opened a new pull request, #1038: URL: https://github.com/apache/parquet-mr/pull/1038 …implement page skipping. Issue: [PARQUET-2252](https://issues.apache.org/jira/browse/PARQUET-2252) This PR makes some methods required to implement column index filter public to allow Iceberg build its own column index filtering. Since Iceberg is going to calculate `RowRanges` itself, this also adds a public method in `ParquetFileReader` that allows users to pass in `RowRanges` to read filtered row group. Use of these changes can refer to this [PR](https://github.com/apache/iceberg/pull/6967), currently it uses reflection as a workaround. > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] zhongyujiang commented on pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …
zhongyujiang commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1449991193 @wgtmac @rdblue Can you please help review this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695023#comment-17695023 ] ASF GitHub Bot commented on PARQUET-2252: - zhongyujiang commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1449991193 @wgtmac @rdblue Can you please help review this? > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] gszadovszky commented on pull request #1036: PARQUET-2230: [CLI] Deprecate commands replaced by rewrite
gszadovszky commented on PR #1036: URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450139433 Thanks a lot, @wgtmac. It looks good to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter
[ https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695070#comment-17695070 ] ASF GitHub Bot commented on PARQUET-2230: - gszadovszky commented on PR #1036: URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450139433 Thanks a lot, @wgtmac. It looks good to me. > Add a new rewrite command powered by ParquetRewriter > > > Key: PARQUET-2230 > URL: https://issues.apache.org/jira/browse/PARQUET-2230 > Project: Parquet > Issue Type: Sub-task > Components: parquet-cli >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > parquet-cli has several commands for rewriting files but missing a > consolidated one to provide the full features of ParquetRewriter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] gszadovszky commented on pull request #1036: PARQUET-2230: [CLI] Deprecate commands replaced by rewrite
gszadovszky commented on PR #1036: URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450142055 (Congrats for the committership! From now on I won't push your PRs. :wink: ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter
[ https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695073#comment-17695073 ] ASF GitHub Bot commented on PARQUET-2230: - gszadovszky commented on PR #1036: URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450142055 (Congrats for the committership! From now on I won't push your PRs. :wink: ) > Add a new rewrite command powered by ParquetRewriter > > > Key: PARQUET-2230 > URL: https://issues.apache.org/jira/browse/PARQUET-2230 > Project: Parquet > Issue Type: Sub-task > Components: parquet-cli >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > parquet-cli has several commands for rewriting files but missing a > consolidated one to provide the full features of ParquetRewriter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: Fallback Encoding for Very Sparse or Sorted Datasets
> What are the reasons for forcing the dictionary to be the first page? This is by design. I guess it benefits sequential scan where the dictionary page is read first and then followed by its encoded indices in the data pages. Otherwise we need to seek anyway. > can this be changed to allow for better encoding decisions in the scenario I described? I think that is possible. But this requires code change to buffer input data and postpone encoding decisions until it gets sufficient knowledge of the data. Best, Gang On Wed, Mar 1, 2023 at 6:33 PM Patrick Hansert wrote: > Hi Gang, > > thanks for your reply. > > On 01.03.23 03:09, Gang Wu wrote: > > If at least one record in the beginning 2 rows is not null, then the > encoded size will be much better. > That is the workaround I have been using for the past weeks, although my > tests show that at least two values are required. > > > 3. If dictionary encoding is in effect, the first page must be a > dictionary page followed by a set of data pages that are only indices of > the dictionary. > > [...] > > 5. By default, the parquet-mr implementation has to decide the encoding > of a page when it reaches 2 records. > > I agree that this is at the core of the problem; the question is, can > this be changed to allow for better encoding decisions in the scenario I > described? An all-null page contains just definition and (possibly) > repetition levels, no value entries, so there is no need to choose their > encoding yet. What are the reasons for forcing the dictionary to be the > first page? > > Kind Regards > > Patrick > >
[GitHub] [parquet-mr] wgtmac commented on pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …
wgtmac commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450359195 @gszadovszky @shangxinli Do you have any concern? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695166#comment-17695166 ] ASF GitHub Bot commented on PARQUET-2252: - wgtmac commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450359195 @gszadovszky @shangxinli Do you have any concern? > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on pull request #1036: PARQUET-2230: [CLI] Deprecate commands replaced by rewrite
wgtmac commented on PR #1036: URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450365870 > (Congrats for the committership! From now on I won't push your PRs. 😉 ) Thank you for your help all the time! @gszadovszky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter
[ https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695170#comment-17695170 ] ASF GitHub Bot commented on PARQUET-2230: - wgtmac commented on PR #1036: URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450365870 > (Congrats for the committership! From now on I won't push your PRs. 😉 ) Thank you for your help all the time! @gszadovszky > Add a new rewrite command powered by ParquetRewriter > > > Key: PARQUET-2230 > URL: https://issues.apache.org/jira/browse/PARQUET-2230 > Project: Parquet > Issue Type: Sub-task > Components: parquet-cli >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > parquet-cli has several commands for rewriting files but missing a > consolidated one to provide the full features of ParquetRewriter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac merged pull request #1036: PARQUET-2230: [CLI] Deprecate commands replaced by rewrite
wgtmac merged PR #1036: URL: https://github.com/apache/parquet-mr/pull/1036 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter
[ https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695171#comment-17695171 ] ASF GitHub Bot commented on PARQUET-2230: - wgtmac merged PR #1036: URL: https://github.com/apache/parquet-mr/pull/1036 > Add a new rewrite command powered by ParquetRewriter > > > Key: PARQUET-2230 > URL: https://issues.apache.org/jira/browse/PARQUET-2230 > Project: Parquet > Issue Type: Sub-task > Components: parquet-cli >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > parquet-cli has several commands for rewriting files but missing a > consolidated one to provide the full features of ParquetRewriter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] gszadovszky commented on pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …
gszadovszky commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450409997 Since these are already used in iceberg I think it is better to have them public and maintain backward compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695180#comment-17695180 ] ASF GitHub Bot commented on PARQUET-2252: - gszadovszky commented on PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450409997 Since these are already used in iceberg I think it is better to have them public and maintain backward compatibility. > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-format] wgtmac commented on a diff in pull request #190: Minor: add FIXED_LEN_BYTE_ARRAY under Types in doc
wgtmac commented on code in PR #190: URL: https://github.com/apache/parquet-format/pull/190#discussion_r1122508344 ## README.md: ## @@ -132,6 +132,7 @@ readers and writers for the format. The types are: - FLOAT: IEEE 32-bit floating point values - DOUBLE: IEEE 64-bit floating point values - BYTE_ARRAY: arbitrarily long byte arrays. + - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays. Review Comment: Thanks for adding this! It is weird that only `BYTE_ARRAY` and `FIXED_LEN_BYTE_ARRAY` end with a period. Better to remove them off for consistency. Should we also mark INT96 as deprecated? cc @gszadovszky @shangxinli -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Vector-plugins + +on: [push, pull_request] + +jobs: + build: + +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: [ '17' ] +codes: [ 'uncompressed,brotli', 'gzip,snappy' ] +name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }} + +steps: + - uses: actions/checkout@master + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: before_install +env: + CI_TARGET_BRANCH: $GITHUB_HEAD_REF +run: | + bash dev/ci-before_install.sh + - name: install +run: | + EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate -Dexpression=extraJavaTestArgs -q -DforceStdout) + export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS" + mvn install --batch-mode -Pvector-plugins -DskipTests=true -Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} -pl -parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift Review Comment: because these module have been run in the Test workflow. I think vector-plugins should run only the modules associated with vector -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695405#comment-17695405 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Vector-plugins + +on: [push, pull_request] + +jobs: + build: + +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: [ '17' ] +codes: [ 'uncompressed,brotli', 'gzip,snappy' ] +name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }} + +steps: + - uses: actions/checkout@master + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: before_install +env: + CI_TARGET_BRANCH: $GITHUB_HEAD_REF +run: | + bash dev/ci-before_install.sh + - name: install +run: | + EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate -Dexpression=extraJavaTestArgs -q -DforceStdout) + export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS" + mvn install --batch-mode -Pvector-plugins -DskipTests=true -Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} -pl -parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift Review Comment: because these module have been run in the Test workflow. I think vector-plugins should run only the modules associated with vector > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.
[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Vector-plugins + +on: [push, pull_request] + +jobs: + build: + +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: [ '17' ] +codes: [ 'uncompressed,brotli', 'gzip,snappy' ] +name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }} + +steps: + - uses: actions/checkout@master + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: before_install +env: + CI_TARGET_BRANCH: $GITHUB_HEAD_REF +run: | + bash dev/ci-before_install.sh + - name: install +run: | + EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate -Dexpression=extraJavaTestArgs -q -DforceStdout) + export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS" + mvn install --batch-mode -Pvector-plugins -DskipTests=true -Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} -pl -parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift Review Comment: because these modules have been execute in the Test workflow. I think vector-plugins should execute only the modules associated with vector. vector-plugins should not execute repeated part with Test workflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695406#comment-17695406 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Vector-plugins + +on: [push, pull_request] + +jobs: + build: + +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: [ '17' ] +codes: [ 'uncompressed,brotli', 'gzip,snappy' ] +name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }} + +steps: + - uses: actions/checkout@master + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: before_install +env: + CI_TARGET_BRANCH: $GITHUB_HEAD_REF +run: | + bash dev/ci-before_install.sh + - name: install +run: | + EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate -Dexpression=extraJavaTestArgs -q -DforceStdout) + export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS" + mvn install --batch-mode -Pvector-plugins -DskipTests=true -Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} -pl -parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift Review Comment: because these modules have been execute in the Test workflow. I think vector-plugins should execute only the modules associated with vector. vector-plugins should not execute repeated part with Test workflow. > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.
[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Vector-plugins + +on: [push, pull_request] + +jobs: + build: + +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: [ '17' ] +codes: [ 'uncompressed,brotli', 'gzip,snappy' ] +name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }} + +steps: + - uses: actions/checkout@master + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: before_install +env: + CI_TARGET_BRANCH: $GITHUB_HEAD_REF +run: | + bash dev/ci-before_install.sh + - name: install +run: | + EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate -Dexpression=extraJavaTestArgs -q -DforceStdout) + export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS" + mvn install --batch-mode -Pvector-plugins -DskipTests=true -Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} -pl -parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift Review Comment: because these modules(parquet-hadoop parquet-arrow ...) have been executed in the Test workflow. I think vector-plugins should execute only the modules associated with vector. vector-plugins should not execute repeated part with Test workflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …
wgtmac commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1122511179 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int blockIndex) throws IOException { } RowRanges rowRanges = getRowRanges(blockIndex); +return readFilteredRowGroup(blockIndex, rowRanges); + } + + /** + * Reads all the columns requested from the specified row group. It may skip specific pages based on the + * {@code rowRanges} passed in. As the rows are not aligned among the pages of the different columns row + * synchronization might be required. See the documentation of the class SynchronizingColumnReader for details. + * + * @param blockIndex the index of the requested block + * @param rowRanges the row ranges to be read from the requested block + * @return the PageReadStore which can provide PageReaders for each column or null if there are no rows in this block + * @throws IOException if an error occurs while reading + * @throws IllegalArgumentException if the {@code blockIndex} is invalid or the {@code rowRanges} is null + */ + public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, RowRanges rowRanges) throws IOException { +if (blockIndex < 0 || blockIndex >= blocks.size()) { + throw new IllegalArgumentException(String.format("Invalid block index %s, the valid block index range are: " + +"[%s, %s]", blockIndex, 0, blocks.size() - 1)); +} + +if (Objects.isNull(rowRanges)) { + throw new IllegalArgumentException("RowRanges must not be null"); +} + +BlockMetaData block = blocks.get(blockIndex); +if (block.getRowCount() == 0L) { + throw new ParquetEmptyBlockException("Illegal row group of 0 rows"); Review Comment: Now the reader simply skips empty row groups instead of throw. Could you change this to be consistent? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695407#comment-17695407 ] ASF GitHub Bot commented on PARQUET-2252: - wgtmac commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1122511179 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int blockIndex) throws IOException { } RowRanges rowRanges = getRowRanges(blockIndex); +return readFilteredRowGroup(blockIndex, rowRanges); + } + + /** + * Reads all the columns requested from the specified row group. It may skip specific pages based on the + * {@code rowRanges} passed in. As the rows are not aligned among the pages of the different columns row + * synchronization might be required. See the documentation of the class SynchronizingColumnReader for details. + * + * @param blockIndex the index of the requested block + * @param rowRanges the row ranges to be read from the requested block + * @return the PageReadStore which can provide PageReaders for each column or null if there are no rows in this block + * @throws IOException if an error occurs while reading + * @throws IllegalArgumentException if the {@code blockIndex} is invalid or the {@code rowRanges} is null + */ + public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, RowRanges rowRanges) throws IOException { +if (blockIndex < 0 || blockIndex >= blocks.size()) { + throw new IllegalArgumentException(String.format("Invalid block index %s, the valid block index range are: " + +"[%s, %s]", blockIndex, 0, blocks.size() - 1)); +} + +if (Objects.isNull(rowRanges)) { + throw new IllegalArgumentException("RowRanges must not be null"); +} + +BlockMetaData block = blocks.get(blockIndex); +if (block.getRowCount() == 0L) { + throw new ParquetEmptyBlockException("Illegal row group of 0 rows"); Review Comment: Now the reader simply skips empty row groups instead of throw. Could you change this to be consistent? > Make some methods public to allow external projects to implement page skipping > -- > > Key: PARQUET-2252 > URL: https://issues.apache.org/jira/browse/PARQUET-2252 > Project: Parquet > Issue Type: New Feature >Reporter: Yujiang Zhong >Priority: Major > > Iceberg hopes to implement the column index filter based on Iceberg's own > expressions, we would like to be able to use some of the methods in Parquet > repo, for example: methods in `RowRanges` and `IndexIterator`, however these > are currently not public. Currently we can only rely on reflection to use > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695408#comment-17695408 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Vector-plugins + +on: [push, pull_request] + +jobs: + build: + +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: [ '17' ] +codes: [ 'uncompressed,brotli', 'gzip,snappy' ] +name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }} + +steps: + - uses: actions/checkout@master + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: before_install +env: + CI_TARGET_BRANCH: $GITHUB_HEAD_REF +run: | + bash dev/ci-before_install.sh + - name: install +run: | + EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate -Dexpression=extraJavaTestArgs -q -DforceStdout) + export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS" + mvn install --batch-mode -Pvector-plugins -DskipTests=true -Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} -pl -parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift Review Comment: because these modules(parquet-hadoop parquet-arrow ...) have been executed in the Test workflow. I think vector-plugins should execute only the modules associated with vector. vector-plugins should not execute repeated part with Test workflow. > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=21
[jira] [Resolved] (PARQUET-2251) Avoid generating Bloomfilter when all pages of a column are encoded by dictionary
[ https://issues.apache.org/jira/browse/PARQUET-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mars resolved PARQUET-2251. --- Resolution: Fixed > Avoid generating Bloomfilter when all pages of a column are encoded by > dictionary > - > > Key: PARQUET-2251 > URL: https://issues.apache.org/jira/browse/PARQUET-2251 > Project: Parquet > Issue Type: Bug >Reporter: Mars >Priority: Major > > In parquet pageV1, even all pages of a column are encoded by dictionary, it > will still generate BloomFilter. Actually it is unnecessary to generate > BloomFilter and it cost time and occupy storage. > Parquet pageV2 doesn't generate BloomFilter if all pages of a column are > encoded by dictionary, -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122538089 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Vector-plugins + +on: [push, pull_request] + +jobs: + build: + +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: [ '17' ] +codes: [ 'uncompressed,brotli', 'gzip,snappy' ] +name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }} + +steps: + - uses: actions/checkout@master + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: before_install +env: + CI_TARGET_BRANCH: $GITHUB_HEAD_REF +run: | + bash dev/ci-before_install.sh + - name: install +run: | + EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate -Dexpression=extraJavaTestArgs -q -DforceStdout) + export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS" + mvn install --batch-mode -Pvector-plugins -DskipTests=true -Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} -pl -parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift Review Comment: @wgtmac I updated the vector-plugins workflow, it only specifies modules needed to execute -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695418#comment-17695418 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122538089 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Vector-plugins + +on: [push, pull_request] + +jobs: + build: + +runs-on: ubuntu-latest +strategy: + fail-fast: false + matrix: +java: [ '17' ] +codes: [ 'uncompressed,brotli', 'gzip,snappy' ] +name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }} + +steps: + - uses: actions/checkout@master + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: before_install +env: + CI_TARGET_BRANCH: $GITHUB_HEAD_REF +run: | + bash dev/ci-before_install.sh + - name: install +run: | + EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate -Dexpression=extraJavaTestArgs -q -DforceStdout) + export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS" + mvn install --batch-mode -Pvector-plugins -DskipTests=true -Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} -pl -parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift Review Comment: @wgtmac I updated the vector-plugins workflow, it only specifies modules needed to execute > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI
wgtmac commented on PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1451385421 I'd request sign off from @gszadovszky @shangxinli -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695499#comment-17695499 ] ASF GitHub Bot commented on PARQUET-2159: - wgtmac commented on PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1451385421 I'd request sign off from @gszadovszky @shangxinli > Parquet bit-packing de/encode optimization > -- > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fang-Xie >Assignee: Fang-Xie >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)