Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-03-01 Thread Patrick Hansert

Hi Gang,

thanks for your reply.

On 01.03.23 03:09, Gang Wu wrote:

If at least one record in the beginning 2 rows is not null, then the 
encoded size will be much better.
That is the workaround I have been using for the past weeks, although my 
tests show that at least two values are required.



3. If dictionary encoding is in effect, the first page must be a dictionary 
page followed by a set of data pages that are only indices of the dictionary.
[...]
5. By default, the parquet-mr implementation has to decide the encoding of a 
page when it reaches 2 records.


I agree that this is at the core of the problem; the question is, can 
this be changed to allow for better encoding decisions in the scenario I 
described? An all-null page contains just definition and (possibly) 
repetition levels, no value entries, so there is no need to choose their 
encoding yet. What are the reasons for forcing the dictionary to be the 
first page?


Kind Regards

Patrick



[jira] [Created] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread Yujiang Zhong (Jira)
Yujiang Zhong created PARQUET-2252:
--

 Summary: Make some methods public to allow external projects to 
implement page skipping
 Key: PARQUET-2252
 URL: https://issues.apache.org/jira/browse/PARQUET-2252
 Project: Parquet
  Issue Type: New Feature
Reporter: Yujiang Zhong


Iceberg hopes to implement the column index filter based on Iceberg's own 
expressions, we would like to be able to use some of the methods in Parquet 
repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
are currently not public. Currently we can only rely on reflection to use them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] zhongyujiang opened a new pull request, #1038: PARQUET-2252: Make some methods public to allow external projects to …

2023-03-01 Thread via GitHub


zhongyujiang opened a new pull request, #1038:
URL: https://github.com/apache/parquet-mr/pull/1038

   …implement page skipping.
   
   Issue: [PARQUET-2252](https://issues.apache.org/jira/browse/PARQUET-2252)
   
   This PR makes some methods required to implement column index filter public 
to allow Iceberg build its own column index filtering.  Since Iceberg is going 
to calculate `RowRanges` itself, this also adds a public method in 
`ParquetFileReader` that allows users to pass in `RowRanges` to read filtered 
row group. Use of these changes can refer to this 
[PR](https://github.com/apache/iceberg/pull/6967), currently it uses reflection 
as a workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695022#comment-17695022
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

zhongyujiang opened a new pull request, #1038:
URL: https://github.com/apache/parquet-mr/pull/1038

   …implement page skipping.
   
   Issue: [PARQUET-2252](https://issues.apache.org/jira/browse/PARQUET-2252)
   
   This PR makes some methods required to implement column index filter public 
to allow Iceberg build its own column index filtering.  Since Iceberg is going 
to calculate `RowRanges` itself, this also adds a public method in 
`ParquetFileReader` that allows users to pass in `RowRanges` to read filtered 
row group. Use of these changes can refer to this 
[PR](https://github.com/apache/iceberg/pull/6967), currently it uses reflection 
as a workaround.




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] zhongyujiang commented on pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …

2023-03-01 Thread via GitHub


zhongyujiang commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1449991193

   @wgtmac @rdblue Can you please help review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695023#comment-17695023
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

zhongyujiang commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1449991193

   @wgtmac @rdblue Can you please help review this?




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] gszadovszky commented on pull request #1036: PARQUET-2230: [CLI] Deprecate commands replaced by rewrite

2023-03-01 Thread via GitHub


gszadovszky commented on PR #1036:
URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450139433

   Thanks a lot, @wgtmac. It looks good to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695070#comment-17695070
 ] 

ASF GitHub Bot commented on PARQUET-2230:
-

gszadovszky commented on PR #1036:
URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450139433

   Thanks a lot, @wgtmac. It looks good to me.




> Add a new rewrite command powered by ParquetRewriter
> 
>
> Key: PARQUET-2230
> URL: https://issues.apache.org/jira/browse/PARQUET-2230
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cli
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> parquet-cli has several commands for rewriting files but missing a 
> consolidated one to provide the full features of ParquetRewriter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] gszadovszky commented on pull request #1036: PARQUET-2230: [CLI] Deprecate commands replaced by rewrite

2023-03-01 Thread via GitHub


gszadovszky commented on PR #1036:
URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450142055

   (Congrats for the committership! From now on I won't push your PRs. :wink: )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695073#comment-17695073
 ] 

ASF GitHub Bot commented on PARQUET-2230:
-

gszadovszky commented on PR #1036:
URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450142055

   (Congrats for the committership! From now on I won't push your PRs. :wink: )




> Add a new rewrite command powered by ParquetRewriter
> 
>
> Key: PARQUET-2230
> URL: https://issues.apache.org/jira/browse/PARQUET-2230
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cli
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> parquet-cli has several commands for rewriting files but missing a 
> consolidated one to provide the full features of ParquetRewriter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-03-01 Thread Gang Wu
>  What are the reasons for forcing the dictionary to be the first page?

This is by design. I guess it benefits sequential scan where the dictionary
page is read first and then followed by its encoded indices in the data
pages. Otherwise we need to seek anyway.

> can this be changed to allow for better encoding decisions in the
scenario I described?

I think that is possible. But this requires code change to buffer input
data and postpone encoding decisions until it gets sufficient knowledge of
the data.

Best,
Gang




On Wed, Mar 1, 2023 at 6:33 PM Patrick Hansert 
wrote:

> Hi Gang,
>
> thanks for your reply.
>
> On 01.03.23 03:09, Gang Wu wrote:
> > If at least one record in the beginning 2 rows is not null, then the
> encoded size will be much better.
> That is the workaround I have been using for the past weeks, although my
> tests show that at least two values are required.
>
> > 3. If dictionary encoding is in effect, the first page must be a
> dictionary page followed by a set of data pages that are only indices of
> the dictionary.
> > [...]
> > 5. By default, the parquet-mr implementation has to decide the encoding
> of a page when it reaches 2 records.
>
> I agree that this is at the core of the problem; the question is, can
> this be changed to allow for better encoding decisions in the scenario I
> described? An all-null page contains just definition and (possibly)
> repetition levels, no value entries, so there is no need to choose their
> encoding yet. What are the reasons for forcing the dictionary to be the
> first page?
>
> Kind Regards
>
> Patrick
>
>


[GitHub] [parquet-mr] wgtmac commented on pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …

2023-03-01 Thread via GitHub


wgtmac commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450359195

   @gszadovszky @shangxinli Do you have any concern?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695166#comment-17695166
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

wgtmac commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450359195

   @gszadovszky @shangxinli Do you have any concern?




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on pull request #1036: PARQUET-2230: [CLI] Deprecate commands replaced by rewrite

2023-03-01 Thread via GitHub


wgtmac commented on PR #1036:
URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450365870

   > (Congrats for the committership! From now on I won't push your PRs. 😉 )
   
   Thank you for your help all the time! @gszadovszky 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695170#comment-17695170
 ] 

ASF GitHub Bot commented on PARQUET-2230:
-

wgtmac commented on PR #1036:
URL: https://github.com/apache/parquet-mr/pull/1036#issuecomment-1450365870

   > (Congrats for the committership! From now on I won't push your PRs. 😉 )
   
   Thank you for your help all the time! @gszadovszky 




> Add a new rewrite command powered by ParquetRewriter
> 
>
> Key: PARQUET-2230
> URL: https://issues.apache.org/jira/browse/PARQUET-2230
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cli
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> parquet-cli has several commands for rewriting files but missing a 
> consolidated one to provide the full features of ParquetRewriter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac merged pull request #1036: PARQUET-2230: [CLI] Deprecate commands replaced by rewrite

2023-03-01 Thread via GitHub


wgtmac merged PR #1036:
URL: https://github.com/apache/parquet-mr/pull/1036


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2230) Add a new rewrite command powered by ParquetRewriter

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695171#comment-17695171
 ] 

ASF GitHub Bot commented on PARQUET-2230:
-

wgtmac merged PR #1036:
URL: https://github.com/apache/parquet-mr/pull/1036




> Add a new rewrite command powered by ParquetRewriter
> 
>
> Key: PARQUET-2230
> URL: https://issues.apache.org/jira/browse/PARQUET-2230
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cli
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> parquet-cli has several commands for rewriting files but missing a 
> consolidated one to provide the full features of ParquetRewriter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] gszadovszky commented on pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …

2023-03-01 Thread via GitHub


gszadovszky commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450409997

   Since these are already used in iceberg I think it is better to have them 
public and maintain backward compatibility. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695180#comment-17695180
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

gszadovszky commented on PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#issuecomment-1450409997

   Since these are already used in iceberg I think it is better to have them 
public and maintain backward compatibility. 




> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] wgtmac commented on a diff in pull request #190: Minor: add FIXED_LEN_BYTE_ARRAY under Types in doc

2023-03-01 Thread via GitHub


wgtmac commented on code in PR #190:
URL: https://github.com/apache/parquet-format/pull/190#discussion_r1122508344


##
README.md:
##
@@ -132,6 +132,7 @@ readers and writers for the format.  The types are:
   - FLOAT: IEEE 32-bit floating point values
   - DOUBLE: IEEE 64-bit floating point values
   - BYTE_ARRAY: arbitrarily long byte arrays.
+  - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays.

Review Comment:
   Thanks for adding this!
   
   It is weird that only `BYTE_ARRAY` and `FIXED_LEN_BYTE_ARRAY` end with a 
period. Better to remove them off for consistency.
   
   Should we also mark INT96 as deprecated? cc @gszadovszky @shangxinli 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-03-01 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   because these module have been run in the Test workflow. I think 
vector-plugins should run only the modules associated with vector



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695405#comment-17695405
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   because these module have been run in the Test workflow. I think 
vector-plugins should run only the modules associated with vector





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-03-01 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   because these modules have been execute in the Test workflow. I think 
vector-plugins should execute only the modules associated with vector. 
vector-plugins should not execute repeated part with Test workflow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695406#comment-17695406
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   because these modules have been execute in the Test workflow. I think 
vector-plugins should execute only the modules associated with vector. 
vector-plugins should not execute repeated part with Test workflow.





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-03-01 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   because these modules(parquet-hadoop parquet-arrow ...) have been executed 
in the Test workflow. I think vector-plugins should execute only the modules 
associated with vector. vector-plugins should not execute repeated part with 
Test workflow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …

2023-03-01 Thread via GitHub


wgtmac commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1122511179


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int 
blockIndex) throws IOException {
 }
 
 RowRanges rowRanges = getRowRanges(blockIndex);
+return readFilteredRowGroup(blockIndex, rowRanges);
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the
+   * {@code rowRanges} passed in. As the rows are not aligned among the pages 
of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @param rowRanges the row ranges to be read from the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   * @throws IllegalArgumentException if the {@code blockIndex} is invalid or 
the {@code rowRanges} is null
+   */
+  public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, 
RowRanges rowRanges) throws IOException {
+if (blockIndex < 0 || blockIndex >= blocks.size()) {
+  throw new IllegalArgumentException(String.format("Invalid block index 
%s, the valid block index range are: " +
+"[%s, %s]", blockIndex, 0, blocks.size() - 1));
+}
+
+if (Objects.isNull(rowRanges)) {
+  throw new IllegalArgumentException("RowRanges must not be null");
+}
+
+BlockMetaData block = blocks.get(blockIndex);
+if (block.getRowCount() == 0L) {
+  throw new ParquetEmptyBlockException("Illegal row group of 0 rows");

Review Comment:
   Now the reader simply skips empty row groups instead of throw. Could you 
change this to be consistent?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695407#comment-17695407
 ] 

ASF GitHub Bot commented on PARQUET-2252:
-

wgtmac commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1122511179


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int 
blockIndex) throws IOException {
 }
 
 RowRanges rowRanges = getRowRanges(blockIndex);
+return readFilteredRowGroup(blockIndex, rowRanges);
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the
+   * {@code rowRanges} passed in. As the rows are not aligned among the pages 
of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @param rowRanges the row ranges to be read from the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   * @throws IllegalArgumentException if the {@code blockIndex} is invalid or 
the {@code rowRanges} is null
+   */
+  public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, 
RowRanges rowRanges) throws IOException {
+if (blockIndex < 0 || blockIndex >= blocks.size()) {
+  throw new IllegalArgumentException(String.format("Invalid block index 
%s, the valid block index range are: " +
+"[%s, %s]", blockIndex, 0, blocks.size() - 1));
+}
+
+if (Objects.isNull(rowRanges)) {
+  throw new IllegalArgumentException("RowRanges must not be null");
+}
+
+BlockMetaData block = blocks.get(blockIndex);
+if (block.getRowCount() == 0L) {
+  throw new ParquetEmptyBlockException("Illegal row group of 0 rows");

Review Comment:
   Now the reader simply skips empty row groups instead of throw. Could you 
change this to be consistent?





> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695408#comment-17695408
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121226362


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   because these modules(parquet-hadoop parquet-arrow ...) have been executed 
in the Test workflow. I think vector-plugins should execute only the modules 
associated with vector. vector-plugins should not execute repeated part with 
Test workflow.





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=21

[jira] [Resolved] (PARQUET-2251) Avoid generating Bloomfilter when all pages of a column are encoded by dictionary

2023-03-01 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars resolved PARQUET-2251.
---
Resolution: Fixed

> Avoid generating Bloomfilter when all pages of a column are encoded by 
> dictionary
> -
>
> Key: PARQUET-2251
> URL: https://issues.apache.org/jira/browse/PARQUET-2251
> Project: Parquet
>  Issue Type: Bug
>Reporter: Mars
>Priority: Major
>
> In parquet pageV1, even all pages of a column are encoded by dictionary, it 
> will still generate BloomFilter. Actually it is unnecessary to generate 
> BloomFilter and it cost time and occupy storage.
> Parquet pageV2 doesn't generate BloomFilter if all pages of a column are 
> encoded by dictionary,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-01 Thread via GitHub


jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122538089


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   @wgtmac I updated the vector-plugins workflow, it only specifies modules 
needed to execute



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695418#comment-17695418
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

jiangjiguang commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122538089


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   @wgtmac I updated the vector-plugins workflow, it only specifies modules 
needed to execute





> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-01 Thread via GitHub


wgtmac commented on PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1451385421

   I'd request sign off from @gszadovszky @shangxinli 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695499#comment-17695499
 ] 

ASF GitHub Bot commented on PARQUET-2159:
-

wgtmac commented on PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1451385421

   I'd request sign off from @gszadovszky @shangxinli 




> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)