[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695963#comment-17695963 ] ASF GitHub Bot commented on PARQUET-2237: - yabola commented on PR #1039: URL: h

[GitHub] [parquet-mr] yabola commented on pull request #1039: PARQUET-2237 Improve performance by skipping BloomFilter when column has a dictionary filter

2023-03-02 Thread via GitHub
yabola commented on PR #1039: URL: https://github.com/apache/parquet-mr/pull/1039#issuecomment-1452828333 @wgtmac @gszadovszky If you have time, please take a look, thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[jira] [Assigned] (PARQUET-2250) Expose column descriptor through RecordReader

2023-03-02 Thread Weston Pace (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace reassigned PARQUET-2250: Assignee: fatemah > Expose column descriptor through RecordReader >

[jira] [Resolved] (PARQUET-2250) Expose column descriptor through RecordReader

2023-03-02 Thread Weston Pace (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved PARQUET-2250. -- Fix Version/s: cpp-11.0.0 Resolution: Fixed Issue resolved by pull request 34318 https

[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695732#comment-17695732 ] ASF GitHub Bot commented on PARQUET-2252: - wgtmac commented on code in PR #1038

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …

2023-03-02 Thread via GitHub
wgtmac commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1123155108 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int blockIndex)

[GitHub] [parquet-format] wgtmac merged pull request #190: MINOR: Add FIXED_LEN_BYTE_ARRAY Type

2023-03-02 Thread via GitHub
wgtmac merged PR #190: URL: https://github.com/apache/parquet-format/pull/190 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.ap

[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695692#comment-17695692 ] ASF GitHub Bot commented on PARQUET-2252: - zhongyujiang commented on code in PR

[GitHub] [parquet-mr] zhongyujiang commented on a diff in pull request #1038: PARQUET-2252: Make some methods public to allow external projects to …

2023-03-02 Thread via GitHub
zhongyujiang commented on code in PR #1038: URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1123028615 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int block

[GitHub] [parquet-format] XinyuZeng commented on a diff in pull request #190: MINOR: Add FIXED_LEN_BYTE_ARRAY Type

2023-03-02 Thread via GitHub
XinyuZeng commented on code in PR #190: URL: https://github.com/apache/parquet-format/pull/190#discussion_r1122985424 ## README.md: ## @@ -132,6 +132,7 @@ readers and writers for the format. The types are: - FLOAT: IEEE 32-bit floating point values - DOUBLE: IEEE 64-bit f

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #190: MINOR: Add FIXED_LEN_BYTE_ARRAY Type

2023-03-02 Thread via GitHub
wgtmac commented on code in PR #190: URL: https://github.com/apache/parquet-format/pull/190#discussion_r1122967418 ## README.md: ## @@ -132,6 +132,7 @@ readers and writers for the format. The types are: - FLOAT: IEEE 32-bit floating point values - DOUBLE: IEEE 64-bit floa

Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-03-02 Thread Gang Wu
I have filed a JIRA: https://issues.apache.org/jira/browse/PARQUET-2253 Best, Gang On Thu, Mar 2, 2023 at 5:39 PM Patrick Hansert wrote: > > This is by design. I guess it benefits sequential scan where the > dictionary > > page is read first and then followed by its encoded indices in the data

[jira] [Created] (PARQUET-2253) Postpone dictionary encoding decision for starting null pages.

2023-03-02 Thread Gang Wu (Jira)
Gang Wu created PARQUET-2253: Summary: Postpone dictionary encoding decision for starting null pages. Key: PARQUET-2253 URL: https://issues.apache.org/jira/browse/PARQUET-2253 Project: Parquet I

[jira] [Commented] (PARQUET-2198) Vulnerabilities in jackson-databind

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695669#comment-17695669 ] ASF GitHub Bot commented on PARQUET-2198: - steveloughran commented on PR #1005:

[GitHub] [parquet-mr] steveloughran commented on pull request #1005: PARQUET-2198 : Updating jackson data bind version to fix CVEs

2023-03-02 Thread via GitHub
steveloughran commented on PR #1005: URL: https://github.com/apache/parquet-mr/pull/1005#issuecomment-1451725294 +been some issues with transient dependencies from jackson releases in hadoop, hence "HADOOP-18332. Remove rs-api dependency by downgrading jackson to 2.12.7.". jersey 1.0 coexis

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695626#comment-17695626 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-02 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122863674 ## parquet-plugins/parquet-encoding-vector/src/test/java/org/apache/parquet/column/values/bitpacking/TestByteBitPacking512VectorLE.java: ## @@ -0,0 +1,172 @@ +/

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695625#comment-17695625 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-02 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122860932 ## .github/workflows/vector-plugins.yml: ## @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreem

Re: Workaround lack of union data type

2023-03-02 Thread Hinko Kocevar
Hi Gang, It is related, yes. I've opened that issue the other day about ORC. Thank you for looking into it! I'm not that picky about the file format (ORC vs. parquet) as long as the features are there I'm OK.. //hinxx From: Gang Wu Sent: Thursday, Marc

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695611#comment-17695611 ] ASF GitHub Bot commented on PARQUET-2159: - gszadovszky commented on code in PR

[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-02 Thread via GitHub
gszadovszky commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122834543 ## parquet-plugins/parquet-encoding-vector/src/test/java/org/apache/parquet/column/values/bitpacking/TestByteBitPacking512VectorLE.java: ## @@ -0,0 +1,172 @@ +/*

[jira] [Commented] (PARQUET-2198) Vulnerabilities in jackson-databind

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695596#comment-17695596 ] ASF GitHub Bot commented on PARQUET-2198: - botchniaque commented on PR #1005: U

[GitHub] [parquet-mr] botchniaque commented on pull request #1005: PARQUET-2198 : Updating jackson data bind version to fix CVEs

2023-03-02 Thread via GitHub
botchniaque commented on PR #1005: URL: https://github.com/apache/parquet-mr/pull/1005#issuecomment-1451576518 I don't have now any specifics, but I guess there were some differences in versions `2.13.x` and `2.14.x` and the scala versions compatibility. This may be the cause why `2.13.x` i

Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-03-02 Thread Patrick Hansert
This is by design. I guess it benefits sequential scan where the dictionary page is read first and then followed by its encoded indices in the data pages. Otherwise we need to seek anyway. Good, then it shouldn't cause problems when putting the dictionary after all-null pages I think that is

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695595#comment-17695595 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-02 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122821855 ## parquet-plugins/parquet-encoding-vector/src/test/java/org/apache/parquet/column/values/bitpacking/TestByteBitPacking512VectorLE.java: ## @@ -0,0 +1,172 @@ +/

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695594#comment-17695594 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-02 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122821855 ## parquet-plugins/parquet-encoding-vector/src/test/java/org/apache/parquet/column/values/bitpacking/TestByteBitPacking512VectorLE.java: ## @@ -0,0 +1,172 @@ +/

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695591#comment-17695591 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695590#comment-17695590 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-02 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122821855 ## parquet-plugins/parquet-encoding-vector/src/test/java/org/apache/parquet/column/values/bitpacking/TestByteBitPacking512VectorLE.java: ## @@ -0,0 +1,172 @@ +/

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-02 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122821855 ## parquet-plugins/parquet-encoding-vector/src/test/java/org/apache/parquet/column/values/bitpacking/TestByteBitPacking512VectorLE.java: ## @@ -0,0 +1,172 @@ +/

Re: Workaround lack of union data type

2023-03-02 Thread Gang Wu
Hi Hinko, Yes, the parquet specs do not support union type natively. Workarounds that you have mentioned require additional work on the reader side. I am working on this issue already: https://github.com/apache/arrow/issues/34262 . Not sure if this is what you are looking for. Best, Gang On Thu

Workaround lack of union data type

2023-03-02 Thread Hinko Kocevar
I have data points that can briefly be described with the following fields: - timestamp - name - value The value field can be 1) a scalar integer (all sizes), a float or double, a string or, 2) an array (list) of all 1) data types. A named data point can for its lifetime only have a single/

[jira] [Commented] (PARQUET-2198) Vulnerabilities in jackson-databind

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695551#comment-17695551 ] ASF GitHub Bot commented on PARQUET-2198: - nikhilenr commented on PR #1005: URL

[GitHub] [parquet-mr] nikhilenr commented on pull request #1005: PARQUET-2198 : Updating jackson data bind version to fix CVEs

2023-03-02 Thread via GitHub
nikhilenr commented on PR #1005: URL: https://github.com/apache/parquet-mr/pull/1005#issuecomment-1451500758 @shangxinli :- Jackson has released newer version of 2.14.2. https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind/2.14.2 Please try to fix as earl

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-02 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695546#comment-17695546 ] ASF GitHub Bot commented on PARQUET-2159: - gszadovszky commented on code in PR

[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1011: PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI

2023-03-02 Thread via GitHub
gszadovszky commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1122744256 ## parquet-plugins/parquet-encoding-vector/src/test/java/org/apache/parquet/column/values/bitpacking/TestByteBitPacking512VectorLE.java: ## @@ -0,0 +1,172 @@ +/*

[GitHub] [parquet-format] gszadovszky commented on a diff in pull request #190: MINOR: Add FIXED_LEN_BYTE_ARRAY Type

2023-03-02 Thread via GitHub
gszadovszky commented on code in PR #190: URL: https://github.com/apache/parquet-format/pull/190#discussion_r1122727045 ## README.md: ## @@ -132,6 +132,7 @@ readers and writers for the format. The types are: - FLOAT: IEEE 32-bit floating point values - DOUBLE: IEEE 64-bit