[GitHub] [parquet-format] JFinis commented on pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

2023-03-28 Thread via GitHub
JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 The gist of all opened issues is the question how to encode pages/column chunks that contain only NaNs. This is actually only an issue for the `ColumnIndex`. For statistics in

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705868#comment-17705868 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: ht

[DISCUSS](PARQUET-2249) Add nan_count to handle NaNs in statistics

2023-03-28 Thread Jan Finis
Dear contributors, My PR has now gathered comments for a week and the gist of all open issues is the question of how to encode pages/column chunks that contain only NaNs. There are different suggestions and I don't see one common favorite yet. I have outlined three alternatives of how we can hand

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

2023-03-28 Thread via GitHub
mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150282767 ## src/main/thrift/parquet.thrift: ## @@ -952,6 +961,9 @@ struct ColumnIndex { * Such more compact values must still be valid values within the column's *

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705900#comment-17705900 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1150299027 ## src/main/thrift/parquet.thrift: ## @@ -223,6 +223,17 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; +

[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

2023-03-28 Thread via GitHub
JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150299691 ## src/main/thrift/parquet.thrift: ## @@ -952,6 +961,9 @@ struct ColumnIndex { * Such more compact values must still be valid values within the column's * l

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705906#comment-17705906 ] ASF GitHub Bot commented on PARQUET-2261: - emkornfield commented on code in PR

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705907#comment-17705907 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1150300360 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,41 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEA

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705909#comment-17705909 ] ASF GitHub Bot commented on PARQUET-2261: - emkornfield commented on code in PR

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1150301523 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,41 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEA

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705910#comment-17705910 ] ASF GitHub Bot commented on PARQUET-2261: - emkornfield commented on code in PR

[GitHub] [parquet-format] mapleFU commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

2023-03-28 Thread via GitHub
mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150378596 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summ

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705930#comment-17705930 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196

[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-28 Thread Steve Loughran (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705940#comment-17705940 ] Steve Loughran commented on PARQUET-2224: - it's not spark, its a cyclone/maven

[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-28 Thread Dongjoon Hyun (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705962#comment-17705962 ] Dongjoon Hyun commented on PARQUET-2224: To [~ste...@apache.org], I don't think

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread Xinli shang
+1 Verified signature and ran internal tests. Thanks Gang for leading this effort! On Mon, Mar 27, 2023 at 9:38 AM Dongjoon Hyun wrote: > +1 > > Thank you, Gang and Yuming. > > Dongjoon. > > On 2023/03/27 05:44:14 "Wang, Yuming" wrote: > > +1. Tested this release through Spark UT: > https://gi

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread Gidon Gershinsky
+1 Verified signature and ran the tests. Thanks Gang and all contributors! Cheers, Gidon On Tue, Mar 28, 2023 at 5:19 PM Xinli shang wrote: > +1 > > Verified signature and ran internal tests. Thanks Gang for leading this > effort! > > On Mon, Mar 27, 2023 at 9:38 AM Dongjoon Hyun wrote: > >

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread Chao Sun
+1 (non-binding). Verified checksum & signature, and ran all the tests locally. Thanks Gang! On Tue, Mar 28, 2023 at 9:37 AM Gidon Gershinsky wrote: > > +1 > > Verified signature and ran the tests. Thanks Gang and all contributors! > > Cheers, Gidon > > > On Tue, Mar 28, 2023 at 5:19 PM Xinli s

[GitHub] [parquet-format] yqiu2 commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
yqiu2 commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151066419 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,45 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED =

[GitHub] [parquet-format] yqiu2 commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
yqiu2 commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151066788 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,45 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED =

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706120#comment-17706120 ] ASF GitHub Bot commented on PARQUET-2261: - yqiu2 commented on code in PR #197:

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706119#comment-17706119 ] ASF GitHub Bot commented on PARQUET-2261: - yqiu2 commented on code in PR #197:

[GitHub] [parquet-format] yqiu2 commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
yqiu2 commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151066788 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,45 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED =

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706121#comment-17706121 ] ASF GitHub Bot commented on PARQUET-2261: - yqiu2 commented on code in PR #197:

[GitHub] [parquet-format] yqiu2 commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
yqiu2 commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151068954 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,45 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED =

[GitHub] [parquet-format] yqiu2 commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
yqiu2 commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151068954 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,45 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED =

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706122#comment-17706122 ] ASF GitHub Bot commented on PARQUET-2261: - yqiu2 commented on code in PR #197:

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706123#comment-17706123 ] ASF GitHub Bot commented on PARQUET-2261: - yqiu2 commented on code in PR #197:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151145783 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,45 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEA

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706146#comment-17706146 ] ASF GitHub Bot commented on PARQUET-2261: - emkornfield commented on code in PR

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151146236 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,45 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEA

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706147#comment-17706147 ] ASF GitHub Bot commented on PARQUET-2261: - emkornfield commented on code in PR

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread L. C. Hsieh
+1 (non-binding) Verified checksum and ran the tests locally. Thanks Gang. One question is that the public key I saw on key server (pgpkeys.mit.edu) is different to the one in https://downloads.apache.org/parquet/KEYS. On 2023/03/28 17:01:30 Chao Sun wrote: > +1 (non-binding). Verified checksum

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151301394 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,41 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED =

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706170#comment-17706170 ] ASF GitHub Bot commented on PARQUET-2261: - wgtmac commented on code in PR #197:

[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-28 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706171#comment-17706171 ] Gang Wu commented on PARQUET-2224: -- Thanks [~dongjoon] for the detail! > Publish SBOM

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
wgtmac commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151311588 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,41 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEATED =

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706175#comment-17706175 ] ASF GitHub Bot commented on PARQUET-2261: - wgtmac commented on code in PR #197:

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Proposal for unencoded/uncompressed statistics

2023-03-28 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151338163 ## src/main/thrift/parquet.thrift: ## @@ -190,6 +190,41 @@ enum FieldRepetitionType { /** The field is repeated and can contain 0 or more values */ REPEA

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706188#comment-17706188 ] ASF GitHub Bot commented on PARQUET-2261: - emkornfield commented on code in PR

[GitHub] [parquet-format] wgtmac commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

2023-03-28 Thread via GitHub
wgtmac commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1151335207 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summa

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706189#comment-17706189 ] ASF GitHub Bot commented on PARQUET-2249: - wgtmac commented on code in PR #196:

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread Gang Wu
Hi L.C. Could you please elaborate the issue with public key? How can I check that by myself? Thanks, Gang On Wed, Mar 29, 2023 at 7:48 AM L. C. Hsieh wrote: > +1 (non-binding) Verified checksum and ran the tests locally. > > Thanks Gang. > > One question is that the public key I saw on key se

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread L. C. Hsieh
Hi Gang, I tried to search your public key on http://pgp.mit.edu/. It shows a different public key: pub 4096R/26D4D78E 2018-04-11 Gang Wu Looks like it is your older public key? Wondering why your new public key is not updated on key server. On 2023/03/29 02:59:47 Gang Wu wrote: > Hi L.C. >

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread Gang Wu
Yes, I have updated my GPG key but have not sent it to http://pgp.mit.edu/. You may find my key from keys.openpgp.org Best, Gang On Wed, Mar 29, 2023 at 1:51 PM L. C. Hsieh wrote: > Hi Gang, > > I tried to search your public key on http://pgp.mit.edu/. > It shows a different public key: > > pu