[jira] [Commented] (PARQUET-2075) Unified Rewriter Tool

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646980#comment-17646980 ] ASF GitHub Bot commented on PARQUET-2075: - wgtmac opened a new pull request, #1014: URL:

[GitHub] [parquet-mr] wgtmac opened a new pull request, #1014: PARQUET-2075: Implement ParquetRewriter

2022-12-13 Thread GitBox
wgtmac opened a new pull request, #1014: URL: https://github.com/apache/parquet-mr/pull/1014 ### Jira - This patch aims to solve the first step of [PARQUET-2075](https://issues.apache.org/jira/browse/PARQUET-2075). ### Tests - Make sure all tasks pass, especially

Re: parquet checksum coverage

2022-12-13 Thread Micah Kornfield
> > i think there's a good case for turning it on as (a) there are lots of > other filesystems out there, including NTFS on windows laptops, *and* > there's the risk of corruption of data in flight from the hdfs data node > processes where the CRC checks place and the actual reader code. Yep,

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646943#comment-17646943 ] ASF GitHub Bot commented on PARQUET-2159: - jatin-bhateja commented on code in PR #1011: URL:

[GitHub] [parquet-mr] jatin-bhateja commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2022-12-13 Thread GitBox
jatin-bhateja commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480 ## parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java: ## @@ -105,4 +116,16 @@ public void unpack8Values(final byte[]

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646941#comment-17646941 ] ASF GitHub Bot commented on PARQUET-2159: - jatin-bhateja commented on code in PR #1011: URL:

[GitHub] [parquet-mr] jatin-bhateja commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2022-12-13 Thread GitBox
jatin-bhateja commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1048036480 ## parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/BytePacker.java: ## @@ -105,4 +116,16 @@ public void unpack8Values(final byte[]

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646899#comment-17646899 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on PR #1011: URL:

[GitHub] [parquet-mr] jiangjiguang commented on pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2022-12-13 Thread GitBox
jiangjiguang commented on PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#issuecomment-1350287549 > This work looks promising! It would be great if you can add some micro-benchmark to parquet-benchmarks. @wgtmac I have add the micro-benchmark to parquet-benchmarks, this

[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646685#comment-17646685 ] ASF GitHub Bot commented on PARQUET-2218: - mapleFU commented on PR #188: URL:

[GitHub] [parquet-format] mapleFU commented on pull request #188: PARQUET-2218: [Format] Clarify CRC computation

2022-12-13 Thread GitBox
mapleFU commented on PR #188: URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348743274 The change looks good to me! Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646675#comment-17646675 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL:

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348718160 I opened https://github.com/apache/parquet-format/pull/188 to clarify the wording. -- This is an automated message from the Apache Git Service. To respond to the message, please

[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646674#comment-17646674 ] ASF GitHub Bot commented on PARQUET-2218: - pitrou commented on PR #188: URL:

[jira] [Commented] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646672#comment-17646672 ] ASF GitHub Bot commented on PARQUET-2218: - pitrou opened a new pull request, #188: URL:

[GitHub] [parquet-format] pitrou commented on pull request #188: PARQUET-2218: [Format] Clarify CRC computation

2022-12-13 Thread GitBox
pitrou commented on PR #188: URL: https://github.com/apache/parquet-format/pull/188#issuecomment-1348714880 @bbraams @gszadovszky @mapleFU thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [parquet-format] pitrou opened a new pull request, #188: PARQUET-2218: [Format] Clarify CRC computation

2022-12-13 Thread GitBox
pitrou opened a new pull request, #188: URL: https://github.com/apache/parquet-format/pull/188 When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous. Clarify that CRC computation happens on the exact binary serialization (instead of a

[jira] [Updated] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread Antoine Pitrou (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated PARQUET-2218: Description: The format spec on CRC checksumming felt ambiguous when trying to implement

[jira] [Created] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2218: --- Summary: [Format] Clarify CRC computation Key: PARQUET-2218 URL: https://issues.apache.org/jira/browse/PARQUET-2218 Project: Parquet Issue Type:

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646656#comment-17646656 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL:

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348674622 @wgtmac No particular rule, no. AFAIU we only synchronize when we want to get meaningful spec changes. -- This is an automated message from the Apache Git Service. To respond to

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646655#comment-17646655 ] ASF GitHub Bot commented on PARQUET-1539: - wgtmac commented on PR #126: URL:

[GitHub] [parquet-format] wgtmac commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
wgtmac commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348672686 > And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. Quick question: is there any rule to sync the `parquet.thrift` file from

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646613#comment-17646613 ] ASF GitHub Bot commented on PARQUET-1539: - mapleFU commented on PR #126: URL:

[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348441147 > And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. OK, thanks for your patient. I updated the descriptions in

[jira] [Commented] (PARQUET-1629) Page-level CRC checksum verification for DataPageV2

2022-12-13 Thread Antoine Pitrou (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646612#comment-17646612 ] Antoine Pitrou commented on PARQUET-1629: - [~mwish] for the record. Perhaps you would be

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646609#comment-17646609 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL:

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348435655 And, yes, it would probably be nice to make the spec wording clearer. I can try to submit something. -- This is an automated message from the Apache Git Service. To respond to the

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646610#comment-17646610 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL:

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348433863 It seems it was done deliberately in parquet-mr and all Parquet committers there agreed that it was how the spec should be interpreted: https://github.com/apache/parquet-mr/pull/647

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646595#comment-17646595 ] ASF GitHub Bot commented on PARQUET-1539: - mapleFU commented on PR #126: URL:

[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348405417 So, should we update the `parquet-format`, or just keep it here and not implement crc in parquet c++ version? @pitrou -- This is an automated message from the Apache Git

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646591#comment-17646591 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL:

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348378187 It does seem that parquet-mr writes a CRC value for dictionary pages...

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646585#comment-17646585 ] ASF GitHub Bot commented on PARQUET-1539: - mapleFU commented on PR #126: URL:

[GitHub] [parquet-format] mapleFU commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323 > (also cc @mapleFU, who's working on CRC support for Parquet C++) Hi, all, I have a question here, the format says: ``` /** The 32bit CRC for the page, to be be

[jira] [Commented] (PARQUET-1539) Clarify CRC checksum in page header

2022-12-13 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646534#comment-17646534 ] ASF GitHub Bot commented on PARQUET-1539: - pitrou commented on PR #126: URL:

[GitHub] [parquet-format] pitrou commented on pull request #126: PARQUET-1539: Clarify CRC checksum in page header

2022-12-13 Thread GitBox
pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 @bbraams @gszadovszky Could you explain why the spec's wording is so complex? It seems to me that the CRC is basically computed over the entire serialized data exactly as it's