from:"Gidon Gershinsky"

Re: Congrats to Julien Le Dem for being next PMC Chair

2024-07-08 Thread Gidon Gershinsky

I'm back online after the vacation. Thank you Xinli, and glad to see you
back Julien.

Cheers, Gidon


On Thu, Jul 4, 2024 at 4:18 AM Gang Wu  wrote:

> Thanks Xinli and welcome back Julien!
>
> Best,
> Gang
>
> On Thu, Jul 4, 2024 at 1:10 AM Parth Chandra  wrote:
>
> > Thanks Xinli for your leadership! And welcome back Julien!
> >
> > -Parth
> >
> > On Wed, Jul 3, 2024 at 5:13 AM Rok Mihevc  wrote:
> >
> > > Congrats Julien and thanks Xinli!
> > >
> > > Rok
> > >
> > > On Wed, Jul 3, 2024 at 8:02 AM Fokko Driesprong 
> > wrote:
> > >
> > > > Thank you Xinli for leading the project for the last few years, and
> > > > congrats Julien!
> > > >
> > > > Kind regards,
> > > > Fokko
> > > >
> > > > Op wo 3 jul 2024 om 03:21 schreef Julien Le Dem :
> > > >
> > > > > Thank you all and thank you Xinli for your leadership!
> > > > >
> > > > >
> > > > > On Tue, Jul 2, 2024 at 6:13 PM Vinoo Ganesh <
> vinoo.gan...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Congrats, Julien!
> > > > > >
> > > > > >
> > > > > > 
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 2, 2024 at 9:01 PM wish maple <
> maplewish...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Congrats Julien
> > > > > > >
> > > > > > > Best,
> > > > > > > Xuwei Fu
> > > > > > >
> > > > > > > Micah Kornfield  于2024年7月3日周三 08:58写道：
> > > > > > >
> > > > > > > > Congrats Julien
> > > > > > > >
> > > > > > > > On Tuesday, July 2, 2024, Andrew Lamb <
> andrewlam...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Congratulations Julien!
> > > > > > > > >
> > > > > > > > > On Tue, Jul 2, 2024, 19:28 Xinli shang
> >  > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I am delighted to share some exciting news with you.
> Please
> > > > join
> > > > > me
> > > > > > > in
> > > > > > > > > > congratulating Julien Le Dem on his back to be the next
> PMC
> > > > > Chair!
> > > > > > > > > >
> > > > > > > > > > Julien is not only the co-author of Apache Parquet but
> also
> > > has
> > > > > > > > > previously
> > > > > > > > > > served as the PMC Chair, where his leadership and
> > > contributions
> > > > > > have
> > > > > > > > been
> > > > > > > > > > invaluable. His expertise and dedication continue to
> shape
> > > our
> > > > > > > > community
> > > > > > > > > > and drive innovation.
> > > > > > > > > >
> > > > > > > > > > We look forward to the continued success and growth of
> our
> > > > Apache
> > > > > > > > Parquet
> > > > > > > > > > under Julien's capable leadership.
> > > > > > > > > >
> > > > > > > > > > Xinli Shang
> > > > > > > > > > ex - Apache Parquet PMC Chair
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] Release Apache Parquet-Java 1.14.1 RC0

2024-06-13 Thread Gidon Gershinsky

+1 (binding).
Ran unitests

Cheers, Gidon


On Thu, Jun 13, 2024 at 11:45 AM Gábor Szádovszky  wrote:

> Checked tarball content, signature and checksum. Executed unit tests. All
> pass.
> +1 (binding)
>
> Gang Wu  ezt írta (időpont: 2024. jún. 13., Cs, 8:43):
>
> > Hi everyone,
> >
> > I propose the following RC to be released as the official Apache
> > Parquet-Java 1.14.1 release.
> >
> > The commit id is 97ede968377400d1d79e3196636ba3de392196ba
> > * This corresponds to the tag: apache-parquet-1.14.1-rc0
> > *
> >
> >
> https://github.com/apache/parquet-java/tree/97ede968377400d1d79e3196636ba3de392196ba
> >
> > The release tarball, signature, and checksums are here:
> > *
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.14.1-rc0
> >
> > You can find the KEYS file here:
> > * https://downloads.apache.org/parquet/KEYS
> >
> > Binary artifacts are staged in Nexus here:
> > *
> https://repository.apache.org/content/groups/staging/org/apache/parquet/
> >
> > This release includes important bug fixes:
> > * PARQUET-2468 - ParquetMetadata.toPrettyJSON throws exception on file
> read
> > when LOG.isDebugEnabled()
> > * PARQUET-2472 - Close resources in finally block in
> ParquetFileWriter#end
> > * PARQUET-2498 - Hadoop vector IO API doesn't handle empty list of ranges
> >
> > Full change logs can be viewed here:
> > *
> >
> >
> https://github.com/apache/parquet-java/blob/parquet-1.14.x/CHANGES.md#version-1141
> >
> > Please download, verify, and test. Please vote in the next 72 hours.
> >
> > [ ] +1 Release this as Apache Parquet-Java 1.14.1
> > [ ] +0
> > [ ] -1 Do not release this because...
> >
> > Best,
> > Gang
> >
>

Re: [ANNOUNCE] New Parquet PMC Member: Gang Wu

2024-05-11 Thread Gidon Gershinsky

Congrats Gang, well deserved!


Cheers, Gidon


On Sat, 11 May 2024 at 20:19 Xinli shang  wrote:

> Hi all,
>
> As a Parquet committer, Gang Wu has remained very active and instructive in
> the community. The Parquet community invited him to be a PMC member, and he
> accepted. It's my pleasure to announce that Gang is now officially a PMC
> member of Apache Parquet.
>
> Congratulations, Gang!
>
> Xinli Shang, on behalf of the Apache Parquet PMC
>

Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-07 Thread Gidon Gershinsky

+1 (binding)

- ran the tests
- ran with the Iceberg encryption code

Cheers, Gidon


On Tue, May 7, 2024 at 4:28 AM Gang Wu  wrote:

> Hi,
>
> It has been open for more than 72 hours already. We still need 2 more
> binding votes. Considering that there was a weekend during the voting
> hours, let's extend it. Thanks!
>
> Best,
> Gang
>
> On Mon, May 6, 2024 at 4:07 PM Fokko Driesprong  wrote:
>
> > Good catch Gábor!
> >
> > I've created PRs to fix this for future releases:
> >
> >- https://github.com/apache/parquet-mr/pull/1347
> >- https://github.com/apache/parquet-mr/pull/1348
> >
> > Kind regards,
> > Fokko
> >
> > Op ma 6 mei 2024 om 08:50 schreef Gábor Szádovszky :
> >
> > > Thanks Fokko, Gang for working on this.
> > > I have some findings:
> > > * nit correction in the original mail: tag is apache-parquet-1.14.0-rc1
> > > (not apache-parquet-1.4.0-rc1)
> > > * The CHANGES.md should have been updated with the one fix you've
> > mentioned
> > > (PARQUET-2465)
> > >
> > > Since I've never used CHANGES.md to actually check a release content, I
> > > don't feel this issue is so crucial to fail this vote. I would let the
> > > other voters decide.
> > > +1 (binding)
> > >
> > > Gang Wu  ezt írta (időpont: 2024. máj. 6., H, 3:33):
> > >
> > > > +1 (non-binding)
> > > >
> > > > Verified signature, checksum and build.
> > > >
> > > > Thanks Fokko for doing this! Let me take care of the rest.
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Mon, May 6, 2024 at 4:36 AM Fokko Driesprong 
> > > wrote:
> > > >
> > > > > Hey everyone,
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - Checked against Trino and the RC1 runs cleanly
> > > > > 
> > > > > - Checked against Iceberg and the tests passed locally. To let the
> CI
> > > > pass
> > > > > we must upgrade Gradle, this is because Parquet ships with a new
> > > Jackson
> > > > > version that contains JDK21 code, but this is an issue on the
> Iceberg
> > > > side
> > > > > <
> > https://github.com/apache/iceberg/pull/10209#issuecomment-2094939429
> > > >.
> > > > >
> > > > > Kind regards,
> > > > > Fokko
> > > > >
> > > > >
> > > > > Op vr 3 mei 2024 om 17:46 schreef Fokko Driesprong <
> fo...@apache.org
> > >:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Since Gang is enjoying a well-deserved vacation
> > > > > > <
> > > >
> https://github.com/apache/parquet-mr/pull/1342#issuecomment-2092774404
> > > > > >,
> > > > > > I'm jumping in for this RC. I propose the following RC to be
> > released
> > > > as
> > > > > > the official Apache Parquet 1.14.0 release.
> > > > > >
> > > > > > The commit ID is fe9179414906cc19b550d13d2819b4e16fddf8a1
> > > > > > * This corresponds to the tag: apache-parquet-1.4.0-rc1
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/fe9179414906cc19b550d13d2819b4e16fddf8a1
> > > > > >
> > > > > > The release tarball, signature, and checksums are here:
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.14.0-rc1/
> > > > > >
> > > > > > You can find the KEYS file here:
> > > > > > * https://downloads.apache.org/parquet/KEYS
> > > > > >
> > > > > > Binary artifacts are staged in Nexus here:
> > > > > > *
> > > > >
> > >
> https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > > > > >
> > > > > > This release includes important changes:
> > > > > >
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/parquet-1.14.x/CHANGES.md#version-1140
> > > > > >
> > > > > > Since RC0 one commit has been added:
> > > > > > https://github.com/apache/parquet-mr/pull/1342
> > > > > >
> > > > > > Please download, verify, and test.
> > > > > >
> > > > > > Please vote in the next 72 hours.
> > > > > >
> > > > > > [ ] +1 Release this as Apache Parquet 1.14.0
> > > > > > [ ] +0
> > > > > > [ ] -1 Do not release this because...
> > > > > >
> > > > > > Kind regards,
> > > > > > Fokko
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-05-02 Thread Gidon Gershinsky

+1 (binding)

Ran the build and tests.

I'm told by the Spark community they'd like to integrate the new parquet-mr
in Spark 4.0, so are interested in having the v1.14 as soon as possible.


On Tue, Apr 30, 2024 at 6:26 PM Vinoo Ganesh  wrote:

> +1 (non-binding)
>
> Bumped to 1.14.0-SNAPSHOT in Spark and ran a few tests too
>
>
> 
>
>
> On Tue, Apr 30, 2024 at 10:20 AM Xinli shang 
> wrote:
>
> > +1 (binding)
> >
> > Validated the KEY
> >
> > On Tue, Apr 30, 2024 at 1:18 AM Gang Wu  wrote:
> >
> > > Thank you!
> > >
> > > On Tue, Apr 30, 2024 at 4:16 PM Gábor Szádovszky 
> > wrote:
> > >
> > > > By importing the KEYS file under [1] the check of the .asc file
> passed!
> > > > So, I went forward and updated the KEYS file under [2] with your new
> > one.
> > > >
> > > > Giving +1 (binding) for the release
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > Gang Wu  ezt írta (időpont: 2024. ápr. 30., K,
> > 9:58):
> > > >
> > > > > I have appended my new key to [1]. Please verify again. However, I
> > > don't
> > > > > have the permission to update [2]. That may not be an issue as I
> > don't
> > > > have
> > > > > to permission to upload the final tarball to the svn release repo.
> > > > >
> > > > > [1] https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > > > [2] https://dist.apache.org/repos/dist/release/parquet/KEYS
> > > > >
> > > > > On Tue, Apr 30, 2024 at 3:45 PM Gábor Szádovszky  >
> > > > wrote:
> > > > >
> > > > > > Sure, please add your new public key to the referenced KEYS file
> > then
> > > > we
> > > > > > should be good. (The previous one would still be required to
> check
> > > the
> > > > > > previous releases, so do not remove it.)
> > > > > >
> > > > > > Gang Wu  ezt írta (időpont: 2024. ápr. 30., K,
> > > > 9:27):
> > > > > >
> > > > > > > Hi Gabor,
> > > > > > >
> > > > > > > Thanks for raising the issue! My original key was deleted by an
> > > > > accident
> > > > > > > of running a shell script and cannot be recovered any more. I
> > have
> > > > > > created
> > > > > > > a new key and used it to sign the tarball. That's why it does
> not
> > > > > exists
> > > > > > in
> > > > > > > the KEYS file. I have sent the new key to some key servers
> > already.
> > > > > Does
> > > > > > > it make sense to add my new key to the KEYS file instead?
> > > > > > >
> > > > > > > Best,
> > > > > > > Gang
> > > > > > >
> > > > > > > On Tue, Apr 30, 2024 at 3:11 PM Gábor Szádovszky <
> > ga...@apache.org
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Gang,
> > > > > > > >
> > > > > > > > Thank you for taking care of the release!
> > > > > > > >
> > > > > > > > Unfortunately, the .asc check fails for me even after
> importing
> > > the
> > > > > > KEYS
> > > > > > > > file. Could you double check if you signed it with the
> correct
> > > key?
> > > > > > > > No other issues were discovered, so no RC1 is required for
> now
> > if
> > > > you
> > > > > > can
> > > > > > > > change the .asc file for the current tarball.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gabor
> > > > > > > >
> > > > > > > > Gang Wu  ezt írta (időpont: 2024. ápr.
> 30.,
> > K,
> > > > > > 7:45):
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > I propose the following RC to be released as the official
> > > Apache
> > > > > > > Parquet
> > > > > > > > > 1.14.0 release.
> > > > > > > > >
> > > > > > > > > The commit id is af0740229929337e1395fd24253a4ed787df2db3
> > > > > > > > > * This corresponds to the tag: apache-parquet-1.14.0-rc0
> > > > > > > > > *
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/af0740229929337e1395fd24253a4ed787df2db3
> > > > > > > > >
> > > > > > > > > The release tarball, signature, and checksums are here:
> > > > > > > > > *
> > > > > > > >
> > > > > >
> > > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.14.0-rc0
> > > > > > > > >
> > > > > > > > > You can find the KEYS file here:
> > > > > > > > > * https://downloads.apache.org/parquet/KEYS
> > > > > > > > >
> > > > > > > > > Binary artifacts are staged in Nexus here:
> > > > > > > > > *
> > > > > > > >
> > > > > >
> > > >
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > > > > > > > >
> > > > > > > > > This release includes important changes:
> > > > > > > > > *
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/parquet-1.14.x/CHANGES.md#version-1140
> > > > > > > > >
> > > > > > > > > Please download, verify, and test.
> > > > > > > > >
> > > > > > > > > Please vote in the next 72 hours.
> > > > > > > > >
> > > > > > > > > [ ] +1 Release this as Apache Parquet 1.14.0
> > > > > > > > > [ ] +0
> > > > > > > > > [ ] -1 Do not release this because...
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Gang
> > > > > > > > >
> > >

Re: How the key rotation works when using Parquet Modular Encryption

2023-11-30 Thread Gidon Gershinsky

On Wed, Nov 29, 2023 at 5:40 PM Priyanshu Sharma
 wrote:

> With Parquet Modular Encryption
> 1. With each key rotation , Is it possible to avoid encryption and
> decryption of existing data?
>
Yes

>
> 2. If master key rotation does not require modification of the data file
> then how would the KMS work.
>
- Basic key rotation simply means the master key version is updated in the
KMS, so the future parquet files are encrypted with the rotated master key
(namely, their data keys will be encrypted with the new master key version).
- In addition, if your threat model requires to re-wrap data keys of
existing parquet files with the rotated master key - this can be done
without modification of the parquet files if they were encrypted in an
"external key material" mode,
parquet.encryption.key.material.store.internally=false
(see
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md#class-propertiesdrivencryptofactory
),
the data keys (encrypted with master keys in KMS) are stored in separate
small key_material files. The key re-wrapping will re-encrypt the data keys
with the rotated master key, and replace the key_material files.

>
> 3. Do we have any constraints for key structure while updating a key.
>
This is up to the KMS service implementation.

>
> It would be better if you could provide a git link having the interface to
> implement KMS. I am already following this git page
> https://github.com/apache/parquet-format/blob/master/Encryption.md but
> still have a few doubts.
>
The links and basic details can be found in
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#columnar-encryption



Cheers, Gidon

Re: [VOTE] Release Apache Parquet Format 2.10.0 RC0

2023-11-19 Thread Gidon Gershinsky

+1 (binding).

Thanks Gang.

Cheers, Gidon


On Fri, Nov 17, 2023 at 5:07 PM Xinli shang  wrote:

> +1 (binding)
>
> Verified the signature. Thanks Gang for leading the effort!
>
> On Thu, Nov 16, 2023 at 9:41 PM wish maple  wrote:
>
> > +1 (no-binding)
> >
> > Thanks Gang for release!
> >
> > Best,
> > Xuwei Fu
> >
> > Gang Wu  于2023年11月16日周四 14:07写道：
> >
> > > Hi everyone,
> > >
> > > I propose the following RC to be released as the official Apache
> Parquet
> > > Format 2.10.0 release.
> > >
> > > The commit id is b9c4fa81c3be13dc98760c92b037fa4dd465cef8
> > > * This corresponds to the tag: apache-parquet-format-2.10.0-rc0
> > > *
> > >
> > >
> >
> https://github.com/apache/parquet-format/tree/b9c4fa81c3be13dc98760c92b037fa4dd465cef8
> > >
> > > The release tarball, signature, and checksums are here:
> > > *
> > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.10.0-rc0
> > >
> > > You can find the KEYS file here:
> > > * https://downloads.apache.org/parquet/KEYS
> > >
> > > Binary artifacts are staged in Nexus here:
> > > *
> > >
> > >
> >
> https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.10.0/
> > >
> > > This release includes important changes listed below:
> > > *
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-2100
> > > * https://issues.apache.org/jira/projects/PARQUET/versions/12350092
> > >
> > > Please download, verify, and test.
> > >
> > > This vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Parquet Format 2.10.0
> > > [ ] +0
> > > [ ] -1 Do not release this because...
> > >
> > > Thanks,
> > > Gang
> > >
> >
>
>
> --
> Xinli Shang
>

Re: [VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-13 Thread Gidon Gershinsky

+1 (binding)

Cheers, Gidon


On Tue, Nov 14, 2023 at 5:31 AM Xinli shang  wrote:

> Yeah, we need one more PMC to vote. If you can help, appreciate it.
>
> On Mon, Nov 13, 2023 at 6:23 AM Fokko Driesprong  wrote:
>
> > +1 non-binding
> >
> > Great work Micah, I went through the PR and it looks very promising.
> >
> > Kind regards,
> > Fokko Driesprong
> >
> >  (Also pinged two more PMC members, hopefully they have time to jump in
> > here)
> >
> > Op vr 10 nov 2023 om 19:40 schreef Micah Kornfield <
> emkornfi...@gmail.com
> > >:
> >
> > > Hello, we need one more PMC member to approve this before the result
> can
> > > become official.  Would someone mind chiming in?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Wed, Nov 8, 2023 at 8:55 AM Gábor Szádovszky 
> > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > On 2023/11/07 02:46:37 Xinli shang wrote:
> > > > > +1 (binding)
> > > > >
> > > > > On Mon, Nov 6, 2023 at 4:56 PM Gang Wu  wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > Best,
> > > > > > Gang
> > > > > >
> > > > > > On Tue, Nov 7, 2023 at 3:57 AM Ed Seidl 
> wrote:
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Ed
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Xinli Shang
> > > > >
> > > >
> > >
> >
>
>
> --
> Xinli Shang
>

[jira] [Resolved] (PARQUET-2364) Encrypt all columns option

2023-11-08 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2364.
---
Fix Version/s: 1.14.0
   Resolution: Fixed

> Encrypt all columns option
> --
>
> Key: PARQUET-2364
> URL: https://issues.apache.org/jira/browse/PARQUET-2364
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0
>
>
> The column encryption mode currently encrypts only the explicitly specified 
> columns. Other columns stay unencrypted. This Jira will add an option to 
> encrypt (and tamper-proof) the other columns with the default footer key.
> Decryption / reading is not affected. The current readers will be able to 
> decrypt the new files, as long as they have access to the required keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (PARQUET-2370) Crypto factory activation of "all column encryption" mode

2023-11-08 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2370.
---
Resolution: Fixed

> Crypto factory activation of "all column encryption" mode
> -
>
> Key: PARQUET-2370
> URL: https://issues.apache.org/jira/browse/PARQUET-2370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0
>
>
> Enable the crypto factory to activate the "encrypt all columns" option 
> (https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-2370) Crypto factory activation of "all column encryption" mode

2023-11-08 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2370:
--
Fix Version/s: 1.14.0

> Crypto factory activation of "all column encryption" mode
> -
>
> Key: PARQUET-2370
> URL: https://issues.apache.org/jira/browse/PARQUET-2370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>    Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0
>
>
> Enable the crypto factory to activate the "encrypt all columns" option 
> (https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (PARQUET-2370) Crypto factory activation of "all column encryption" mode

2023-10-23 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2370:
-

 Summary: Crypto factory activation of "all column encryption" mode
 Key: PARQUET-2370
 URL: https://issues.apache.org/jira/browse/PARQUET-2370
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Enable the crypto factory to activate the "encrypt all columns" option 
(https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (PARQUET-2364) Encrypt all columns option

2023-10-16 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2364:
-

 Summary: Encrypt all columns option
 Key: PARQUET-2364
 URL: https://issues.apache.org/jira/browse/PARQUET-2364
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


The column encryption mode currently encrypts only the explicitly specified 
columns. Other columns stay unencrypted. This Jira will add an option to 
encrypt (and tamper-proof) the other columns with the default footer key.

Decryption / reading is not affected. The current readers will be able to 
decrypt the new files, as long as they have access to the required keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2223) Parquet Data Masking for Column Encryption

2023-06-16 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733538#comment-17733538
 ] 

Gidon Gershinsky commented on PARQUET-2223:
---

Yep, I also think so. I'll have a look at the current version of the design 
document.

> Parquet Data Masking for Column Encryption
> --
>
> Key: PARQUET-2223
> URL: https://issues.apache.org/jira/browse/PARQUET-2223
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Jiashen Zhang
>Priority: Major
>
> h1. Background
> h2. What is Data Masking?
> Data masking is a technique used to protect sensitive data by replacing it 
> with modified or obscured values. The purpose of data masking is to ensure 
> that sensitive information, such as Personally Identifiable Information 
> (PII), remains hidden from unauthorized users while allowing authorized users 
> to perform their tasks.
> Here are a few key points about data masking:
>  * Protection of Sensitive Data: Data masking helps to safeguard sensitive 
> data, such as Social Security numbers, credit card numbers, names, addresses, 
> and other personally identifiable information. By applying masking 
> techniques, the original values are replaced with fictional or transformed 
> data that retains the format and structure but removes any identifiable 
> information.
>  * Controlled Access: Data masking enables controlled access to sensitive 
> data. Authorized users, typically with appropriate permissions, can access 
> the unmasked or original data, while unauthorized users or users without the 
> necessary permissions will only see the masked data.
>  * Various Masking Techniques: There are different masking techniques 
> available, depending on the specific data privacy requirements and use cases. 
> Some commonly used techniques include:
>  ** Nullification: Replacing original data with NULL values.
>  ** Randomization: Replacing sensitive data with randomly generated values.
>  ** Substitution: Replacing sensitive data with fictional but realistic 
> values.
>  ** Hashing: Transforming sensitive data into irreversible hashed values.
>  ** Redaction: Removing or masking specific parts of sensitive data while 
> retaining other non-sensitive information.
>  * Compliance and Data Privacy: Data masking is often employed to comply with 
> data protection regulations and maintain data privacy. By masking sensitive 
> data, we can reduce the risk of data breaches and unauthorized access while 
> still allowing legitimate users to perform their tasks.
>  * Maintaining Data Consistency: Data masking techniques aim to maintain data 
> consistency and integrity by ensuring that masked data retains the original 
> data's format, structure, and relationships. This allows applications and 
> processes that rely on the data to continue functioning correctly.
> h2. Why do we need it?
> Data masking serves several important purposes and provides numerous 
> benefits. Here are some reasons why we need data masking:
>  * Data Privacy and Compliance: Data masking helps us comply with data 
> privacy regulations such as the General Data Protection Regulation (GDPR) and 
> the Health Insurance Portability and Accountability Act (HIPAA). These 
> regulations require us to protect sensitive data and ensure that it is only 
> accessible to authorized individuals. Data masking enables us to comply with 
> these regulations by de-identifying sensitive data.
>  * Minimize Data Exposure: By masking sensitive data, we can reduce the risk 
> of data breaches and unauthorized access. If a security breach occurs, the 
> exposed data will be meaningless to unauthorized users due to the masking. 
> This helps protect individuals' privacy and prevents misuse of sensitive 
> information.
>  * Secure Testing and Development Environments: Data masking is particularly 
> useful in creating secure testing and development environments. By masking 
> sensitive data, we can use realistic but fictional data for testing, 
> analysis, and development activities without exposing real personal or 
> sensitive information.
>  * Enhanced Data Sharing: Data masking allows us to share data with external 
> parties, such as partners or third-party vendors, while protecting sensitive 
> information. Masked data can be shared with confidence, as the original 
> sensitive values are replaced with transformed or fictional data.
>  * Employee Privacy: Data masking helps protect employee privacy by 
> obfuscating sensitive employee information, such as social security numbers 
> or salary details, in databases or HR systems. This s

Re: [VOTE] Release Apache Parquet 1.13.1 RC0

2023-05-14 Thread Gidon Gershinsky

+1

ran the test suite.

Cheers, Gidon


On Sat, May 13, 2023 at 11:48 PM Xinli shang 
wrote:

> +1
>
> I verified the signature and ran a sanity test.
>
>
>
> On Fri, May 12, 2023 at 6:15 PM pk singh  wrote:
>
> > Thanks Fokko, this is super-helpful and unblocks parquet 1.13 upgrade for
> > iceberg  !
> >
> > +1 (non-binding) from my end as well.
> >
> > Regards,
> > Prashant Singh
> >
> >
> >
> > On 2023/05/12 13:37:30 Fokko Driesprong wrote:
> > > Hi everyone,
> > >
> > >
> > > I propose the following RC to be released as the official Apache
> Parquet
> > > 1.13.1 release.
> > >
> > >
> > > The commit id is db4183109d5b734ec5930d870cdae161e408ddba
> > >
> > > * This corresponds to the tag: apache-parquet-1.13.1-rc0
> > >
> > > *
> > >
> >
> https://github.com/apache/parquet-mr/tree/db4183109d5b734ec5930d870cdae161e408ddba
> > >
> > >
> > > The release tarball, signature, and checksums are here:
> > >
> > > *
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.13.1-rc0
> > >
> > >
> > > You can find the KEYS file here:
> > >
> > > * https://downloads.apache.org/parquet/KEYS
> > >
> > >
> > > Binary artifacts are staged in Nexus here:
> > >
> > > *
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > >
> > >
> > > This release includes important changes:
> > >
> > > * https://github.com/apache/parquet-mr/commits/parquet-1.13.x
> > >
> > >
> > > Handy commands for verifying the release:
> > >
> > > *
> > >
> >
> https://iceberg.apache.org/how-to-release/#validating-a-source-release-candidate
> > >
> > > Replace Iceberg with Parquet :)
> > >
> > >
> > > Please download, verify, and test.
> > >
> > >
> > > Please vote in the next 72 hours.
> > >
> > >
> > > [ ] +1 Release this as Apache Parquet 1.13.1
> > >
> > > [ ] +0
> > >
> > > [ ] -1 Do not release this because...
> > >
>
>
>
> --
> Xinli Shang
>

[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2023-05-04 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719294#comment-17719294
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

[~Nageswaran] A couple of updates on this.

We should be able to skip this verification for encrypted files, a pull request 
is sent to parquet-mr.

Also, I've tried the new Spark 3.4.0 (as is, no modifications) with the scala 
test above - no exception was thrown. Probably, the updated Spark code bypasses 
the problematic parquet read path. Can you check if Spark 3.4.0 works ok for 
your usecase.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> [square_int_column]. Null File Decryptor
>   at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMeta

[jira] [Created] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-04 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2297:
-

 Summary: Encrypted files should not be checked for delta encoding 
problem
 Key: PARQUET-2297
 URL: https://issues.apache.org/jira/browse/PARQUET-2297
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.13.0
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky
 Fix For: 1.14.0, 1.13.1


Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) was 
fixed in writers since parquet-mr-1.8. This fix also added a 
`checkDeltaByteArrayProblem` method in readers, that runs over all columns and 
checks for this problem in older files. 

This now triggers an unrelated exception when reading encrypted files, in the 
following situation: trying to read an unencrypted column, without having keys 
for encrypted columns (see https://issues.apache.org/jira/browse/PARQUET-2193). 
This happens in Spark, with nested columns (files with regular columns are ok).

Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
encrypted files - because these files can be written only with parquet-mr-1.12 
and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2023-05-02 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718795#comment-17718795
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

Yep, sorry about the delay. This turned out to be more challenging than I 
hoped; a fix at the encryption code level will require some changes in the 
format specification.. A rather big deal, and likely unjustified in this case. 
The immediate trigger is the `checkDeltaByteArrayProblem` verification, added 8 
years ago to detect encoding irregularities in older files.  For some reason 
this check is done only on files with nested columns, and not on files with 
regular columns (at least in Spark). Maybe the right thing today is to remove 
that verification. I'll check with the community.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apa

Re: [VOTE] Release Apache Parquet 1.13.0 RC0

2023-04-05 Thread Gidon Gershinsky

+1

Ran the tests.
Thanks Gang and all contributors!

Cheers, Gidon


On Tue, Apr 4, 2023 at 3:54 AM Xinli shang  wrote:

> +1
>
> Verified checksum and signature, and ran internal tests.
>
> Gang, thanks a lot for leading this effort!
>
> On Mon, Apr 3, 2023 at 12:06 AM Gábor Szádovszky  wrote:
>
> > Verified checksum and signature, diffed tarball and repo content,
> > build/unit tests pass.
> > +1 (binding) for releasing this content as 1.13.0
> >
> > NOTE: It is completely fine or even a good practice to release the first
> > minor release from its separate branch (instead of master). Do not forget
> > to merge back CHANGES.md and the new version numbers update
> > (1.14.0-SNAPSHOT) to master, please.
> >
> > Thank you again, Gang for working on this release!
> >
> > On 2023/04/03 05:43:58 "Wang, Yuming" wrote:
> > > +1. Tested this release through Apache Spark UT:
> > https://github.com/apache/spark/pull/40555
> > >
> > > From: Gang Wu 
> > > Date: Monday, April 3, 2023 at 00:40
> > > To: dev@parquet.apache.org 
> > > Subject: [VOTE] Release Apache Parquet 1.13.0 RC0
> > > External Email
> > >
> > > Hi everyone,
> > >
> > > I propose the following RC to be released as the official Apache
> Parquet
> > > 1.13.0 release.
> > >
> > > The commit id is 2e369ed173f66f057c296e63c1bc31d77f294f41
> > > * This corresponds to the tag: apache-parquet-1.13.0-rc0
> > > *
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F2e369ed173f66f057c296e63c1bc31d77f294f41=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=hwxVa%2FxkYd47gnxJg4PI5nSXPuuF%2FSIC1XqhwcDgbN0%3D=0
> > >
> > > The release tarball, signature, and checksums are here:
> > > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.13.0-rc0=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=eb%2Fxey4DnprQyTxYRxdF201f7qz1zbm5berRDdVA3rY%3D=0
> > >
> > > You can find the KEYS file here:
> > > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.apache.org%2Fparquet%2FKEYS=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=pN8Xku%2BirF5nYcffdkJe4yh84mDFjjaVXewj0m8b1Kw%3D=0
> > >
> > > Binary artifacts are staged in Nexus here:
> > > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=qbyr5Y1EDslnqB8qi1CubNbPv9rATxpIoSbUmslaRIg%3D=0
> > >
> > > This release includes important changes listed:
> > > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fparquet-1.13.x%2FCHANGES.md=05%7C01%7Cyumwang%40ebay.com%7C8ace28c601754ff8a0e908db3398f2b2%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638160504241192076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=YmU60HCl776s6O4hvu%2FNFFXZY1ij9E0z9HquzmeJDxc%3D=0
> > >
> > > Please download, verify, and test.
> > >
> > > Please vote in the next 72 hours.
> > >
> > > [ ] +1 Release this as Apache Parquet 1.13.0
> > > [ ] +0
> > > [ ] -1 Do not release this because...
> > >
> > > Best regards,
> > > Gang
> > >
> >
>
>
> --
> Xinli Shang
>

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-28 Thread Gidon Gershinsky

+1

Verified signature and ran the tests. Thanks Gang and all contributors!

Cheers, Gidon


On Tue, Mar 28, 2023 at 5:19 PM Xinli shang  wrote:

> +1
>
> Verified signature and ran internal tests.  Thanks Gang for leading this
> effort!
>
> On Mon, Mar 27, 2023 at 9:38 AM Dongjoon Hyun  wrote:
>
> > +1
> >
> > Thank you, Gang and Yuming.
> >
> > Dongjoon.
> >
> > On 2023/03/27 05:44:14 "Wang, Yuming" wrote:
> > > +1. Tested this release through Spark UT:
> > https://github.com/apache/spark/pull/40555.
> > >
> > >
> > > From: Gang Wu 
> > > Date: Sunday, March 26, 2023 at 22:42
> > > To: dev@parquet.apache.org 
> > > Subject: [VOTE] Release Apache Parquet 1.12.4 RC0
> > > External Email
> > >
> > > Hi everyone,
> > >
> > > I propose the following RC to be released as the official Apache
> Parquet
> > > 1.12.4 release.
> > >
> > > The commit id is 22069e58494e7cb5d50e664c7ffa1cf1468404f8
> > > * This corresponds to the tag: apache-parquet-1.12.4-rc0
> > > *
> > >
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F22069e58494e7cb5d50e664c7ffa1cf1468404f8=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=%2Bny4R%2BgGQwIc3yMxsHfPh87YYTPhJ580UUoGV30WUQU%3D=0
> > >
> > > The release tarball, signature, and checksums are here:
> > > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.12.4-rc0=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=qW7uIIvyamqkT7FbkBWvwKD1VnfeRWnKLUBpcVHXvck%3D=0
> > >
> > > You can find the KEYS file here:
> > > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.apache.org%2Fparquet%2FKEYS=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=79Et30L9u4w4%2F%2B%2FTvPTpXEobOuvTV9XyVmapKC2qwoY%3D=0
> > >
> > > Binary artifacts are staged in Nexus here:
> > > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=Z%2FhRa8zc5ZHhs15Epx7X%2BIUwQJI4MoyPMOgAIJemvHU%3D=0
> > >
> > > This release includes important changes listed:
> > > *
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fparquet-1.12.4%2FCHANGES.md=05%7C01%7Cyumwang%40ebay.com%7Cc5216cd229664f939b6508db2e0855ed%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638154385567464296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=SXURCILyTz6SYb3iNPEnedkgjMk%2BA%2FLYHyS4TvT4bbM%3D=0
> > >
> > > Please download, verify, and test.
> > >
> > > Please vote in the next 72 hours.
> > >
> > > [ ] +1 Release this as Apache Parquet 1.12.4
> > > [ ] +0
> > > [ ] -1 Do not release this because...
> > >
> > > Best regards,
> > > Gang
> > >
> >
>
>
> --
> Xinli Shang
>

Re: Gang Wu as new Apache Parquet committer

2023-03-04 Thread Gidon Gershinsky

Congrats Gang!

Cheers, Gidon


On Sat, Mar 4, 2023 at 10:41 PM Micah Kornfield 
wrote:

> Congrats!
>
> On Monday, February 27, 2023, Xinli shang  wrote:
>
> > The Project Management Committee (PMC) for Apache Parquet has invited
> Gang
> > Wu (gangwu) to become a committer and we are pleased to announce that he
> > has accepted.
> >
> > Congratulations and welcome, Gang!
> >
> > --
> > Xinli Shang
> >
>

[jira] [Assigned] (PARQUET-2103) crypto exception in print toPrettyJSON

2023-01-11 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky reassigned PARQUET-2103:
-

Assignee: Gidon Gershinsky

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3
>Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Minor
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)

[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON

2023-01-11 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2103:
--
Affects Version/s: 1.12.3

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3
>Reporter: Gidon Gershinsky
>Priority: Major
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shad

[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON

2023-01-11 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2103:
--
Priority: Minor  (was: Major)

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3
>Reporter: Gidon Gershinsky
>Priority: Minor
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
&

Re: Modular encryption to support arrays and nested arrays

2022-10-31 Thread Gidon Gershinsky

Parquet columnar encryption supports these types. Currently, it requires an
explicit full path for each column to be encrypted.
Your sample will work with
*spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys",
"k2:rider.list.element.foo,rider.list.element.bar")*

Having said that, there are a couple of things that can be improved (thank
you for running these checks!)

- the exception text is not informative enough, doesn't help much in
correcting the parameters. I've opened a Jira for this (and for updating
the parameter documentation).
The goal is to make the exception print something like:
*Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException:
Encrypted column [rider] not in file schema column list: [foo] ,
[rider.list.element.foo] , [rider.list.element.bar] , [ts] , [uuid]*

- Configuring a key for all children of a nested schema node (eg "
*k2:rider.*"*). This had been discussed in the past, but not followed up..
Is this something you'd be interested to build? Alternatively, I can do it,
but this will take me a while to get to.


Cheers, Gidon


On Sat, Oct 29, 2022 at 12:45 AM nicolas paris 
wrote:

> Hello,
>
> apparently, modular encryption does not yet support **arrays** types.
>
> ```scala
> spark.sparkContext.hadoopConfiguration.set("parquet.crypto.factory.class",
> "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.kms.client.class"
> , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.key.list",
> "k1:AAECAwQFBgcICQoLDA0ODw==, k2:AAECAAECAAECAAECAAECAA==")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.plaintext.footer",
> "true")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.footer.key",
> "k1")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys",
> "k2:rider")
>
> val df = spark.sql("select 1 as foo, array(named_struct('foo',2, 'bar',3))
> as rider, 3 as ts, uuid() as uuid")
> df.write.format("parquet").mode("overwrite").save("/tmp/enc")
>
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException:
> Encrypted column [rider] not in file schema
>
> ```
>
> also, the doted columnpath would not support to encrypt within nested
> structure mixed with arrays. For example, there is no way I am aware of to
> target "all foo in rider".
>
> ```
> root
>  |-- foo: integer (nullable = false)
>  |-- rider: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- foo: integer (nullable = false)
>  |||-- bar: integer (nullable = false)
>  |-- ts: integer (nullable = false)
>  |-- uuid: string (nullable = false)
> ```
>
> so far, those two issues makes arrays of confidential information
> impossible to encrypt, or am I missing something ?
>
> Thanks,
>

[jira] [Created] (PARQUET-2208) Add details to nested column encryption config doc and exception text

2022-10-31 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2208:
-

 Summary: Add details to nested column encryption config doc and 
exception text
 Key: PARQUET-2208
 URL: https://issues.apache.org/jira/browse/PARQUET-2208
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.3
Reporter: Gidon Gershinsky


Parquet columnar encryption requires an explicit full path for each column to 
be encrypted. If a partial path is configured, the thrown exception is not 
informative enough, doesn't help much in correcting the parameters.
The goal is to make the exception print something like:
_Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted 
column [rider] not in file schema column list: [foo] , [rider.list.element.foo] 
, [rider.list.element.bar] , [ts] , [uuid]_
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Modular encryption to return null values instead of Crypto exception when bad key provided

2022-10-27 Thread Gidon Gershinsky

trying to project columns without authorization can be very costly, for two
reasons:
- unnecessary per-column/file calls to the (remote) KMS service, plus the
cost of per-call authorization checks
- red-flagging unauthorized calls and triggering "breach attempt" alerts

IMO, the best way to handle this is to have a layer on top of parquet -
that gets the list of authorized columns for the reader (eg from a policy
engine), and allows to project only them (returning nulls for the others)

Cheers, Gidon


On Thu, Oct 27, 2022 at 1:01 AM nicolas paris 
wrote:

> hello,
>
> as mentionned in several places [1], from a data analyst point of view,
> having null values for encrypted columns when one has no key to decrypt
> is better than getting exceptions, and ease the data exploration
> allowing select * instead of writing each allowed columns.
>
> I have been digging the crypto source code to find a easy way to catch
> crypto exception and turn values to null from the
> DecryptionPropertiesFactory that can be passed to the query engine
> thought hadoop configs.
>
> I might be missing something, but I haven't found a way to tell the
> ParquetReader to put nulls and go ahead reading un-encrypted columns
> when something get wrong with the KMS.
>
> Is such behavior available or are you willing to add such feature at
> parquet level in the future ?
>
> Thanks
>
>
> [1]
>
> https://www.uber.com/en-FR/blog/one-stone-three-birds-finer-grained-encryption-apache-parquet/
>

Re: Parquet modular encryption on nested fields

2022-10-26 Thread Gidon Gershinsky

There is a discussion on this at
https://issues.apache.org/jira/browse/PARQUET-2193 .
Basically, a workaround exists today, please check if it works for you.
Currently, I'm checking options for a more permanent solution.

(in the future, please send emails with text, instead of attaching it as a
file).

Cheers, Gidon


On Tue, Oct 25, 2022 at 1:02 PM nicolas paris 
wrote:

>

[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2022-10-10 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614917#comment-17614917
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

Welcome.

>From the sound of it, this might require each file to be processed by one 
>thread only (instead of reading a single file by multiple threads); which 
>should be ok in typical usecases where one thread/executor reads multiple 
>files anyway. But I'll have a deeper look at this.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> [square_int_column]. Null File Decryptor
>   at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>   at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348)
>   at

[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2022-09-29 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610868#comment-17610868
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

Hmm, looks like this method runs over all columns, projected and not projected:
org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191)
 

Please check if setting "parquet.split.files" to "false" solves this problem.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> [square_int_column]. Null File Decryptor
>   at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>   at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348)
&

[jira] [Commented] (PARQUET-2194) parquet.encryption.plaintext.footer parameter being true, code expects parquet.encryption.footer.key

2022-09-29 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610855#comment-17610855
 ] 

Gidon Gershinsky commented on PARQUET-2194:
---

Footer key is required also in the plaintext footer mode - it is used to sign 
the footer, 
https://github.com/apache/parquet-mr/tree/master/parquet-hadoop#class-propertiesdrivencryptofactory

> parquet.encryption.plaintext.footer parameter being true, code expects 
> parquet.encryption.footer.key
> 
>
> Key: PARQUET-2194
> URL: https://issues.apache.org/jira/browse/PARQUET-2194
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> I want my footer in parquet file to be non encrypted. so I set the 
> _parquet.encryption.plaintext.footer_ to be {_}true{_}, but when I tried to 
> run my code, parquet-mr is expecting __ value __ for the __ property 
> _parquet.encryption.footer.key  **_  
> Reproducer
> Spark 3.3.0 
> Download the 
> [file|[https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  ] and place it in spark - jar directory 
> using spark-shell
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") 
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") 
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==") 
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
>  
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" 
> val partitionCol = 1 
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) 
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem) 
> val dataRange = (1 to 100).toList 
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i 
> squares.toDS().show() 
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).parquet(encryptedParquetPath){code}
> I get the below error, my expectation is if I set properties for my footer to 
> be plain text, why do we need keys for footer.
>  
> {code:java}
>  
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: Undefined 
> footer key
>   at 
> org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.getFileEncryptionProperties(PropertiesDrivenCryptoFactory.java:88)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.createEncryptionProperties(ParquetOutputFormat.java:554)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:478)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:155)
>   at 
> org.apache.spark.sql.execution.datasources.BaseDynamicPartitionDataWriter.renewCurrentWriter(FileFormatDataWriter.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionDataSingleWriter.write(FileFormatDataWriter.scala:365)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:331)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:338)
>   ... 9 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (PARQUET-2197) Document uniform encryption

2022-09-28 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2197:
-

 Summary: Document uniform encryption
 Key: PARQUET-2197
 URL: https://issues.apache.org/jira/browse/PARQUET-2197
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.3
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Document the hadoop parameter for uniform encryption



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-09-14 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605098#comment-17605098
 ] 

Gidon Gershinsky commented on PARQUET-1711:
---

[~emkornfield] what do you think about these 3 alternatives?

> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-09-08 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602127#comment-17602127
 ] 

Gidon Gershinsky edited comment on PARQUET-1711 at 9/9/22 5:45 AM:
---

Hi to all on this Jira. Looks like we have a number of alternative solutions to 
this problem today, 
[https://github.com/apache/parquet-mr/pull/995]

[https://github.com/apache/parquet-mr/pull/445]

[https://github.com/apache/parquet-mr/pull/988]

Can you take a look and provide your opinion on them?


was (Author: gershinsky):
Hi to all on this Jira. Looks like we have a number of alternative solutions to 
this problem today, 

> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-09-08 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602127#comment-17602127
 ] 

Gidon Gershinsky commented on PARQUET-1711:
---

Hi to all on this Jira. Looks like we have a number of alternative solutions to 
this problem today, 

> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (PARQUET-2040) Uniform encryption

2022-07-28 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2040.
---
Resolution: Fixed

> Uniform encryption
> --
>
> Key: PARQUET-2040
> URL: https://issues.apache.org/jira/browse/PARQUET-2040
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> PME low-level spec supports using the same encryption key for all columns, 
> which is useful in a number of scenarios. However, this feature is not 
> exposed yet in the high-level API, because its misuse can break the NIST 
> limit on the number of AES GCM operations with one key. We will develop a 
> limit-enforcing code and provide an API for uniform table encryption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (PARQUET-2136) File writer construction with encryptor

2022-07-28 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2136.
---
Resolution: Fixed

> File writer construction with encryptor
> ---
>
> Key: PARQUET-2136
> URL: https://issues.apache.org/jira/browse/PARQUET-2136
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently, a file writer object can be constructed with encryption 
> properties. We need an additional constructor, that can accept an encryptor 
> instead, in order to support lazy materialization of parquet file writers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Review of Q2 Parquet report

2022-07-05 Thread Gidon Gershinsky

nit: MR-1.12.3 released on 202*2*-05-26.

Cheers, Gidon


On Tue, Jul 5, 2022 at 6:04 PM Xinli shang  wrote:

> Hi all,
>
> The report below is what I am going to submit for hte past quarter. Please
> review and comment on it. Thanks.
>
>
> ## Description:
> The mission of Parquet is the creation and maintenance of software related
> to
> columnar storage format available to any project in the Apache Hadoop
> ecosystem
>
> ## Issues:
> no
>
> ## Membership Data:
> Apache Parquet was founded 2015-04-21 (7 years ago)
> There are currently 37 committers and 27 PMC members in this project.
> The Committer-to-PMC ratio is roughly 5:4.
>
> Community changes, past quarter:
> - No new PMC members. Last addition was Gidon Gershinsky on 2021-11-23.
> - No new committers. Last addition was Gidon Gershinsky on 2021-04-05.
>
> ## Project Activity:
> MR-1.12.3 was released on 2021-05-26.
> MR-1.11.2 was released on 2021-10-06.
> MR-1.12.2 was released on 2021-10-06.
> MR-1.12.0 was released on 2021-03-25.
>
> ## Community Health:
> dev@parquet.apache.org had a 65% decrease in traffic in the past quarter
> (270 emails compared to 751)
> 27 issues opened in JIRA, past quarter (no change)
> 8 issues closed in JIRA, past quarter (-52% change)
> 38 commits in the past quarter (18% increase)
> 12 code contributors in the past quarter (20% increase)
> 27 PRs opened on GitHub, past quarter (-20% change)
> 17 PRs closed on GitHub, past quarter (-43% change
>
>
> --
> Xinli Shang
>
-- 

Cheers, Gidon

[jira] [Resolved] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-06-21 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2120.
---
Resolution: Fixed

> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
> Fix For: 1.12.3
>
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-06-21 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556828#comment-17556828
 ] 

Gidon Gershinsky commented on PARQUET-2120:
---

[~shangxinli] and the Parquet community, can you assign this Jira to [~rshkv] 

> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
> Fix For: 1.12.3
>
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Resolved] (PARQUET-2148) Enable uniform decryption with plaintext footer

2022-06-21 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2148.
---
Resolution: Fixed

> Enable uniform decryption with plaintext footer
> ---
>
> Key: PARQUET-2148
> URL: https://issues.apache.org/jira/browse/PARQUET-2148
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>    Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently, uniform decryption is not enabled in the plaintext footer mode - 
> for no good reason. Column metadata is available, we just need to decrypt and 
> use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Resolved] (PARQUET-2144) Fix ColumnIndexBuilder for notIn predicate

2022-06-21 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2144.
---
Resolution: Fixed

> Fix ColumnIndexBuilder for notIn predicate
> --
>
> Key: PARQUET-2144
> URL: https://issues.apache.org/jira/browse/PARQUET-2144
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Huaxin Gao
>Priority: Major
> Fix For: 1.12.3
>
>
> Column Index is not built correctly for notIn predicate. Need to fix the bug.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Resolved] (PARQUET-2145) Release 1.12.3

2022-06-21 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2145.
---
Resolution: Fixed

> Release 1.12.3
> --
>
> Key: PARQUET-2145
> URL: https://issues.apache.org/jira/browse/PARQUET-2145
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>    Reporter: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2145) Release 1.12.3

2022-06-21 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556825#comment-17556825
 ] 

Gidon Gershinsky commented on PARQUET-2145:
---

This version is already released, 
[https://parquet.incubator.apache.org/blog/2022/05/26/1.12.3/]

 

Lets indeed close this Jira.

> Release 1.12.3
> --
>
> Key: PARQUET-2145
> URL: https://issues.apache.org/jira/browse/PARQUET-2145
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>    Reporter: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-06-13 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553425#comment-17553425
 ] 

Gidon Gershinsky commented on PARQUET-2117:
---

[~sha...@uber.com] Could you add [~prakharjain09] to the Parquet contributors.

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Re: [VOTE] Release Apache Parquet 1.12.3 RC1

2022-05-22 Thread Gidon Gershinsky

+1. Downloaded, verified and tested.

Cheers, Gidon


On Fri, May 20, 2022 at 8:49 PM Xinli shang  wrote:

> Hi everyone,
>
>
> I propose the following RC to be released as the official Apache Parquet
>  1.12.3 release.
>
>
> The commit id is f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b
>
> * This corresponds to the tag: apache-parquet-1.12.3-rc1
>
> *
> https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.12.3-rc1
>
>
> The release tarball, signature, and checksums are here:
>
> * https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.12.3-rc1
>
>
> You can find the KEYS file here:
>
> * *https://dist.apache.org/repos/dist/release/parquet/KEYS
> *
>
>
> Binary artifacts are staged in Nexus here:
>
> * https://repository.apache.org/content/groups/staging/org/apache/parquet/
>
>
> This release includes important changes listed
> https://github.com/apache/parquet-mr/blob/parquet-1.12.3/CHANGES.md.
>
>
> Please download, verify, and test.
>
>
> Please vote in the next 72 hours.
>
>
> [ ] +1 Release this as Apache Parquet 1.12.3
>
> [ ] +0
>
> [ ] -1 Do not release this because...
>
>
>
> 
>
> Xinli Shang
>
> PMC Chair of Apache Parquet
>
> TLM Uber Data Infra
>

[jira] [Updated] (PARQUET-2101) Fix wrong descriptions about the default block size

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2101:
--
Fix Version/s: 1.12.3

> Fix wrong descriptions about the default block size
> ---
>
> Key: PARQUET-2101
> URL: https://issues.apache.org/jira/browse/PARQUET-2101
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-mr, parquet-protobuf
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Trivial
> Fix For: 1.12.3
>
>
> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L90
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L240
> https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoParquetWriter.java#L80
> These javadocs say the default block size is 50 MB but it's actually 128MB.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2081) Encryption translation tool - Parquet-hadoop

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2081:
--
Fix Version/s: 1.12.3
   (was: 1.13.0)

> Encryption translation tool - Parquet-hadoop
> 
>
> Key: PARQUET-2081
> URL: https://issues.apache.org/jira/browse/PARQUET-2081
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.3
>
>
> This is the implement the core part of the Encryption translation tool in 
> parquet-hadoop. After this, we will have another Jira/PR for parquet-cli to 
> integrate with key tools for encryption properties.. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2102) Typo in ColumnIndexBase toString

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2102:
--
Fix Version/s: 1.12.3

> Typo in ColumnIndexBase toString
> 
>
> Key: PARQUET-2102
> URL: https://issues.apache.org/jira/browse/PARQUET-2102
> Project: Parquet
>  Issue Type: Bug
>Reporter: Ryan Rupp
>Assignee: Ryan Rupp
>Priority: Trivial
> Fix For: 1.12.3
>
>
> Trivial thing but noticed [here|https://github.com/trinodb/trino/issues/9890] 
> since ColumnIndexBase.toString() was used in a wrapped exception message - 
> "boundary" has a typo (boudary).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2040) Uniform encryption

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2040:
--
Fix Version/s: 1.12.3

> Uniform encryption
> --
>
> Key: PARQUET-2040
> URL: https://issues.apache.org/jira/browse/PARQUET-2040
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> PME low-level spec supports using the same encryption key for all columns, 
> which is useful in a number of scenarios. However, this feature is not 
> exposed yet in the high-level API, because its misuse can break the NIST 
> limit on the number of AES GCM operations with one key. We will develop a 
> limit-enforcing code and provide an API for uniform table encryption.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2076) Improve Travis CI build Performance

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2076:
--
Fix Version/s: 1.12.3

> Improve Travis CI build Performance
> ---
>
> Key: PARQUET-2076
> URL: https://issues.apache.org/jira/browse/PARQUET-2076
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Chen Zhang
>Priority: Trivial
> Fix For: 1.12.3
>
>
> According to [Common Build Problems - Travis CI 
> (travis-ci.com)|https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received],
>  we should carefully use travis_wait, as it may make the build unstable and 
> extend the build time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2107) Travis failures

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2107:
--
Fix Version/s: 1.12.3

> Travis failures
> ---
>
> Key: PARQUET-2107
> URL: https://issues.apache.org/jira/browse/PARQUET-2107
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.12.3
>
>
> There are Travis failures since a while in our PRs. See e.g. 
> https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or 
> https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2106:
--
Fix Version/s: 1.12.3

> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> ---
>
> Key: PARQUET-2106
> URL: https://issues.apache.org/jira/browse/PARQUET-2106
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, 
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that 
> BinaryComparator is the source of substantial churn of extremely short-lived 
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
> allocations in our benchmarks, putting substantial pressure on a Garbage 
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>  
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid 
> doing any allocations, since this code lies on the hot-path of every Parquet 
> write, therefore causing substantial churn amplification.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2105) Refactor the test code of creating the test file

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2105:
--
Fix Version/s: 1.12.3

> Refactor the test code of creating the test file 
> -
>
> Key: PARQUET-2105
> URL: https://issues.apache.org/jira/browse/PARQUET-2105
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.3
>
>
> In the tests, there are many places that need to create a test parquet file 
> with different settings. Currently, each test file just creates its own code. 
> It would be better to have a test file builder to create that. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2112) Fix typo in MessageColumnIO

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2112:
--
Fix Version/s: 1.12.3
   (was: 1.13.0)

> Fix typo in MessageColumnIO
> ---
>
> Key: PARQUET-2112
> URL: https://issues.apache.org/jira/browse/PARQUET-2112
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.3
>
>
> Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes'



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2128) Bump Thrift to 0.16.0

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2128:
--
Fix Version/s: 1.12.3

> Bump Thrift to 0.16.0
> -
>
> Key: PARQUET-2128
> URL: https://issues.apache.org/jira/browse/PARQUET-2128
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Vinoo Ganesh
>Assignee: Vinoo Ganesh
>Priority: Minor
> Fix For: 1.12.3
>
>
> Thrift 0.16.0 has been released 
> https://github.com/apache/thrift/releases/tag/v0.16.0



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2120:
--
Fix Version/s: 1.12.3

> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
> Fix For: 1.12.3
>
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2129) Add uncompressedSize to "meta" output

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2129:
--
Fix Version/s: 1.12.3

> Add uncompressedSize to "meta" output
> -
>
> Key: PARQUET-2129
> URL: https://issues.apache.org/jira/browse/PARQUET-2129
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Vinoo Ganesh
>Assignee: Vinoo Ganesh
>Priority: Minor
> Fix For: 1.12.3
>
>
> The `uncompressedSize` is currently not printed in the output of the parquet 
> meta command. This PR adds the uncompressedSize in to the output. 
> This was also reported by Deepak Gangwar. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2121) Remove descriptions for the removed modules

2022-05-19 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2121:
--
Fix Version/s: 1.12.3

> Remove descriptions for the removed modules
> ---
>
> Key: PARQUET-2121
> URL: https://issues.apache.org/jira/browse/PARQUET-2121
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
> Fix For: 1.12.3
>
>
> PARQUET-2020 removed some deprecated modules, but the related descriptions 
> still remain in some documents. They should be removed since their existence 
> is misleading.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2136) File writer construction with encryptor

2022-05-18 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2136:
--
Fix Version/s: 1.12.3

> File writer construction with encryptor
> ---
>
> Key: PARQUET-2136
> URL: https://issues.apache.org/jira/browse/PARQUET-2136
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently, a file writer object can be constructed with encryption 
> properties. We need an additional constructor, that can accept an encryptor 
> instead, in order to support lazy materialization of parquet file writers.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2144) Fix ColumnIndexBuilder for notIn predicate

2022-05-18 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2144:
--
Fix Version/s: 1.12.3

> Fix ColumnIndexBuilder for notIn predicate
> --
>
> Key: PARQUET-2144
> URL: https://issues.apache.org/jira/browse/PARQUET-2144
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Huaxin Gao
>Priority: Major
> Fix For: 1.12.3
>
>
> Column Index is not built correctly for notIn predicate. Need to fix the bug.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar

2022-05-18 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2127:
--
Fix Version/s: 1.12.3

> Security risk in latest parquet-jackson-1.12.2.jar
> --
>
> Key: PARQUET-2127
> URL: https://issues.apache.org/jira/browse/PARQUET-2127
> Project: Parquet
>  Issue Type: Improvement
>Reporter: phoebe chen
>Priority: Major
> Fix For: 1.12.3
>
>
> Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK 
> serialization to serialize JsonNode 
> ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to 
> 2.13.1 can fix this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (PARQUET-2148) Enable uniform decryption with plaintext footer

2022-05-16 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2148:
-

 Summary: Enable uniform decryption with plaintext footer
 Key: PARQUET-2148
 URL: https://issues.apache.org/jira/browse/PARQUET-2148
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky
 Fix For: 1.12.3


Currently, uniform decryption is not enabled in the plaintext footer mode - for 
no good reason. Column metadata is available, we just need to decrypt and use 
it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Re: Meeting notes for Parquet monthly sync - 4/27/2022

2022-05-04 Thread Gidon Gershinsky

Hi all, we're starting to work on this part:

*Release 1.12.3   SNAPSHOT release *

Meaning that technically there will be two releases, starting with an
unofficial snapshot of the current master for completing dependent prs in
other projects - followed by the official parquet-mr-1.12.3 release.

For the latter, I've created
https://issues.apache.org/jira/browse/PARQUET-2145 . Feel free to add
relevant jiras as dependencies for this one (preferably if their PRs are
already merged in the master branch). I'll also make a pass over the recent
commits / jiras.

Cheers, Gidon

On Wed, Apr 27, 2022 at 8:03 PM Xinli shang  wrote:

> 4/27/2022
>
> Attendees (Timothy Miller, Vinoo Ganesh, Satish K, Gidon Gershinsky, Xinli
> Shang, Huaxin Gao)
>
>1.
>
>Cell-Level encryption
>1.
>
>   Internal implementation and rollout
>   2.
>
>   Welcome new comments
>   2.
>
>Release 1.12.3
>1.
>
>   SNAPSHOT release - Gidon will take the lead
>   3.
>
>ID resolution
>1.
>
>   Huaxin will address Ryan’s comments
>   4.
>
>UUID support for parquet-cli
>1.
>
>   See some exceptions when running the tool. Timothy will investigate
>   it.
>   5. The next meeting will be at 8:30 am on Tuesday
>
>
> --
> Xinli Shang
> VP Apache Parquet PMC Chair
> Tech Lead Manager @ Uber Data Infra
>

[jira] [Created] (PARQUET-2145) Release 1.12.3

2022-05-04 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2145:
-

 Summary: Release 1.12.3
 Key: PARQUET-2145
 URL: https://issues.apache.org/jira/browse/PARQUET-2145
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Gidon Gershinsky
 Fix For: 1.12.3






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher

2022-04-24 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526997#comment-17526997
 ] 

Gidon Gershinsky commented on PARQUET-2098:
---

[~theosib-amazon] I got ~half of this (code; not the unitests yet). But in the 
meantime, it became unclear if we need this functionality (in the upcoming 
release). Do you have a usecase for it?

> Add more methods into interface of BlockCipher
> --
>
> Key: PARQUET-2098
> URL: https://issues.apache.org/jira/browse/PARQUET-2098
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Currently BlockCipher interface has methods without letting caller to specify 
> length/offset. In some use cases like Presto,  it is needed to pass in a byte 
> array and the data to be encrypted only occupys partially of the array.  So 
> we need to add a new methods something like below for decrypt. Similar 
> methods might be needed for encrypt. 
> byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, 
> byte[] aad);



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (PARQUET-2136) File writer construction with encryptor

2022-04-04 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2136:
-

 Summary: File writer construction with encryptor
 Key: PARQUET-2136
 URL: https://issues.apache.org/jira/browse/PARQUET-2136
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.2
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Currently, a file writer object can be constructed with encryption properties. 
We need an additional constructor, that can accept an encryptor instead, in 
order to support lazy materialization of parquet file writers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Re: Parquet Column Resolution by ID

2022-02-12 Thread Gidon Gershinsky

Thanks Xinli, works well now. I've reviewed the doc.

Cheers, Gidon


On Fri, Feb 11, 2022 at 7:21 PM Xinli shang  wrote:

> Hi Gidon,
>
> I just shared the 'comment' permission for everybody. Let me know if you
> still have issues with it.
>
> Xinli
>
> On Thu, Feb 10, 2022 at 9:45 PM Gidon Gershinsky  wrote:
>
> > Hi Huaxin,
> >
> > Can you open this document for comments?
> >
> > Cheers, Gidon
> >
> >
> > On Fri, Feb 11, 2022 at 6:01 AM huaxin gao 
> wrote:
> >
> > > Hi Parquet community,
> > >
> > > Xinli and I drafted a design doc to support ID based column resolution
> in
> > > Parquet. Here is the link
> > > <
> > >
> >
> https://docs.google.com/document/d/1hDLFIKuVhhnTNpA5bTo4nfD-MUZz8Iq4V9FXrr1WPsw/edit?usp=sharing
> > > >.
> > > We'd like to start a discussion on the doc and any feedback is welcome!
> > >
> > > Thanks,
> > > Huaxin
> > >
> >
>
>
> --
> Xinli Shang
>

Re: Parquet Column Resolution by ID

2022-02-10 Thread Gidon Gershinsky

Hi Huaxin,

Can you open this document for comments?

Cheers, Gidon


On Fri, Feb 11, 2022 at 6:01 AM huaxin gao  wrote:

> Hi Parquet community,
>
> Xinli and I drafted a design doc to support ID based column resolution in
> Parquet. Here is the link
> <
> https://docs.google.com/document/d/1hDLFIKuVhhnTNpA5bTo4nfD-MUZz8Iq4V9FXrr1WPsw/edit?usp=sharing
> >.
> We'd like to start a discussion on the doc and any feedback is welcome!
>
> Thanks,
> Huaxin
>

[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher

2022-01-27 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483575#comment-17483575
 ] 

Gidon Gershinsky commented on PARQUET-2098:
---

sure, I can take this one

> Add more methods into interface of BlockCipher
> --
>
> Key: PARQUET-2098
> URL: https://issues.apache.org/jira/browse/PARQUET-2098
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Currently BlockCipher interface has methods without letting caller to specify 
> length/offset. In some use cases like Presto,  it is needed to pass in a byte 
> array and the data to be encrypted only occupys partially of the array.  So 
> we need to add a new methods something like below for decrypt. Similar 
> methods might be needed for encrypt. 
> byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, 
> byte[] aad);



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-24 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448596#comment-17448596
 ] 

Gidon Gershinsky commented on PARQUET-2103:
---

[~gszadovszky] thanks for pointing in the right direction. We can check that a 
file is encrypted, and then skip printing its column metadata - this solves the 
problem at hand. We will still be able to print the file-wide metadata (as 
opposed to the per-column metadata, which is encrypted with column-specific 
keys). I'll start working on a patch. 

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Gidon Gershinsky
>Priority: Major
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.par

[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-24 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2103:
--
Description: 
In debug mode, this code 
{{if (LOG.isDebugEnabled()) {}}
{{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
{{}}}
called in 
{{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}

 

_*in encrypted files with plaintext footer*_ 

triggers an exception:

 
{{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
Null File Decryptor     }}

{{    at 
org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    at 
org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]}}
{{    at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]}}
{{    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]}}
{{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    ... 23 more}}

  was:
In debug mode, this code 
{{if (LOG.isDebugEnabled()) {}}
{{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
{{}}}
called in 
{{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}

 

_*for unencrypted files

[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-22 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447372#comment-17447372
 ] 

Gidon Gershinsky commented on PARQUET-2103:
---

[~gszadovszky] [~sha...@uber.com] will appreciate getting your advise on the 
solution options (here or at the sync call). This seems to be a print with a 
blind reflection loop, that calls all nested classes / methods in an object. 
Since v1.12.0, there is a EncryptedColumnChunkMetaData class inside the 
ColumnChunkMetaData. Creating its instance and calling the "decrypt" method is 
not a good idea, for either unencrypted or encrypted files.

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Gidon Gershinsky
>Priority: Major
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*for unencrypted files*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSe

[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-16 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2103:
--
Description: 
In debug mode, this code 
{{if (LOG.isDebugEnabled()) {}}
{{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
{{}}}
called in 
{{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}

 

_*for unencrypted files*_ 

triggers an exception:

 
{{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
Null File Decryptor     }}

{{    at 
org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    at 
org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]}}
{{    at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]}}
{{    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]}}
{{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    ... 23 more}}

  was:
In debug mode, this code 
if (LOG.isDebugEnabled()) \{
  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));
}
called in org.apache.parquet.format.converter.

ParquetMetadataConverter.

readParquetMetadata()

 

_*for unencrypted files*_ 

triggers an exception:

 
Caused

[jira] [Created] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-16 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2103:
-

 Summary: crypto exception in print toPrettyJSON
 Key: PARQUET-2103
 URL: https://issues.apache.org/jira/browse/PARQUET-2103
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.12.2, 1.12.1, 1.12.0
Reporter: Gidon Gershinsky


In debug mode, this code 
if (LOG.isDebugEnabled()) \{
  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));
}
called in org.apache.parquet.format.converter.

ParquetMetadataConverter.

readParquetMetadata()

 

_*for unencrypted files*_ 

triggers an exception:

 
Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. Null 
File Decryptor     at 
org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
 ~[parquet-hadoop-1.12.0jar:1.12.0]
    at 
org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
 ~[parquet-hadoop-1.12.0jar:1.12.0]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]
    at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]
    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68)
 ~[parquet-hadoop-1.12.0jar:1.12.0]
    ... 23 more



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-28 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421323#comment-17421323
 ] 

Gidon Gershinsky commented on PARQUET-2080:
---

Oh, sorry, done.

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-28 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421193#comment-17421193
 ] 

Gidon Gershinsky commented on PARQUET-2080:
---

Hi [~gszadovszky] , I've prepared a short writeup on this alternative solution, 
with a discussion of the tradeoffs. After writing it, my feeling is that the 
trade-off is not in favor of this alternative option; but [here it 
goes|https://docs.google.com/document/d/1zr6-4em8C8DGi-D3jGosQe2gvJKluat-8uUbS0y7F-0/edit?usp=sharing],
 just to cover all bases. Will appreciate your opinion on this.

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [VOTE] Release Apache Parquet 1.12.1 RC1

2021-09-15 Thread Gidon Gershinsky

A late +1 (non-binding).

- ran build and test, everything was ok
- ran extra tests with encryption, standalone and Spark, everything passed

Thanks Xinli and all for contributing to this release!

Cheers, Gidon


On Wed, Sep 15, 2021 at 6:53 AM Xinli shang  wrote:

> The vote to release 1.12.1 RC1 as Apache Parquet MR 1.12.1 is PASSED with
> the required three +1 binding votes and one +1 non-binding votes. (There
> were no -1 or 0 votes.)
>
> Thank you all who verified and voted!
>
> I'm going forward with the release process soon.
>
> On Tue, Sep 14, 2021 at 5:23 PM Julien Le Dem  wrote:
>
> > +1 (binding)
> > I verified the signature
> > the build and tests pass (with java 8)
> >
> > On Tue, Sep 14, 2021 at 4:14 PM Xinli shang 
> > wrote:
> >
> > > I also vote +1 (binding). Thanks everybody for verifying!
> > >
> > > On Tue, Sep 14, 2021 at 2:00 PM Chao Sun  wrote:
> > >
> > > > +1 (non-binding).
> > > >
> > > > - tested on the Spark side and all tests passed, including the issue
> in
> > > > SPARK-36696
> > > > - verified signature and checksum of the release
> > > >
> > > > Thanks Xinli for driving the release work!
> > > >
> > > > Chao
> > > >
> > > > On Tue, Sep 14, 2021 at 3:01 AM Gabor Szadovszky 
> > > wrote:
> > > >
> > > > > Thanks for the new RC, Xinli.
> > > > >
> > > > > The content seems correct to me. The checksum and sign are correct.
> > > Unit
> > > > > tests pass.
> > > > >
> > > > > My vote is +1 (binding)
> > > > >
> > > > > On Mon, Sep 13, 2021 at 8:11 PM Xinli shang
>  > >
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > >
> > > > > > I propose the following RC to be released as the official Apache
> > > > Parquet
> > > > > > 1.12.1 release.
> > > > > >
> > > > > >
> > > > > > The commit id is 2a5c06c58fa987f85aa22170be14d927d5ff6e7d
> > > > > >
> > > > > > * This corresponds to the tag: apache-parquet-1.12.1-rc1
> > > > > >
> > > > > > *
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/tree/2a5c06c58fa987f85aa22170be14d927d5ff6e7d
> > > > > >
> > > > > >
> > > > > > The release tarball, signature, and checksums are here:
> > > > > >
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.12.1-rc1/
> > > > > >
> > > > > >
> > > > > > You can find the KEYS file here:
> > > > > >
> > > > > > * *https://dist.apache.org/repos/dist/release/parquet/KEYS
> > > > > > *
> > > > > >
> > > > > >
> > > > > > Binary artifacts are staged in Nexus here:
> > > > > >
> > > > > > *
> > > > >
> > >
> https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > > > > >
> > > > > >
> > > > > > This release includes important changes listed
> > > > > >
> > https://github.com/apache/parquet-mr/blob/parquet-1.12.x/CHANGES.md
> > > > > >
> > > > > >
> > > > > > Please download, verify, and test.
> > > > > >
> > > > > >
> > > > > > Please vote in the next 72 hours.
> > > > > >
> > > > > >
> > > > > > [ ] +1 Release this as Apache Parquet 1.12.1
> > > > > >
> > > > > > [ ] +0
> > > > > >
> > > > > > [ ] -1 Do not release this because...
> > > > > >
> > > > > > --
> > > > > > Xinli Shang | Tech Lead Manager @ Uber Data Infra
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Xinli Shang
> > >
> >
>
>
> --
> Xinli Shang
>

[jira] [Updated] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-14 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2080:
--
Description: 
Due to PARQUET-2078 RowGroup.file_offset is not reliable.

This field is also wrongly calculated in the C++ oss parquet implementation 
PARQUET-2089

  was:
Due to PARQUET-2078 RowGroup.file_offset is not reliable.

This field is also wrongly calculated in the C++ oss parquet (Arrow rep).


> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-14 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky reassigned PARQUET-2080:
-

Assignee: Gidon Gershinsky  (was: Gabor Szadovszky)

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-13 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2080:
--
Description: 
Due to PARQUET-2078 RowGroup.file_offset is not reliable.

This field is also wrongly calculated in the C++ oss parquet (Arrow rep).

  was:Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall 
deprecate the field and add suggestions how to calculate the value.


> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet (Arrow rep).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-13 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414075#comment-17414075
 ] 

Gidon Gershinsky commented on PARQUET-2080:
---

[~gszadovszky] yes, I'll take it. There might be a different solution (also 
format-related) that bypasses the need to calculate such parameter in any 
implementation, so it can be fully deprecated. I'll get back with the details 
and we'll discuss the trade-offs.

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall deprecate 
> the field and add suggestions how to calculate the value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [VOTE] Release Apache Parquet 1.12.1 RC0

2021-09-13 Thread Gidon Gershinsky

+1 (non-binding)

- checked the sum
- ran build and test, everything was ok
- ran additional framework tests with the built jars, passed

Cheers, Gidon


On Sun, Sep 12, 2021 at 12:05 AM Xinli shang 
wrote:

> Hi everyone,
>
>
> I propose the following RC to be released as the official Apache Parquet
>  release.
>
>
> The commit id is d1dccf6e680d86e94ce97005f5ac51848ba6d794
>
> * This corresponds to the tag: apache-parquet-1.12.1-rc0
>
> * https://github.com/apache/parquet-mr/tree/
> d1dccf6e680d86e94ce97005f5ac51848ba6d794
>
>
> The release tarball, signature, and checksums are here:
>
> *
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.12.1-rc0/
>
>
> You can find the KEYS file here:
>
> * *https://dist.apache.org/repos/dist/release/parquet/KEYS
> *
>
>
> Binary artifacts are staged in Nexus here:
>
> * https://repository.apache.org/content/groups/staging/org/apache/parquet/
>
>
> This release includes important changes listed
> https://github.com/apache/parquet-mr/blob/parquet-1.12.x/CHANGES.md
>
>
> Please download, verify, and test.
>
>
> Please vote in the next 72 hours.
>
>
> [ ] +1 Release this as Apache Parquet 1.12.1
>
> [ ] +0
>
> [ ] -1 Do not release this because...
>
> --
> Xinli Shang
>

[jira] [Commented] (PARQUET-2071) Encryption translation tool

2021-08-05 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393982#comment-17393982
 ] 

Gidon Gershinsky commented on PARQUET-2071:
---

A very useful tool, I'll be glad to review the pr.

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (PARQUET-1908) CLONE - [C++] Update cpp crypto package to match signed-off specification

2021-08-03 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1908.
---
Resolution: Fixed

PR merged in May 2019

> CLONE - [C++] Update cpp crypto package to match signed-off specification
> -
>
> Key: PARQUET-1908
> URL: https://issues.apache.org/jira/browse/PARQUET-1908
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Akshay
>    Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-5.0.0
>
>
> An initial version of crypto package is merged. This Jira updates the crypto 
> code to 
>  # conform the signed off specification (wire protocol updates, signature tag 
> creation, AAD support, etc)
>  # improve performance by extending cipher lifecycle to file writing/reading 
> - instead of creating cipher on each encrypt/decrypt operation  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: New Parquet PMC chair

2021-05-28 Thread Gidon Gershinsky

Congratulations Xinli, well deserved!!

Cheers, Gidon


On Sat, May 29, 2021 at 12:34 AM Julien Le Dem 
wrote:

> Hello Parquet community,
> The Parquet PMC discussed and decided some time ago to move to a rotating
> chair.
> Every year around this time the PMC will elect a new chair to represent the
> project to the board.
> I'm happy to announce that Xinli Shang is the first to be elected to be VP
> Apache Parquet since the inception of the project.
> Xinli has been driving several community efforts and is instrumental to the
> project.
> Please join me in congratulating him.
> congrats Xinli!
> Julien
> - former Parquet PMC chair
>

[jira] [Created] (PARQUET-2053) Pluggable key material store

2021-05-25 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2053:
-

 Summary: Pluggable key material store
 Key: PARQUET-2053
 URL: https://issues.apache.org/jira/browse/PARQUET-2053
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Encryption key material can be stored either inside Parquet files, or outside 
(configurable). For outside storage, Parquet already has a pluggable interface 
for custom implementations, {{FileKeyMaterialStore,}} but no mechanism to load 
them (currently, one implementation is packaged in parquet-mr, and always 
loaded when outside storage is configured). We will provide a way to load 
custom implementations of the {{FileKeyMaterialStore}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-1230) CLI tools for encrypted files

2021-05-04 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1230:
--
  Component/s: parquet-mr
Affects Version/s: 1.12.0

> CLI tools for encrypted files
> -
>
> Key: PARQUET-1230
> URL: https://issues.apache.org/jira/browse/PARQUET-1230
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-1230) CLI tools for encrypted files

2021-05-04 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1230:
--
Parent: (was: PARQUET-1178)
Issue Type: New Feature  (was: Sub-task)

> CLI tools for encrypted files
> -
>
> Key: PARQUET-1230
> URL: https://issues.apache.org/jira/browse/PARQUET-1230
> Project: Parquet
>  Issue Type: New Feature
>    Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-2040) Uniform encryption

2021-04-29 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2040:
-

 Summary: Uniform encryption
 Key: PARQUET-2040
 URL: https://issues.apache.org/jira/browse/PARQUET-2040
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


PME low-level spec supports using the same encryption key for all columns, 
which is useful in a number of scenarios. However, this feature is not exposed 
yet in the high-level API, because its misuse can break the NIST limit on the 
number of AES GCM operations with one key. We will develop a limit-enforcing 
code and provide an API for uniform table encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-2033) Make "null decryptor" exception more informative

2021-04-20 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2033:
-

 Summary: Make "null decryptor" exception more informative
 Key: PARQUET-2033
 URL: https://issues.apache.org/jira/browse/PARQUET-2033
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Forgetting to pass decryption properties when reading an encrypted column in 
files with plaintext footer, results in a "null decryptor" exception thrown in 
the ColumnChunkMetaData class. The exception text can/should be updated to 
point to the possible reason.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-2014) Local key wrapping with rotation

2021-04-04 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created PARQUET-2014:
-

 Summary: Local key wrapping with rotation
 Key: PARQUET-2014
 URL: https://issues.apache.org/jira/browse/PARQUET-2014
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


parquet-mr-1.12.0 has an experimental support for local wrapping of encryption 
keys, that doesn't handle master key versions and key rotation. This Jira will 
add these capabilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (PARQUET-1613) Key rotation tool

2021-04-04 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1613.
---
Resolution: Done

handled by pr 615

> Key rotation tool
> -
>
> Key: PARQUET-1613
> URL: https://issues.apache.org/jira/browse/PARQUET-1613
> Project: Parquet
>  Issue Type: Sub-task
>    Reporter: Gidon Gershinsky
>Assignee: Maya Anderson
>Priority: Major
>
> Rotates the master key, for both single and double wrappers.
> For the latter, enables support for a single KMS call per column, in readers 
> of any data sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (PARQUET-1612) Double wrapped key manager

2021-04-04 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1612.
---
Resolution: Done

handled by pr 615

> Double wrapped key manager
> --
>
> Key: PARQUET-1612
> URL: https://issues.apache.org/jira/browse/PARQUET-1612
> Project: Parquet
>  Issue Type: Sub-task
>    Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
>
> To minimize interaction with KMS, this manager will wrap the encryption keys 
> twice.  Might be combined with key rotation for further optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (PARQUET-1178) Parquet modular encryption

2021-03-26 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1178.
---
Resolution: Done

Released. Thanks to all who've contributed to this new Parquet capability!

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.0
>
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-1178) Parquet modular encryption

2021-03-26 Thread Gidon Gershinsky (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1178:
--
Fix Version/s: 1.12.0

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>    Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.0
>
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [RESULT] Release Apache Parquet 1.12.0 RC4

2021-03-25 Thread Gidon Gershinsky

Great news!!!
And thanks Gabor and Xinli for handling the release process!

Cheers, Gidon


On Thu, Mar 25, 2021 at 7:01 PM Xinli shang  wrote:

> Thanks everybody for the verification and special thanks to all the
> contributors to this release! This release includes awesome features and
> improvements. We look forward to the industry's adoption!
>
> On Thu, Mar 25, 2021 at 3:35 AM Gabor Szadovszky  wrote:
>
> > The vote to release 1.12.0 RC4 as Apache Parquet MR 1.12.0 is PASSED with
> > the required three +1 binding votes and two +1 non-binding votes. (There
> > were no -1 or 0 votes.)
> > Thank you all who verified and voted!
> >
> > I'm going forward with the release process soon.
> >
> > On Thu, Mar 25, 2021 at 1:26 AM Julien Le Dem 
> wrote:
> >
> > > +1 (binding)
> > > I verified the signature and built from source.
> > > All tests pass.
> > > Looks good.
> > >
> > > On Wed, Mar 24, 2021 at 2:07 AM Gabor Szadovszky 
> > wrote:
> > >
> > > > I currently have the feeling that the Avro/Jackson related issue has
> > been
> > > > discussed and the community agrees on moving forward with this RC as
> is
> > > > (without upgrading the Avro and the Jackson dependencies).
> > > > So, I'm giving my +1 (binding) vote.
> > > >
> > > > On Tue, Mar 23, 2021 at 9:28 PM Aaron Niskode-Dossett
> > > >  wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - cloned the 1.12.0-rc-4 tag from github
> > > > > - compiled jars locally and all tests passed
> > > > > - used the 1.12.0 jars as dependencies for a local application that
> > > > streams
> > > > > data into protobuf-parquet files
> > > > > - confirmed data is correct and can be read with parquet-tools
> > compiled
> > > > > from parquet 1.11.1
> > > > >
> > > > > On Tue, Mar 23, 2021 at 10:47 AM Xinli shang
>  > >
> > > > > wrote:
> > > > >
> > > > > > Let's discuss it in today's community sync meeting.
> > > > > >
> > > > > > On Tue, Mar 23, 2021 at 8:37 AM Aaron Niskode-Dossett
> > > > > >  wrote:
> > > > > >
> > > > > > > Gabor and Ismaël, thank you both for the very clear
> explanations
> > of
> > > > > > what's
> > > > > > > going on.
> > > > > > >
> > > > > > > Based on Gabor's description of avro compatibility I would be
> +1
> > > > > > > (non-binding) for the current RC.
> > > > > > >
> > > > > > > On Tue, Mar 23, 2021 at 4:36 AM Gabor Szadovszky <
> > ga...@apache.org
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks, Ismaël for the explanation. I have a couple of notes
> > > about
> > > > > your
> > > > > > > > concerns.
> > > > > > > >
> > > > > > > > - Parquet 1.12.0 as per the semantic versioning is not a
> major
> > > but
> > > > a
> > > > > > > minor
> > > > > > > > release. (It is different from the Avro versioning strategy
> > where
> > > > the
> > > > > > > > second version number means major version changes.)
> > > > > > > > - The jackson dependency is shaded in the parquet jars so the
> > > > > > > > synchronization of the version is not needed (and not even
> > > > possible).
> > > > > > > > - Using the latest Avro version makes sense but if we do not
> > use
> > > it
> > > > > for
> > > > > > > the
> > > > > > > > current release it should not cause any issues in our
> clients.
> > > > Let's
> > > > > > > check
> > > > > > > > the following example. We upgrade to the latest 1.10.2 Avro
> > > release
> > > > > in
> > > > > > > > parquet then release it under 1.12.0. Later on Avro creates a
> > new
> > > > > > release
> > > > > > > > (e.g. 1.10.3 or even 1.11.0) while Parquet does not. In this
> > case
> > > > our
> > > > > > > > clients need to upgrade Avro without Parquet. If it is a
> major
> > > Avro
> > > > > > > release
> > > > > > > > it might occur that the Parquet code has to be updated but
> > > usually
> > > > it
> > > > > > is
> > > > > > > > not the case. (The last time we've had to change production
> > code
> > > > for
> > > > > an
> > > > > > > > Avro upgrade was from 1.7.6 to 1.8.0.) I think our clients
> > should
> > > > be
> > > > > > able
> > > > > > > > to upgrade Avro independently from Parquet and vice versa
> > (until
> > > > > there
> > > > > > > are
> > > > > > > > incompatibility issues). I would even change Parquet's Avro
> > > > > dependency
> > > > > > to
> > > > > > > > "provided" but that might be a breaking change and clearly
> > won't
> > > do
> > > > > it
> > > > > > > just
> > > > > > > > before the release.
> > > > > > > >
> > > > > > > > What do you think? Anyone have a strong opinion about this
> > topic?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gabor
> > > > > > > >
> > > > > > > > On Mon, Mar 22, 2021 at 6:31 PM Ismaël Mejía <
> > ieme...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Sure. The Avro upgrade feature/API wise is minor for
> Parquet,
> > > so
> > > > > the
> > > > > > > > > possibility of adding a regression is really REALLY minor.
> > The
> > > > > hidden
> > > > > > > > issue
> > > > > > > > > is the new transitive dependencies

[jira] [Commented] (PARQUET-1997) [C++] AesEncryptor and AesDecryptor primitives are unsafe

2021-03-10 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299054#comment-17299054
 ] 

Gidon Gershinsky commented on PARQUET-1997:
---

I recall we talked about that with Tham, but I forgot the details.. [~thamha], 
do you remember what the return value is for? 

> [C++] AesEncryptor and AesDecryptor primitives are unsafe
> -
>
> Key: PARQUET-1997
> URL: https://issues.apache.org/jira/browse/PARQUET-1997
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> {{AesEncryptor::Encrypt}}, {{AesDecryptor::Decrypt}} take a pointer to the 
> output buffer but without the output buffer length. The caller is required to 
> guess the expected output length. The functions also return the written 
> output length, but at this point it's too late: data may have been written 
> out of bounds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1997) [C++] AesEncryptor and AesDecryptor primitives are unsafe

2021-03-10 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299041#comment-17299041
 ] 

Gidon Gershinsky commented on PARQUET-1997:
---

[~apitrou] This point is addressed by the _int 
AesEncryptor::CiphertextSizeDelta()_ function - the caller uses it to allocate 
the output buffer. This is not a part of public Parquet API; the caller is the 
parquet code.

> [C++] AesEncryptor and AesDecryptor primitives are unsafe
> -
>
> Key: PARQUET-1997
> URL: https://issues.apache.org/jira/browse/PARQUET-1997
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> {{AesEncryptor::Encrypt}}, {{AesDecryptor::Decrypt}} take a pointer to the 
> output buffer but without the output buffer length. The caller is required to 
> guess the expected output length. The functions also return the written 
> output length, but at this point it's too late: data may have been written 
> out of bounds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1992) Cannot build from tarball because of git submodules

2021-03-02 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293509#comment-17293509
 ] 

Gidon Gershinsky commented on PARQUET-1992:
---

This contribution had been added by [~mayaa], she knows the subject better than 
me. Maya, could you address the comments and the question in this jira?

> Cannot build from tarball because of git submodules
> ---
>
> Key: PARQUET-1992
> URL: https://issues.apache.org/jira/browse/PARQUET-1992
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Priority: Blocker
>
> Because we use git submodules (to get test parquet files) a simple "mvn clean 
> install" fails from the unpacked tarball due to "not a git repository".
> I think we would have 2 options to solve this situation:
> * Include all the required files (even only for testing) in the tarball and 
> somehow avoid the git submodule update in case of executed in a non-git 
> envrionment
> * Make the downloading of the parquet files and the related tests optional so 
> it won't fail the build from the tarball



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 >

1 - 100 of 285 matches

Mail list logo