Re: Adding Parquet encryption support to PyArrow

2020-09-09 Thread Gidon Gershinsky
Thanks guys. I'll go over the intro sections to merge/streamline the text there. I've added a "commenter" access for all, so everybody could take part in the doc's discussion threads. For edit access, please contact Itamar (by pressing the request button). Cheers, Gidon On Wed, Sep 9, 2020 at

Re: Adding Parquet encryption support to PyArrow

2020-09-09 Thread Roee Shlomo
Hi Itamar, Thanks for starting the document. I've added an initial draft version of the API (parts of it at least). I have also added problem statement and goals sections to list what I understand that we want to achieve. On 2020/09/08 17:44:07, "Itamar Turner-Trauring" wrote: > Still

Re: Adding Parquet encryption support to PyArrow

2020-09-08 Thread Itamar Turner-Trauring
Still learning from the discussion/docs, but in the meantime I created https://issues.apache.org/jira/projects/ARROW/issues/ARROW-9947 which has link to a Google

Re: Adding Parquet encryption support to PyArrow

2020-09-06 Thread Itamar Turner-Trauring
On Tuesday when I'm back at work I will read all the above, and can coordinate on starting a design doc. On Sun, Sep 6, 2020, at 5:03 AM, Gidon Gershinsky wrote: > Cool, thank you. This would solve the problem at hand. > I agree it'd be good to kick off the PyArrow API discussion in parallel >

Re: Adding Parquet encryption support to PyArrow

2020-09-06 Thread Gidon Gershinsky
Cool, thank you. This would solve the problem at hand. I agree it'd be good to kick off the PyArrow API discussion in parallel with the PR8023 review. Maybe you and Itamar could prep a googledoc draft for the community to have a look and to comment. Cheers, Gidon On Fri, Sep 4, 2020 at 6:08 PM

Re: Adding Parquet encryption support to PyArrow

2020-09-06 Thread Gidon Gershinsky
Here it goes, as promised. Briefly, the low level interface is harder to use, demands deeper encryption expertise, and requires a developer to manually implement a number of components, critical for production deployment. The high level interface is simple and has these critical components

Re: Adding Parquet encryption support to PyArrow

2020-09-04 Thread Roee Shlomo
Sounds good. In the suggestion above the builders for FileEncryptionProperties/FileDecryptionProperties should not be exposed, so only key tools would create those. This is just one option of course. On 2020/09/03 20:44:26, Antoine Pitrou wrote: > > It would be useful for outsiders to expose

Re: Adding Parquet encryption support to PyArrow

2020-09-04 Thread Gidon Gershinsky
Sure, I'll prep a brief summary on this by Sunday, got a weekend kicking in here today. Cheers, Gidon On Thu, Sep 3, 2020 at 11:44 PM Antoine Pitrou wrote: > > It would be useful for outsiders to expose what those two API levels > are, and to what usage they correspond. > Is Parquet

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Antoine Pitrou
It would be useful for outsiders to expose what those two API levels are, and to what usage they correspond. Is Parquet encryption used only with that Spark? While Spark interoperability is important, Parquet files are more ubiquitous than that. Regards Antoine. Le 03/09/2020 à 22:31, Gidon

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Gidon Gershinsky
Why would the low level API be exposed directly.. This will break the interop between the two analytic ecosystems down the road. Again, let me suggest leveraging the high level interface, based on the PropertiesDrivenCryptoFactory. It should address your technical requirements; if it doesn't, we

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Roee Shlomo
Hi Itamar, I implemented some python wrappers for the low level API and would be happy to collaborate on that. The reason I didn't push this forward yet is what Gidon mentioned. The API to expose to python users needs to be finalized first and it must include the key tools API for interop with

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Itamar Turner-Trauring
On Thu, Sep 3, 2020, at 11:01 AM, Antoine Pitrou wrote: > > Hi Gidon, > > Le 03/09/2020 à 16:53, Gidon Gershinsky a écrit : > > Hi Itamar, > > > > My suggestion would be wrap a different API in Python - the high-level > > encryption interface of > > https://github.com/apache/arrow/pull/8023 >

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Gidon Gershinsky
Hi Antoine, Sounds good to me. This PR is already being actively reviewed, and it'd be good to have Itamar's assessment. Cheers, Gidon On Thu, Sep 3, 2020 at 6:01 PM Antoine Pitrou wrote: > > Hi Gidon, > > Le 03/09/2020 à 16:53, Gidon Gershinsky a écrit : > > Hi Itamar, > > > > My

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Antoine Pitrou
Hi Gidon, Le 03/09/2020 à 16:53, Gidon Gershinsky a écrit : > Hi Itamar, > > My suggestion would be wrap a different API in Python - the high-level > encryption interface of > https://github.com/apache/arrow/pull/8023 We need a strategy for reviewing those changes. The PR is quite large,

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Gidon Gershinsky
Hi Itamar, My suggestion would be wrap a different API in Python - the high-level encryption interface of https://github.com/apache/arrow/pull/8023 This will enable interoperability with Apache Spark (and other frameworks), where we don't expose the low level parquet encryption API. If such a

Adding Parquet encryption support to PyArrow

2020-09-03 Thread Itamar Turner-Trauring
Hi, I'm looking into implementing this, and it seems like there are two parts: packaging, but also wrapping the APIs in Python. Is the latter item accurate? If so, any examples of similar existing wrapped APIs, or should I just come up with something on my own? Context: