Re: Adding Parquet encryption support to PyArrow

Gidon Gershinsky Fri, 04 Sep 2020 00:24:20 -0700

Sure, I'll prep a brief summary on this by Sunday, got a weekend kicking in
here today.


Cheers, Gidon


On Thu, Sep 3, 2020 at 11:44 PM Antoine Pitrou <anto...@python.org> wrote:

>
> It would be useful for outsiders to expose what those two API levels
> are, and to what usage they correspond.
> Is Parquet encryption used only with that Spark?  While Spark
> interoperability is important, Parquet files are more ubiquitous than that.
>
> Regards
>
> Antoine.
>
>
> Le 03/09/2020 à 22:31, Gidon Gershinsky a écrit :
> > Why would the low level API be exposed directly.. This will break the
> > interop between the two analytic ecosystems down the road.
> > Again, let me suggest leveraging the high level interface, based on the
> > PropertiesDrivenCryptoFactory.
> > It should address your technical requirements; if it doesn't, we can
> > discuss the gaps.
> > All questions are welcome.
> >
> > Cheers, Gidon
> >
> >
> > On Thu, Sep 3, 2020 at 10:11 PM Roee Shlomo <roe...@gmail.com> wrote:
> >
> >> Hi Itamar,
> >>
> >> I implemented some python wrappers for the low level API and would be
> >> happy to collaborate on that. The reason I didn't push this forward yet
> is
> >> what Gidon mentioned. The API to expose to python users needs to be
> >> finalized first and it must include the key tools API for interop with
> >> Spark.
> >>
> >> Perhaps it would be good to kickoff a discussion on how the pyarrow API
> >> for PME should look like (in parallel to reviewing the arrow-cpp
> >> implementation of key-tools; to ensure that wrapping it would be a
> >> reasonable effort).
> >>
> >> One possible approach is to expose both the low level API and keytools
> >> separately. A user would create and initialize a
> >> PropertiesDrivenCryptoFactory and use it to create the
> >> FileEncryptionProperties/FileDecryptionProperties to pass to the lower
> >> level API. In pandas this would translate to something like:
> >> ```
> >> factory = PropertiesDrivenCryptoFactory(...)
> >> df.to_parquet(path, engine="pyarrow",
> >> encryption=factory.getFileEncryptionProperties(...))
> >> df = pd.read_parquet(path, engine="pyarrow",
> >> decryption=factory.getFileDecryptionProperties(...))
> >> ```
> >> This should also work with reading datasets since decryption uses a
> >> KeyRetriever, but I'm not sure what will need to be done once datasets
> will
> >> support write.
> >>
> >> On 2020/09/03 14:11:51, "Itamar Turner-Trauring" <
> ita...@pythonspeed.com>
> >> wrote:
> >>> Hi,
> >>>
> >>> I'm looking into implementing this, and it seems like there are two
> >> parts: packaging, but also wrapping the APIs in Python. Is the latter
> item
> >> accurate? If so, any examples of similar existing wrapped APIs, or
> should I
> >> just come up with something on my own?
> >>>
> >>> Context:
> >>> https://github.com/apache/arrow/pull/4826
> >>> https://issues.apache.org/jira/browse/ARROW-8040
> >>>
> >>> Thanks,
> >>>
> >>> —Itamar
> >>
> >
>

Re: Adding Parquet encryption support to PyArrow

Reply via email to