Re: Adding Parquet encryption support to PyArrow

Gidon Gershinsky Sun, 06 Sep 2020 01:59:20 -0700

Here it goes, as promised.

Briefly, the low level interface is harder to use, demands deeper
encryption expertise, and requires a developer to manually implement a
number of components, critical for production deployment.
The high level interface is simple and has these critical components
implemented. But this of course is more or less the definition of
"low-level" vs "high-level" in general.

So in more depth, the low level interface exposes direct implementations of
the spec,
https://github.com/apache/parquet-format/blob/apache-parquet-format-2.7.0/Encryption.md

This is a low level crypto stuff. Also, the spec explicitly leaves the key
management out of scope.

The high level interface is simple
https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing
Basically, all you need to do in order to encrypt/decrypt data, is to
configure a few properties. To get key management, you need to create an
implementation of the provided simple KmsClient interface, that will
interact with the KMS (key management service), deployed in your production
(or in the cloud where you run). We'll provide a sample implementation of
such a client for a popular open source KMS. The rest is handled by the
https://github.com/apache/arrow/pull/8023  , which implements the best
practices in this security field (including prevention of mistakes like
using the same data key on many files), and calls the low level API as
needed. It'd be good if Roee and Itamar would create the appropriate Python
wrappers for 8023 in PyArrow and pandas.

Re Spark and all other Scala/Java analytic frameworks - they will be
automatically enabled for encryption, when upgrading their parquet-mr
version to 1.12 - the upcoming release, planned with PME (already in the
master). Of course, this is possible only with the high-level interface,
accessed by passing the properties via the built-in configuration channels.

Cheers, Gidon

On Thu, Sep 3, 2020 at 11:44 PM Antoine Pitrou <anto...@python.org> wrote:

>
> It would be useful for outsiders to expose what those two API levels
> are, and to what usage they correspond.
> Is Parquet encryption used only with that Spark?  While Spark
> interoperability is important, Parquet files are more ubiquitous than that.
>
> Regards
>
> Antoine.
>
>
> Le 03/09/2020 à 22:31, Gidon Gershinsky a écrit :
> > Why would the low level API be exposed directly.. This will break the
> > interop between the two analytic ecosystems down the road.
> > Again, let me suggest leveraging the high level interface, based on the
> > PropertiesDrivenCryptoFactory.
> > It should address your technical requirements; if it doesn't, we can
> > discuss the gaps.
> > All questions are welcome.
> >
> > Cheers, Gidon
> >
> >
> > On Thu, Sep 3, 2020 at 10:11 PM Roee Shlomo <roe...@gmail.com> wrote:
> >
> >> Hi Itamar,
> >>
> >> I implemented some python wrappers for the low level API and would be
> >> happy to collaborate on that. The reason I didn't push this forward yet
> is
> >> what Gidon mentioned. The API to expose to python users needs to be
> >> finalized first and it must include the key tools API for interop with
> >> Spark.
> >>
> >> Perhaps it would be good to kickoff a discussion on how the pyarrow API
> >> for PME should look like (in parallel to reviewing the arrow-cpp
> >> implementation of key-tools; to ensure that wrapping it would be a
> >> reasonable effort).
> >>
> >> One possible approach is to expose both the low level API and keytools
> >> separately. A user would create and initialize a
> >> PropertiesDrivenCryptoFactory and use it to create the
> >> FileEncryptionProperties/FileDecryptionProperties to pass to the lower
> >> level API. In pandas this would translate to something like:
> >> ```
> >> factory = PropertiesDrivenCryptoFactory(...)
> >> df.to_parquet(path, engine="pyarrow",
> >> encryption=factory.getFileEncryptionProperties(...))
> >> df = pd.read_parquet(path, engine="pyarrow",
> >> decryption=factory.getFileDecryptionProperties(...))
> >> ```
> >> This should also work with reading datasets since decryption uses a
> >> KeyRetriever, but I'm not sure what will need to be done once datasets
> will
> >> support write.
> >>
> >> On 2020/09/03 14:11:51, "Itamar Turner-Trauring" <
> ita...@pythonspeed.com>
> >> wrote:
> >>> Hi,
> >>>
> >>> I'm looking into implementing this, and it seems like there are two
> >> parts: packaging, but also wrapping the APIs in Python. Is the latter
> item
> >> accurate? If so, any examples of similar existing wrapped APIs, or
> should I
> >> just come up with something on my own?
> >>>
> >>> Context:
> >>> https://github.com/apache/arrow/pull/4826
> >>> https://issues.apache.org/jira/browse/ARROW-8040
> >>>
> >>> Thanks,
> >>>
> >>> —Itamar
> >>
> >
>

Re: Adding Parquet encryption support to PyArrow

Reply via email to