Precisely, the main change is in the threading model. Afaik, the document
proposes a model that fits pandas, but might be problematic for other users
of this library.
Technically, this is not showstopper though; if the community decides on
this model, it will be compatible with the high-level encryption design;
but the change implementation would need to be done by pandas experts (not
us; but we'll help where we can).
Micah, you know this subject (and the community) better than we do - we'd
much appreciate it if you'd take a lead on removing this roadblock.

Cheers, Gidon


On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I think some of the comments might be conflicting.  One of the concerns
> (that I would need to refresh myself on to offer an opinion which was
> covered in Ben's doc) was the threading model we expect in the library.
>
> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hi Gidon,
> >
> > Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> > > Regarding the high-level layer, I think it waits for a progress at
> > >
> >
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> > > No activity there since last November. This is unfortunate, because
> Tham
> > > has put a lot of work in coding the high-level layer (and addressing
> 200+
> > > review comments) in the PR https://github.com/apache/arrow/pull/8023.
> > The
> > > code is functional, compatible with the Java version in parquet-mr, and
> > can
> > > be updated with the threading changes in the doc above. I hope all this
> > > good work will not be wasted.
> >
> > I'm sorry for the possibly frustrating process.  Looking at the PR,
> > though, it seems a bunch of comments were not addressed.  Is it possible
> > to go through them and ensure they get an answer and/or a resolution?
> >
> > Best regards
> >
> > Antoine.
> >
> >
> >
> > >
> > > Cheers, Gidon
> > >
> > >
> > > On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com
> >
> > > wrote:
> > >
> > >> My thoughts:
> > >> 1.  I've lost track of the higher level encryption implementation in
> > C++.
> > >> I think we were trying to come to a consensus on the threading/thread
> > >> safety model?
> > >>
> > >> 2.  I'm open to exposing the lower level encryption libraries in
> python
> > >> (without appropriate namespacing/communication).  It seems at least
> for
> > >> reading, there is potentially less harm (I'll caveat that with I'm
> not a
> > >> security expert).  Are both the low level read and write
> implementations
> > >> necessary?  (it probably makes sense to have a few smaller PRs for
> > exposing
> > >> this functionality anyways).
> > >>
> > >>
> > >>
> > >> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> > >> ita...@pythonspeed.com> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Since the PR for high-level C++ Parquet encryption API appears
> stalled
> > (
> > >>> https://github.com/apache/arrow/pull/8023), I'm looking into
> exposing
> > >> the
> > >>> low-level Parquet encryption API to Python.
> > >>>
> > >>> Arguments for doing this: the low-level API is all the users I'm
> > talking
> > >>> to need, at the moment, so it's plausible others would also find some
> > >>> benefit in having the Pyarrow API expose low-level Parquet
> encryption.
> > >> Then
> > >>> again, it might only be this one company and no one else cares.
> > >>>
> > >>> The arguments against, per Gidon Gershinsky:
> > >>>
> > >>>>  * security: low-level encryption API is easy to misuse (eg giving
> the
> > >>> same keys for a number of different files; this'd break the AES GCM
> > >>> cipher). The high-level encryption layer handles that by applying
> > >> envelope
> > >>> encryption and other best practices in data security. Also, this
> layer
> > is
> > >>> maintained by the community, meaning that future improvements and
> > >> security
> > >>> fixes can be upstreamed by anyone, and available to all.
> > >>>>  * compatibility: parquet-mr implements the high-level encryption
> > >> layer.
> > >>> If we want the files produced by Spark/Presto/etc to be readable by
> > >>> pandas/PyArrow (and vice versa), we need to provide the Arrow users
> > with
> > >>> the high-level API.
> > >>>> ...
> > >>>>
> > >>>> The current situation is not ideal, it'd be good to merge the
> > >> high-level
> > >>> PR (and maybe hide the low level), but here we are; also, C++ is a
> kind
> > >> of
> > >>> a low-level language; Python would expose it to a less experienced
> > >> audience.
> > >>>
> > >>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
> > >>>
> > >>> I find the compatibility argument less compelling, that's readily
> > >>> addressed by documentation. I am not a crypto expert so I can't
> > evaluate
> > >>> how risky exposing the low-level encryption APIs would be, but I can
> > see
> > >>> how that would be a significant concern.
> > >>>
> > >>> Some options are:
> > >>>  * Status quo, no Python API for low-level Parquet encryption. This
> has
> > >>> two possible outcomes:
> > >>>    * Eventually high-level API gets merged, gets Python binding.
> > >>>    * High-level encryption API is never merged, Python users never
> get
> > >>> access to encryption.
> > >>>  * Add low-level Parquet encryption API to Pyarrow, perhaps using
> > >> "hazmat"
> > >>> idiom used by the Python cryptography package (API namespace
> indicating
> > >>> "use at your own risk, this is dangerous", basically, e.g.
> > >>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
> > >>>    * Gidon Gershinsky did not find this suggestion compelling enough
> to
> > >>> override his security concerns.
> > >>>  * Low-level encryption done as third party Python package, either
> > >> private
> > >>> or open source. This is annoying technically, plausibly would require
> > >>> maintaining a fork.
> > >>> Any other ideas? Thoughts on these options?
> > >>>
> > >>> —Itamar
> > >>
> > >
> >
>

Reply via email to