Hi Antoine,

My part there is mostly review and some advice. The bulk of the work is
done by Tham, and by the community members who've reviewed the PR; my
frustration is with seeing it in limbo for a while now.
Regarding the remaining comments - currently, the main sticking points are
the change proposals in this googledoc. Once their status is clarified, I
hope Tham will be able to resume addressing the comments (I'll help with
some of them if needed).

Cheers, Gidon


On Tue, Feb 16, 2021 at 6:03 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi Gidon,
>
> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> > Regarding the high-level layer, I think it waits for a progress at
> >
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> > No activity there since last November. This is unfortunate, because Tham
> > has put a lot of work in coding the high-level layer (and addressing 200+
> > review comments) in the PR https://github.com/apache/arrow/pull/8023.
> The
> > code is functional, compatible with the Java version in parquet-mr, and
> can
> > be updated with the threading changes in the doc above. I hope all this
> > good work will not be wasted.
>
> I'm sorry for the possibly frustrating process.  Looking at the PR,
> though, it seems a bunch of comments were not addressed.  Is it possible
> to go through them and ensure they get an answer and/or a resolution?
>
> Best regards
>
> Antoine.
>
>
>
> >
> > Cheers, Gidon
> >
> >
> > On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> My thoughts:
> >> 1.  I've lost track of the higher level encryption implementation in
> C++.
> >> I think we were trying to come to a consensus on the threading/thread
> >> safety model?
> >>
> >> 2.  I'm open to exposing the lower level encryption libraries in python
> >> (without appropriate namespacing/communication).  It seems at least for
> >> reading, there is potentially less harm (I'll caveat that with I'm not a
> >> security expert).  Are both the low level read and write implementations
> >> necessary?  (it probably makes sense to have a few smaller PRs for
> exposing
> >> this functionality anyways).
> >>
> >>
> >>
> >> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> >> ita...@pythonspeed.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> Since the PR for high-level C++ Parquet encryption API appears stalled
> (
> >>> https://github.com/apache/arrow/pull/8023), I'm looking into exposing
> >> the
> >>> low-level Parquet encryption API to Python.
> >>>
> >>> Arguments for doing this: the low-level API is all the users I'm
> talking
> >>> to need, at the moment, so it's plausible others would also find some
> >>> benefit in having the Pyarrow API expose low-level Parquet encryption.
> >> Then
> >>> again, it might only be this one company and no one else cares.
> >>>
> >>> The arguments against, per Gidon Gershinsky:
> >>>
> >>>>  * security: low-level encryption API is easy to misuse (eg giving the
> >>> same keys for a number of different files; this'd break the AES GCM
> >>> cipher). The high-level encryption layer handles that by applying
> >> envelope
> >>> encryption and other best practices in data security. Also, this layer
> is
> >>> maintained by the community, meaning that future improvements and
> >> security
> >>> fixes can be upstreamed by anyone, and available to all.
> >>>>  * compatibility: parquet-mr implements the high-level encryption
> >> layer.
> >>> If we want the files produced by Spark/Presto/etc to be readable by
> >>> pandas/PyArrow (and vice versa), we need to provide the Arrow users
> with
> >>> the high-level API.
> >>>> ...
> >>>>
> >>>> The current situation is not ideal, it'd be good to merge the
> >> high-level
> >>> PR (and maybe hide the low level), but here we are; also, C++ is a kind
> >> of
> >>> a low-level language; Python would expose it to a less experienced
> >> audience.
> >>>
> >>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
> >>>
> >>> I find the compatibility argument less compelling, that's readily
> >>> addressed by documentation. I am not a crypto expert so I can't
> evaluate
> >>> how risky exposing the low-level encryption APIs would be, but I can
> see
> >>> how that would be a significant concern.
> >>>
> >>> Some options are:
> >>>  * Status quo, no Python API for low-level Parquet encryption. This has
> >>> two possible outcomes:
> >>>    * Eventually high-level API gets merged, gets Python binding.
> >>>    * High-level encryption API is never merged, Python users never get
> >>> access to encryption.
> >>>  * Add low-level Parquet encryption API to Pyarrow, perhaps using
> >> "hazmat"
> >>> idiom used by the Python cryptography package (API namespace indicating
> >>> "use at your own risk, this is dangerous", basically, e.g.
> >>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
> >>>    * Gidon Gershinsky did not find this suggestion compelling enough to
> >>> override his security concerns.
> >>>  * Low-level encryption done as third party Python package, either
> >> private
> >>> or open source. This is annoying technically, plausibly would require
> >>> maintaining a fork.
> >>> Any other ideas? Thoughts on these options?
> >>>
> >>> —Itamar
> >>
> >
>

Reply via email to