>From the doc,
"To maintain consistency with the style of parquet-cpp, the above
structures should not be explicitly synchronized with individual mutexes.
In the case of a parquet::arrow::FileReader, the request to read a given
selection of row groups and columns is issued from a single main thread.
Note that this does require that all keys required for a read are assembled
on the main thread so that DecryptionKeyRetriever objects are not directly
accessing any caches"

The current PR code doesn't require a single main thread. Any thread can
read any file, both footer and pages. So the key cache is shared, to save
N-1 interactions with the KMS server.

Cheers, Gidon


On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org> wrote:

>
> I'm not sure a threading model is expected for an encryption layer.  Am
> I missing something?
>
> Regards
>
> Antoine.
>
>
> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> > Precisely, the main change is in the threading model. Afaik, the document
> > proposes a model that fits pandas, but might be problematic for other
> users
> > of this library.
> > Technically, this is not showstopper though; if the community decides on
> > this model, it will be compatible with the high-level encryption design;
> > but the change implementation would need to be done by pandas experts
> (not
> > us; but we'll help where we can).
> > Micah, you know this subject (and the community) better than we do - we'd
> > much appreciate it if you'd take a lead on removing this roadblock.
> >
> > Cheers, Gidon
> >
> >
> > On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> I think some of the comments might be conflicting.  One of the concerns
> >> (that I would need to refresh myself on to offer an opinion which was
> >> covered in Ben's doc) was the threading model we expect in the library.
> >>
> >> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >>
> >>>
> >>> Hi Gidon,
> >>>
> >>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> >>>> Regarding the high-level layer, I think it waits for a progress at
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> >>>> No activity there since last November. This is unfortunate, because
> >> Tham
> >>>> has put a lot of work in coding the high-level layer (and addressing
> >> 200+
> >>>> review comments) in the PR https://github.com/apache/arrow/pull/8023.
> >>> The
> >>>> code is functional, compatible with the Java version in parquet-mr,
> and
> >>> can
> >>>> be updated with the threading changes in the doc above. I hope all
> this
> >>>> good work will not be wasted.
> >>>
> >>> I'm sorry for the possibly frustrating process.  Looking at the PR,
> >>> though, it seems a bunch of comments were not addressed.  Is it
> possible
> >>> to go through them and ensure they get an answer and/or a resolution?
> >>>
> >>> Best regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> >>>>
> >>>> Cheers, Gidon
> >>>>
> >>>>
> >>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
> emkornfi...@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>>> My thoughts:
> >>>>> 1.  I've lost track of the higher level encryption implementation in
> >>> C++.
> >>>>> I think we were trying to come to a consensus on the threading/thread
> >>>>> safety model?
> >>>>>
> >>>>> 2.  I'm open to exposing the lower level encryption libraries in
> >> python
> >>>>> (without appropriate namespacing/communication).  It seems at least
> >> for
> >>>>> reading, there is potentially less harm (I'll caveat that with I'm
> >> not a
> >>>>> security expert).  Are both the low level read and write
> >> implementations
> >>>>> necessary?  (it probably makes sense to have a few smaller PRs for
> >>> exposing
> >>>>> this functionality anyways).
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> >>>>> ita...@pythonspeed.com> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Since the PR for high-level C++ Parquet encryption API appears
> >> stalled
> >>> (
> >>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into
> >> exposing
> >>>>> the
> >>>>>> low-level Parquet encryption API to Python.
> >>>>>>
> >>>>>> Arguments for doing this: the low-level API is all the users I'm
> >>> talking
> >>>>>> to need, at the moment, so it's plausible others would also find
> some
> >>>>>> benefit in having the Pyarrow API expose low-level Parquet
> >> encryption.
> >>>>> Then
> >>>>>> again, it might only be this one company and no one else cares.
> >>>>>>
> >>>>>> The arguments against, per Gidon Gershinsky:
> >>>>>>
> >>>>>>>  * security: low-level encryption API is easy to misuse (eg giving
> >> the
> >>>>>> same keys for a number of different files; this'd break the AES GCM
> >>>>>> cipher). The high-level encryption layer handles that by applying
> >>>>> envelope
> >>>>>> encryption and other best practices in data security. Also, this
> >> layer
> >>> is
> >>>>>> maintained by the community, meaning that future improvements and
> >>>>> security
> >>>>>> fixes can be upstreamed by anyone, and available to all.
> >>>>>>>  * compatibility: parquet-mr implements the high-level encryption
> >>>>> layer.
> >>>>>> If we want the files produced by Spark/Presto/etc to be readable by
> >>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow users
> >>> with
> >>>>>> the high-level API.
> >>>>>>> ...
> >>>>>>>
> >>>>>>> The current situation is not ideal, it'd be good to merge the
> >>>>> high-level
> >>>>>> PR (and maybe hide the low level), but here we are; also, C++ is a
> >> kind
> >>>>> of
> >>>>>> a low-level language; Python would expose it to a less experienced
> >>>>> audience.
> >>>>>>
> >>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
> >>>>>>
> >>>>>> I find the compatibility argument less compelling, that's readily
> >>>>>> addressed by documentation. I am not a crypto expert so I can't
> >>> evaluate
> >>>>>> how risky exposing the low-level encryption APIs would be, but I can
> >>> see
> >>>>>> how that would be a significant concern.
> >>>>>>
> >>>>>> Some options are:
> >>>>>>  * Status quo, no Python API for low-level Parquet encryption. This
> >> has
> >>>>>> two possible outcomes:
> >>>>>>    * Eventually high-level API gets merged, gets Python binding.
> >>>>>>    * High-level encryption API is never merged, Python users never
> >> get
> >>>>>> access to encryption.
> >>>>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps using
> >>>>> "hazmat"
> >>>>>> idiom used by the Python cryptography package (API namespace
> >> indicating
> >>>>>> "use at your own risk, this is dangerous", basically, e.g.
> >>>>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
> >>>>>>    * Gidon Gershinsky did not find this suggestion compelling enough
> >> to
> >>>>>> override his security concerns.
> >>>>>>  * Low-level encryption done as third party Python package, either
> >>>>> private
> >>>>>> or open source. This is annoying technically, plausibly would
> require
> >>>>>> maintaining a fork.
> >>>>>> Any other ideas? Thoughts on these options?
> >>>>>>
> >>>>>> —Itamar
> >>>>>
> >>>>
> >>>
> >>
> >
>

Reply via email to