This certainly sounds good to me.

Cheers, Gidon


On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org> wrote:

>
> I don't think any notion of threading should be present in the
> implementation, except for the required locks around shared structures.
>  I don't know where the idea of a "main thread" comes from, but it
> probably shouldn't exist in a C++ library.
>
> Regards
>
> Antoine.
>
>
>
> Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
> > Just to clarify. There are two options, which one do you refer to? A
> design
> > with a main thread that handles projections and the keys (relevant for
> the
> > projected columns); or the current code with any thread allowed to handle
> > full file reading, inc the footer, column projections and their keys? Can
> > you finalize this with Micah?
> > The good news is, Tham is still interested to resume this work, and is ok
> > with either option. Please let her know whether the current threading
> model
> > stays, or should be modified with the changes proposed in the doc (for
> the
> > latter, some guidance with the details would be needed).
> >
> > Cheers, Gidon
> >
> >
> > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >>
> >>
> >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> >>> From the doc,
> >>> "To maintain consistency with the style of parquet-cpp, the above
> >>> structures should not be explicitly synchronized with individual
> mutexes.
> >>> In the case of a parquet::arrow::FileReader, the request to read a
> given
> >>> selection of row groups and columns is issued from a single main
> thread.
> >>> Note that this does require that all keys required for a read are
> >> assembled
> >>> on the main thread so that DecryptionKeyRetriever objects are not
> >> directly
> >>> accessing any caches"
> >>>
> >>> The current PR code doesn't require a single main thread. Any thread
> can
> >>> read any file, both footer and pages. So the key cache is shared, to
> save
> >>> N-1 interactions with the KMS server.
> >>
> >> I don't think there's any contention on this.  IMHO the only concerns
> >> are about the implementation, not the semantics.
> >>
> >> Best regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> Cheers, Gidon
> >>>
> >>>
> >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> I'm not sure a threading model is expected for an encryption layer.
> Am
> >>>> I missing something?
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> >>>>> Precisely, the main change is in the threading model. Afaik, the
> >> document
> >>>>> proposes a model that fits pandas, but might be problematic for other
> >>>> users
> >>>>> of this library.
> >>>>> Technically, this is not showstopper though; if the community decides
> >> on
> >>>>> this model, it will be compatible with the high-level encryption
> >> design;
> >>>>> but the change implementation would need to be done by pandas experts
> >>>> (not
> >>>>> us; but we'll help where we can).
> >>>>> Micah, you know this subject (and the community) better than we do -
> >> we'd
> >>>>> much appreciate it if you'd take a lead on removing this roadblock.
> >>>>>
> >>>>> Cheers, Gidon
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <
> emkornfi...@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> I think some of the comments might be conflicting.  One of the
> >> concerns
> >>>>>> (that I would need to refresh myself on to offer an opinion which
> was
> >>>>>> covered in Ben's doc) was the threading model we expect in the
> >> library.
> >>>>>>
> >>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org>
> >>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> Hi Gidon,
> >>>>>>>
> >>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> >>>>>>>> Regarding the high-level layer, I think it waits for a progress at
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> >>>>>>>> No activity there since last November. This is unfortunate,
> because
> >>>>>> Tham
> >>>>>>>> has put a lot of work in coding the high-level layer (and
> addressing
> >>>>>> 200+
> >>>>>>>> review comments) in the PR
> >> https://github.com/apache/arrow/pull/8023.
> >>>>>>> The
> >>>>>>>> code is functional, compatible with the Java version in
> parquet-mr,
> >>>> and
> >>>>>>> can
> >>>>>>>> be updated with the threading changes in the doc above. I hope all
> >>>> this
> >>>>>>>> good work will not be wasted.
> >>>>>>>
> >>>>>>> I'm sorry for the possibly frustrating process.  Looking at the PR,
> >>>>>>> though, it seems a bunch of comments were not addressed.  Is it
> >>>> possible
> >>>>>>> to go through them and ensure they get an answer and/or a
> resolution?
> >>>>>>>
> >>>>>>> Best regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Cheers, Gidon
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
> >>>> emkornfi...@gmail.com
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> My thoughts:
> >>>>>>>>> 1.  I've lost track of the higher level encryption implementation
> >> in
> >>>>>>> C++.
> >>>>>>>>> I think we were trying to come to a consensus on the
> >> threading/thread
> >>>>>>>>> safety model?
> >>>>>>>>>
> >>>>>>>>> 2.  I'm open to exposing the lower level encryption libraries in
> >>>>>> python
> >>>>>>>>> (without appropriate namespacing/communication).  It seems at
> least
> >>>>>> for
> >>>>>>>>> reading, there is potentially less harm (I'll caveat that with
> I'm
> >>>>>> not a
> >>>>>>>>> security expert).  Are both the low level read and write
> >>>>>> implementations
> >>>>>>>>> necessary?  (it probably makes sense to have a few smaller PRs
> for
> >>>>>>> exposing
> >>>>>>>>> this functionality anyways).
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> >>>>>>>>> ita...@pythonspeed.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Since the PR for high-level C++ Parquet encryption API appears
> >>>>>> stalled
> >>>>>>> (
> >>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into
> >>>>>> exposing
> >>>>>>>>> the
> >>>>>>>>>> low-level Parquet encryption API to Python.
> >>>>>>>>>>
> >>>>>>>>>> Arguments for doing this: the low-level API is all the users I'm
> >>>>>>> talking
> >>>>>>>>>> to need, at the moment, so it's plausible others would also find
> >>>> some
> >>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet
> >>>>>> encryption.
> >>>>>>>>> Then
> >>>>>>>>>> again, it might only be this one company and no one else cares.
> >>>>>>>>>>
> >>>>>>>>>> The arguments against, per Gidon Gershinsky:
> >>>>>>>>>>
> >>>>>>>>>>>  * security: low-level encryption API is easy to misuse (eg
> >> giving
> >>>>>> the
> >>>>>>>>>> same keys for a number of different files; this'd break the AES
> >> GCM
> >>>>>>>>>> cipher). The high-level encryption layer handles that by
> applying
> >>>>>>>>> envelope
> >>>>>>>>>> encryption and other best practices in data security. Also, this
> >>>>>> layer
> >>>>>>> is
> >>>>>>>>>> maintained by the community, meaning that future improvements
> and
> >>>>>>>>> security
> >>>>>>>>>> fixes can be upstreamed by anyone, and available to all.
> >>>>>>>>>>>  * compatibility: parquet-mr implements the high-level
> encryption
> >>>>>>>>> layer.
> >>>>>>>>>> If we want the files produced by Spark/Presto/etc to be readable
> >> by
> >>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow
> >> users
> >>>>>>> with
> >>>>>>>>>> the high-level API.
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>> The current situation is not ideal, it'd be good to merge the
> >>>>>>>>> high-level
> >>>>>>>>>> PR (and maybe hide the low level), but here we are; also, C++
> is a
> >>>>>> kind
> >>>>>>>>> of
> >>>>>>>>>> a low-level language; Python would expose it to a less
> experienced
> >>>>>>>>> audience.
> >>>>>>>>>>
> >>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
> >>>>>>>>>>
> >>>>>>>>>> I find the compatibility argument less compelling, that's
> readily
> >>>>>>>>>> addressed by documentation. I am not a crypto expert so I can't
> >>>>>>> evaluate
> >>>>>>>>>> how risky exposing the low-level encryption APIs would be, but I
> >> can
> >>>>>>> see
> >>>>>>>>>> how that would be a significant concern.
> >>>>>>>>>>
> >>>>>>>>>> Some options are:
> >>>>>>>>>>  * Status quo, no Python API for low-level Parquet encryption.
> >> This
> >>>>>> has
> >>>>>>>>>> two possible outcomes:
> >>>>>>>>>>    * Eventually high-level API gets merged, gets Python binding.
> >>>>>>>>>>    * High-level encryption API is never merged, Python users
> never
> >>>>>> get
> >>>>>>>>>> access to encryption.
> >>>>>>>>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps
> using
> >>>>>>>>> "hazmat"
> >>>>>>>>>> idiom used by the Python cryptography package (API namespace
> >>>>>> indicating
> >>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g.
> >>>>>>>>>>
> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
> >>>>>>>>>>    * Gidon Gershinsky did not find this suggestion compelling
> >> enough
> >>>>>> to
> >>>>>>>>>> override his security concerns.
> >>>>>>>>>>  * Low-level encryption done as third party Python package,
> either
> >>>>>>>>> private
> >>>>>>>>>> or open source. This is annoying technically, plausibly would
> >>>> require
> >>>>>>>>>> maintaining a fork.
> >>>>>>>>>> Any other ideas? Thoughts on these options?
> >>>>>>>>>>
> >>>>>>>>>> —Itamar
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Reply via email to