Thanks, then we'll just go ahead and address the remaining comments.

Cheers, Gidon


On Thu, Feb 18, 2021 at 5:45 PM Antoine Pitrou <anto...@python.org> wrote:

>
> I don't think there's any concern around having a process-global shared
> key cache.  The discussion was just around the implementation.
>
> Also, FTR, a standalone LRU cache class is proposed here, which may
> reduce the amount of original code in the Parquet encryption PR:
> https://github.com/apache/arrow/pull/8716
>
> Best regards
>
> Antoine.
>
>
> Le 18/02/2021 à 16:40, Gidon Gershinsky a écrit :
> > I believe the shared structures that were debated are the key caches.
> >
> > Cheers, Gidon
> >
> >
> > On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >>>
> >>> I don't think any notion of threading should be present in the
> >>> implementation, except for the required locks around shared structures.
> >>
> >>
> >> I seem to recall the debate was how to model some class interactions to
> >> determine what should be considered shared structures and what should
> not.
> >>
> >> On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com>
> wrote:
> >>
> >>> This certainly sounds good to me.
> >>>
> >>> Cheers, Gidon
> >>>
> >>>
> >>> On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> I don't think any notion of threading should be present in the
> >>>> implementation, except for the required locks around shared
> structures.
> >>>>  I don't know where the idea of a "main thread" comes from, but it
> >>>> probably shouldn't exist in a C++ library.
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>>
> >>>> Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
> >>>>> Just to clarify. There are two options, which one do you refer to? A
> >>>> design
> >>>>> with a main thread that handles projections and the keys (relevant
> >> for
> >>>> the
> >>>>> projected columns); or the current code with any thread allowed to
> >>> handle
> >>>>> full file reading, inc the footer, column projections and their keys?
> >>> Can
> >>>>> you finalize this with Micah?
> >>>>> The good news is, Tham is still interested to resume this work, and
> >> is
> >>> ok
> >>>>> with either option. Please let her know whether the current threading
> >>>> model
> >>>>> stays, or should be modified with the changes proposed in the doc
> >> (for
> >>>> the
> >>>>> latter, some guidance with the details would be needed).
> >>>>>
> >>>>> Cheers, Gidon
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org>
> >>>> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> >>>>>>> From the doc,
> >>>>>>> "To maintain consistency with the style of parquet-cpp, the above
> >>>>>>> structures should not be explicitly synchronized with individual
> >>>> mutexes.
> >>>>>>> In the case of a parquet::arrow::FileReader, the request to read a
> >>>> given
> >>>>>>> selection of row groups and columns is issued from a single main
> >>>> thread.
> >>>>>>> Note that this does require that all keys required for a read are
> >>>>>> assembled
> >>>>>>> on the main thread so that DecryptionKeyRetriever objects are not
> >>>>>> directly
> >>>>>>> accessing any caches"
> >>>>>>>
> >>>>>>> The current PR code doesn't require a single main thread. Any
> >> thread
> >>>> can
> >>>>>>> read any file, both footer and pages. So the key cache is shared,
> >> to
> >>>> save
> >>>>>>> N-1 interactions with the KMS server.
> >>>>>>
> >>>>>> I don't think there's any contention on this.  IMHO the only
> >> concerns
> >>>>>> are about the implementation, not the semantics.
> >>>>>>
> >>>>>> Best regards
> >>>>>>
> >>>>>> Antoine.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Cheers, Gidon
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <
> >> anto...@python.org>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I'm not sure a threading model is expected for an encryption
> >> layer.
> >>>> Am
> >>>>>>>> I missing something?
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> Antoine.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> >>>>>>>>> Precisely, the main change is in the threading model. Afaik, the
> >>>>>> document
> >>>>>>>>> proposes a model that fits pandas, but might be problematic for
> >>> other
> >>>>>>>> users
> >>>>>>>>> of this library.
> >>>>>>>>> Technically, this is not showstopper though; if the community
> >>> decides
> >>>>>> on
> >>>>>>>>> this model, it will be compatible with the high-level encryption
> >>>>>> design;
> >>>>>>>>> but the change implementation would need to be done by pandas
> >>> experts
> >>>>>>>> (not
> >>>>>>>>> us; but we'll help where we can).
> >>>>>>>>> Micah, you know this subject (and the community) better than we
> >> do
> >>> -
> >>>>>> we'd
> >>>>>>>>> much appreciate it if you'd take a lead on removing this
> >> roadblock.
> >>>>>>>>>
> >>>>>>>>> Cheers, Gidon
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <
> >>>> emkornfi...@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I think some of the comments might be conflicting.  One of the
> >>>>>> concerns
> >>>>>>>>>> (that I would need to refresh myself on to offer an opinion
> >> which
> >>>> was
> >>>>>>>>>> covered in Ben's doc) was the threading model we expect in the
> >>>>>> library.
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <
> >>> anto...@python.org>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Gidon,
> >>>>>>>>>>>
> >>>>>>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> >>>>>>>>>>>> Regarding the high-level layer, I think it waits for a
> >> progress
> >>> at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> >>>>>>>>>>>> No activity there since last November. This is unfortunate,
> >>>> because
> >>>>>>>>>> Tham
> >>>>>>>>>>>> has put a lot of work in coding the high-level layer (and
> >>>> addressing
> >>>>>>>>>> 200+
> >>>>>>>>>>>> review comments) in the PR
> >>>>>> https://github.com/apache/arrow/pull/8023.
> >>>>>>>>>>> The
> >>>>>>>>>>>> code is functional, compatible with the Java version in
> >>>> parquet-mr,
> >>>>>>>> and
> >>>>>>>>>>> can
> >>>>>>>>>>>> be updated with the threading changes in the doc above. I hope
> >>> all
> >>>>>>>> this
> >>>>>>>>>>>> good work will not be wasted.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm sorry for the possibly frustrating process.  Looking at the
> >>> PR,
> >>>>>>>>>>> though, it seems a bunch of comments were not addressed.  Is it
> >>>>>>>> possible
> >>>>>>>>>>> to go through them and ensure they get an answer and/or a
> >>>> resolution?
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards
> >>>>>>>>>>>
> >>>>>>>>>>> Antoine.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers, Gidon
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
> >>>>>>>> emkornfi...@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> My thoughts:
> >>>>>>>>>>>>> 1.  I've lost track of the higher level encryption
> >>> implementation
> >>>>>> in
> >>>>>>>>>>> C++.
> >>>>>>>>>>>>> I think we were trying to come to a consensus on the
> >>>>>> threading/thread
> >>>>>>>>>>>>> safety model?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2.  I'm open to exposing the lower level encryption libraries
> >>> in
> >>>>>>>>>> python
> >>>>>>>>>>>>> (without appropriate namespacing/communication).  It seems at
> >>>> least
> >>>>>>>>>> for
> >>>>>>>>>>>>> reading, there is potentially less harm (I'll caveat that
> >> with
> >>>> I'm
> >>>>>>>>>> not a
> >>>>>>>>>>>>> security expert).  Are both the low level read and write
> >>>>>>>>>> implementations
> >>>>>>>>>>>>> necessary?  (it probably makes sense to have a few smaller
> >> PRs
> >>>> for
> >>>>>>>>>>> exposing
> >>>>>>>>>>>>> this functionality anyways).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> >>>>>>>>>>>>> ita...@pythonspeed.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Since the PR for high-level C++ Parquet encryption API
> >> appears
> >>>>>>>>>> stalled
> >>>>>>>>>>> (
> >>>>>>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking
> >> into
> >>>>>>>>>> exposing
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> low-level Parquet encryption API to Python.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Arguments for doing this: the low-level API is all the users
> >>> I'm
> >>>>>>>>>>> talking
> >>>>>>>>>>>>>> to need, at the moment, so it's plausible others would also
> >>> find
> >>>>>>>> some
> >>>>>>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet
> >>>>>>>>>> encryption.
> >>>>>>>>>>>>> Then
> >>>>>>>>>>>>>> again, it might only be this one company and no one else
> >>> cares.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The arguments against, per Gidon Gershinsky:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>  * security: low-level encryption API is easy to misuse (eg
> >>>>>> giving
> >>>>>>>>>> the
> >>>>>>>>>>>>>> same keys for a number of different files; this'd break the
> >>> AES
> >>>>>> GCM
> >>>>>>>>>>>>>> cipher). The high-level encryption layer handles that by
> >>>> applying
> >>>>>>>>>>>>> envelope
> >>>>>>>>>>>>>> encryption and other best practices in data security. Also,
> >>> this
> >>>>>>>>>> layer
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> maintained by the community, meaning that future
> >> improvements
> >>>> and
> >>>>>>>>>>>>> security
> >>>>>>>>>>>>>> fixes can be upstreamed by anyone, and available to all.
> >>>>>>>>>>>>>>>  * compatibility: parquet-mr implements the high-level
> >>>> encryption
> >>>>>>>>>>>>> layer.
> >>>>>>>>>>>>>> If we want the files produced by Spark/Presto/etc to be
> >>> readable
> >>>>>> by
> >>>>>>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the
> >> Arrow
> >>>>>> users
> >>>>>>>>>>> with
> >>>>>>>>>>>>>> the high-level API.
> >>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The current situation is not ideal, it'd be good to merge
> >> the
> >>>>>>>>>>>>> high-level
> >>>>>>>>>>>>>> PR (and maybe hide the low level), but here we are; also,
> >> C++
> >>>> is a
> >>>>>>>>>> kind
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>> a low-level language; Python would expose it to a less
> >>>> experienced
> >>>>>>>>>>>>> audience.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I find the compatibility argument less compelling, that's
> >>>> readily
> >>>>>>>>>>>>>> addressed by documentation. I am not a crypto expert so I
> >>> can't
> >>>>>>>>>>> evaluate
> >>>>>>>>>>>>>> how risky exposing the low-level encryption APIs would be,
> >>> but I
> >>>>>> can
> >>>>>>>>>>> see
> >>>>>>>>>>>>>> how that would be a significant concern.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Some options are:
> >>>>>>>>>>>>>>  * Status quo, no Python API for low-level Parquet
> >> encryption.
> >>>>>> This
> >>>>>>>>>> has
> >>>>>>>>>>>>>> two possible outcomes:
> >>>>>>>>>>>>>>    * Eventually high-level API gets merged, gets Python
> >>> binding.
> >>>>>>>>>>>>>>    * High-level encryption API is never merged, Python users
> >>>> never
> >>>>>>>>>> get
> >>>>>>>>>>>>>> access to encryption.
> >>>>>>>>>>>>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps
> >>>> using
> >>>>>>>>>>>>> "hazmat"
> >>>>>>>>>>>>>> idiom used by the Python cryptography package (API namespace
> >>>>>>>>>> indicating
> >>>>>>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g.
> >>>>>>>>>>>>>>
> >>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
> >>>>>>>>>>>>>>    * Gidon Gershinsky did not find this suggestion
> >> compelling
> >>>>>> enough
> >>>>>>>>>> to
> >>>>>>>>>>>>>> override his security concerns.
> >>>>>>>>>>>>>>  * Low-level encryption done as third party Python package,
> >>>> either
> >>>>>>>>>>>>> private
> >>>>>>>>>>>>>> or open source. This is annoying technically, plausibly
> >> would
> >>>>>>>> require
> >>>>>>>>>>>>>> maintaining a fork.
> >>>>>>>>>>>>>> Any other ideas? Thoughts on these options?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> —Itamar
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Reply via email to