I believe the shared structures that were debated are the key caches.

Cheers, Gidon


On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> >
> > I don't think any notion of threading should be present in the
> > implementation, except for the required locks around shared structures.
>
>
> I seem to recall the debate was how to model some class interactions to
> determine what should be considered shared structures and what should not.
>
> On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com> wrote:
>
> > This certainly sounds good to me.
> >
> > Cheers, Gidon
> >
> >
> > On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > I don't think any notion of threading should be present in the
> > > implementation, except for the required locks around shared structures.
> > >  I don't know where the idea of a "main thread" comes from, but it
> > > probably shouldn't exist in a C++ library.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
> > > > Just to clarify. There are two options, which one do you refer to? A
> > > design
> > > > with a main thread that handles projections and the keys (relevant
> for
> > > the
> > > > projected columns); or the current code with any thread allowed to
> > handle
> > > > full file reading, inc the footer, column projections and their keys?
> > Can
> > > > you finalize this with Micah?
> > > > The good news is, Tham is still interested to resume this work, and
> is
> > ok
> > > > with either option. Please let her know whether the current threading
> > > model
> > > > stays, or should be modified with the changes proposed in the doc
> (for
> > > the
> > > > latter, some guidance with the details would be needed).
> > > >
> > > > Cheers, Gidon
> > > >
> > > >
> > > > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >
> > > >>
> > > >>
> > > >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> > > >>> From the doc,
> > > >>> "To maintain consistency with the style of parquet-cpp, the above
> > > >>> structures should not be explicitly synchronized with individual
> > > mutexes.
> > > >>> In the case of a parquet::arrow::FileReader, the request to read a
> > > given
> > > >>> selection of row groups and columns is issued from a single main
> > > thread.
> > > >>> Note that this does require that all keys required for a read are
> > > >> assembled
> > > >>> on the main thread so that DecryptionKeyRetriever objects are not
> > > >> directly
> > > >>> accessing any caches"
> > > >>>
> > > >>> The current PR code doesn't require a single main thread. Any
> thread
> > > can
> > > >>> read any file, both footer and pages. So the key cache is shared,
> to
> > > save
> > > >>> N-1 interactions with the KMS server.
> > > >>
> > > >> I don't think there's any contention on this.  IMHO the only
> concerns
> > > >> are about the implementation, not the semantics.
> > > >>
> > > >> Best regards
> > > >>
> > > >> Antoine.
> > > >>
> > > >>
> > > >>>
> > > >>> Cheers, Gidon
> > > >>>
> > > >>>
> > > >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <
> anto...@python.org>
> > > >> wrote:
> > > >>>
> > > >>>>
> > > >>>> I'm not sure a threading model is expected for an encryption
> layer.
> > > Am
> > > >>>> I missing something?
> > > >>>>
> > > >>>> Regards
> > > >>>>
> > > >>>> Antoine.
> > > >>>>
> > > >>>>
> > > >>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> > > >>>>> Precisely, the main change is in the threading model. Afaik, the
> > > >> document
> > > >>>>> proposes a model that fits pandas, but might be problematic for
> > other
> > > >>>> users
> > > >>>>> of this library.
> > > >>>>> Technically, this is not showstopper though; if the community
> > decides
> > > >> on
> > > >>>>> this model, it will be compatible with the high-level encryption
> > > >> design;
> > > >>>>> but the change implementation would need to be done by pandas
> > experts
> > > >>>> (not
> > > >>>>> us; but we'll help where we can).
> > > >>>>> Micah, you know this subject (and the community) better than we
> do
> > -
> > > >> we'd
> > > >>>>> much appreciate it if you'd take a lead on removing this
> roadblock.
> > > >>>>>
> > > >>>>> Cheers, Gidon
> > > >>>>>
> > > >>>>>
> > > >>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <
> > > emkornfi...@gmail.com
> > > >>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> I think some of the comments might be conflicting.  One of the
> > > >> concerns
> > > >>>>>> (that I would need to refresh myself on to offer an opinion
> which
> > > was
> > > >>>>>> covered in Ben's doc) was the threading model we expect in the
> > > >> library.
> > > >>>>>>
> > > >>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <
> > anto...@python.org>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> Hi Gidon,
> > > >>>>>>>
> > > >>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> > > >>>>>>>> Regarding the high-level layer, I think it waits for a
> progress
> > at
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> > > >>>>>>>> No activity there since last November. This is unfortunate,
> > > because
> > > >>>>>> Tham
> > > >>>>>>>> has put a lot of work in coding the high-level layer (and
> > > addressing
> > > >>>>>> 200+
> > > >>>>>>>> review comments) in the PR
> > > >> https://github.com/apache/arrow/pull/8023.
> > > >>>>>>> The
> > > >>>>>>>> code is functional, compatible with the Java version in
> > > parquet-mr,
> > > >>>> and
> > > >>>>>>> can
> > > >>>>>>>> be updated with the threading changes in the doc above. I hope
> > all
> > > >>>> this
> > > >>>>>>>> good work will not be wasted.
> > > >>>>>>>
> > > >>>>>>> I'm sorry for the possibly frustrating process.  Looking at the
> > PR,
> > > >>>>>>> though, it seems a bunch of comments were not addressed.  Is it
> > > >>>> possible
> > > >>>>>>> to go through them and ensure they get an answer and/or a
> > > resolution?
> > > >>>>>>>
> > > >>>>>>> Best regards
> > > >>>>>>>
> > > >>>>>>> Antoine.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Cheers, Gidon
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
> > > >>>> emkornfi...@gmail.com
> > > >>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> My thoughts:
> > > >>>>>>>>> 1.  I've lost track of the higher level encryption
> > implementation
> > > >> in
> > > >>>>>>> C++.
> > > >>>>>>>>> I think we were trying to come to a consensus on the
> > > >> threading/thread
> > > >>>>>>>>> safety model?
> > > >>>>>>>>>
> > > >>>>>>>>> 2.  I'm open to exposing the lower level encryption libraries
> > in
> > > >>>>>> python
> > > >>>>>>>>> (without appropriate namespacing/communication).  It seems at
> > > least
> > > >>>>>> for
> > > >>>>>>>>> reading, there is potentially less harm (I'll caveat that
> with
> > > I'm
> > > >>>>>> not a
> > > >>>>>>>>> security expert).  Are both the low level read and write
> > > >>>>>> implementations
> > > >>>>>>>>> necessary?  (it probably makes sense to have a few smaller
> PRs
> > > for
> > > >>>>>>> exposing
> > > >>>>>>>>> this functionality anyways).
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> > > >>>>>>>>> ita...@pythonspeed.com> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Since the PR for high-level C++ Parquet encryption API
> appears
> > > >>>>>> stalled
> > > >>>>>>> (
> > > >>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking
> into
> > > >>>>>> exposing
> > > >>>>>>>>> the
> > > >>>>>>>>>> low-level Parquet encryption API to Python.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Arguments for doing this: the low-level API is all the users
> > I'm
> > > >>>>>>> talking
> > > >>>>>>>>>> to need, at the moment, so it's plausible others would also
> > find
> > > >>>> some
> > > >>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet
> > > >>>>>> encryption.
> > > >>>>>>>>> Then
> > > >>>>>>>>>> again, it might only be this one company and no one else
> > cares.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The arguments against, per Gidon Gershinsky:
> > > >>>>>>>>>>
> > > >>>>>>>>>>>  * security: low-level encryption API is easy to misuse (eg
> > > >> giving
> > > >>>>>> the
> > > >>>>>>>>>> same keys for a number of different files; this'd break the
> > AES
> > > >> GCM
> > > >>>>>>>>>> cipher). The high-level encryption layer handles that by
> > > applying
> > > >>>>>>>>> envelope
> > > >>>>>>>>>> encryption and other best practices in data security. Also,
> > this
> > > >>>>>> layer
> > > >>>>>>> is
> > > >>>>>>>>>> maintained by the community, meaning that future
> improvements
> > > and
> > > >>>>>>>>> security
> > > >>>>>>>>>> fixes can be upstreamed by anyone, and available to all.
> > > >>>>>>>>>>>  * compatibility: parquet-mr implements the high-level
> > > encryption
> > > >>>>>>>>> layer.
> > > >>>>>>>>>> If we want the files produced by Spark/Presto/etc to be
> > readable
> > > >> by
> > > >>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the
> Arrow
> > > >> users
> > > >>>>>>> with
> > > >>>>>>>>>> the high-level API.
> > > >>>>>>>>>>> ...
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> The current situation is not ideal, it'd be good to merge
> the
> > > >>>>>>>>> high-level
> > > >>>>>>>>>> PR (and maybe hide the low level), but here we are; also,
> C++
> > > is a
> > > >>>>>> kind
> > > >>>>>>>>> of
> > > >>>>>>>>>> a low-level language; Python would expose it to a less
> > > experienced
> > > >>>>>>>>> audience.
> > > >>>>>>>>>>
> > > >>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
> > > >>>>>>>>>>
> > > >>>>>>>>>> I find the compatibility argument less compelling, that's
> > > readily
> > > >>>>>>>>>> addressed by documentation. I am not a crypto expert so I
> > can't
> > > >>>>>>> evaluate
> > > >>>>>>>>>> how risky exposing the low-level encryption APIs would be,
> > but I
> > > >> can
> > > >>>>>>> see
> > > >>>>>>>>>> how that would be a significant concern.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Some options are:
> > > >>>>>>>>>>  * Status quo, no Python API for low-level Parquet
> encryption.
> > > >> This
> > > >>>>>> has
> > > >>>>>>>>>> two possible outcomes:
> > > >>>>>>>>>>    * Eventually high-level API gets merged, gets Python
> > binding.
> > > >>>>>>>>>>    * High-level encryption API is never merged, Python users
> > > never
> > > >>>>>> get
> > > >>>>>>>>>> access to encryption.
> > > >>>>>>>>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps
> > > using
> > > >>>>>>>>> "hazmat"
> > > >>>>>>>>>> idiom used by the Python cryptography package (API namespace
> > > >>>>>> indicating
> > > >>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g.
> > > >>>>>>>>>>
> > > `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
> > > >>>>>>>>>>    * Gidon Gershinsky did not find this suggestion
> compelling
> > > >> enough
> > > >>>>>> to
> > > >>>>>>>>>> override his security concerns.
> > > >>>>>>>>>>  * Low-level encryption done as third party Python package,
> > > either
> > > >>>>>>>>> private
> > > >>>>>>>>>> or open source. This is annoying technically, plausibly
> would
> > > >>>> require
> > > >>>>>>>>>> maintaining a fork.
> > > >>>>>>>>>> Any other ideas? Thoughts on these options?
> > > >>>>>>>>>>
> > > >>>>>>>>>> —Itamar
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >
> >
>

Reply via email to