I believe the shared structures that were debated are the key caches. Cheers, Gidon
On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > I don't think any notion of threading should be present in the > > implementation, except for the required locks around shared structures. > > > I seem to recall the debate was how to model some class interactions to > determine what should be considered shared structures and what should not. > > On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky <gg5...@gmail.com> wrote: > > > This certainly sounds good to me. > > > > Cheers, Gidon > > > > > > On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > I don't think any notion of threading should be present in the > > > implementation, except for the required locks around shared structures. > > > I don't know where the idea of a "main thread" comes from, but it > > > probably shouldn't exist in a C++ library. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > > Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit : > > > > Just to clarify. There are two options, which one do you refer to? A > > > design > > > > with a main thread that handles projections and the keys (relevant > for > > > the > > > > projected columns); or the current code with any thread allowed to > > handle > > > > full file reading, inc the footer, column projections and their keys? > > Can > > > > you finalize this with Micah? > > > > The good news is, Tham is still interested to resume this work, and > is > > ok > > > > with either option. Please let her know whether the current threading > > > model > > > > stays, or should be modified with the changes proposed in the doc > (for > > > the > > > > latter, some guidance with the details would be needed). > > > > > > > > Cheers, Gidon > > > > > > > > > > > > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org> > > > wrote: > > > > > > > >> > > > >> > > > >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit : > > > >>> From the doc, > > > >>> "To maintain consistency with the style of parquet-cpp, the above > > > >>> structures should not be explicitly synchronized with individual > > > mutexes. > > > >>> In the case of a parquet::arrow::FileReader, the request to read a > > > given > > > >>> selection of row groups and columns is issued from a single main > > > thread. > > > >>> Note that this does require that all keys required for a read are > > > >> assembled > > > >>> on the main thread so that DecryptionKeyRetriever objects are not > > > >> directly > > > >>> accessing any caches" > > > >>> > > > >>> The current PR code doesn't require a single main thread. Any > thread > > > can > > > >>> read any file, both footer and pages. So the key cache is shared, > to > > > save > > > >>> N-1 interactions with the KMS server. > > > >> > > > >> I don't think there's any contention on this. IMHO the only > concerns > > > >> are about the implementation, not the semantics. > > > >> > > > >> Best regards > > > >> > > > >> Antoine. > > > >> > > > >> > > > >>> > > > >>> Cheers, Gidon > > > >>> > > > >>> > > > >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou < > anto...@python.org> > > > >> wrote: > > > >>> > > > >>>> > > > >>>> I'm not sure a threading model is expected for an encryption > layer. > > > Am > > > >>>> I missing something? > > > >>>> > > > >>>> Regards > > > >>>> > > > >>>> Antoine. > > > >>>> > > > >>>> > > > >>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : > > > >>>>> Precisely, the main change is in the threading model. Afaik, the > > > >> document > > > >>>>> proposes a model that fits pandas, but might be problematic for > > other > > > >>>> users > > > >>>>> of this library. > > > >>>>> Technically, this is not showstopper though; if the community > > decides > > > >> on > > > >>>>> this model, it will be compatible with the high-level encryption > > > >> design; > > > >>>>> but the change implementation would need to be done by pandas > > experts > > > >>>> (not > > > >>>>> us; but we'll help where we can). > > > >>>>> Micah, you know this subject (and the community) better than we > do > > - > > > >> we'd > > > >>>>> much appreciate it if you'd take a lead on removing this > roadblock. > > > >>>>> > > > >>>>> Cheers, Gidon > > > >>>>> > > > >>>>> > > > >>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield < > > > emkornfi...@gmail.com > > > >>> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> I think some of the comments might be conflicting. One of the > > > >> concerns > > > >>>>>> (that I would need to refresh myself on to offer an opinion > which > > > was > > > >>>>>> covered in Ben's doc) was the threading model we expect in the > > > >> library. > > > >>>>>> > > > >>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou < > > anto...@python.org> > > > >>>> wrote: > > > >>>>>> > > > >>>>>>> > > > >>>>>>> Hi Gidon, > > > >>>>>>> > > > >>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : > > > >>>>>>>> Regarding the high-level layer, I think it waits for a > progress > > at > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>> > > > >> > > > > > > https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing > > > >>>>>>>> No activity there since last November. This is unfortunate, > > > because > > > >>>>>> Tham > > > >>>>>>>> has put a lot of work in coding the high-level layer (and > > > addressing > > > >>>>>> 200+ > > > >>>>>>>> review comments) in the PR > > > >> https://github.com/apache/arrow/pull/8023. > > > >>>>>>> The > > > >>>>>>>> code is functional, compatible with the Java version in > > > parquet-mr, > > > >>>> and > > > >>>>>>> can > > > >>>>>>>> be updated with the threading changes in the doc above. I hope > > all > > > >>>> this > > > >>>>>>>> good work will not be wasted. > > > >>>>>>> > > > >>>>>>> I'm sorry for the possibly frustrating process. Looking at the > > PR, > > > >>>>>>> though, it seems a bunch of comments were not addressed. Is it > > > >>>> possible > > > >>>>>>> to go through them and ensure they get an answer and/or a > > > resolution? > > > >>>>>>> > > > >>>>>>> Best regards > > > >>>>>>> > > > >>>>>>> Antoine. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>> Cheers, Gidon > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield < > > > >>>> emkornfi...@gmail.com > > > >>>>>>> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> My thoughts: > > > >>>>>>>>> 1. I've lost track of the higher level encryption > > implementation > > > >> in > > > >>>>>>> C++. > > > >>>>>>>>> I think we were trying to come to a consensus on the > > > >> threading/thread > > > >>>>>>>>> safety model? > > > >>>>>>>>> > > > >>>>>>>>> 2. I'm open to exposing the lower level encryption libraries > > in > > > >>>>>> python > > > >>>>>>>>> (without appropriate namespacing/communication). It seems at > > > least > > > >>>>>> for > > > >>>>>>>>> reading, there is potentially less harm (I'll caveat that > with > > > I'm > > > >>>>>> not a > > > >>>>>>>>> security expert). Are both the low level read and write > > > >>>>>> implementations > > > >>>>>>>>> necessary? (it probably makes sense to have a few smaller > PRs > > > for > > > >>>>>>> exposing > > > >>>>>>>>> this functionality anyways). > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < > > > >>>>>>>>> ita...@pythonspeed.com> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Hi, > > > >>>>>>>>>> > > > >>>>>>>>>> Since the PR for high-level C++ Parquet encryption API > appears > > > >>>>>> stalled > > > >>>>>>> ( > > > >>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking > into > > > >>>>>> exposing > > > >>>>>>>>> the > > > >>>>>>>>>> low-level Parquet encryption API to Python. > > > >>>>>>>>>> > > > >>>>>>>>>> Arguments for doing this: the low-level API is all the users > > I'm > > > >>>>>>> talking > > > >>>>>>>>>> to need, at the moment, so it's plausible others would also > > find > > > >>>> some > > > >>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet > > > >>>>>> encryption. > > > >>>>>>>>> Then > > > >>>>>>>>>> again, it might only be this one company and no one else > > cares. > > > >>>>>>>>>> > > > >>>>>>>>>> The arguments against, per Gidon Gershinsky: > > > >>>>>>>>>> > > > >>>>>>>>>>> * security: low-level encryption API is easy to misuse (eg > > > >> giving > > > >>>>>> the > > > >>>>>>>>>> same keys for a number of different files; this'd break the > > AES > > > >> GCM > > > >>>>>>>>>> cipher). The high-level encryption layer handles that by > > > applying > > > >>>>>>>>> envelope > > > >>>>>>>>>> encryption and other best practices in data security. Also, > > this > > > >>>>>> layer > > > >>>>>>> is > > > >>>>>>>>>> maintained by the community, meaning that future > improvements > > > and > > > >>>>>>>>> security > > > >>>>>>>>>> fixes can be upstreamed by anyone, and available to all. > > > >>>>>>>>>>> * compatibility: parquet-mr implements the high-level > > > encryption > > > >>>>>>>>> layer. > > > >>>>>>>>>> If we want the files produced by Spark/Presto/etc to be > > readable > > > >> by > > > >>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the > Arrow > > > >> users > > > >>>>>>> with > > > >>>>>>>>>> the high-level API. > > > >>>>>>>>>>> ... > > > >>>>>>>>>>> > > > >>>>>>>>>>> The current situation is not ideal, it'd be good to merge > the > > > >>>>>>>>> high-level > > > >>>>>>>>>> PR (and maybe hide the low level), but here we are; also, > C++ > > > is a > > > >>>>>> kind > > > >>>>>>>>> of > > > >>>>>>>>>> a low-level language; Python would expose it to a less > > > experienced > > > >>>>>>>>> audience. > > > >>>>>>>>>> > > > >>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) > > > >>>>>>>>>> > > > >>>>>>>>>> I find the compatibility argument less compelling, that's > > > readily > > > >>>>>>>>>> addressed by documentation. I am not a crypto expert so I > > can't > > > >>>>>>> evaluate > > > >>>>>>>>>> how risky exposing the low-level encryption APIs would be, > > but I > > > >> can > > > >>>>>>> see > > > >>>>>>>>>> how that would be a significant concern. > > > >>>>>>>>>> > > > >>>>>>>>>> Some options are: > > > >>>>>>>>>> * Status quo, no Python API for low-level Parquet > encryption. > > > >> This > > > >>>>>> has > > > >>>>>>>>>> two possible outcomes: > > > >>>>>>>>>> * Eventually high-level API gets merged, gets Python > > binding. > > > >>>>>>>>>> * High-level encryption API is never merged, Python users > > > never > > > >>>>>> get > > > >>>>>>>>>> access to encryption. > > > >>>>>>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps > > > using > > > >>>>>>>>> "hazmat" > > > >>>>>>>>>> idiom used by the Python cryptography package (API namespace > > > >>>>>> indicating > > > >>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g. > > > >>>>>>>>>> > > > `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > > > >>>>>>>>>> * Gidon Gershinsky did not find this suggestion > compelling > > > >> enough > > > >>>>>> to > > > >>>>>>>>>> override his security concerns. > > > >>>>>>>>>> * Low-level encryption done as third party Python package, > > > either > > > >>>>>>>>> private > > > >>>>>>>>>> or open source. This is annoying technically, plausibly > would > > > >>>> require > > > >>>>>>>>>> maintaining a fork. > > > >>>>>>>>>> Any other ideas? Thoughts on these options? > > > >>>>>>>>>> > > > >>>>>>>>>> —Itamar > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > >