This certainly sounds good to me. Cheers, Gidon
On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou <anto...@python.org> wrote: > > I don't think any notion of threading should be present in the > implementation, except for the required locks around shared structures. > I don't know where the idea of a "main thread" comes from, but it > probably shouldn't exist in a C++ library. > > Regards > > Antoine. > > > > Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit : > > Just to clarify. There are two options, which one do you refer to? A > design > > with a main thread that handles projections and the keys (relevant for > the > > projected columns); or the current code with any thread allowed to handle > > full file reading, inc the footer, column projections and their keys? Can > > you finalize this with Micah? > > The good news is, Tham is still interested to resume this work, and is ok > > with either option. Please let her know whether the current threading > model > > stays, or should be modified with the changes proposed in the doc (for > the > > latter, some guidance with the details would be needed). > > > > Cheers, Gidon > > > > > > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org> > wrote: > > > >> > >> > >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit : > >>> From the doc, > >>> "To maintain consistency with the style of parquet-cpp, the above > >>> structures should not be explicitly synchronized with individual > mutexes. > >>> In the case of a parquet::arrow::FileReader, the request to read a > given > >>> selection of row groups and columns is issued from a single main > thread. > >>> Note that this does require that all keys required for a read are > >> assembled > >>> on the main thread so that DecryptionKeyRetriever objects are not > >> directly > >>> accessing any caches" > >>> > >>> The current PR code doesn't require a single main thread. Any thread > can > >>> read any file, both footer and pages. So the key cache is shared, to > save > >>> N-1 interactions with the KMS server. > >> > >> I don't think there's any contention on this. IMHO the only concerns > >> are about the implementation, not the semantics. > >> > >> Best regards > >> > >> Antoine. > >> > >> > >>> > >>> Cheers, Gidon > >>> > >>> > >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org> > >> wrote: > >>> > >>>> > >>>> I'm not sure a threading model is expected for an encryption layer. > Am > >>>> I missing something? > >>>> > >>>> Regards > >>>> > >>>> Antoine. > >>>> > >>>> > >>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : > >>>>> Precisely, the main change is in the threading model. Afaik, the > >> document > >>>>> proposes a model that fits pandas, but might be problematic for other > >>>> users > >>>>> of this library. > >>>>> Technically, this is not showstopper though; if the community decides > >> on > >>>>> this model, it will be compatible with the high-level encryption > >> design; > >>>>> but the change implementation would need to be done by pandas experts > >>>> (not > >>>>> us; but we'll help where we can). > >>>>> Micah, you know this subject (and the community) better than we do - > >> we'd > >>>>> much appreciate it if you'd take a lead on removing this roadblock. > >>>>> > >>>>> Cheers, Gidon > >>>>> > >>>>> > >>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield < > emkornfi...@gmail.com > >>> > >>>>> wrote: > >>>>> > >>>>>> I think some of the comments might be conflicting. One of the > >> concerns > >>>>>> (that I would need to refresh myself on to offer an opinion which > was > >>>>>> covered in Ben's doc) was the threading model we expect in the > >> library. > >>>>>> > >>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org> > >>>> wrote: > >>>>>> > >>>>>>> > >>>>>>> Hi Gidon, > >>>>>>> > >>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : > >>>>>>>> Regarding the high-level layer, I think it waits for a progress at > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >> > https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing > >>>>>>>> No activity there since last November. This is unfortunate, > because > >>>>>> Tham > >>>>>>>> has put a lot of work in coding the high-level layer (and > addressing > >>>>>> 200+ > >>>>>>>> review comments) in the PR > >> https://github.com/apache/arrow/pull/8023. > >>>>>>> The > >>>>>>>> code is functional, compatible with the Java version in > parquet-mr, > >>>> and > >>>>>>> can > >>>>>>>> be updated with the threading changes in the doc above. I hope all > >>>> this > >>>>>>>> good work will not be wasted. > >>>>>>> > >>>>>>> I'm sorry for the possibly frustrating process. Looking at the PR, > >>>>>>> though, it seems a bunch of comments were not addressed. Is it > >>>> possible > >>>>>>> to go through them and ensure they get an answer and/or a > resolution? > >>>>>>> > >>>>>>> Best regards > >>>>>>> > >>>>>>> Antoine. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> Cheers, Gidon > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield < > >>>> emkornfi...@gmail.com > >>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> My thoughts: > >>>>>>>>> 1. I've lost track of the higher level encryption implementation > >> in > >>>>>>> C++. > >>>>>>>>> I think we were trying to come to a consensus on the > >> threading/thread > >>>>>>>>> safety model? > >>>>>>>>> > >>>>>>>>> 2. I'm open to exposing the lower level encryption libraries in > >>>>>> python > >>>>>>>>> (without appropriate namespacing/communication). It seems at > least > >>>>>> for > >>>>>>>>> reading, there is potentially less harm (I'll caveat that with > I'm > >>>>>> not a > >>>>>>>>> security expert). Are both the low level read and write > >>>>>> implementations > >>>>>>>>> necessary? (it probably makes sense to have a few smaller PRs > for > >>>>>>> exposing > >>>>>>>>> this functionality anyways). > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < > >>>>>>>>> ita...@pythonspeed.com> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> Since the PR for high-level C++ Parquet encryption API appears > >>>>>> stalled > >>>>>>> ( > >>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into > >>>>>> exposing > >>>>>>>>> the > >>>>>>>>>> low-level Parquet encryption API to Python. > >>>>>>>>>> > >>>>>>>>>> Arguments for doing this: the low-level API is all the users I'm > >>>>>>> talking > >>>>>>>>>> to need, at the moment, so it's plausible others would also find > >>>> some > >>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet > >>>>>> encryption. > >>>>>>>>> Then > >>>>>>>>>> again, it might only be this one company and no one else cares. > >>>>>>>>>> > >>>>>>>>>> The arguments against, per Gidon Gershinsky: > >>>>>>>>>> > >>>>>>>>>>> * security: low-level encryption API is easy to misuse (eg > >> giving > >>>>>> the > >>>>>>>>>> same keys for a number of different files; this'd break the AES > >> GCM > >>>>>>>>>> cipher). The high-level encryption layer handles that by > applying > >>>>>>>>> envelope > >>>>>>>>>> encryption and other best practices in data security. Also, this > >>>>>> layer > >>>>>>> is > >>>>>>>>>> maintained by the community, meaning that future improvements > and > >>>>>>>>> security > >>>>>>>>>> fixes can be upstreamed by anyone, and available to all. > >>>>>>>>>>> * compatibility: parquet-mr implements the high-level > encryption > >>>>>>>>> layer. > >>>>>>>>>> If we want the files produced by Spark/Presto/etc to be readable > >> by > >>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow > >> users > >>>>>>> with > >>>>>>>>>> the high-level API. > >>>>>>>>>>> ... > >>>>>>>>>>> > >>>>>>>>>>> The current situation is not ideal, it'd be good to merge the > >>>>>>>>> high-level > >>>>>>>>>> PR (and maybe hide the low level), but here we are; also, C++ > is a > >>>>>> kind > >>>>>>>>> of > >>>>>>>>>> a low-level language; Python would expose it to a less > experienced > >>>>>>>>> audience. > >>>>>>>>>> > >>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) > >>>>>>>>>> > >>>>>>>>>> I find the compatibility argument less compelling, that's > readily > >>>>>>>>>> addressed by documentation. I am not a crypto expert so I can't > >>>>>>> evaluate > >>>>>>>>>> how risky exposing the low-level encryption APIs would be, but I > >> can > >>>>>>> see > >>>>>>>>>> how that would be a significant concern. > >>>>>>>>>> > >>>>>>>>>> Some options are: > >>>>>>>>>> * Status quo, no Python API for low-level Parquet encryption. > >> This > >>>>>> has > >>>>>>>>>> two possible outcomes: > >>>>>>>>>> * Eventually high-level API gets merged, gets Python binding. > >>>>>>>>>> * High-level encryption API is never merged, Python users > never > >>>>>> get > >>>>>>>>>> access to encryption. > >>>>>>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps > using > >>>>>>>>> "hazmat" > >>>>>>>>>> idiom used by the Python cryptography package (API namespace > >>>>>> indicating > >>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g. > >>>>>>>>>> > `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > >>>>>>>>>> * Gidon Gershinsky did not find this suggestion compelling > >> enough > >>>>>> to > >>>>>>>>>> override his security concerns. > >>>>>>>>>> * Low-level encryption done as third party Python package, > either > >>>>>>>>> private > >>>>>>>>>> or open source. This is annoying technically, plausibly would > >>>> require > >>>>>>>>>> maintaining a fork. > >>>>>>>>>> Any other ideas? Thoughts on these options? > >>>>>>>>>> > >>>>>>>>>> —Itamar > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >