Just to clarify. There are two options, which one do you refer to? A design with a main thread that handles projections and the keys (relevant for the projected columns); or the current code with any thread allowed to handle full file reading, inc the footer, column projections and their keys? Can you finalize this with Micah? The good news is, Tham is still interested to resume this work, and is ok with either option. Please let her know whether the current threading model stays, or should be modified with the changes proposed in the doc (for the latter, some guidance with the details would be needed).
Cheers, Gidon On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org> wrote: > > > Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit : > > From the doc, > > "To maintain consistency with the style of parquet-cpp, the above > > structures should not be explicitly synchronized with individual mutexes. > > In the case of a parquet::arrow::FileReader, the request to read a given > > selection of row groups and columns is issued from a single main thread. > > Note that this does require that all keys required for a read are > assembled > > on the main thread so that DecryptionKeyRetriever objects are not > directly > > accessing any caches" > > > > The current PR code doesn't require a single main thread. Any thread can > > read any file, both footer and pages. So the key cache is shared, to save > > N-1 interactions with the KMS server. > > I don't think there's any contention on this. IMHO the only concerns > are about the implementation, not the semantics. > > Best regards > > Antoine. > > > > > > Cheers, Gidon > > > > > > On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org> > wrote: > > > >> > >> I'm not sure a threading model is expected for an encryption layer. Am > >> I missing something? > >> > >> Regards > >> > >> Antoine. > >> > >> > >> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : > >>> Precisely, the main change is in the threading model. Afaik, the > document > >>> proposes a model that fits pandas, but might be problematic for other > >> users > >>> of this library. > >>> Technically, this is not showstopper though; if the community decides > on > >>> this model, it will be compatible with the high-level encryption > design; > >>> but the change implementation would need to be done by pandas experts > >> (not > >>> us; but we'll help where we can). > >>> Micah, you know this subject (and the community) better than we do - > we'd > >>> much appreciate it if you'd take a lead on removing this roadblock. > >>> > >>> Cheers, Gidon > >>> > >>> > >>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com > > > >>> wrote: > >>> > >>>> I think some of the comments might be conflicting. One of the > concerns > >>>> (that I would need to refresh myself on to offer an opinion which was > >>>> covered in Ben's doc) was the threading model we expect in the > library. > >>>> > >>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org> > >> wrote: > >>>> > >>>>> > >>>>> Hi Gidon, > >>>>> > >>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : > >>>>>> Regarding the high-level layer, I think it waits for a progress at > >>>>>> > >>>>> > >>>> > >> > https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing > >>>>>> No activity there since last November. This is unfortunate, because > >>>> Tham > >>>>>> has put a lot of work in coding the high-level layer (and addressing > >>>> 200+ > >>>>>> review comments) in the PR > https://github.com/apache/arrow/pull/8023. > >>>>> The > >>>>>> code is functional, compatible with the Java version in parquet-mr, > >> and > >>>>> can > >>>>>> be updated with the threading changes in the doc above. I hope all > >> this > >>>>>> good work will not be wasted. > >>>>> > >>>>> I'm sorry for the possibly frustrating process. Looking at the PR, > >>>>> though, it seems a bunch of comments were not addressed. Is it > >> possible > >>>>> to go through them and ensure they get an answer and/or a resolution? > >>>>> > >>>>> Best regards > >>>>> > >>>>> Antoine. > >>>>> > >>>>> > >>>>> > >>>>>> > >>>>>> Cheers, Gidon > >>>>>> > >>>>>> > >>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield < > >> emkornfi...@gmail.com > >>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> My thoughts: > >>>>>>> 1. I've lost track of the higher level encryption implementation > in > >>>>> C++. > >>>>>>> I think we were trying to come to a consensus on the > threading/thread > >>>>>>> safety model? > >>>>>>> > >>>>>>> 2. I'm open to exposing the lower level encryption libraries in > >>>> python > >>>>>>> (without appropriate namespacing/communication). It seems at least > >>>> for > >>>>>>> reading, there is potentially less harm (I'll caveat that with I'm > >>>> not a > >>>>>>> security expert). Are both the low level read and write > >>>> implementations > >>>>>>> necessary? (it probably makes sense to have a few smaller PRs for > >>>>> exposing > >>>>>>> this functionality anyways). > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < > >>>>>>> ita...@pythonspeed.com> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> Since the PR for high-level C++ Parquet encryption API appears > >>>> stalled > >>>>> ( > >>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into > >>>> exposing > >>>>>>> the > >>>>>>>> low-level Parquet encryption API to Python. > >>>>>>>> > >>>>>>>> Arguments for doing this: the low-level API is all the users I'm > >>>>> talking > >>>>>>>> to need, at the moment, so it's plausible others would also find > >> some > >>>>>>>> benefit in having the Pyarrow API expose low-level Parquet > >>>> encryption. > >>>>>>> Then > >>>>>>>> again, it might only be this one company and no one else cares. > >>>>>>>> > >>>>>>>> The arguments against, per Gidon Gershinsky: > >>>>>>>> > >>>>>>>>> * security: low-level encryption API is easy to misuse (eg > giving > >>>> the > >>>>>>>> same keys for a number of different files; this'd break the AES > GCM > >>>>>>>> cipher). The high-level encryption layer handles that by applying > >>>>>>> envelope > >>>>>>>> encryption and other best practices in data security. Also, this > >>>> layer > >>>>> is > >>>>>>>> maintained by the community, meaning that future improvements and > >>>>>>> security > >>>>>>>> fixes can be upstreamed by anyone, and available to all. > >>>>>>>>> * compatibility: parquet-mr implements the high-level encryption > >>>>>>> layer. > >>>>>>>> If we want the files produced by Spark/Presto/etc to be readable > by > >>>>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow > users > >>>>> with > >>>>>>>> the high-level API. > >>>>>>>>> ... > >>>>>>>>> > >>>>>>>>> The current situation is not ideal, it'd be good to merge the > >>>>>>> high-level > >>>>>>>> PR (and maybe hide the low level), but here we are; also, C++ is a > >>>> kind > >>>>>>> of > >>>>>>>> a low-level language; Python would expose it to a less experienced > >>>>>>> audience. > >>>>>>>> > >>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) > >>>>>>>> > >>>>>>>> I find the compatibility argument less compelling, that's readily > >>>>>>>> addressed by documentation. I am not a crypto expert so I can't > >>>>> evaluate > >>>>>>>> how risky exposing the low-level encryption APIs would be, but I > can > >>>>> see > >>>>>>>> how that would be a significant concern. > >>>>>>>> > >>>>>>>> Some options are: > >>>>>>>> * Status quo, no Python API for low-level Parquet encryption. > This > >>>> has > >>>>>>>> two possible outcomes: > >>>>>>>> * Eventually high-level API gets merged, gets Python binding. > >>>>>>>> * High-level encryption API is never merged, Python users never > >>>> get > >>>>>>>> access to encryption. > >>>>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps using > >>>>>>> "hazmat" > >>>>>>>> idiom used by the Python cryptography package (API namespace > >>>> indicating > >>>>>>>> "use at your own risk, this is dangerous", basically, e.g. > >>>>>>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > >>>>>>>> * Gidon Gershinsky did not find this suggestion compelling > enough > >>>> to > >>>>>>>> override his security concerns. > >>>>>>>> * Low-level encryption done as third party Python package, either > >>>>>>> private > >>>>>>>> or open source. This is annoying technically, plausibly would > >> require > >>>>>>>> maintaining a fork. > >>>>>>>> Any other ideas? Thoughts on these options? > >>>>>>>> > >>>>>>>> —Itamar > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >