>From the doc, "To maintain consistency with the style of parquet-cpp, the above structures should not be explicitly synchronized with individual mutexes. In the case of a parquet::arrow::FileReader, the request to read a given selection of row groups and columns is issued from a single main thread. Note that this does require that all keys required for a read are assembled on the main thread so that DecryptionKeyRetriever objects are not directly accessing any caches"
The current PR code doesn't require a single main thread. Any thread can read any file, both footer and pages. So the key cache is shared, to save N-1 interactions with the KMS server. Cheers, Gidon On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org> wrote: > > I'm not sure a threading model is expected for an encryption layer. Am > I missing something? > > Regards > > Antoine. > > > Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : > > Precisely, the main change is in the threading model. Afaik, the document > > proposes a model that fits pandas, but might be problematic for other > users > > of this library. > > Technically, this is not showstopper though; if the community decides on > > this model, it will be compatible with the high-level encryption design; > > but the change implementation would need to be done by pandas experts > (not > > us; but we'll help where we can). > > Micah, you know this subject (and the community) better than we do - we'd > > much appreciate it if you'd take a lead on removing this roadblock. > > > > Cheers, Gidon > > > > > > On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >> I think some of the comments might be conflicting. One of the concerns > >> (that I would need to refresh myself on to offer an opinion which was > >> covered in Ben's doc) was the threading model we expect in the library. > >> > >> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org> > wrote: > >> > >>> > >>> Hi Gidon, > >>> > >>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : > >>>> Regarding the high-level layer, I think it waits for a progress at > >>>> > >>> > >> > https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing > >>>> No activity there since last November. This is unfortunate, because > >> Tham > >>>> has put a lot of work in coding the high-level layer (and addressing > >> 200+ > >>>> review comments) in the PR https://github.com/apache/arrow/pull/8023. > >>> The > >>>> code is functional, compatible with the Java version in parquet-mr, > and > >>> can > >>>> be updated with the threading changes in the doc above. I hope all > this > >>>> good work will not be wasted. > >>> > >>> I'm sorry for the possibly frustrating process. Looking at the PR, > >>> though, it seems a bunch of comments were not addressed. Is it > possible > >>> to go through them and ensure they get an answer and/or a resolution? > >>> > >>> Best regards > >>> > >>> Antoine. > >>> > >>> > >>> > >>>> > >>>> Cheers, Gidon > >>>> > >>>> > >>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield < > emkornfi...@gmail.com > >>> > >>>> wrote: > >>>> > >>>>> My thoughts: > >>>>> 1. I've lost track of the higher level encryption implementation in > >>> C++. > >>>>> I think we were trying to come to a consensus on the threading/thread > >>>>> safety model? > >>>>> > >>>>> 2. I'm open to exposing the lower level encryption libraries in > >> python > >>>>> (without appropriate namespacing/communication). It seems at least > >> for > >>>>> reading, there is potentially less harm (I'll caveat that with I'm > >> not a > >>>>> security expert). Are both the low level read and write > >> implementations > >>>>> necessary? (it probably makes sense to have a few smaller PRs for > >>> exposing > >>>>> this functionality anyways). > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < > >>>>> ita...@pythonspeed.com> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> Since the PR for high-level C++ Parquet encryption API appears > >> stalled > >>> ( > >>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into > >> exposing > >>>>> the > >>>>>> low-level Parquet encryption API to Python. > >>>>>> > >>>>>> Arguments for doing this: the low-level API is all the users I'm > >>> talking > >>>>>> to need, at the moment, so it's plausible others would also find > some > >>>>>> benefit in having the Pyarrow API expose low-level Parquet > >> encryption. > >>>>> Then > >>>>>> again, it might only be this one company and no one else cares. > >>>>>> > >>>>>> The arguments against, per Gidon Gershinsky: > >>>>>> > >>>>>>> * security: low-level encryption API is easy to misuse (eg giving > >> the > >>>>>> same keys for a number of different files; this'd break the AES GCM > >>>>>> cipher). The high-level encryption layer handles that by applying > >>>>> envelope > >>>>>> encryption and other best practices in data security. Also, this > >> layer > >>> is > >>>>>> maintained by the community, meaning that future improvements and > >>>>> security > >>>>>> fixes can be upstreamed by anyone, and available to all. > >>>>>>> * compatibility: parquet-mr implements the high-level encryption > >>>>> layer. > >>>>>> If we want the files produced by Spark/Presto/etc to be readable by > >>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow users > >>> with > >>>>>> the high-level API. > >>>>>>> ... > >>>>>>> > >>>>>>> The current situation is not ideal, it'd be good to merge the > >>>>> high-level > >>>>>> PR (and maybe hide the low level), but here we are; also, C++ is a > >> kind > >>>>> of > >>>>>> a low-level language; Python would expose it to a less experienced > >>>>> audience. > >>>>>> > >>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) > >>>>>> > >>>>>> I find the compatibility argument less compelling, that's readily > >>>>>> addressed by documentation. I am not a crypto expert so I can't > >>> evaluate > >>>>>> how risky exposing the low-level encryption APIs would be, but I can > >>> see > >>>>>> how that would be a significant concern. > >>>>>> > >>>>>> Some options are: > >>>>>> * Status quo, no Python API for low-level Parquet encryption. This > >> has > >>>>>> two possible outcomes: > >>>>>> * Eventually high-level API gets merged, gets Python binding. > >>>>>> * High-level encryption API is never merged, Python users never > >> get > >>>>>> access to encryption. > >>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps using > >>>>> "hazmat" > >>>>>> idiom used by the Python cryptography package (API namespace > >> indicating > >>>>>> "use at your own risk, this is dangerous", basically, e.g. > >>>>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > >>>>>> * Gidon Gershinsky did not find this suggestion compelling enough > >> to > >>>>>> override his security concerns. > >>>>>> * Low-level encryption done as third party Python package, either > >>>>> private > >>>>>> or open source. This is annoying technically, plausibly would > require > >>>>>> maintaining a fork. > >>>>>> Any other ideas? Thoughts on these options? > >>>>>> > >>>>>> —Itamar > >>>>> > >>>> > >>> > >> > > >