I'm not sure a threading model is expected for an encryption layer. Am I missing something?
Regards Antoine. Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : > Precisely, the main change is in the threading model. Afaik, the document > proposes a model that fits pandas, but might be problematic for other users > of this library. > Technically, this is not showstopper though; if the community decides on > this model, it will be compatible with the high-level encryption design; > but the change implementation would need to be done by pandas experts (not > us; but we'll help where we can). > Micah, you know this subject (and the community) better than we do - we'd > much appreciate it if you'd take a lead on removing this roadblock. > > Cheers, Gidon > > > On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> I think some of the comments might be conflicting. One of the concerns >> (that I would need to refresh myself on to offer an opinion which was >> covered in Ben's doc) was the threading model we expect in the library. >> >> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org> wrote: >> >>> >>> Hi Gidon, >>> >>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : >>>> Regarding the high-level layer, I think it waits for a progress at >>>> >>> >> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing >>>> No activity there since last November. This is unfortunate, because >> Tham >>>> has put a lot of work in coding the high-level layer (and addressing >> 200+ >>>> review comments) in the PR https://github.com/apache/arrow/pull/8023. >>> The >>>> code is functional, compatible with the Java version in parquet-mr, and >>> can >>>> be updated with the threading changes in the doc above. I hope all this >>>> good work will not be wasted. >>> >>> I'm sorry for the possibly frustrating process. Looking at the PR, >>> though, it seems a bunch of comments were not addressed. Is it possible >>> to go through them and ensure they get an answer and/or a resolution? >>> >>> Best regards >>> >>> Antoine. >>> >>> >>> >>>> >>>> Cheers, Gidon >>>> >>>> >>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com >>> >>>> wrote: >>>> >>>>> My thoughts: >>>>> 1. I've lost track of the higher level encryption implementation in >>> C++. >>>>> I think we were trying to come to a consensus on the threading/thread >>>>> safety model? >>>>> >>>>> 2. I'm open to exposing the lower level encryption libraries in >> python >>>>> (without appropriate namespacing/communication). It seems at least >> for >>>>> reading, there is potentially less harm (I'll caveat that with I'm >> not a >>>>> security expert). Are both the low level read and write >> implementations >>>>> necessary? (it probably makes sense to have a few smaller PRs for >>> exposing >>>>> this functionality anyways). >>>>> >>>>> >>>>> >>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < >>>>> ita...@pythonspeed.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Since the PR for high-level C++ Parquet encryption API appears >> stalled >>> ( >>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into >> exposing >>>>> the >>>>>> low-level Parquet encryption API to Python. >>>>>> >>>>>> Arguments for doing this: the low-level API is all the users I'm >>> talking >>>>>> to need, at the moment, so it's plausible others would also find some >>>>>> benefit in having the Pyarrow API expose low-level Parquet >> encryption. >>>>> Then >>>>>> again, it might only be this one company and no one else cares. >>>>>> >>>>>> The arguments against, per Gidon Gershinsky: >>>>>> >>>>>>> * security: low-level encryption API is easy to misuse (eg giving >> the >>>>>> same keys for a number of different files; this'd break the AES GCM >>>>>> cipher). The high-level encryption layer handles that by applying >>>>> envelope >>>>>> encryption and other best practices in data security. Also, this >> layer >>> is >>>>>> maintained by the community, meaning that future improvements and >>>>> security >>>>>> fixes can be upstreamed by anyone, and available to all. >>>>>>> * compatibility: parquet-mr implements the high-level encryption >>>>> layer. >>>>>> If we want the files produced by Spark/Presto/etc to be readable by >>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow users >>> with >>>>>> the high-level API. >>>>>>> ... >>>>>>> >>>>>>> The current situation is not ideal, it'd be good to merge the >>>>> high-level >>>>>> PR (and maybe hide the low level), but here we are; also, C++ is a >> kind >>>>> of >>>>>> a low-level language; Python would expose it to a less experienced >>>>> audience. >>>>>> >>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) >>>>>> >>>>>> I find the compatibility argument less compelling, that's readily >>>>>> addressed by documentation. I am not a crypto expert so I can't >>> evaluate >>>>>> how risky exposing the low-level encryption APIs would be, but I can >>> see >>>>>> how that would be a significant concern. >>>>>> >>>>>> Some options are: >>>>>> * Status quo, no Python API for low-level Parquet encryption. This >> has >>>>>> two possible outcomes: >>>>>> * Eventually high-level API gets merged, gets Python binding. >>>>>> * High-level encryption API is never merged, Python users never >> get >>>>>> access to encryption. >>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps using >>>>> "hazmat" >>>>>> idiom used by the Python cryptography package (API namespace >> indicating >>>>>> "use at your own risk, this is dangerous", basically, e.g. >>>>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). >>>>>> * Gidon Gershinsky did not find this suggestion compelling enough >> to >>>>>> override his security concerns. >>>>>> * Low-level encryption done as third party Python package, either >>>>> private >>>>>> or open source. This is annoying technically, plausibly would require >>>>>> maintaining a fork. >>>>>> Any other ideas? Thoughts on these options? >>>>>> >>>>>> —Itamar >>>>> >>>> >>> >> >