Precisely, the main change is in the threading model. Afaik, the document proposes a model that fits pandas, but might be problematic for other users of this library. Technically, this is not showstopper though; if the community decides on this model, it will be compatible with the high-level encryption design; but the change implementation would need to be done by pandas experts (not us; but we'll help where we can). Micah, you know this subject (and the community) better than we do - we'd much appreciate it if you'd take a lead on removing this roadblock.
Cheers, Gidon On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I think some of the comments might be conflicting. One of the concerns > (that I would need to refresh myself on to offer an opinion which was > covered in Ben's doc) was the threading model we expect in the library. > > On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hi Gidon, > > > > Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : > > > Regarding the high-level layer, I think it waits for a progress at > > > > > > https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing > > > No activity there since last November. This is unfortunate, because > Tham > > > has put a lot of work in coding the high-level layer (and addressing > 200+ > > > review comments) in the PR https://github.com/apache/arrow/pull/8023. > > The > > > code is functional, compatible with the Java version in parquet-mr, and > > can > > > be updated with the threading changes in the doc above. I hope all this > > > good work will not be wasted. > > > > I'm sorry for the possibly frustrating process. Looking at the PR, > > though, it seems a bunch of comments were not addressed. Is it possible > > to go through them and ensure they get an answer and/or a resolution? > > > > Best regards > > > > Antoine. > > > > > > > > > > > > Cheers, Gidon > > > > > > > > > On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com > > > > > wrote: > > > > > >> My thoughts: > > >> 1. I've lost track of the higher level encryption implementation in > > C++. > > >> I think we were trying to come to a consensus on the threading/thread > > >> safety model? > > >> > > >> 2. I'm open to exposing the lower level encryption libraries in > python > > >> (without appropriate namespacing/communication). It seems at least > for > > >> reading, there is potentially less harm (I'll caveat that with I'm > not a > > >> security expert). Are both the low level read and write > implementations > > >> necessary? (it probably makes sense to have a few smaller PRs for > > exposing > > >> this functionality anyways). > > >> > > >> > > >> > > >> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < > > >> ita...@pythonspeed.com> wrote: > > >> > > >>> Hi, > > >>> > > >>> Since the PR for high-level C++ Parquet encryption API appears > stalled > > ( > > >>> https://github.com/apache/arrow/pull/8023), I'm looking into > exposing > > >> the > > >>> low-level Parquet encryption API to Python. > > >>> > > >>> Arguments for doing this: the low-level API is all the users I'm > > talking > > >>> to need, at the moment, so it's plausible others would also find some > > >>> benefit in having the Pyarrow API expose low-level Parquet > encryption. > > >> Then > > >>> again, it might only be this one company and no one else cares. > > >>> > > >>> The arguments against, per Gidon Gershinsky: > > >>> > > >>>> * security: low-level encryption API is easy to misuse (eg giving > the > > >>> same keys for a number of different files; this'd break the AES GCM > > >>> cipher). The high-level encryption layer handles that by applying > > >> envelope > > >>> encryption and other best practices in data security. Also, this > layer > > is > > >>> maintained by the community, meaning that future improvements and > > >> security > > >>> fixes can be upstreamed by anyone, and available to all. > > >>>> * compatibility: parquet-mr implements the high-level encryption > > >> layer. > > >>> If we want the files produced by Spark/Presto/etc to be readable by > > >>> pandas/PyArrow (and vice versa), we need to provide the Arrow users > > with > > >>> the high-level API. > > >>>> ... > > >>>> > > >>>> The current situation is not ideal, it'd be good to merge the > > >> high-level > > >>> PR (and maybe hide the low level), but here we are; also, C++ is a > kind > > >> of > > >>> a low-level language; Python would expose it to a less experienced > > >> audience. > > >>> > > >>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) > > >>> > > >>> I find the compatibility argument less compelling, that's readily > > >>> addressed by documentation. I am not a crypto expert so I can't > > evaluate > > >>> how risky exposing the low-level encryption APIs would be, but I can > > see > > >>> how that would be a significant concern. > > >>> > > >>> Some options are: > > >>> * Status quo, no Python API for low-level Parquet encryption. This > has > > >>> two possible outcomes: > > >>> * Eventually high-level API gets merged, gets Python binding. > > >>> * High-level encryption API is never merged, Python users never > get > > >>> access to encryption. > > >>> * Add low-level Parquet encryption API to Pyarrow, perhaps using > > >> "hazmat" > > >>> idiom used by the Python cryptography package (API namespace > indicating > > >>> "use at your own risk, this is dangerous", basically, e.g. > > >>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > > >>> * Gidon Gershinsky did not find this suggestion compelling enough > to > > >>> override his security concerns. > > >>> * Low-level encryption done as third party Python package, either > > >> private > > >>> or open source. This is annoying technically, plausibly would require > > >>> maintaining a fork. > > >>> Any other ideas? Thoughts on these options? > > >>> > > >>> —Itamar > > >> > > > > > >