Regarding the high-level layer, I think it waits for a progress at https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing No activity there since last November. This is unfortunate, because Tham has put a lot of work in coding the high-level layer (and addressing 200+ review comments) in the PR https://github.com/apache/arrow/pull/8023. The code is functional, compatible with the Java version in parquet-mr, and can be updated with the threading changes in the doc above. I hope all this good work will not be wasted.
Cheers, Gidon On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > My thoughts: > 1. I've lost track of the higher level encryption implementation in C++. > I think we were trying to come to a consensus on the threading/thread > safety model? > > 2. I'm open to exposing the lower level encryption libraries in python > (without appropriate namespacing/communication). It seems at least for > reading, there is potentially less harm (I'll caveat that with I'm not a > security expert). Are both the low level read and write implementations > necessary? (it probably makes sense to have a few smaller PRs for exposing > this functionality anyways). > > > > On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < > ita...@pythonspeed.com> wrote: > > > Hi, > > > > Since the PR for high-level C++ Parquet encryption API appears stalled ( > > https://github.com/apache/arrow/pull/8023), I'm looking into exposing > the > > low-level Parquet encryption API to Python. > > > > Arguments for doing this: the low-level API is all the users I'm talking > > to need, at the moment, so it's plausible others would also find some > > benefit in having the Pyarrow API expose low-level Parquet encryption. > Then > > again, it might only be this one company and no one else cares. > > > > The arguments against, per Gidon Gershinsky: > > > > > * security: low-level encryption API is easy to misuse (eg giving the > > same keys for a number of different files; this'd break the AES GCM > > cipher). The high-level encryption layer handles that by applying > envelope > > encryption and other best practices in data security. Also, this layer is > > maintained by the community, meaning that future improvements and > security > > fixes can be upstreamed by anyone, and available to all. > > > * compatibility: parquet-mr implements the high-level encryption > layer. > > If we want the files produced by Spark/Presto/etc to be readable by > > pandas/PyArrow (and vice versa), we need to provide the Arrow users with > > the high-level API. > > > ... > > > > > > The current situation is not ideal, it'd be good to merge the > high-level > > PR (and maybe hide the low level), but here we are; also, C++ is a kind > of > > a low-level language; Python would expose it to a less experienced > audience. > > > > (Source: https://issues.apache.org/jira/browse/ARROW-8040) > > > > I find the compatibility argument less compelling, that's readily > > addressed by documentation. I am not a crypto expert so I can't evaluate > > how risky exposing the low-level encryption APIs would be, but I can see > > how that would be a significant concern. > > > > Some options are: > > * Status quo, no Python API for low-level Parquet encryption. This has > > two possible outcomes: > > * Eventually high-level API gets merged, gets Python binding. > > * High-level encryption API is never merged, Python users never get > > access to encryption. > > * Add low-level Parquet encryption API to Pyarrow, perhaps using > "hazmat" > > idiom used by the Python cryptography package (API namespace indicating > > "use at your own risk, this is dangerous", basically, e.g. > > `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). > > * Gidon Gershinsky did not find this suggestion compelling enough to > > override his security concerns. > > * Low-level encryption done as third party Python package, either > private > > or open source. This is annoying technically, plausibly would require > > maintaining a fork. > > Any other ideas? Thoughts on these options? > > > > —Itamar >