I don't think any notion of threading should be present in the implementation, except for the required locks around shared structures. I don't know where the idea of a "main thread" comes from, but it probably shouldn't exist in a C++ library.
Regards Antoine. Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit : > Just to clarify. There are two options, which one do you refer to? A design > with a main thread that handles projections and the keys (relevant for the > projected columns); or the current code with any thread allowed to handle > full file reading, inc the footer, column projections and their keys? Can > you finalize this with Micah? > The good news is, Tham is still interested to resume this work, and is ok > with either option. Please let her know whether the current threading model > stays, or should be modified with the changes proposed in the doc (for the > latter, some guidance with the details would be needed). > > Cheers, Gidon > > > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org> wrote: > >> >> >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit : >>> From the doc, >>> "To maintain consistency with the style of parquet-cpp, the above >>> structures should not be explicitly synchronized with individual mutexes. >>> In the case of a parquet::arrow::FileReader, the request to read a given >>> selection of row groups and columns is issued from a single main thread. >>> Note that this does require that all keys required for a read are >> assembled >>> on the main thread so that DecryptionKeyRetriever objects are not >> directly >>> accessing any caches" >>> >>> The current PR code doesn't require a single main thread. Any thread can >>> read any file, both footer and pages. So the key cache is shared, to save >>> N-1 interactions with the KMS server. >> >> I don't think there's any contention on this. IMHO the only concerns >> are about the implementation, not the semantics. >> >> Best regards >> >> Antoine. >> >> >>> >>> Cheers, Gidon >>> >>> >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org> >> wrote: >>> >>>> >>>> I'm not sure a threading model is expected for an encryption layer. Am >>>> I missing something? >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit : >>>>> Precisely, the main change is in the threading model. Afaik, the >> document >>>>> proposes a model that fits pandas, but might be problematic for other >>>> users >>>>> of this library. >>>>> Technically, this is not showstopper though; if the community decides >> on >>>>> this model, it will be compatible with the high-level encryption >> design; >>>>> but the change implementation would need to be done by pandas experts >>>> (not >>>>> us; but we'll help where we can). >>>>> Micah, you know this subject (and the community) better than we do - >> we'd >>>>> much appreciate it if you'd take a lead on removing this roadblock. >>>>> >>>>> Cheers, Gidon >>>>> >>>>> >>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com >>> >>>>> wrote: >>>>> >>>>>> I think some of the comments might be conflicting. One of the >> concerns >>>>>> (that I would need to refresh myself on to offer an opinion which was >>>>>> covered in Ben's doc) was the threading model we expect in the >> library. >>>>>> >>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org> >>>> wrote: >>>>>> >>>>>>> >>>>>>> Hi Gidon, >>>>>>> >>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit : >>>>>>>> Regarding the high-level layer, I think it waits for a progress at >>>>>>>> >>>>>>> >>>>>> >>>> >> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing >>>>>>>> No activity there since last November. This is unfortunate, because >>>>>> Tham >>>>>>>> has put a lot of work in coding the high-level layer (and addressing >>>>>> 200+ >>>>>>>> review comments) in the PR >> https://github.com/apache/arrow/pull/8023. >>>>>>> The >>>>>>>> code is functional, compatible with the Java version in parquet-mr, >>>> and >>>>>>> can >>>>>>>> be updated with the threading changes in the doc above. I hope all >>>> this >>>>>>>> good work will not be wasted. >>>>>>> >>>>>>> I'm sorry for the possibly frustrating process. Looking at the PR, >>>>>>> though, it seems a bunch of comments were not addressed. Is it >>>> possible >>>>>>> to go through them and ensure they get an answer and/or a resolution? >>>>>>> >>>>>>> Best regards >>>>>>> >>>>>>> Antoine. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Cheers, Gidon >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield < >>>> emkornfi...@gmail.com >>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> My thoughts: >>>>>>>>> 1. I've lost track of the higher level encryption implementation >> in >>>>>>> C++. >>>>>>>>> I think we were trying to come to a consensus on the >> threading/thread >>>>>>>>> safety model? >>>>>>>>> >>>>>>>>> 2. I'm open to exposing the lower level encryption libraries in >>>>>> python >>>>>>>>> (without appropriate namespacing/communication). It seems at least >>>>>> for >>>>>>>>> reading, there is potentially less harm (I'll caveat that with I'm >>>>>> not a >>>>>>>>> security expert). Are both the low level read and write >>>>>> implementations >>>>>>>>> necessary? (it probably makes sense to have a few smaller PRs for >>>>>>> exposing >>>>>>>>> this functionality anyways). >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring < >>>>>>>>> ita...@pythonspeed.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Since the PR for high-level C++ Parquet encryption API appears >>>>>> stalled >>>>>>> ( >>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into >>>>>> exposing >>>>>>>>> the >>>>>>>>>> low-level Parquet encryption API to Python. >>>>>>>>>> >>>>>>>>>> Arguments for doing this: the low-level API is all the users I'm >>>>>>> talking >>>>>>>>>> to need, at the moment, so it's plausible others would also find >>>> some >>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet >>>>>> encryption. >>>>>>>>> Then >>>>>>>>>> again, it might only be this one company and no one else cares. >>>>>>>>>> >>>>>>>>>> The arguments against, per Gidon Gershinsky: >>>>>>>>>> >>>>>>>>>>> * security: low-level encryption API is easy to misuse (eg >> giving >>>>>> the >>>>>>>>>> same keys for a number of different files; this'd break the AES >> GCM >>>>>>>>>> cipher). The high-level encryption layer handles that by applying >>>>>>>>> envelope >>>>>>>>>> encryption and other best practices in data security. Also, this >>>>>> layer >>>>>>> is >>>>>>>>>> maintained by the community, meaning that future improvements and >>>>>>>>> security >>>>>>>>>> fixes can be upstreamed by anyone, and available to all. >>>>>>>>>>> * compatibility: parquet-mr implements the high-level encryption >>>>>>>>> layer. >>>>>>>>>> If we want the files produced by Spark/Presto/etc to be readable >> by >>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow >> users >>>>>>> with >>>>>>>>>> the high-level API. >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> The current situation is not ideal, it'd be good to merge the >>>>>>>>> high-level >>>>>>>>>> PR (and maybe hide the low level), but here we are; also, C++ is a >>>>>> kind >>>>>>>>> of >>>>>>>>>> a low-level language; Python would expose it to a less experienced >>>>>>>>> audience. >>>>>>>>>> >>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040) >>>>>>>>>> >>>>>>>>>> I find the compatibility argument less compelling, that's readily >>>>>>>>>> addressed by documentation. I am not a crypto expert so I can't >>>>>>> evaluate >>>>>>>>>> how risky exposing the low-level encryption APIs would be, but I >> can >>>>>>> see >>>>>>>>>> how that would be a significant concern. >>>>>>>>>> >>>>>>>>>> Some options are: >>>>>>>>>> * Status quo, no Python API for low-level Parquet encryption. >> This >>>>>> has >>>>>>>>>> two possible outcomes: >>>>>>>>>> * Eventually high-level API gets merged, gets Python binding. >>>>>>>>>> * High-level encryption API is never merged, Python users never >>>>>> get >>>>>>>>>> access to encryption. >>>>>>>>>> * Add low-level Parquet encryption API to Pyarrow, perhaps using >>>>>>>>> "hazmat" >>>>>>>>>> idiom used by the Python cryptography package (API namespace >>>>>> indicating >>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g. >>>>>>>>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`). >>>>>>>>>> * Gidon Gershinsky did not find this suggestion compelling >> enough >>>>>> to >>>>>>>>>> override his security concerns. >>>>>>>>>> * Low-level encryption done as third party Python package, either >>>>>>>>> private >>>>>>>>>> or open source. This is annoying technically, plausibly would >>>> require >>>>>>>>>> maintaining a fork. >>>>>>>>>> Any other ideas? Thoughts on these options? >>>>>>>>>> >>>>>>>>>> —Itamar >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >