I don't think any notion of threading should be present in the
implementation, except for the required locks around shared structures.
 I don't know where the idea of a "main thread" comes from, but it
probably shouldn't exist in a C++ library.

Regards

Antoine.



Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
> Just to clarify. There are two options, which one do you refer to? A design
> with a main thread that handles projections and the keys (relevant for the
> projected columns); or the current code with any thread allowed to handle
> full file reading, inc the footer, column projections and their keys? Can
> you finalize this with Micah?
> The good news is, Tham is still interested to resume this work, and is ok
> with either option. Please let her know whether the current threading model
> stays, or should be modified with the changes proposed in the doc (for the
> latter, some guidance with the details would be needed).
> 
> Cheers, Gidon
> 
> 
> On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou <anto...@python.org> wrote:
> 
>>
>>
>> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
>>> From the doc,
>>> "To maintain consistency with the style of parquet-cpp, the above
>>> structures should not be explicitly synchronized with individual mutexes.
>>> In the case of a parquet::arrow::FileReader, the request to read a given
>>> selection of row groups and columns is issued from a single main thread.
>>> Note that this does require that all keys required for a read are
>> assembled
>>> on the main thread so that DecryptionKeyRetriever objects are not
>> directly
>>> accessing any caches"
>>>
>>> The current PR code doesn't require a single main thread. Any thread can
>>> read any file, both footer and pages. So the key cache is shared, to save
>>> N-1 interactions with the KMS server.
>>
>> I don't think there's any contention on this.  IMHO the only concerns
>> are about the implementation, not the semantics.
>>
>> Best regards
>>
>> Antoine.
>>
>>
>>>
>>> Cheers, Gidon
>>>
>>>
>>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <anto...@python.org>
>> wrote:
>>>
>>>>
>>>> I'm not sure a threading model is expected for an encryption layer.  Am
>>>> I missing something?
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
>>>>> Precisely, the main change is in the threading model. Afaik, the
>> document
>>>>> proposes a model that fits pandas, but might be problematic for other
>>>> users
>>>>> of this library.
>>>>> Technically, this is not showstopper though; if the community decides
>> on
>>>>> this model, it will be compatible with the high-level encryption
>> design;
>>>>> but the change implementation would need to be done by pandas experts
>>>> (not
>>>>> us; but we'll help where we can).
>>>>> Micah, you know this subject (and the community) better than we do -
>> we'd
>>>>> much appreciate it if you'd take a lead on removing this roadblock.
>>>>>
>>>>> Cheers, Gidon
>>>>>
>>>>>
>>>>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <emkornfi...@gmail.com
>>>
>>>>> wrote:
>>>>>
>>>>>> I think some of the comments might be conflicting.  One of the
>> concerns
>>>>>> (that I would need to refresh myself on to offer an opinion which was
>>>>>> covered in Ben's doc) was the threading model we expect in the
>> library.
>>>>>>
>>>>>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <anto...@python.org>
>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Gidon,
>>>>>>>
>>>>>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
>>>>>>>> Regarding the high-level layer, I think it waits for a progress at
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
>>>>>>>> No activity there since last November. This is unfortunate, because
>>>>>> Tham
>>>>>>>> has put a lot of work in coding the high-level layer (and addressing
>>>>>> 200+
>>>>>>>> review comments) in the PR
>> https://github.com/apache/arrow/pull/8023.
>>>>>>> The
>>>>>>>> code is functional, compatible with the Java version in parquet-mr,
>>>> and
>>>>>>> can
>>>>>>>> be updated with the threading changes in the doc above. I hope all
>>>> this
>>>>>>>> good work will not be wasted.
>>>>>>>
>>>>>>> I'm sorry for the possibly frustrating process.  Looking at the PR,
>>>>>>> though, it seems a bunch of comments were not addressed.  Is it
>>>> possible
>>>>>>> to go through them and ensure they get an answer and/or a resolution?
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> Antoine.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Cheers, Gidon
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
>>>> emkornfi...@gmail.com
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> My thoughts:
>>>>>>>>> 1.  I've lost track of the higher level encryption implementation
>> in
>>>>>>> C++.
>>>>>>>>> I think we were trying to come to a consensus on the
>> threading/thread
>>>>>>>>> safety model?
>>>>>>>>>
>>>>>>>>> 2.  I'm open to exposing the lower level encryption libraries in
>>>>>> python
>>>>>>>>> (without appropriate namespacing/communication).  It seems at least
>>>>>> for
>>>>>>>>> reading, there is potentially less harm (I'll caveat that with I'm
>>>>>> not a
>>>>>>>>> security expert).  Are both the low level read and write
>>>>>> implementations
>>>>>>>>> necessary?  (it probably makes sense to have a few smaller PRs for
>>>>>>> exposing
>>>>>>>>> this functionality anyways).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
>>>>>>>>> ita...@pythonspeed.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Since the PR for high-level C++ Parquet encryption API appears
>>>>>> stalled
>>>>>>> (
>>>>>>>>>> https://github.com/apache/arrow/pull/8023), I'm looking into
>>>>>> exposing
>>>>>>>>> the
>>>>>>>>>> low-level Parquet encryption API to Python.
>>>>>>>>>>
>>>>>>>>>> Arguments for doing this: the low-level API is all the users I'm
>>>>>>> talking
>>>>>>>>>> to need, at the moment, so it's plausible others would also find
>>>> some
>>>>>>>>>> benefit in having the Pyarrow API expose low-level Parquet
>>>>>> encryption.
>>>>>>>>> Then
>>>>>>>>>> again, it might only be this one company and no one else cares.
>>>>>>>>>>
>>>>>>>>>> The arguments against, per Gidon Gershinsky:
>>>>>>>>>>
>>>>>>>>>>>  * security: low-level encryption API is easy to misuse (eg
>> giving
>>>>>> the
>>>>>>>>>> same keys for a number of different files; this'd break the AES
>> GCM
>>>>>>>>>> cipher). The high-level encryption layer handles that by applying
>>>>>>>>> envelope
>>>>>>>>>> encryption and other best practices in data security. Also, this
>>>>>> layer
>>>>>>> is
>>>>>>>>>> maintained by the community, meaning that future improvements and
>>>>>>>>> security
>>>>>>>>>> fixes can be upstreamed by anyone, and available to all.
>>>>>>>>>>>  * compatibility: parquet-mr implements the high-level encryption
>>>>>>>>> layer.
>>>>>>>>>> If we want the files produced by Spark/Presto/etc to be readable
>> by
>>>>>>>>>> pandas/PyArrow (and vice versa), we need to provide the Arrow
>> users
>>>>>>> with
>>>>>>>>>> the high-level API.
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> The current situation is not ideal, it'd be good to merge the
>>>>>>>>> high-level
>>>>>>>>>> PR (and maybe hide the low level), but here we are; also, C++ is a
>>>>>> kind
>>>>>>>>> of
>>>>>>>>>> a low-level language; Python would expose it to a less experienced
>>>>>>>>> audience.
>>>>>>>>>>
>>>>>>>>>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
>>>>>>>>>>
>>>>>>>>>> I find the compatibility argument less compelling, that's readily
>>>>>>>>>> addressed by documentation. I am not a crypto expert so I can't
>>>>>>> evaluate
>>>>>>>>>> how risky exposing the low-level encryption APIs would be, but I
>> can
>>>>>>> see
>>>>>>>>>> how that would be a significant concern.
>>>>>>>>>>
>>>>>>>>>> Some options are:
>>>>>>>>>>  * Status quo, no Python API for low-level Parquet encryption.
>> This
>>>>>> has
>>>>>>>>>> two possible outcomes:
>>>>>>>>>>    * Eventually high-level API gets merged, gets Python binding.
>>>>>>>>>>    * High-level encryption API is never merged, Python users never
>>>>>> get
>>>>>>>>>> access to encryption.
>>>>>>>>>>  * Add low-level Parquet encryption API to Pyarrow, perhaps using
>>>>>>>>> "hazmat"
>>>>>>>>>> idiom used by the Python cryptography package (API namespace
>>>>>> indicating
>>>>>>>>>> "use at your own risk, this is dangerous", basically, e.g.
>>>>>>>>>> `cryptography.hazmat.primitives.ciphers.aead.``ChaCha20Poly1305`).
>>>>>>>>>>    * Gidon Gershinsky did not find this suggestion compelling
>> enough
>>>>>> to
>>>>>>>>>> override his security concerns.
>>>>>>>>>>  * Low-level encryption done as third party Python package, either
>>>>>>>>> private
>>>>>>>>>> or open source. This is annoying technically, plausibly would
>>>> require
>>>>>>>>>> maintaining a fork.
>>>>>>>>>> Any other ideas? Thoughts on these options?
>>>>>>>>>>
>>>>>>>>>> —Itamar
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

Reply via email to