Re: [DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-16 Thread Gidon Gershinsky
Thanks Ben, I left a few comments there.

Cheers, Gidon


On Mon, Nov 16, 2020 at 2:58 AM Benjamin Kietzman 
wrote:

> @Gidon
>
> Copied the gist to a google doc for commenting:
>
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit#
>
> @Micah
>
> > it would be preferable to put them in an internal namespace to ensure
> adequate unit testing is in place.
>
> Agreed; my intent was only to indicate that implementation details should
> be private. We can even have
> *_internal.h header files for those to make them available for unit
> testing as we do with some other
> classes. I've added a note of this caveat to the doc.
>
> > It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
> still be implemented
> > with internal synchronization to guarantee thread-safety?
>
> If these are exclusively accessed from the main thread (where they are
> used to assemble
> FileDecryptionProperties) then there is still no need to synchronize them.
>
> @Tham
>
> > As I see, the returned FileDecryptionProperties object from 
> > PropertiesDrivenCryptoFactory
> is not mutable,
> > since it has a DecryptionKeyRetriever object.
>
> If the DecryptionKeyRetriever accesses the caches directly, then it will
> indeed need to be synchronized.
> Instead, we can ensure that cache access happens on the main thread before
> worker tasks are launched.
> After we assemble this ahead-of-time set of keys it will not change during
> the course of a read, so the
> DecryptionKeyRetriever can safely access it without mutexes. I've added a
> comment to the doc
>
> On Fri, Nov 13, 2020 at 3:09 AM Gidon Gershinsky  wrote:
>
>> Hi all,
>>
>> Glad to see the parquet-cpp progress on this! Can I suggest creating a
>> googledoc for the technical discussion? The current md doc format seems to
>> be harder for pinpointed comments. I got a few, but they are too minor for
>> sending to the two mailing lists.
>>
>> Cheers, Gidon
>>
>>
>> On Fri, Nov 13, 2020 at 9:17 AM Tham Ha  wrote:
>>
>>> Hi Ben,
>>>
>>> Can you reconsider this point:
>>> > Properties (including FileDecryptionProperties) are immutable after
>>> construction and thus can be safely shared between threads with no further
>>> synchronization.
>>>
>>> As I see, the returned FileDecryptionProperties object from 
>>> PropertiesDrivenCryptoFactory
>>> is not mutable, since it has a DecryptionKeyRetriever object (i.e. 
>>> FileKeyUnwrapper
>>> object in this pull) that helps get the key from key metadata.
>>> From the current implementation of FileKeyUnwrapper with the cache, I
>>> think synchronization is necessary, unless you have another idea.
>>>
>>> Cheers, Tham
>>>
>>> On Fri, Nov 13, 2020 at 1:28 PM Micah Kornfield 
>>> wrote:
>>>
 I skimmed through and this seems like a clean design (I would have to
 reread the PR to do a comparison.  A few thoughts of the top of my head:


 > - Multiple internal classes are left public in header files, where it
 > would be
 >   preferred that public classes be kept to a minimum.

 I think some of these classes are non-trivial.  If that  is the case it
 would be preferable to put them in an internal namespace to ensure
 adequate unit testing is in place.

 It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
 still be implemented with internal synchronization to guarantee
 thread-safety?



 On Wed, Nov 11, 2020 at 6:19 PM Benjamin Kietzman 
 wrote:

 > In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
 > https://github.com/apache/arrow/pull/8023 which port the parquet-mr
 > design to c++: there's some question whether that design is consistent
 > with the style and conventions of the c++ implementation of parquet.
 >
 > Here is a Gist with a sketched alternative design adapted to c++
 > https://gist.github.com/bkietz/f701d32add6f5a2aeed89ce36a443d43
 >
 > Specific concerns in brief:
 > - Multiple internal classes are left public in header files, where it
 > would be
 >   preferred that public classes be kept to a minimum.
 > - Several levels of explicit synchronization with mutexes are used to
 > guarantee
 >   that each class is completely thread safe, but this is unnecessary
 > overhead
 >   given the pattern of use of `parquet-cpp`.
 >

>>>
>>>
>>> --
>>> Tham Ha Thi
>>> C++ Developer
>>> *EMOTIV* www.emotiv.com
>>> Neurotech for the Global Community
>>>
>>> LEGAL NOTICE
>>>
>>> This message (including all attachments) contains confidential
>>> information intended for a specific individual and purpose, and is
>>> protected by law.  Any confidentiality or privilege is not waived or lost
>>> because this email has been sent to you by mistake. If you have received it
>>> in error, please let us know by reply email, delete it from your system and
>>> destroy any copies.
>>>
>>> This email is also subject to 

Re: [DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-15 Thread Benjamin Kietzman
@Gidon

Copied the gist to a google doc for commenting:
https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit#

@Micah

> it would be preferable to put them in an internal namespace to ensure
adequate unit testing is in place.

Agreed; my intent was only to indicate that implementation details should
be private. We can even have
*_internal.h header files for those to make them available for unit testing
as we do with some other
classes. I've added a note of this caveat to the doc.

> It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
still be implemented
> with internal synchronization to guarantee thread-safety?

If these are exclusively accessed from the main thread (where they are used
to assemble
FileDecryptionProperties) then there is still no need to synchronize them.

@Tham

> As I see, the returned FileDecryptionProperties object from 
> PropertiesDrivenCryptoFactory
is not mutable,
> since it has a DecryptionKeyRetriever object.

If the DecryptionKeyRetriever accesses the caches directly, then it will
indeed need to be synchronized.
Instead, we can ensure that cache access happens on the main thread before
worker tasks are launched.
After we assemble this ahead-of-time set of keys it will not change during
the course of a read, so the
DecryptionKeyRetriever can safely access it without mutexes. I've added a
comment to the doc

On Fri, Nov 13, 2020 at 3:09 AM Gidon Gershinsky  wrote:

> Hi all,
>
> Glad to see the parquet-cpp progress on this! Can I suggest creating a
> googledoc for the technical discussion? The current md doc format seems to
> be harder for pinpointed comments. I got a few, but they are too minor for
> sending to the two mailing lists.
>
> Cheers, Gidon
>
>
> On Fri, Nov 13, 2020 at 9:17 AM Tham Ha  wrote:
>
>> Hi Ben,
>>
>> Can you reconsider this point:
>> > Properties (including FileDecryptionProperties) are immutable after
>> construction and thus can be safely shared between threads with no further
>> synchronization.
>>
>> As I see, the returned FileDecryptionProperties object from 
>> PropertiesDrivenCryptoFactory
>> is not mutable, since it has a DecryptionKeyRetriever object (i.e. 
>> FileKeyUnwrapper
>> object in this pull) that helps get the key from key metadata.
>> From the current implementation of FileKeyUnwrapper with the cache, I
>> think synchronization is necessary, unless you have another idea.
>>
>> Cheers, Tham
>>
>> On Fri, Nov 13, 2020 at 1:28 PM Micah Kornfield 
>> wrote:
>>
>>> I skimmed through and this seems like a clean design (I would have to
>>> reread the PR to do a comparison.  A few thoughts of the top of my head:
>>>
>>>
>>> > - Multiple internal classes are left public in header files, where it
>>> > would be
>>> >   preferred that public classes be kept to a minimum.
>>>
>>> I think some of these classes are non-trivial.  If that  is the case it
>>> would be preferable to put them in an internal namespace to ensure
>>> adequate unit testing is in place.
>>>
>>> It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
>>> still be implemented with internal synchronization to guarantee
>>> thread-safety?
>>>
>>>
>>>
>>> On Wed, Nov 11, 2020 at 6:19 PM Benjamin Kietzman 
>>> wrote:
>>>
>>> > In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
>>> > https://github.com/apache/arrow/pull/8023 which port the parquet-mr
>>> > design to c++: there's some question whether that design is consistent
>>> > with the style and conventions of the c++ implementation of parquet.
>>> >
>>> > Here is a Gist with a sketched alternative design adapted to c++
>>> > https://gist.github.com/bkietz/f701d32add6f5a2aeed89ce36a443d43
>>> >
>>> > Specific concerns in brief:
>>> > - Multiple internal classes are left public in header files, where it
>>> > would be
>>> >   preferred that public classes be kept to a minimum.
>>> > - Several levels of explicit synchronization with mutexes are used to
>>> > guarantee
>>> >   that each class is completely thread safe, but this is unnecessary
>>> > overhead
>>> >   given the pattern of use of `parquet-cpp`.
>>> >
>>>
>>
>>
>> --
>> Tham Ha Thi
>> C++ Developer
>> *EMOTIV* www.emotiv.com
>> Neurotech for the Global Community
>>
>> LEGAL NOTICE
>>
>> This message (including all attachments) contains confidential
>> information intended for a specific individual and purpose, and is
>> protected by law.  Any confidentiality or privilege is not waived or lost
>> because this email has been sent to you by mistake. If you have received it
>> in error, please let us know by reply email, delete it from your system and
>> destroy any copies.
>>
>> This email is also subject to copyright. Any disclosure, copying, or
>> distribution of this message, or the taking of any action based on it, is
>> strictly prohibited.
>>
>> Emails may be interfered with, may contain computer viruses or other
>> defects and may not be successfully replicated on other systems. We 

Re: [DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-13 Thread Tham Ha
Hi Ben,

Can you reconsider this point:
> Properties (including FileDecryptionProperties) are immutable after
construction and thus can be safely shared between threads with no further
synchronization.

As I see, the returned FileDecryptionProperties object from
PropertiesDrivenCryptoFactory
is not mutable, since it has a DecryptionKeyRetriever object (i.e.
FileKeyUnwrapper
object in this pull) that helps get the key from key metadata.
>From the current implementation of FileKeyUnwrapper with the cache, I think
synchronization is necessary, unless you have another idea.

Cheers, Tham

On Fri, Nov 13, 2020 at 1:28 PM Micah Kornfield 
wrote:

> I skimmed through and this seems like a clean design (I would have to
> reread the PR to do a comparison.  A few thoughts of the top of my head:
>
>
> > - Multiple internal classes are left public in header files, where it
> > would be
> >   preferred that public classes be kept to a minimum.
>
> I think some of these classes are non-trivial.  If that  is the case it
> would be preferable to put them in an internal namespace to ensure
> adequate unit testing is in place.
>
> It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
> still be implemented with internal synchronization to guarantee
> thread-safety?
>
>
>
> On Wed, Nov 11, 2020 at 6:19 PM Benjamin Kietzman 
> wrote:
>
> > In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
> > https://github.com/apache/arrow/pull/8023 which port the parquet-mr
> > design to c++: there's some question whether that design is consistent
> > with the style and conventions of the c++ implementation of parquet.
> >
> > Here is a Gist with a sketched alternative design adapted to c++
> > https://gist.github.com/bkietz/f701d32add6f5a2aeed89ce36a443d43
> >
> > Specific concerns in brief:
> > - Multiple internal classes are left public in header files, where it
> > would be
> >   preferred that public classes be kept to a minimum.
> > - Several levels of explicit synchronization with mutexes are used to
> > guarantee
> >   that each class is completely thread safe, but this is unnecessary
> > overhead
> >   given the pattern of use of `parquet-cpp`.
> >
>


-- 
Tham Ha Thi
C++ Developer
*EMOTIV* www.emotiv.com
Neurotech for the Global Community

LEGAL NOTICE

This message (including all attachments) contains confidential information
intended for a specific individual and purpose, and is protected by law.
Any confidentiality or privilege is not waived or lost because this email
has been sent to you by mistake. If you have received it in error, please
let us know by reply email, delete it from your system and destroy any
copies.

This email is also subject to copyright. Any disclosure, copying, or
distribution of this message, or the taking of any action based on it, is
strictly prohibited.

Emails may be interfered with, may contain computer viruses or other
defects and may not be successfully replicated on other systems. We give no
warranties in relation to these matters. If you have any doubts about the
authenticity of an email purportedly sent by us, please contact us
immediately.


Re: [DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-13 Thread Gidon Gershinsky
Hi all,

Glad to see the parquet-cpp progress on this! Can I suggest creating a
googledoc for the technical discussion? The current md doc format seems to
be harder for pinpointed comments. I got a few, but they are too minor for
sending to the two mailing lists.

Cheers, Gidon


On Fri, Nov 13, 2020 at 9:17 AM Tham Ha  wrote:

> Hi Ben,
>
> Can you reconsider this point:
> > Properties (including FileDecryptionProperties) are immutable after
> construction and thus can be safely shared between threads with no further
> synchronization.
>
> As I see, the returned FileDecryptionProperties object from 
> PropertiesDrivenCryptoFactory
> is not mutable, since it has a DecryptionKeyRetriever object (i.e. 
> FileKeyUnwrapper
> object in this pull) that helps get the key from key metadata.
> From the current implementation of FileKeyUnwrapper with the cache, I
> think synchronization is necessary, unless you have another idea.
>
> Cheers, Tham
>
> On Fri, Nov 13, 2020 at 1:28 PM Micah Kornfield 
> wrote:
>
>> I skimmed through and this seems like a clean design (I would have to
>> reread the PR to do a comparison.  A few thoughts of the top of my head:
>>
>>
>> > - Multiple internal classes are left public in header files, where it
>> > would be
>> >   preferred that public classes be kept to a minimum.
>>
>> I think some of these classes are non-trivial.  If that  is the case it
>> would be preferable to put them in an internal namespace to ensure
>> adequate unit testing is in place.
>>
>> It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
>> still be implemented with internal synchronization to guarantee
>> thread-safety?
>>
>>
>>
>> On Wed, Nov 11, 2020 at 6:19 PM Benjamin Kietzman 
>> wrote:
>>
>> > In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
>> > https://github.com/apache/arrow/pull/8023 which port the parquet-mr
>> > design to c++: there's some question whether that design is consistent
>> > with the style and conventions of the c++ implementation of parquet.
>> >
>> > Here is a Gist with a sketched alternative design adapted to c++
>> > https://gist.github.com/bkietz/f701d32add6f5a2aeed89ce36a443d43
>> >
>> > Specific concerns in brief:
>> > - Multiple internal classes are left public in header files, where it
>> > would be
>> >   preferred that public classes be kept to a minimum.
>> > - Several levels of explicit synchronization with mutexes are used to
>> > guarantee
>> >   that each class is completely thread safe, but this is unnecessary
>> > overhead
>> >   given the pattern of use of `parquet-cpp`.
>> >
>>
>
>
> --
> Tham Ha Thi
> C++ Developer
> *EMOTIV* www.emotiv.com
> Neurotech for the Global Community
>
> LEGAL NOTICE
>
> This message (including all attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law.
> Any confidentiality or privilege is not waived or lost because this email
> has been sent to you by mistake. If you have received it in error, please
> let us know by reply email, delete it from your system and destroy any
> copies.
>
> This email is also subject to copyright. Any disclosure, copying, or
> distribution of this message, or the taking of any action based on it, is
> strictly prohibited.
>
> Emails may be interfered with, may contain computer viruses or other
> defects and may not be successfully replicated on other systems. We give no
> warranties in relation to these matters. If you have any doubts about the
> authenticity of an email purportedly sent by us, please contact us
> immediately.
>
>


Re: [DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-12 Thread Micah Kornfield
I skimmed through and this seems like a clean design (I would have to
reread the PR to do a comparison.  A few thoughts of the top of my head:


> - Multiple internal classes are left public in header files, where it
> would be
>   preferred that public classes be kept to a minimum.

I think some of these classes are non-trivial.  If that  is the case it
would be preferable to put them in an internal namespace to ensure
adequate unit testing is in place.

It seems like KmsClient and maybe PropertiesDrivenCryptoFactory should
still be implemented with internal synchronization to guarantee
thread-safety?



On Wed, Nov 11, 2020 at 6:19 PM Benjamin Kietzman 
wrote:

> In the context of https://issues.apache.org/jira/browse/ARROW-9318 /
> https://github.com/apache/arrow/pull/8023 which port the parquet-mr
> design to c++: there's some question whether that design is consistent
> with the style and conventions of the c++ implementation of parquet.
>
> Here is a Gist with a sketched alternative design adapted to c++
> https://gist.github.com/bkietz/f701d32add6f5a2aeed89ce36a443d43
>
> Specific concerns in brief:
> - Multiple internal classes are left public in header files, where it
> would be
>   preferred that public classes be kept to a minimum.
> - Several levels of explicit synchronization with mutexes are used to
> guarantee
>   that each class is completely thread safe, but this is unnecessary
> overhead
>   given the pattern of use of `parquet-cpp`.
>