Sorry, Micah, and thanks again.

Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg5...@gmail.com>
Date: Fri, Jul 10, 2020 at 10:41 AM
Subject: Re: Property-driven Parquet encryption
To: dev <dev@arrow.apache.org>, <emkornfi...@gmail.com>


Hi Michah,

Thanks! I was hoping for community feedback, it's better to discuss these
things now, than during the rull request review.


>
>
> * 1. "kms_client_class" This sounds like it might be a very Java centric
> approach.  Have you given consideration to how this can be used in
> C++/Python?  *


Yes, yesterday Tham helped me to realize this. If no easy-to-use "C++
magic" is found (that allows to instantiate classes by their names), we'll
simply
define a KmsClient factory interface, that will be explicitly registered in
the key management tools.



> *Should I just RTFD*
>

I googled up RFTD definitions; not sure if any of those found apply here..
Could you elaborate? :)


>
> * 2.  std::unordered_map<std::string, std::string> custom_kms_conf; Is
> uniqueness of "keys" intentional (i.e. why not std::vector<std::pair<>>)?*
>

yes. Property keys must be unique.


>
>
>
> * 3.  It seems like a slightly asymmetric API to have key-value pairs
> separately  for custom_kms_conf and packing all column key metadata into a
> serialized string.  Is there a reason for this?*
>

The custom_kms_conf will be consumed by custom kms implementations. But if
we'll go with the KmsClientFactory approach - then indeed this parameter
might be dropped, because the factory will be already configured with the
custom properties.


> * packing all column key metadata into a serialized string.  Is there a
> reason for this?*
>

I'm not sure I understand. By column key metadata, do you mean the
column_keys parameter?

Cheers, Gidon


>
>
> On Wed, Jul 8, 2020 at 11:06 PM Gidon Gershinsky <gg5...@gmail.com> wrote:
>
> > Ok, so we had a look with Tham at the current pyarrow and parquet-cpp
> > configuration objects. There is no Hadoop-like free map (this is good, I
> > guess). Instead, the property keys are pre-defined in most objects.
> >
> > But some objects (such as  HdfsConnectionConfig ,
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h#L87)
> > have a number of pre-defined keys - and a free string-to-string map,
> > `extra_conf`. This approach is a good fit for us, because we build tools
> > that allow to work with different external KMS's (encryption Key
> Management
> > Services). Each KMS requires a custom client that will connect
> > parquet encryption to the KMS server. We provide an interface for such
> > clients; many properties are pre-defined, but the custom client
> > implementations will require custom properties. We'll define
> configuration
> > objects that will look like this:
> >
> > struct KmsConnectionConfig {
> >     std::string kms_client_class;
> >     std::string kms_instance_id;
> >     std::string kms_instance_url;
> >     std::string key_access_token;
> >     std::unordered_map<std::string, std::string> custom_kms_conf;
> > };
> >
> > struct EncryptionConfig {
> >     std::string column_keys;
> >     std::string footer_key;
> >     std::string encryption_algorithm;
> > };
> >
> > Cheers, Gidon
> >
> >
> > ---------- Forwarded message ---------
> > From: Gidon Gershinsky <gg5...@gmail.com>
> > Date: Tue, Jul 7, 2020 at 9:35 AM
> > Subject: Property-driven Parquet encryption
> > To: dev <dev@arrow.apache.org>
> > Cc: tham <t...@emotiv.com>
> >
> >
> > Hi all,
> >
> > We are working on the Parquet modular encryption, and are currently
> adding
> > a high-level interface that allows to encrypt/decrypt parquet files via
> > properties only (without calling the low level API). In the
> > spark/parquet-mr domain, we're using the Hadoop configuration properties
> > for that purpose - they are already passed from Spark to Parquet, and
> allow
> > to add custom key-value properties that can carry the list of encrypted
> > columns, key identities etc, as described in the
> >
> >
> https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing
> >
> > I'm not sufficiently familiar with the pandas/pyarrow/parquet-cpp
> > ecosystem. Is there an analog of Hadoop configuration (a free key-value
> > map, passed all the way down to parquet-cpp)? Or a more structured
> > configuration object (where we'll need to add the encryption-related
> > properties)? All suggestions are welcome.
> >
> > Cheers, Gidon
> >
>

Reply via email to