Hi Michah, Thanks! I was hoping for community feedback, it's better to discuss these things now, than during the rull request review.
> > > * 1. "kms_client_class" This sounds like it might be a very Java centric > approach. Have you given consideration to how this can be used in > C++/Python? * Yes, yesterday Tham helped me to realize this. If no easy-to-use "C++ magic" is found (that allows to instantiate classes by their names), we'll simply define a KmsClient factory interface, that will be explicitly registered in the key management tools. > *Should I just RTFD* > I googled up RFTD definitions; not sure if any of those found apply here.. Could you elaborate? :) > > * 2. std::unordered_map<std::string, std::string> custom_kms_conf; Is > uniqueness of "keys" intentional (i.e. why not std::vector<std::pair<>>)?* > yes. Property keys must be unique. > > > > * 3. It seems like a slightly asymmetric API to have key-value pairs > separately for custom_kms_conf and packing all column key metadata into a > serialized string. Is there a reason for this?* > The custom_kms_conf will be consumed by custom kms implementations. But if we'll go with the KmsClientFactory approach - then indeed this parameter might be dropped, because the factory will be already configured with the custom properties. > * packing all column key metadata into a serialized string. Is there a > reason for this?* > I'm not sure I understand. By column key metadata, do you mean the column_keys parameter? Cheers, Gidon > > > On Wed, Jul 8, 2020 at 11:06 PM Gidon Gershinsky <gg5...@gmail.com> wrote: > > > Ok, so we had a look with Tham at the current pyarrow and parquet-cpp > > configuration objects. There is no Hadoop-like free map (this is good, I > > guess). Instead, the property keys are pre-defined in most objects. > > > > But some objects (such as HdfsConnectionConfig , > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h#L87) > > have a number of pre-defined keys - and a free string-to-string map, > > `extra_conf`. This approach is a good fit for us, because we build tools > > that allow to work with different external KMS's (encryption Key > Management > > Services). Each KMS requires a custom client that will connect > > parquet encryption to the KMS server. We provide an interface for such > > clients; many properties are pre-defined, but the custom client > > implementations will require custom properties. We'll define > configuration > > objects that will look like this: > > > > struct KmsConnectionConfig { > > std::string kms_client_class; > > std::string kms_instance_id; > > std::string kms_instance_url; > > std::string key_access_token; > > std::unordered_map<std::string, std::string> custom_kms_conf; > > }; > > > > struct EncryptionConfig { > > std::string column_keys; > > std::string footer_key; > > std::string encryption_algorithm; > > }; > > > > Cheers, Gidon > > > > > > ---------- Forwarded message --------- > > From: Gidon Gershinsky <gg5...@gmail.com> > > Date: Tue, Jul 7, 2020 at 9:35 AM > > Subject: Property-driven Parquet encryption > > To: dev <dev@arrow.apache.org> > > Cc: tham <t...@emotiv.com> > > > > > > Hi all, > > > > We are working on the Parquet modular encryption, and are currently > adding > > a high-level interface that allows to encrypt/decrypt parquet files via > > properties only (without calling the low level API). In the > > spark/parquet-mr domain, we're using the Hadoop configuration properties > > for that purpose - they are already passed from Spark to Parquet, and > allow > > to add custom key-value properties that can carry the list of encrypted > > columns, key identities etc, as described in the > > > > > https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing > > > > I'm not sufficiently familiar with the pandas/pyarrow/parquet-cpp > > ecosystem. Is there an analog of Hadoop configuration (a free key-value > > map, passed all the way down to parquet-cpp)? Or a more structured > > configuration object (where we'll need to add the encryption-related > > properties)? All suggestions are welcome. > > > > Cheers, Gidon > > >