Ok, so we had a look with Tham at the current pyarrow and parquet-cpp
configuration objects. There is no Hadoop-like free map (this is good, I
guess). Instead, the property keys are pre-defined in most objects.

But some objects (such as  HdfsConnectionConfig ,
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h#L87)
have a number of pre-defined keys - and a free string-to-string map,
`extra_conf`. This approach is a good fit for us, because we build tools
that allow to work with different external KMS's (encryption Key Management
Services). Each KMS requires a custom client that will connect
parquet encryption to the KMS server. We provide an interface for such
clients; many properties are pre-defined, but the custom client
implementations will require custom properties. We'll define configuration
objects that will look like this:

struct KmsConnectionConfig {
    std::string kms_client_class;
    std::string kms_instance_id;
    std::string kms_instance_url;
    std::string key_access_token;
    std::unordered_map<std::string, std::string> custom_kms_conf;
};

struct EncryptionConfig {
    std::string column_keys;
    std::string footer_key;
    std::string encryption_algorithm;
};

Cheers, Gidon


---------- Forwarded message ---------
From: Gidon Gershinsky <gg5...@gmail.com>
Date: Tue, Jul 7, 2020 at 9:35 AM
Subject: Property-driven Parquet encryption
To: dev <dev@arrow.apache.org>
Cc: tham <t...@emotiv.com>


Hi all,

We are working on the Parquet modular encryption, and are currently adding
a high-level interface that allows to encrypt/decrypt parquet files via
properties only (without calling the low level API). In the
spark/parquet-mr domain, we're using the Hadoop configuration properties
for that purpose - they are already passed from Spark to Parquet, and allow
to add custom key-value properties that can carry the list of encrypted
columns, key identities etc, as described in the
https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing

I'm not sufficiently familiar with the pandas/pyarrow/parquet-cpp
ecosystem. Is there an analog of Hadoop configuration (a free key-value
map, passed all the way down to parquet-cpp)? Or a more structured
configuration object (where we'll need to add the encryption-related
properties)? All suggestions are welcome.

Cheers, Gidon

Reply via email to