Ok, so we had a look with Tham at the current pyarrow and parquet-cpp configuration objects. There is no Hadoop-like free map (this is good, I guess). Instead, the property keys are pre-defined in most objects.
But some objects (such as HdfsConnectionConfig , https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h#L87) have a number of pre-defined keys - and a free string-to-string map, `extra_conf`. This approach is a good fit for us, because we build tools that allow to work with different external KMS's (encryption Key Management Services). Each KMS requires a custom client that will connect parquet encryption to the KMS server. We provide an interface for such clients; many properties are pre-defined, but the custom client implementations will require custom properties. We'll define configuration objects that will look like this: struct KmsConnectionConfig { std::string kms_client_class; std::string kms_instance_id; std::string kms_instance_url; std::string key_access_token; std::unordered_map<std::string, std::string> custom_kms_conf; }; struct EncryptionConfig { std::string column_keys; std::string footer_key; std::string encryption_algorithm; }; Cheers, Gidon ---------- Forwarded message --------- From: Gidon Gershinsky <gg5...@gmail.com> Date: Tue, Jul 7, 2020 at 9:35 AM Subject: Property-driven Parquet encryption To: dev <dev@arrow.apache.org> Cc: tham <t...@emotiv.com> Hi all, We are working on the Parquet modular encryption, and are currently adding a high-level interface that allows to encrypt/decrypt parquet files via properties only (without calling the low level API). In the spark/parquet-mr domain, we're using the Hadoop configuration properties for that purpose - they are already passed from Spark to Parquet, and allow to add custom key-value properties that can carry the list of encrypted columns, key identities etc, as described in the https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing I'm not sufficiently familiar with the pandas/pyarrow/parquet-cpp ecosystem. Is there an analog of Hadoop configuration (a free key-value map, passed all the way down to parquet-cpp)? Or a more structured configuration object (where we'll need to add the encryption-related properties)? All suggestions are welcome. Cheers, Gidon