adamreeve opened a new issue, #15216:
URL: https://github.com/apache/datafusion/issues/15216

   ### Is your feature request related to a problem or challenge?
   
   arrow-rs is in the process of gaining support for [Parquet modular 
encryption](https://parquet.apache.org/docs/file-format/data-pages/encryption/) 
- see https://github.com/apache/arrow-rs/issues/7278. It would be useful to be 
able to read and write encrypted Parquet files with DataFusion, but it's not 
clear how to integrate this feature due to the complex configuration required.
   
   Examples of this complex configuration are:
   * Users may require different encryption or decryption keys to be specified 
per Parquet file
   * The encryption and decryption keys specified may depend on the file schema
   * The encryption keys may need to be generated per file by interacting with 
a user's key management service (KMS)
   * Decryption keys may need to be retrieved dynamically based on the metadata 
read from Parquet files and require interaction with a KMS. This process would 
be opaque to DataFusion, but requires the `FileDecryptionProperties` in 
arrow-rs to be created with a callback that can't be represented as a string 
configuration option (https://github.com/apache/arrow-rs/issues/7257).
   
   I have an example of what using a KMS might look like to read and write 
encrypted files but this isn't yet merged in arrow-rs: 
https://github.com/adamreeve/arrow-rs/blob/7afb60e1ee0e4c190468c153b252324235a63d96/parquet/examples/round_trip_encrypted_parquet.rs
   
   Currently all Parquet format options can be easily encoded as strings or 
primitive types, and live in `datafusion-common`, which has an optional 
dependency on the parquet crate, although `TableParquetOptions` is always 
defined even if the parquet feature is disabled.
   
   We're experimenting with using encryption in DataFusion by adding encoded 
keys to the `ParquetOptions` struct, but this is quite limited and doesn't 
support the more complex configuration options I mention above.
   
   ### Describe the solution you'd like
   
   One solution might be to allow users to arbitrarily customize the Parquet 
writing and reading options, eg. with something like:
   ```diff
   --- a/datafusion/common/src/config.rs
   +++ b/datafusion/common/src/config.rs
   @@ -1615,6 +1615,12 @@ pub struct TableParquetOptions {
        /// )
        /// ```
        pub key_value_metadata: HashMap<String, Option<String>>,
   +    /// Callback to modify the Parquet WriterPropertiesBuilder with custom 
configuration
   +    #[cfg(feature = "parquet")]
   +    pub writer_configuration: Option<Arc<dyn Fn(WriterPropertiesBuilder) -> 
WriterPropertiesBuilder>>,
   +    /// Callback to modify the Parquet ArrowReaderOptions with custom 
configuration
   +    #[cfg(feature = "parquet")]
   +    pub read_configuration: Option<Arc<dyn Fn(ArrowReaderOptions) -> 
ArrowReaderOptions>>,
    }
    
    impl TableParquetOptions {
   ```
   
   These callbacks would probably need some other inputs like the file schema 
too. This would allow DataFusion users to specify encryption specific options 
without DataFusion itself needing to know about them, and might be useful for 
applying other Parquet options that aren't already exposed in DataFusion. This 
also supports generating different encryption properties per file.
   
   `TableParquetOptions` can currently be created from environment variables, 
which wouldn't be possible for these extra fields, but I don't think that 
should be a problem?
   
   Another minor issue is that `TableParquetOptions` implements `PartialEq`, 
and I don't think it would be possible to sanely implement equality while 
allowing custom callbacks like this.
   
   ### Describe alternatives you've considered
   
   @alamb also suggested in https://github.com/delta-io/delta-rs/issues/3300 
that it could be possible to use an `Arc<dyn Any>` to allow passing more 
complex configuration types through `TableParquetOptions`.
   
   I'm not sure exactly what this would look like though. Maybe the option 
would still hold a callback function but just hidden behind the `Any` trait, or 
maybe we would want to limit this to encryption specific configuration options, 
but I think we'd need to maintain the ability to generate `ArrowReaderOptions` 
and  `WriterProperties` per file.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to