Kontinuation commented on issue #1766: URL: https://github.com/apache/datafusion-comet/issues/1766#issuecomment-2905034413
There are 2 potential approaches: **Approach 1: Translating hadoop configurations to datafusion object store configurations.** We'll use native implementations of DataFusion object stores to access cloud storages, but we have to setup credentials and other configurations using the config values retrieved from Hadoop configuration. For instance, we can construct credential configs for [object_store](`https://github.com/apache/arrow-rs-object-store`) using the values of `spark.hadoop.fs.s3a.access.key` and `spark.hadoop.fs.s3a.secret.key` in Hadoop config. A concrete example of this approach is Apache Gluten's [ConfigExtractor](https://github.com/apache/incubator-gluten/blob/v1.3.0/cpp/velox/utils/ConfigExtractor.cc#L53-L147). It uses [Velox's native storage accessors](https://github.com/facebookincubator/velox/tree/main/velox/connectors/hive/storage_adapters) and maps Hadoop configurations to storage accessors' configurations. This approach does not guarantee 100% compatibility with Vanilla Spark, as not all configurations are supported by Rust's object store implementations and translations for some Hadoop configurations are not possible. **Approach 2: Call Hadoop FileSystem API using JNI.** This approach provides better compatibility with Vanilla Spark, as we are using the original Hadoop FileSystem API implementations for accessing cloud storages. However, this approach adds additional JNI interface and has higher overall complexity. I suggest that we implement approach 1 first as it is straightforward and works for many useful use cases, then considering approach 2 for cases that approach 1 cannot handle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org