Re: [I] Support reading data from S3 using native_datafusion Parquet scanner [datafusion-comet]

via GitHub Fri, 23 May 2025 09:30:01 -0700


Kontinuation commented on issue #1766:
URL: 
https://github.com/apache/datafusion-comet/issues/1766#issuecomment-2905034413


   There are 2 potential approaches:
   
   **Approach 1: Translating hadoop configurations to datafusion object store 
configurations.**
   
   We'll use native implementations of DataFusion object stores to access cloud 
storages, but we have to setup credentials and other configurations using the 
config values retrieved from Hadoop configuration. For instance, we can 
construct credential configs for 
[object_store](`https://github.com/apache/arrow-rs-object-store`) using the 
values of `spark.hadoop.fs.s3a.access.key` and `spark.hadoop.fs.s3a.secret.key` 
in Hadoop config.
   
   A concrete example of this approach is Apache Gluten's 
[ConfigExtractor](https://github.com/apache/incubator-gluten/blob/v1.3.0/cpp/velox/utils/ConfigExtractor.cc#L53-L147).
 It uses [Velox's native storage 
accessors](https://github.com/facebookincubator/velox/tree/main/velox/connectors/hive/storage_adapters)
 and maps Hadoop configurations to storage accessors' configurations.
   
   This approach does not guarantee 100% compatibility with Vanilla Spark, as 
not all configurations are supported by Rust's object store implementations and 
translations for some Hadoop configurations are not possible.
   
   **Approach 2: Call Hadoop FileSystem API using JNI.**
   
   This approach provides better compatibility with Vanilla Spark, as we are 
using the original Hadoop FileSystem API implementations for accessing cloud 
storages. However, this approach adds additional JNI interface and has higher 
overall complexity.
   
   I suggest that we implement approach 1 first as it is straightforward and 
works for many useful use cases, then considering approach 2 for cases that 
approach 1 cannot handle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support reading data from S3 using native_datafusion Parquet scanner [datafusion-comet]

Reply via email to