LantaoJin opened a new issue, #70:
URL: https://github.com/apache/datafusion-java/issues/70
### Is your feature request related to a problem or challenge?
`SessionContext.registerParquet(name, path)` (and the read/register
counterparts for CSV, NDJSON, Arrow, Avro) accept arbitrary path strings,
but there is no Java surface to attach an `object_store::ObjectStore`
implementation to a URL scheme + bucket. As a result, today the only
remote-storage paths that work from `datafusion-java` are the ones the
default `RuntimeEnv` resolves out of process-level environment variables
— there is no way to:
- Pass S3 access key / secret / session token / region / endpoint per
context (so multi-tenant Java apps cannot give two contexts different
buckets or different credentials in the same JVM).
- Use anything other than the AWS-SDK env-var defaults (no GCS, no
Azure Blob, no plain HTTP-listing, no MinIO with a custom endpoint).
- Re-point an `s3://` URL at a different region / endpoint without
process-wide env mutation.
Concretely, the following fails today even with valid AWS env vars set,
because no S3 store is registered with the runtime:
```java
ctx.registerParquet("orders", "s3://my-bucket/orders/2026-05/");
// RuntimeException: No suitable object store found for
s3://my-bucket/orders/2026-05/
```
DataFusion's Rust `RuntimeEnv::register_object_store(url, store)` already
solves this end of the problem; the gap is purely in the Java surface
above the JNI line.
### Describe the solution you'd like
A typed registration API at construction time, on the existing
`SessionContextBuilder`. Stores are registered before `SessionContext` is
returned; the registration travels through the same
`session_options.proto` byte channel that the rest of the builder uses,
so no new JNI signature is needed.
```java
SessionContext ctx = SessionContext.builder()
.registerObjectStore(ObjectStoreOptions.s3()
.bucket("my-bucket")
.region("us-east-1")
.accessKeyId("...")
.secretAccessKey("...")
.build())
.registerObjectStore(ObjectStoreOptions.s3()
.bucket("other-bucket")
.region("eu-west-1")
.endpoint("https://minio.internal:9000")
.allowHttp(true)
.build())
.build();
ctx.registerParquet("orders", "s3://my-bucket/orders/");
ctx.registerParquet("audit", "s3://other-bucket/audit/");
```
`ObjectStoreOptions` is a sealed-style hierarchy with one concrete
factory per backend:
- `ObjectStoreOptions.s3()` — `AmazonS3` (also covers MinIO / R2 / any
S3-compatible endpoint via `endpoint(...)` + `allowHttp(...)`).
- `ObjectStoreOptions.gcs()` — `GoogleCloudStorage`.
- `ObjectStoreOptions.azure()` — `MicrosoftAzure` Blob Storage.
- `ObjectStoreOptions.http()` — listing-capable HTTP store.
For a v1 the four above are the natural set: they're the four that
upstream `object_store` exposes as first-class, and they cover essentially
every reported `s3://` / `gs://` / `az://` / `https://` use case I've
seen in datafusion-java issues.
Each builder maps 1:1 to the corresponding `object_store` Rust builder
fields (`AmazonS3Builder`, `GoogleCloudStorageBuilder`,
`MicrosoftAzureBuilder`, `HttpBuilder`); the JNI side decodes the proto
once and constructs the store with `.build()`.
The URL that DataFusion uses to look up the store is derived from the
options — for S3 it's `s3://<bucket>` (matching how
`AmazonS3Builder::with_bucket_name(b).build()` is registered). Callers
who want a non-default scheme (e.g. `s3a://`) can opt in via an explicit
`url(...)` setter.
### Describe alternatives you've considered
**A free-form `Map<String,String>` setter.** Easier on the API surface
but loses every type-safety / discoverability benefit. The four
`ObjectStoreOptions` factories are mostly mechanical — once one is
written, the rest follow the same shape — so the cost is small.
**Exposing a Java `ObjectStore` SPI** so callers can implement their own
backend in Java. Out of scope for v1: every `get`/`list`/`put`/`delete`
becomes a JNI upcall, and the request rate of those calls (one per
parquet footer, plus per row group) makes Java upcalls a serious hot
path. The right shape there is a separate issue once anyone reports a
real need; for now, embedders that want a custom backend have the same
options Rust users do (build their own `ObjectStore` impl in Rust and
ship a fork).
**Process-level singletons via `RuntimeEnv` builder.** Doesn't scale to
multi-tenant JVMs that want different credentials per context. The
proposed API already supports the singleton case (one builder, one
context, one shared registration) without forcing it.
### Additional context
The cloud backends are heavy dependencies, so the `datafusion-jni` crate
should expose them behind opt-in Cargo features:
```toml
[features]
default = []
object-store-aws = ["object_store/aws"]
object-store-gcp = ["object_store/gcp"]
object-store-azure = ["object_store/azure"]
object-store-http = ["object_store/http"]
```
The Java side always *compiles* the four `ObjectStoreOptions.*` classes;
the native side panics with a clear error if the corresponding feature
is not built in. Default `make test` builds with all four enabled (so CI
covers them); a slimmer downstream build (just `object-store-aws`, say)
is supported but trips an explicit error from the JNI layer if a caller
tries to register a backend that isn't compiled in.
This matches PR #60's pattern for the `avro` feature: the Cargo feature
is opt-in on `object_store`, but always enabled in our default build so
that Java callers can rely on backends being present without juggling
features through Maven.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]