LantaoJin opened a new issue, #48:
URL: https://github.com/apache/datafusion-java/issues/48
### Is your feature request related to a problem or challenge?
`SessionContextBuilder` (introduced in #28) exposes a typed setter for the
six most-common `SessionConfig` knobs: `batchSize`, `targetPartitions`,
`collectStatistics`, `informationSchema`, `memoryLimit`, and
`tempDirectory`.
The Rust `ConfigOptions` struct that backs `SessionConfig` carries roughly
200 keys split across seven sections (`datafusion.catalog.*`,
`datafusion.execution.*`, `datafusion.optimizer.*`, `datafusion.sql_parser.*`,
`datafusion.explain.*`, `datafusion.format.*`, plus user `extensions`). The
Java builder reaches none of these except the six it names explicitly.
A few representative gaps the *typed* approach would have to fill one at a
time, but a *generic* `setOption(key, value)` lights up in a single PR:
- `datafusion.execution.parquet.pushdown_filters` — decode-time predicate
pushdown via `RowFilter`. Workload-dependent default; many embedders flip this
per-context.
- `datafusion.execution.parquet.bloom_filter_on_read`,
`…parquet.reorder_filters`, `…parquet.schema_force_view_types`,
`…parquet.binary_as_string` — the rest of the parquet family.
- `datafusion.optimizer.prefer_hash_join`,
`…optimizer.default_filter_selectivity`, `…optimizer.repartition_joins`,
`…optimizer.expand_views_at_output` — optimizer dials almost every embedder
ends up tuning.
- `datafusion.execution.time_zone` — required for any timestamp arithmetic
that needs a non-UTC zone.
- `datafusion.format.timestamp_format`, `…format.date_format`,
`…format.binary_format` — affect every `DataFrame.show()` and any text
rendering.
- `datafusion.sql_parser.dialect`, `…sql_parser.enable_ident_normalization`
— needed to integrate with non-PostgreSQL frontends (MySQL, Snowflake, Hive, …).
- `datafusion.explain.show_statistics`, `…explain.show_sizes`,
`…explain.format` — needed by tooling/UI built on top of `EXPLAIN`.
- User `extensions` set via `Extensions::insert(...)` — currently
unreachable from Java at any granularity.
DataFusion already exposes a string-keyed setter for these, so the Java
binding can mirror it directly:
```rust
config.options_mut().set(key, value)?; // ConfigOptions::set
```
This issue tracks adding the matching Java surface.
### Describe the solution you'd like
One additional method on `SessionContextBuilder`:
```java
ctx.builder()
.batchSize(8192) // existing typed
.targetPartitions(16) // existing typed
.setOption("datafusion.execution.parquet.pushdown_filters", "true")
.setOption("datafusion.optimizer.prefer_hash_join", "true")
.setOption("datafusion.execution.time_zone", "UTC")
.build();
```
### Wire format
Purely additive — one new field on the existing `SessionOptions` proto:
```protobuf
message SessionOptions {
optional uint64 batch_size = 1;
// ... unchanged ...
optional string temp_directory = 6;
map<string, string> options = 7; // new
}
```
`map<string, string>` over `repeated KeyValue` because:
- The Java/Rust generated APIs are nicer (`putOptions` /
`HashMap<String,String>` rather than a wrapper struct).
- Each key is unique by definition, which is the constraint we want.
- Wire bytes are identical to a repeated KV under proto3.
The trade-off is that `map` has no defined iteration order on the Rust side.
`ConfigOptions::set` is order-independent, so this is fine — but it does mean a
Java-side `LinkedHashMap` will preserve caller order for serialization-order
tests, while reapplying on the Rust side may use a different order. Documenting
this is enough.
### Java surface
```java
public SessionContextBuilder setOption(String key, String value) {
if (key == null || value == null) {
throw new IllegalArgumentException(
"setOption key and value must be non-null");
}
this.options.put(key, value); // LinkedHashMap<String, String>
return this;
}
```
Stored into a `LinkedHashMap<String, String>` so caller-order is preserved
on the Java side; serialized via `b.putAllOptions(...)` into the proto.
### Rust surface
In `native/src/lib.rs::createSessionContextWithOptions`, after the
existing typed-field decoding:
```rust
let mut config = SessionConfig::new();
// ... existing typed setters unchanged ...
for (k, v) in &opts.options {
config.options_mut().set(k, v)?;
}
```
`ConfigOptions::set` returns a `DataFusionError` on unknown keys or
unparseable values, which `try_unwrap_or_throw` already surfaces as a
`RuntimeException` with the underlying message. Same error path as the rest of
the JNI crate.
### Out of scope (separate follow-ups)
- Read-time options like `CsvReadOptions.has_header` per-call, already
typed, not session-level.
- Compile-time validation of keys. Java has no way to know whether a key
string is valid in the pinned DataFusion version; we rely on the runtime error
from `ConfigOptions::set`. Same UX as the Rust string-keyed API.
- A typed `ConfigException` hierarchy. Today every JNI error collapses to
`RuntimeException`. Promoting these to typed exceptions is a separate
cross-cutting concern (see related work in `prs/CONTRIB_ISSUES.md` #12 in the
OpenSearch-sourced backlog).
- A read-side `getOption(key)` getter. Useful for tooling but orthogonal to
setting; can be added later without changing the setter shape.
### Describe alternatives you've considered
1. **One named setter per key**, e.g. `parquetPushdownFilters(boolean)`.
Originally drafted as the first cut. Rejected because it scales to ~200 setters
and forces a separate PR for every new dial DataFusion adds upstream. Named
setters for the *common* knobs (the six already in #28) are the right call; a
generic escape hatch covers the long tail.
2. **A bundled `parquetOptions(ParquetSessionOptions)` builder**
per-section. Decent ergonomics but loses everything outside parquet, and forces
us to re-design the surface every time DataFusion adds a section. The
`setOption(key, value)` shape tracks the upstream API exactly and needs no
follow-up when new keys are added.
3. **Map setter** `setOptions(Map<String, String>)`. Probably ships
alongside the single-entry setter as a convenience overload for callers loading
config from a properties file.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]