LantaoJin opened a new issue, #48:
URL: https://github.com/apache/datafusion-java/issues/48

   ### Is your feature request related to a problem or challenge?
   
   `SessionContextBuilder` (introduced in #28) exposes a typed setter for the 
six most-common `SessionConfig` knobs: `batchSize`, `targetPartitions`, 
`collectStatistics`, `informationSchema`, `memoryLimit`, and
   `tempDirectory`.
   
   The Rust `ConfigOptions` struct that backs `SessionConfig` carries roughly 
200 keys split across seven sections (`datafusion.catalog.*`, 
`datafusion.execution.*`, `datafusion.optimizer.*`, `datafusion.sql_parser.*`, 
`datafusion.explain.*`, `datafusion.format.*`, plus user `extensions`). The 
Java builder reaches none of these except the six it names explicitly.
   
   A few representative gaps the *typed* approach would have to fill one at a 
time, but a *generic* `setOption(key, value)` lights up in a single PR:
   
   - `datafusion.execution.parquet.pushdown_filters` — decode-time predicate 
pushdown via `RowFilter`. Workload-dependent default; many embedders flip this 
per-context.
   - `datafusion.execution.parquet.bloom_filter_on_read`, 
`…parquet.reorder_filters`, `…parquet.schema_force_view_types`,  
`…parquet.binary_as_string` — the rest of the parquet family.
   - `datafusion.optimizer.prefer_hash_join`,  
`…optimizer.default_filter_selectivity`, `…optimizer.repartition_joins`, 
`…optimizer.expand_views_at_output` — optimizer dials almost every embedder 
ends up tuning.
   - `datafusion.execution.time_zone` — required for any timestamp arithmetic 
that needs a non-UTC zone.
   - `datafusion.format.timestamp_format`, `…format.date_format`, 
`…format.binary_format` — affect every `DataFrame.show()` and any text 
rendering.
   - `datafusion.sql_parser.dialect`, `…sql_parser.enable_ident_normalization` 
— needed to integrate with non-PostgreSQL frontends (MySQL, Snowflake, Hive, …).
   - `datafusion.explain.show_statistics`, `…explain.show_sizes`, 
`…explain.format` — needed by tooling/UI built on top of `EXPLAIN`.
   - User `extensions` set via `Extensions::insert(...)` — currently 
unreachable from Java at any granularity.
   
   DataFusion already exposes a string-keyed setter for these, so the Java 
binding can mirror it directly:
   
   ```rust
   config.options_mut().set(key, value)?;   // ConfigOptions::set
   ```
   
   This issue tracks adding the matching Java surface.
   
   ### Describe the solution you'd like
   
   One additional method on `SessionContextBuilder`:
   
   ```java
   ctx.builder()
       .batchSize(8192)                                       // existing typed
       .targetPartitions(16)                                  // existing typed
       .setOption("datafusion.execution.parquet.pushdown_filters", "true")
       .setOption("datafusion.optimizer.prefer_hash_join", "true")
       .setOption("datafusion.execution.time_zone", "UTC")
       .build();
   ```
   
   ### Wire format
   
   Purely additive — one new field on the existing `SessionOptions` proto:
   
   ```protobuf
   message SessionOptions {
     optional uint64 batch_size = 1;
     // ... unchanged ...
     optional string temp_directory = 6;
     map<string, string> options = 7;     // new
   }
   ```
   
   `map<string, string>` over `repeated KeyValue` because:
   - The Java/Rust generated APIs are nicer (`putOptions` / 
`HashMap<String,String>` rather than a wrapper struct).
   - Each key is unique by definition, which is the constraint we want.
   - Wire bytes are identical to a repeated KV under proto3.
   
   The trade-off is that `map` has no defined iteration order on the Rust side. 
`ConfigOptions::set` is order-independent, so this is fine — but it does mean a 
Java-side `LinkedHashMap` will preserve caller order for serialization-order 
tests, while reapplying on the Rust side may use a different order. Documenting 
this is enough.
   
   ### Java surface
   
   ```java
   public SessionContextBuilder setOption(String key, String value) {
       if (key == null || value == null) {
           throw new IllegalArgumentException(
               "setOption key and value must be non-null");
       }
       this.options.put(key, value);   // LinkedHashMap<String, String>
       return this;
   }
   ```
   
   Stored into a `LinkedHashMap<String, String>` so caller-order is preserved 
on the Java side; serialized via `b.putAllOptions(...)` into the proto.
   
   ### Rust surface
   
   In `native/src/lib.rs::createSessionContextWithOptions`, after the
   existing typed-field decoding:
   
   ```rust
   let mut config = SessionConfig::new();
   // ... existing typed setters unchanged ...
   for (k, v) in &opts.options {
       config.options_mut().set(k, v)?;
   }
   ```
   
   `ConfigOptions::set` returns a `DataFusionError` on unknown keys or 
unparseable values, which `try_unwrap_or_throw` already surfaces as a 
`RuntimeException` with the underlying message. Same error path as the rest of 
the JNI crate.
   
   
   ### Out of scope (separate follow-ups)
   
   - Read-time options like `CsvReadOptions.has_header` per-call, already 
typed, not session-level.
   - Compile-time validation of keys. Java has no way to know whether a key 
string is valid in the pinned DataFusion version; we rely on the runtime error 
from `ConfigOptions::set`. Same UX as the Rust string-keyed API.
   - A typed `ConfigException` hierarchy. Today every JNI error collapses to 
`RuntimeException`. Promoting these to typed exceptions is a separate 
cross-cutting concern (see related work in `prs/CONTRIB_ISSUES.md` #12 in the 
OpenSearch-sourced backlog).
   - A read-side `getOption(key)` getter. Useful for tooling but orthogonal to 
setting; can be added later without changing the setter shape.
   
   ### Describe alternatives you've considered
   
   1. **One named setter per key**, e.g. `parquetPushdownFilters(boolean)`. 
Originally drafted as the first cut. Rejected because it scales to ~200 setters 
and forces a separate PR for every new dial DataFusion adds upstream. Named 
setters for the *common* knobs (the six already in #28) are the right call; a 
generic escape hatch covers the long tail.
   
   2. **A bundled `parquetOptions(ParquetSessionOptions)` builder** 
per-section. Decent ergonomics but loses everything outside parquet, and forces 
us to re-design the surface every time DataFusion adds a section. The 
`setOption(key, value)` shape tracks the upstream API exactly and needs no 
follow-up when new keys are added.
   
   3. **Map setter**  `setOptions(Map<String, String>)`. Probably ships 
alongside the single-entry setter as a convenience overload for callers loading 
config from a properties file.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to