andygrove opened a new pull request, #27:
URL: https://github.com/apache/datafusion-java/pull/27
## Which issue does this PR close?
- Closes #.
## Rationale for this change
DataFusion's `DataFrame::write_parquet` is the natural sink for transformed
data, but today the Java bindings have no way to write results back to Parquet
— `collect()` into JVM-side Arrow batches is the only option. This blocks
ETL-shaped workloads where read → transform → write needs to stay in native
code.
## What changes are included in this PR?
- New `org.apache.datafusion.ParquetWriteOptions` (fluent setters for
`compression` and `singleFileOutput`), shaped to mirror the existing
`ParquetReadOptions`.
- `DataFrame.writeParquet(String)` and `DataFrame.writeParquet(String,
ParquetWriteOptions)` overloads. Both retain the DataFrame (clone on the Rust
side), matching the existing `count()` / `show()` pattern; the receiver stays
usable and must still be closed.
- One new JNI function
`Java_org_apache_datafusion_DataFrame_writeParquetWithOptions` in
`native/src/lib.rs`. Compression strings (e.g. `"zstd(3)"`, `"snappy"`,
`"uncompressed"`) are passed verbatim to DataFusion; invalid values surface as
`RuntimeException` at write time.
Deliberately not included this round:
- An `overwriteMode` knob / Java `InsertOp` enum — DataFusion 53.1.0's
`write_parquet` only implements `Append` (`Overwrite` and `Replace` raise `"not
implemented"`). Re-adding the knob once upstream support lands is a
non-breaking addition.
- `partition_by`, `sort_by`, row-group / page-size / dictionary / statistics
/ bloom-filter knobs — out of scope for the first cut.
- CSV / JSON / Avro write outputs and any shared `WriteOptions` base.
## Are these changes tested?
Yes. Two test files added:
- `ParquetWriteOptionsTest` — pure-Java unit tests for defaults and fluent
setter behavior.
- `DataFrameWriteParquetTest` — four integration tests guarded by
`Assumptions.assumeTrue(Files.exists(lineitem))` so they skip cleanly when
`tpch-data/` is absent:
- Round-trip row count via the no-options overload (multi-file output).
- `singleFileOutput(true)` produces a regular file at the supplied path.
- `compression("zstd(3)")` writes and reads back with row count preserved.
- DataFrame remains usable after `writeParquet` (pins the retain
semantics).
`make test` shows 42 tests, 0 failures. `cargo clippy --all-targets
--workspace -- -D warnings` is clean.
## Are there any user-facing changes?
Yes — adds new public API (`DataFrame.writeParquet` overloads +
`ParquetWriteOptions`). No removals or breaking changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]