[PR] feat(write): add DataFrame.writeParquet with ParquetWriteOptions [datafusion-java]

via GitHub Wed, 13 May 2026 07:35:52 -0700


andygrove opened a new pull request, #27:
URL: https://github.com/apache/datafusion-java/pull/27


   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   DataFusion's `DataFrame::write_parquet` is the natural sink for transformed 
data, but today the Java bindings have no way to write results back to Parquet 
— `collect()` into JVM-side Arrow batches is the only option. This blocks 
ETL-shaped workloads where read → transform → write needs to stay in native 
code.
   
   ## What changes are included in this PR?
   
   - New `org.apache.datafusion.ParquetWriteOptions` (fluent setters for 
`compression` and `singleFileOutput`), shaped to mirror the existing 
`ParquetReadOptions`.
   - `DataFrame.writeParquet(String)` and `DataFrame.writeParquet(String, 
ParquetWriteOptions)` overloads. Both retain the DataFrame (clone on the Rust 
side), matching the existing `count()` / `show()` pattern; the receiver stays 
usable and must still be closed.
   - One new JNI function 
`Java_org_apache_datafusion_DataFrame_writeParquetWithOptions` in 
`native/src/lib.rs`. Compression strings (e.g. `"zstd(3)"`, `"snappy"`, 
`"uncompressed"`) are passed verbatim to DataFusion; invalid values surface as 
`RuntimeException` at write time.
   
   Deliberately not included this round:
   
   - An `overwriteMode` knob / Java `InsertOp` enum — DataFusion 53.1.0's 
`write_parquet` only implements `Append` (`Overwrite` and `Replace` raise `"not 
implemented"`). Re-adding the knob once upstream support lands is a 
non-breaking addition.
   - `partition_by`, `sort_by`, row-group / page-size / dictionary / statistics 
/ bloom-filter knobs — out of scope for the first cut.
   - CSV / JSON / Avro write outputs and any shared `WriteOptions` base.
   
   ## Are these changes tested?
   
   Yes. Two test files added:
   
   - `ParquetWriteOptionsTest` — pure-Java unit tests for defaults and fluent 
setter behavior.
   - `DataFrameWriteParquetTest` — four integration tests guarded by 
`Assumptions.assumeTrue(Files.exists(lineitem))` so they skip cleanly when 
`tpch-data/` is absent:
     - Round-trip row count via the no-options overload (multi-file output).
     - `singleFileOutput(true)` produces a regular file at the supplied path.
     - `compression("zstd(3)")` writes and reads back with row count preserved.
     - DataFrame remains usable after `writeParquet` (pins the retain 
semantics).
   
   `make test` shows 42 tests, 0 failures. `cargo clippy --all-targets 
--workspace -- -D warnings` is clean.
   
   ## Are there any user-facing changes?
   
   Yes — adds new public API (`DataFrame.writeParquet` overloads + 
`ParquetWriteOptions`). No removals or breaking changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(write): add DataFrame.writeParquet with ParquetWriteOptions [datafusion-java]

Reply via email to