LantaoJin opened a new pull request, #53:
URL: https://github.com/apache/datafusion-java/pull/53

   ## Which issue does this PR close?
   
   - Closes 38.
   
   ## Rationale for this change
   
   `DataFrame.writeParquet` (#27) lets Java callers materialize a query result, 
but CSV remains read-only. DataFusion supports `DataFrame::write_csv` upstream 
with the full writer-side option surface (delimiter, quote, escape, null token, 
compression, partitioning, single-file vs directory output). Issue #38 tracks 
exposing it on the Java side.
   
   The CSV write surface is wider than parquet's — six writer-side knobs plus 
`singleFileOutput` and `partitionCols` — so this PR uses the proto-over-JNI 
pattern (introduced in #29 and reused by all the read-side option classes) 
instead of the wide-JNI pattern `writeParquet` shipped with. Sending a single 
`byte[]` keeps the JNI signature stable as more knobs are added.
   
   ## What changes are included in this PR?
   
   - `proto/csv_write_options.proto` — new `CsvWriteOptionsProto` message. 
Fields are `optional` so unset values preserve DataFusion's defaults; 
`partition_cols` is `repeated` so the empty list round-trips unambiguously. 
`FileCompressionType` is reused from `csv_read_options.proto` because the codec 
set is identical between read and write at the upstream level. Promoting the 
enum to a shared `compression.proto` in PR #47. Once that lands, this PR's 
import switches one line.
   - `CsvWriteOptions` Java builder mirroring the upstream `CsvOptions` 
writer-side API: `singleFileOutput`, `partitionCols`, `hasHeader`, `delimiter`, 
`quote`, `escape`, `nullValue`, `fileCompressionType`. All defaults are unset 
(null) so callers only pay for knobs they touch.
   - `DataFrame.writeCsv(String)` and `DataFrame.writeCsv(String, 
CsvWriteOptions)` overloads with up-front null-arg validation. The receiver 
remains usable after the call, matching `writeParquet`'s "retain after write" 
semantics.
   - `Java_org_apache_datafusion_DataFrame_writeCsvWithOptions` JNI handler in 
`native/src/csv.rs` (co-located with the read-side handlers since they share 
the proto-decode plumbing). Decodes the proto, builds `DataFrameWriteOptions` 
and an `Option<CsvOptions>`, then calls `DataFrame::write_csv`. 
`Option<CsvOptions>` is left as `None` when no writer knob is set so 
DataFusion's defaults apply.
   
   Out of scope (for follow-ups):
   - Other writer fields exposed by upstream `CsvOptions` but not in #38's 
checklist: `terminator`, `doubleQuote`, `dateFormat`, `datetimeFormat`, 
`timestampFormat`, `timestampTzFormat`, `timeFormat`, `compressionLevel`, 
`truncatedRows`. Easy follow-up — same proto, just add fields.
   - `compression_level` — separate from the compression codec; the upstream 
`CsvOptions` exposes `with_compression_level` but the issue doesn't list it.
   
   ## Are these changes tested?
   
   Yes, 11 new tests across `CsvWriteOptionsTest` and `DataFrameWriteCsvTest`.
   
   ## Are there any user-facing changes?
   
   Yes, purely additive. New public API:
   
   - `org.apache.datafusion.CsvWriteOptions`
   - `DataFrame.writeCsv(String)`
   - `DataFrame.writeCsv(String, CsvWriteOptions)`
   
   The new `org.apache.datafusion.protobuf.CsvWriteOptionsProto` generated 
class is also exposed via the protobuf-Java output, consistent with how the 
read-side option protos are exposed. No API removals, no deprecations, no 
behavior change for existing callers.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to