[PR] feat(arrow): expose Arrow IPC reader via registerArrow and readArrow [datafusion-java]

via GitHub Thu, 14 May 2026 21:08:34 -0700


LantaoJin opened a new pull request, #52:
URL: https://github.com/apache/datafusion-java/pull/52


   ## Which issue does this PR close?
   
   
   - Closes #37.
   
   ## Rationale for this change
   
   DataFusion 53.x supports Arrow IPC files via 
`SessionContext::register_arrow` / `read_arrow`, but the Java bindings only 
expose Parquet, CSV, and (in #47) NDJSON. Since JVM results already come back 
as Arrow batches via the C Data Interface, an Arrow IPC reader on the Java side 
closes the natural round-trip: Java callers can write Arrow IPC to disk with 
arrow-vector's `ArrowFileWriter`, then read it back through DataFusion without 
going through Parquet or any other intermediate format. Today they have to fall 
back to `CREATE EXTERNAL TABLE … STORED AS ARROW` via SQL, which works but 
bypasses the typed builder pattern.
   
   This PR is the Java surface for the existing upstream functionality. Issue 
#37 tracks it; the implementation follows the same proto-over-JNI pattern as 
#47 (NDJSON), #29 (the CSV/Parquet refactor), and the merged CSV/Parquet 
readers.
   
   ## What changes are included in this PR?
   
   - `proto/arrow_read_options.proto` — new `ArrowReadOptionsProto` message. 
Single field: `file_extension` (default `.arrow`). Explicit Arrow schema rides 
on the existing IPC byte channel through the JNI layer, mirroring the 
parquet/csv/json paths, and is therefore not encoded in this message. No 
`FileCompressionType` field — Arrow IPC files carry body compression (LZ4_FRAME 
/ ZSTD per-buffer) inside the file format itself.
   - `ArrowReadOptions` Java builder with `fileExtension(String)` and 
`schema(Schema)` setters.
   - `SessionContext.registerArrow(name, path[, options])` and 
`readArrow(path[, options])` overloads, structurally identical to the 
parquet/csv/json entry points. All include null-argument validation up front, 
applying Andy's #47 review feedback proactively so reviewers don't have to flag 
the same pattern again.
   - `native/src/arrow.rs` — JNI module that decodes `ArrowReadOptionsProto`, 
constructs the upstream `ArrowReadOptions`, and forwards to `register_arrow` / 
`read_arrow`. Imports `ArrowReadOptions` from `datafusion::execution::options` 
rather than `prelude` (it's not re-exported there, same situation as 
`JsonReadOptions`).
   
   Out of scope (for follow-ups):
   - `tablePartitionCols` — neither parquet, csv, nor ndjson exposes Hive-style 
partitioning on the Java side yet. Adding it for Arrow only would diverge.
   
   ## Are these changes tested?
   
   Yes, 9 new tests across `ArrowReadOptionsTest` and `SessionContextArrowTest`.
   
   ## Are there any user-facing changes?
   
   Yes, purely additive. New public API:
   
   - `org.apache.datafusion.ArrowReadOptions`
   - `SessionContext.registerArrow(String, String)`
   - `SessionContext.registerArrow(String, String, ArrowReadOptions)`
   - `SessionContext.readArrow(String) → DataFrame`
   - `SessionContext.readArrow(String, ArrowReadOptions) → DataFrame`
   
   The new `org.apache.datafusion.protobuf.ArrowReadOptionsProto` generated 
class is also exposed via the protobuf-Java output, consistent with how 
`CsvReadOptionsProto`, `NdJsonReadOptionsProto`, and `ParquetReadOptionsProto` 
are exposed. No API removals, no deprecations, no behavior change for existing 
callers.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(arrow): expose Arrow IPC reader via registerArrow and readArrow [datafusion-java]

Reply via email to