LantaoJin opened a new pull request, #52: URL: https://github.com/apache/datafusion-java/pull/52
## Which issue does this PR close? - Closes #37. ## Rationale for this change DataFusion 53.x supports Arrow IPC files via `SessionContext::register_arrow` / `read_arrow`, but the Java bindings only expose Parquet, CSV, and (in #47) NDJSON. Since JVM results already come back as Arrow batches via the C Data Interface, an Arrow IPC reader on the Java side closes the natural round-trip: Java callers can write Arrow IPC to disk with arrow-vector's `ArrowFileWriter`, then read it back through DataFusion without going through Parquet or any other intermediate format. Today they have to fall back to `CREATE EXTERNAL TABLE … STORED AS ARROW` via SQL, which works but bypasses the typed builder pattern. This PR is the Java surface for the existing upstream functionality. Issue #37 tracks it; the implementation follows the same proto-over-JNI pattern as #47 (NDJSON), #29 (the CSV/Parquet refactor), and the merged CSV/Parquet readers. ## What changes are included in this PR? - `proto/arrow_read_options.proto` — new `ArrowReadOptionsProto` message. Single field: `file_extension` (default `.arrow`). Explicit Arrow schema rides on the existing IPC byte channel through the JNI layer, mirroring the parquet/csv/json paths, and is therefore not encoded in this message. No `FileCompressionType` field — Arrow IPC files carry body compression (LZ4_FRAME / ZSTD per-buffer) inside the file format itself. - `ArrowReadOptions` Java builder with `fileExtension(String)` and `schema(Schema)` setters. - `SessionContext.registerArrow(name, path[, options])` and `readArrow(path[, options])` overloads, structurally identical to the parquet/csv/json entry points. All include null-argument validation up front, applying Andy's #47 review feedback proactively so reviewers don't have to flag the same pattern again. - `native/src/arrow.rs` — JNI module that decodes `ArrowReadOptionsProto`, constructs the upstream `ArrowReadOptions`, and forwards to `register_arrow` / `read_arrow`. Imports `ArrowReadOptions` from `datafusion::execution::options` rather than `prelude` (it's not re-exported there, same situation as `JsonReadOptions`). Out of scope (for follow-ups): - `tablePartitionCols` — neither parquet, csv, nor ndjson exposes Hive-style partitioning on the Java side yet. Adding it for Arrow only would diverge. ## Are these changes tested? Yes, 9 new tests across `ArrowReadOptionsTest` and `SessionContextArrowTest`. ## Are there any user-facing changes? Yes, purely additive. New public API: - `org.apache.datafusion.ArrowReadOptions` - `SessionContext.registerArrow(String, String)` - `SessionContext.registerArrow(String, String, ArrowReadOptions)` - `SessionContext.readArrow(String) → DataFrame` - `SessionContext.readArrow(String, ArrowReadOptions) → DataFrame` The new `org.apache.datafusion.protobuf.ArrowReadOptionsProto` generated class is also exposed via the protobuf-Java output, consistent with how `CsvReadOptionsProto`, `NdJsonReadOptionsProto`, and `ParquetReadOptionsProto` are exposed. No API removals, no deprecations, no behavior change for existing callers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
