LantaoJin opened a new pull request, #47:
URL: https://github.com/apache/datafusion-java/pull/47
## Which issue does this PR close?
Closes #35
## Rationale for this change
DataFusion 53.x supports newline-delimited JSON via
`SessionContext::read_json` / `register_json`, but the Java bindings only
expose Parquet and CSV readers today. Users with NDJSON input have to fall back
to `CREATE EXTERNAL TABLE … STORED AS JSON` through `SessionContext.sql`, which
works but loses the typed-builder ergonomics the parquet/CSV bindings already
provide. Issue #35 tracks closing that gap; this PR is the implementation.
## What changes are included in this PR?
- `proto/json_read_options.proto` — new `NdJsonReadOptionsProto` message.
Reuses `FileCompressionType` from `csv_read_options.proto` (CSV and JSON accept
the same compression set in DataFusion).
- `NdJsonReadOptions` Java builder with `fileExtension`,
`fileCompressionType`, `schemaInferMaxRecords`, and an explicit Arrow
`schema(Schema)`. Defaults match the Rust struct (`.json`, `UNCOMPRESSED`,
infer from data).
- `SessionContext.registerJson(name, path[, options])` and `readJson(path[,
options])` overloads, structurally identical to the parquet/CSV entry points
(Java builds the proto, JNI hands a `byte[]` to native).
- `native/src/json.rs` — JNI module that decodes `NdJsonReadOptionsProto`,
constructs the upstream `JsonReadOptions`, and forwards to `register_json` /
`read_json`. Imports `prelude::JsonReadOptions` rather than the deprecated
`NdJsonReadOptions` alias; the user-facing Java/proto name still matches the
issue ask.
Out of scope (kept for follow-ups so each PR stays small):
- `tablePartitionCols`, `fileSortOrder` — neither parquet nor CSV exposes
these in the Java surface today; adding them only for JSON would diverge.
- `newline_delimited` — DataFusion 53.x exposes the knob, but the JSON-array
reader path is not yet stable upstream. Both the issue title and the Rust API
name (`NdJson`) imply newline-delimited.
- AVRO source — separate issue.
## Are these changes tested?
Yes.
- `NdJsonReadOptionsTest` (4 tests):
- defaults round-trip through proto,
- fully-configured options round-trip through proto,
- `schema(Schema)` is held by reference and not embedded in proto bytes,
- sweep over every `FileCompressionType` variant.
- `SessionContextJsonTest` (3 tests):
- `registerJson` + SQL `COUNT(*)` and projection on an inferred-schema
NDJSON file,
- `readJson` with an explicit Arrow schema,
- `registerJson` with a custom `.ndjson` file extension.
- `make test` is green: 68 tests, 0 failures, 0 errors. The 12 skipped
cases are pre-existing parquet/TPC-H data-dependent tests unaffected
by this PR.
- `cargo clippy --all-targets -- -D warnings`, `cargo fmt -- --check`,
and `./mvnw spotless:apply` are all clean.
## Are there any user-facing changes?
Yes — purely additive. New public API:
- `org.apache.datafusion.NdJsonReadOptions`
- `SessionContext.registerJson(String, String)`
- `SessionContext.registerJson(String, String, NdJsonReadOptions)`
- `SessionContext.readJson(String) → DataFrame`
- `SessionContext.readJson(String, NdJsonReadOptions) → DataFrame`
No existing API changes; no deprecations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]