LantaoJin opened a new pull request, #51: URL: https://github.com/apache/datafusion-java/pull/51
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #50 . ## Rationale for this change `DataFrame.collect(allocator)` is the only way to retrieve query results today. Internally, the native handler calls `df.collect().await` which materializes every batch into a `Vec<RecordBatch>` on the Rust heap *before* the first batch crosses the FFI boundary. For TB-scale or unbounded result sets, this OOMs the Rust side regardless of how memory accounting is configured downstream. DataFusion's own `SessionContext::execute_stream` already returns a `SendableRecordBatchStream` that yields one batch at a time. The Java binding just needs to wire that to the existing `FFI_ArrowArrayStream` path so each `loadNextBatch()` from Java drives one `stream.next()` on the Rust side. ## What changes are included in this PR? - `DataFrame.executeStream(BufferAllocator) → ArrowReader` — peer to `collect`. Same lifecycle (consumes the DataFrame, caller closes the returned reader, allocator must outlive it). Same return type so existing reader-driven code paths work unchanged. - `native/src/lib.rs` — new `Java_org_apache_datafusion_DataFrame_executeStreamDataFrame` JNI handler plus a small `StreamingReader` adapter that bridges DataFusion's async `SendableRecordBatchStream` to Arrow's synchronous `RecordBatchReader` interface. Each `next()` drives one `runtime().block_on(stream.next())`. - `native/Cargo.toml` — adds `futures = "0.3"` for `StreamExt::next`. - Updated `collect`'s Javadoc to point at `executeStream` for analytics-scale queries. `collect` is unchanged. The two methods coexist; `collect` remains the right call for small results that benefit from a single owning buffer, `executeStream` is the right call for unbounded or very large result sets. ## Are these changes tested? Yes — 6 new tests in `DataFrameExecuteStreamTest` ## Are there any user-facing changes? Yes — purely additive. New public API: - `DataFrame.executeStream(BufferAllocator) → ArrowReader` `collect`'s behavior is unchanged; only its Javadoc gains a forward-pointer to `executeStream`. No deprecations, no API removals. A new transitive build dependency (`futures = "0.3"`) is added to the native crate for `StreamExt`. It is already pulled in transitively by `tokio` and `datafusion`, so this just makes the dependency explicit on the import path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
