[PR] feat(dataframe): add executeStream(allocator) for incremental batch iteration [datafusion-java]

via GitHub Thu, 14 May 2026 02:55:27 -0700


LantaoJin opened a new pull request, #51:
URL: https://github.com/apache/datafusion-java/pull/51


   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   - Closes #50 .
   
   ## Rationale for this change
   
   `DataFrame.collect(allocator)` is the only way to retrieve query results 
today. Internally, the native handler calls `df.collect().await` which 
materializes every batch into a `Vec<RecordBatch>` on the Rust heap *before* 
the first batch crosses the FFI boundary. For TB-scale or unbounded result 
sets, this OOMs the Rust side regardless of how memory accounting is configured 
downstream.
   
   DataFusion's own `SessionContext::execute_stream` already returns a 
`SendableRecordBatchStream` that yields one batch at a time. The Java binding 
just needs to wire that to the existing `FFI_ArrowArrayStream` path so each 
`loadNextBatch()` from Java drives one `stream.next()` on the Rust side.
   
   ## What changes are included in this PR?
   
   - `DataFrame.executeStream(BufferAllocator) → ArrowReader` — peer to 
`collect`. Same lifecycle (consumes the DataFrame, caller closes the returned 
reader, allocator must outlive it). Same return type so existing reader-driven 
code paths work unchanged.
   - `native/src/lib.rs` — new 
`Java_org_apache_datafusion_DataFrame_executeStreamDataFrame` JNI handler plus 
a small `StreamingReader` adapter that bridges DataFusion's async 
`SendableRecordBatchStream` to Arrow's synchronous `RecordBatchReader` 
interface. Each `next()` drives one `runtime().block_on(stream.next())`.
   - `native/Cargo.toml` — adds `futures = "0.3"` for `StreamExt::next`.
   - Updated `collect`'s Javadoc to point at `executeStream` for 
analytics-scale queries.
   
   `collect` is unchanged. The two methods coexist; `collect` remains the right 
call for small results that benefit from a single owning buffer, 
`executeStream` is the right call for unbounded or very large result sets.
   
   ## Are these changes tested?
   
   Yes — 6 new tests in `DataFrameExecuteStreamTest`
   
   ## Are there any user-facing changes?
   
   Yes — purely additive. New public API:
   
   - `DataFrame.executeStream(BufferAllocator) → ArrowReader`
   
   `collect`'s behavior is unchanged; only its Javadoc gains a forward-pointer 
to `executeStream`. No deprecations, no API removals.
   
   A new transitive build dependency (`futures = "0.3"`) is added to the native 
crate for `StreamExt`. It is already pulled in transitively by `tokio` and 
`datafusion`, so this just makes the dependency explicit on the import path.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(dataframe): add executeStream(allocator) for incremental batch iteration [datafusion-java]

Reply via email to