andygrove commented on PR #1389: URL: https://github.com/apache/datafusion-ballista/pull/1389#issuecomment-3765473575
Thanks for the review @milenkovicm! I've addressed the feedback in the following commits: ## Addressed ### 1. Return stream instead of `Vec<RecordBatch>` (d4fcf1a7) Added `stream_sort_shuffle_partition()` that returns a `SendableRecordBatchStream` instead of collecting all batches into a vector. This reads batches lazily on-demand, reducing memory usage for large partitions. ### 2. Documentation improvements (0445c48c) - Added the Spark algorithm description: *"Internally, results from individual map tasks are kept in memory until they can't fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks."* - Fixed wording: "that writes" → "It writes" ### 3. Flight service support for remote reads (53b5d571) - Updated `do_get` to detect sort-based shuffle files and read only the requested partition using the index - Added detection in `IO_BLOCK_TRANSPORT` to return an informative error (block transport can't support sort-based shuffle since it transfers the entire file containing all partitions) ### 4. Parameterized tests with rstest (a297d475) Added remote flight read tests using `rstest` to parameterize all 15 sort shuffle tests. Tests now run in two modes: - `Local`: reads shuffle data from local filesystem - `RemoteFlight`: forces remote read through flight service with `BALLISTA_SHUFFLE_READER_FORCE_REMOTE_READ=true` and `BALLISTA_SHUFFLE_READER_REMOTE_PREFER_FLIGHT=true` All 32 tests pass (15 tests × 2 modes + 2 comparison tests). ## Notes - **Streaming for `spill.rs`**: Evaluated but deferred - batches flow through quickly during finalization so memory impact is less critical - **Block transport**: Returns `Status::unimplemented` for sort-based shuffle with a message directing users to enable flight transport -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
