andygrove opened a new pull request, #1615:
URL: https://github.com/apache/datafusion-ballista/pull/1615

   # Which issue does this PR close?
   
   Closes #.
   
   # Rationale for this change
   
   Sort-shuffle finalize previously decoded every spilled batch and re-emitted 
it through an IPC `FileWriter`, paying decompress + Arrow allocation + 
recompress for every spilled byte. This PR replaces that round-trip with a 
`std::io::copy` of the spill file straight into the consolidated output. On 
Linux this engages `copy_file_range` / `sendfile`, so spilled bytes never 
re-enter user space.
   
   # What changes are included in this PR?
   
   The on-disk format for sort-shuffle output changes:
   
   - **Data file**: was a single IPC File with a footer of batch-block offsets. 
Now it is a leading schema-header IPC stream followed by per-partition byte 
ranges, each holding zero or more concatenated self-contained IPC streams.
   - **Index file**: was little-endian i64 cumulative batch indices (despite 
the docstring already promising byte offsets). Now it stores actual 
little-endian i64 byte offsets, matching what the docstring always claimed.
   - **Reader**: `stream_sort_shuffle_partition` recovers the schema from the 
leading header stream and uses a new bounded multi-stream reader that crosses 
concatenated stream EOS markers within a partition's byte range.
   
   Hash-based shuffle is intentionally untouched. Public API of 
`is_sort_shuffle_output`, `get_index_path`, and `stream_sort_shuffle_partition` 
is unchanged, so `ShuffleReaderExec` and the executor's Arrow Flight service 
work without modification.
   
   New tests cover multi-spill, in-memory-only, and empty-partition round-trips.
   
   # Are there any user-facing changes?
   
   No public-API changes. The sort-shuffle on-disk format changes — it is 
executor-internal, but in-flight files written by older binaries are not 
readable by this version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to