[PR] [VL] Restore hash shuffle reader payload merging [gluten]

via GitHub Thu, 14 May 2026 22:56:12 -0700


zhli1142015 opened a new pull request, #12097:
URL: https://github.com/apache/gluten/pull/12097


   ## What changes are proposed in this pull request?
   
   After #10499, the hash shuffle reader changed from potentially merging 
multiple payloads into larger batches to returning one batch per payload. That 
kept shuffle-read output batches small and increased downstream overhead.
   
   Restore reader-side coalescing for mergeable plain hash shuffle payloads, 
but flush at Spark shuffle stream boundaries so payloads from different input 
streams are never combined. Keep dictionary and complex-type payloads unmerged, 
reset dictionary state per stream, and carry over a payload that would exceed 
the configured batch size.
   
   Add stream-local merge tests covering multi-column 
primitive/bool/string/nullable data, per-stream merge boundaries, carry-over, 
dictionary and complex-type paths, and invalid batch sizes.
   
   ## How was this patch tested?
   
   UT
   
   ## Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: GitHub Copilot
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Restore hash shuffle reader payload merging [gluten]

Reply via email to