zhli1142015 commented on PR #12097: URL: https://github.com/apache/gluten/pull/12097#issuecomment-4459250747
@marin-ma I re-checked the current code paths. `VeloxResizeBatchesExec` can address a similar symptom, but I do not think it is an exact replacement for this reader-side merge. In our internal evaluations, we have seen cases where the cost of `VeloxResizeBatchesExec` is higher than its benefit and causes regressions, so we have not been relying on it for those workloads. The issue we are trying to address here is not really a shuffle regression by itself. After #10499, the hash shuffle reader can emit one `RowVector` / `ColumnarBatch` per payload. When the shuffle output contains many small payloads, the downstream operators after shuffle have to consume many small vectors, and that is where the overhead shows up. Restoring coalescing inside the hash shuffle reader reduces this fragmentation before those vectors reach the downstream operators. I also re-checked the side-effect risk in this PR. The merge is local to the hash shuffle reader, capped by `batchSize`, flushed at Spark shuffle stream boundaries, and skipped for dictionary / complex payloads. If the next payload cannot be merged, the code carries it over instead of crossing the size limit. So the behavior remains conservative and should not change ordering or mix data across streams. `VeloxResizeBatchesExec` can still be useful when it is enabled, especially because it can combine across stream boundaries, but this PR does not remove or block that path. These two mechanisms can coexist: the reader-side fast path handles the common plain hash-shuffle payload fragmentation early, and `VeloxResizeBatchesExec` can still do additional resizing afterwards if a deployment chooses to enable it. So I do not see this as an either-or choice; keeping this targeted reader-side coalescing looks reasonable to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
