marin-ma opened a new pull request, #10499: URL: https://github.com/apache/incubator-gluten/pull/10499
The current design of Shuffle reading process reuses Spark's `BlockStoreShuffleReader`. The shuffle reader process each input stream individually with a dedicated `SerializerInstance` to deserialize the input data. This works fine when the output are rows. However, in Gluten we convert the deserialised row data into columnar batches during sort-based shuffle read. In real use cases, each input stream may only contain a small number of rows, The deserialisation time for sort-based shuffle reader can be very slow due to the small batch of row->column conversion. In this case, it's hard to tune the performance for the r2c process. This patch adds a new `ColumnarShuffleReader` to create only one `SerializerInstance` for a reducer task that accepts deserialising all input streams. This allows the native reader to load all input streams so that it can do the r2c conversion with reading/accumulating a larger number of rows. This change can also eliminate the `VeloxResizeBatches` operation for sort-based shuffle read. ***There are no benefits for hash-based shuffle reader and rss-sort shuffle reader.*** Below charts demonstrate the before and after shuffle read process for this change. Before: <img width="303" height="411" alt="image" src="https://github.com/user-attachments/assets/a4183e44-e921-4704-affd-3a97fc1e2fe4" /> After: <img width="338" height="434" alt="image" src="https://github.com/user-attachments/assets/5d349c83-d36e-4788-89cf-cb2c9ec235d9" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
