marin-ma opened a new pull request, #10499:
URL: https://github.com/apache/incubator-gluten/pull/10499

   The current design of Shuffle reading process reuses Spark's 
`BlockStoreShuffleReader`. The shuffle reader process each input stream 
individually with a dedicated `SerializerInstance` to deserialize the input 
data. This works fine when the output are rows. However, in Gluten we convert 
the deserialised row data into columnar batches during sort-based shuffle read. 
   
   In real use cases, each input stream may only contain a small number of 
rows, The deserialisation time for sort-based shuffle reader can be very slow 
due to the small batch of row->column conversion. In this case, it's hard to 
tune the performance for the r2c process.
   
   This patch adds a new `ColumnarShuffleReader` to create only one 
`SerializerInstance` for a reducer task that accepts deserialising all input 
streams. This allows the native reader to load all input streams so that it can 
do the r2c conversion with reading/accumulating a larger number of rows. This 
change can also eliminate the `VeloxResizeBatches` operation for sort-based 
shuffle read.
   
   ***There are no benefits for hash-based shuffle reader and rss-sort shuffle 
reader.***
   
   Below charts demonstrate the before and after shuffle read process for this 
change.
   
   Before:
   <img width="303" height="411" alt="image" 
src="https://github.com/user-attachments/assets/a4183e44-e921-4704-affd-3a97fc1e2fe4";
 />
   
   After:
   <img width="338" height="434" alt="image" 
src="https://github.com/user-attachments/assets/5d349c83-d36e-4788-89cf-cb2c9ec235d9";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to