[GitHub] [arrow-ballista] yahoNanJing commented on issue #660: Proposal for more efficient disk-based shuffle mechanism

via GitHub Tue, 14 Feb 2023 21:53:54 -0800


yahoNanJing commented on issue #660:
URL: https://github.com/apache/arrow-ballista/issues/660#issuecomment-1430792369


   Thanks @andygrove for raising the discussion for this topic. 
   
   For the second approach, the optimization change is also limited when tasks 
of a query stage are assigned to different executors, which is a common case 
when using the RoundRobin task scheduling policy for load balancing.
   
   Actually, to reduce the shuffle write file, I recommend to use the 
sort-based shuffle writer used in Spark 
https://issues.apache.org/jira/browse/SPARK-2045. Then for each original 
`ShuffleWriterExec`, there will be only 2 output files rather than N files for 
its downside stage. One file for shuffling data with concatenating all of the 
output partition data, and the other one for the indexes of each partition's 
offset in the data file.
   
   An intuitive graph can be find here, 
https://github.com/blaze-init/blaze/blob/master/dev/doc/architectural_overview.md.
   
   Hi @yjshen, could you share your opinions?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-ballista] yahoNanJing commented on issue #660: Proposal for more efficient disk-based shuffle mechanism

Reply via email to