Re: Question on Celeborn workers,

2023-10-16 Thread Keyong Zhou
Yeah, retaining the map output can reduce the needed tasks to be recomputed for DETERMINATE stages when an output file is lost. This is one important design tradeoff. Currently Celeborn also supports MapPartition for Flink Batch, in which case partition data is not aggregated, instead one

Re: Question on Celeborn workers,

2023-10-16 Thread Mridul Muralidharan
With push based shuffle in Apache Spark (magnet), we have both the map output and reducer orientated merged output preserved - with reducer oriented view chosen by default for reads and fallback to mapper output when reducer output is missing/failures. That mitigates this specific issue for

Re: Question on Celeborn workers,

2023-10-16 Thread Keyong Zhou
Hi Sungwoo, What you are pointing out is very correct. Currently shuffle data is distributed across `celeborn.master.slot.assign.maxWorkers` workers, which defaults to 1, so I believe the cascading stage rerun will definitely happen. I think setting ` celeborn.master.slot.assign.maxWorkers`

Question on Celeborn workers,

2023-10-12 Thread Sungwoo Park
I have a question on how Celeborn distributes shuffle data among Celeborn workers. From our observation, it seems that whenever a Celeborn worker fails or gets killed (in a small cluster of less than 25 nodes), almost every edge is affected. Does this mean that an edge with multiple