Re: Question on Celeborn workers,

2023-10-16 Thread Keyong Zhou
Yeah, retaining the map output can reduce the needed tasks to be recomputed for DETERMINATE stages when an output file is lost. This is one important design tradeoff. Currently Celeborn also supports MapPartition for Flink Batch, in which case partition data is not aggregated, instead one

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-16 Thread Mridul Muralidharan
On Mon, Oct 16, 2023 at 11:31 AM Erik fang wrote: > Hi Mridul, > > For a), > DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier() > >

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-16 Thread Erik fang
Hi Mridul, For a), DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier() to decide whether the whole stage needs to be recomputed I think

Re: Question on Celeborn workers,

2023-10-16 Thread Mridul Muralidharan
With push based shuffle in Apache Spark (magnet), we have both the map output and reducer orientated merged output preserved - with reducer oriented view chosen by default for reads and fallback to mapper output when reducer output is missing/failures. That mitigates this specific issue for

Re: Question on Celeborn workers,

2023-10-16 Thread Keyong Zhou
Hi Sungwoo, What you are pointing out is very correct. Currently shuffle data is distributed across `celeborn.master.slot.assign.maxWorkers` workers, which defaults to 1, so I believe the cascading stage rerun will definitely happen. I think setting ` celeborn.master.slot.assign.maxWorkers`