date:20231016

Re: Question on Celeborn workers,

2023-10-16 Thread Keyong Zhou

Yeah, retaining the map output can reduce the needed tasks to be recomputed for DETERMINATE stages when an output file is lost. This is one important design tradeoff. Currently Celeborn also supports MapPartition for Flink Batch, in which case partition data is not aggregated, instead one

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-16 Thread Mridul Muralidharan

On Mon, Oct 16, 2023 at 11:31 AM Erik fang wrote: > Hi Mridul, > > For a), > DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier() > >

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-16 Thread Erik fang

Hi Mridul, For a), DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier() to decide whether the whole stage needs to be recomputed I think

Re: Question on Celeborn workers,

2023-10-16 Thread Mridul Muralidharan

With push based shuffle in Apache Spark (magnet), we have both the map output and reducer orientated merged output preserved - with reducer oriented view chosen by default for reads and fallback to mapper output when reducer output is missing/failures. That mitigates this specific issue for

Re: Question on Celeborn workers,

2023-10-16 Thread Keyong Zhou

Hi Sungwoo, What you are pointing out is very correct. Currently shuffle data is distributed across `celeborn.master.slot.assign.maxWorkers` workers, which defaults to 1, so I believe the cascading stage rerun will definitely happen. I think setting ` celeborn.master.slot.assign.maxWorkers`

Re: Question on Celeborn workers,

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Re: Question on Celeborn workers,

Re: Question on Celeborn workers,

5 matches

Site Navigation

Mail list logo

Footer information