Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Erik fang
Hi Mridul, Your explanation is clear and great Thank you so much! On Fri, Oct 20, 2023 at 11:59 AM Keyong Zhou wrote: > Hi Mridul, thanks for the explanation, it's clear to me now, Thanks! > > Mridul Muralidharan 于2023年10月20日周五 11:15写道: > > > To add my response - what I described (w.r.t

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Keyong Zhou
Hi Mridul, thanks for the explanation, it's clear to me now, Thanks! Mridul Muralidharan 于2023年10月20日周五 11:15写道: > To add my response - what I described (w.r.t failing job) applies only to > ResultStage. > It walks the lineage DAG to identify all indeterminate parents to rollback. > If there

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Mridul Muralidharan
To add my response - what I described (w.r.t failing job) applies only to ResultStage. It walks the lineage DAG to identify all indeterminate parents to rollback. If there are only ShuffleMapStages in the set of stages to rollback, it will simply discard their output, rollback all of them, and

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Mridul Muralidharan
Good question, and ResultStage is actually special cased in spark as its output could have already been consumed (for example collect() to driver, etc) - and so if it is one of the stages which needs to be rolled back, the job is aborted. To illustrate, see the following: -- snip -- package

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Keyong Zhou
In fact, I'm wondering if Spark will rerun the whole reduce ShuffleMapStage if its upstream ShuffleMapStage is INDETERMINATE and rerun. Keyong Zhou 于2023年10月19日周四 23:00写道: > Thanks Erik for bringing up this question, I'm also curious about the > answer, any feedback is appreciated. > > Thanks,

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Keyong Zhou
Thanks Erik for bringing up this question, I'm also curious about the answer, any feedback is appreciated. Thanks, Keyong Zhou Erik fang 于2023年10月19日周四 22:16写道: > Mridul, > > sure, I totally agree SPARK-25299 is a much better solution, as long as we > can get it from spark community > (btw,

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Erik fang
Mridul, sure, I totally agree SPARK-25299 is a much better solution, as long as we can get it from spark community (btw, private[spark] of RDD.outputDeterministicLevel is no big deal, celeborn already has spark-integration code with [spark] scope) I also have a question about INDETERMINATE