I see that it is the DAGScheduler that orchestrates task resubmission. This code<https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L625-L633> is responsible for calling submitStage for any failed stages. How does spark.task.maxFailures affect this (if at all) ?
Log on driver: https://gist.github.com/gregakespret/7874908#file-gistfile1-txt-L1045-L1062(lines where I killed JVM worker are selected) Grega -- [image: Inline image 1] *Grega Kešpret* Analytics engineer Celtra — Rich Media Mobile Advertising celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile> On Mon, Dec 9, 2013 at 4:35 PM, Grega Kešpret <gr...@celtra.com> wrote: > Hi! > > I tried this (by setting spark.task.maxFailures to 1) and it still does > not fail-fast. I started a job and after some time, I killed all JVMs > running on one of the two workers. I was expecting Spark job to fail, > however it re-fetched tasks to one of the two workers that was still alive > and the job succeeded. > > Grega > -- > [image: Inline image 1] > *Grega Kešpret* > Analytics engineer > > Celtra — Rich Media Mobile Advertising > celtra.com <http://www.celtra.com/> | > @celtramobile<http://www.twitter.com/celtramobile> > > > On Mon, Dec 9, 2013 at 10:43 AM, Grega Kešpret <gr...@celtra.com> wrote: > >> Hi Reynold, >> >> I submitted a pull request here - >> https://github.com/apache/incubator-spark/pull/245 >> Do I need to do anything else (perhaps add a ticket in JIRA)? >> >> Best, >> Grega >> -- >> [image: Inline image 1] >> *Grega Kešpret* >> >> Analytics engineer >> >> Celtra — Rich Media Mobile Advertising >> celtra.com <http://www.celtra.com/> | >> @celtramobile<http://www.twitter.com/celtramobile> >> >> >> On Fri, Nov 29, 2013 at 6:24 PM, Reynold Xin <r...@apache.org> wrote: >> >>> Looks like a bug to me. Can you submit a pull request? >>> >>> >>> >>> On Fri, Nov 29, 2013 at 2:02 AM, Grega Kešpret <gr...@celtra.com> wrote: >>> >>> > Looking at >>> > http://spark.incubator.apache.org/docs/latest/configuration.html >>> > docs says: >>> > Number of individual task failures before giving up on the job. Should >>> be >>> > greater than or equal to 1. Number of allowed retries = this value - 1. >>> > >>> > However, looking at the code >>> > >>> > >>> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala#L532 >>> > >>> > if I set spark.task.maxFailures to 1, this means that job will fail >>> after >>> > task fails for the second time. Shouldn't this line be corrected to if >>> ( >>> > numFailures(index) >= MAX_TASK_FAILURES) { >>> > ? >>> > >>> > I can open a pull request if this is the case. >>> > >>> > Thanks, >>> > Grega >>> > -- >>> > [image: Inline image 1] >>> > *Grega Kešpret* >>> > Analytics engineer >>> > >>> > Celtra — Rich Media Mobile Advertising >>> > celtra.com <http://www.celtra.com/> | @celtramobile< >>> http://www.twitter.com/celtramobile> >>> > >>> >> >> >