I see that it is the DAGScheduler that orchestrates task resubmission. This
code<https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L625-L633>
is
responsible for calling submitStage for any failed stages. How does
spark.task.maxFailures affect this (if at all) ?

Log on driver:
https://gist.github.com/gregakespret/7874908#file-gistfile1-txt-L1045-L1062(lines
where I killed JVM worker are selected)


Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com <http://www.celtra.com/> |
@celtramobile<http://www.twitter.com/celtramobile>


On Mon, Dec 9, 2013 at 4:35 PM, Grega Kešpret <gr...@celtra.com> wrote:

> Hi!
>
> I tried this (by setting spark.task.maxFailures to 1) and it still does
> not fail-fast. I started a job and after some time, I killed all JVMs
> running on one of the two workers. I was expecting Spark job to fail,
> however it re-fetched tasks to one of the two workers that was still alive
> and the job succeeded.
>
> Grega
> --
> [image: Inline image 1]
> *Grega Kešpret*
> Analytics engineer
>
> Celtra — Rich Media Mobile Advertising
> celtra.com <http://www.celtra.com/> | 
> @celtramobile<http://www.twitter.com/celtramobile>
>
>
> On Mon, Dec 9, 2013 at 10:43 AM, Grega Kešpret <gr...@celtra.com> wrote:
>
>> Hi Reynold,
>>
>> I submitted a pull request here -
>> https://github.com/apache/incubator-spark/pull/245
>> Do I need to do anything else (perhaps add a ticket in JIRA)?
>>
>> Best,
>> Grega
>> --
>> [image: Inline image 1]
>> *Grega Kešpret*
>>
>> Analytics engineer
>>
>> Celtra — Rich Media Mobile Advertising
>> celtra.com <http://www.celtra.com/> | 
>> @celtramobile<http://www.twitter.com/celtramobile>
>>
>>
>> On Fri, Nov 29, 2013 at 6:24 PM, Reynold Xin <r...@apache.org> wrote:
>>
>>> Looks like a bug to me. Can you submit a pull request?
>>>
>>>
>>>
>>> On Fri, Nov 29, 2013 at 2:02 AM, Grega Kešpret <gr...@celtra.com> wrote:
>>>
>>> > Looking at
>>> > http://spark.incubator.apache.org/docs/latest/configuration.html
>>> > docs says:
>>> > Number of individual task failures before giving up on the job. Should
>>> be
>>> > greater than or equal to 1. Number of allowed retries = this value - 1.
>>> >
>>> > However, looking at the code
>>> >
>>> >
>>> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala#L532
>>> >
>>> > if I set spark.task.maxFailures to 1, this means that job will fail
>>> after
>>> > task fails for the second time. Shouldn't this line be corrected to if
>>> (
>>> > numFailures(index) >= MAX_TASK_FAILURES) {
>>> > ?
>>> >
>>> > I can open a pull request if this is the case.
>>> >
>>> > Thanks,
>>> > Grega
>>> > --
>>> > [image: Inline image 1]
>>> > *Grega Kešpret*
>>> > Analytics engineer
>>> >
>>> > Celtra — Rich Media Mobile Advertising
>>> > celtra.com <http://www.celtra.com/> | @celtramobile<
>>> http://www.twitter.com/celtramobile>
>>> >
>>>
>>
>>
>

Reply via email to