[jira] [Updated] (SPARK-24755) Executor loss can cause task to be not resubmitted

Mridul Muralidharan (JIRA) Sat, 07 Jul 2018 00:32:15 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-24755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mridul Muralidharan updated SPARK-24755:
----------------------------------------
    Description: 
As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
if task needs to be resubmitted for partition.

Consider following:

For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively 
(one of them being speculative task)

T1 finishes successfully first.

This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.

Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since 
killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is 
no other copy of P1 around (T2 was killed when T1 succeeded).


I noticed this bug as part of reviewing PR# 21653 for SPARK-13343

Essentially, SPARK-22074 causes a regression (which I dont usually observe due 
to shuffle service, sigh) - and as such the fix is broken IMO : I believe it 
got introduced as part of the review (the original change looked fine to me - 
but I did not look at it in detail).

I dont have a PR handy for this, so if anyone wants to pick it up, please do 
feel free !
+CC [~XuanYuan] who fixed SPARK-22074 initially.

  was:
As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
if task needs to be resubmitted for partition.

Consider following:

For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively 
(one of them being speculative task)

T1 finishes successfully first.

This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.

Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since 
killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is 
no other copy of P1 around (T2 was killed, not T1 - which was successful).


I noticed this bug as part of reviewing PR# 21653 for SPARK-13343

Essentially, SPARK-22074 causes a regression (which I dont usually observe due 
to shuffle service, sigh) - and as such the fix is broken IMO : I believe it 
got introduced as part of the review (the original change looked fine to me - 
but I did not look at it in detail).

I dont have a PR handy for this, so if anyone wants to pick it up, please do 
feel free !
+CC [~XuanYuan] who fixed SPARK-22074 initially.


> Executor loss can cause task to be not resubmitted
> --------------------------------------------------
>
>                 Key: SPARK-24755
>                 URL: https://issues.apache.org/jira/browse/SPARK-24755
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Mridul Muralidharan
>            Priority: Major
>
> As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
> checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
> if task needs to be resubmitted for partition.
> Consider following:
> For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 
> respectively (one of them being speculative task)
> T1 finishes successfully first.
> This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
> We also end up killing task T2.
> Now, exec-1 if/when goes MIA.
> executorLost will no longer schedule task for P1 - since 
> killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there 
> is no other copy of P1 around (T2 was killed when T1 succeeded).
> I noticed this bug as part of reviewing PR# 21653 for SPARK-13343
> Essentially, SPARK-22074 causes a regression (which I dont usually observe 
> due to shuffle service, sigh) - and as such the fix is broken IMO : I believe 
> it got introduced as part of the review (the original change looked fine to 
> me - but I did not look at it in detail).
> I dont have a PR handy for this, so if anyone wants to pick it up, please do 
> feel free !
> +CC [~XuanYuan] who fixed SPARK-22074 initially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24755) Executor loss can cause task to be not resubmitted

Reply via email to