[ https://issues.apache.org/jira/browse/SPARK-24755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mridul Muralidharan updated SPARK-24755: ---------------------------------------- Description: As part of SPARK-22074, when an executor is lost, TSM.executorLost currently checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide if task needs to be resubmitted for partition. Consider following: For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively (one of them being speculative task) T1 finishes successfully first. This results in setting "killedByOtherAttempt(P1) = true" due to running T2. We also end up killing task T2. Now, exec-1 if/when goes MIA. executorLost will no longer schedule task for P1 - since killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is no other copy of P1 around (T2 was killed when T1 succeeded). I noticed this bug as part of reviewing PR# 21653 for SPARK-13343 Essentially, SPARK-22074 causes a regression (which I dont usually observe due to shuffle service, sigh) - and as such the fix is broken IMO : I believe it got introduced as part of the review (the original change looked fine to me - but I did not look at it in detail). I dont have a PR handy for this, so if anyone wants to pick it up, please do feel free ! +CC [~XuanYuan] who fixed SPARK-22074 initially. was: As part of SPARK-22074, when an executor is lost, TSM.executorLost currently checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide if task needs to be resubmitted for partition. Consider following: For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively (one of them being speculative task) T1 finishes successfully first. This results in setting "killedByOtherAttempt(P1) = true" due to running T2. We also end up killing task T2. Now, exec-1 if/when goes MIA. executorLost will no longer schedule task for P1 - since killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is no other copy of P1 around (T2 was killed, not T1 - which was successful). I noticed this bug as part of reviewing PR# 21653 for SPARK-13343 Essentially, SPARK-22074 causes a regression (which I dont usually observe due to shuffle service, sigh) - and as such the fix is broken IMO : I believe it got introduced as part of the review (the original change looked fine to me - but I did not look at it in detail). I dont have a PR handy for this, so if anyone wants to pick it up, please do feel free ! +CC [~XuanYuan] who fixed SPARK-22074 initially. > Executor loss can cause task to be not resubmitted > -------------------------------------------------- > > Key: SPARK-24755 > URL: https://issues.apache.org/jira/browse/SPARK-24755 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.0 > Reporter: Mridul Muralidharan > Priority: Major > > As part of SPARK-22074, when an executor is lost, TSM.executorLost currently > checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide > if task needs to be resubmitted for partition. > Consider following: > For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 > respectively (one of them being speculative task) > T1 finishes successfully first. > This results in setting "killedByOtherAttempt(P1) = true" due to running T2. > We also end up killing task T2. > Now, exec-1 if/when goes MIA. > executorLost will no longer schedule task for P1 - since > killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there > is no other copy of P1 around (T2 was killed when T1 succeeded). > I noticed this bug as part of reviewing PR# 21653 for SPARK-13343 > Essentially, SPARK-22074 causes a regression (which I dont usually observe > due to shuffle service, sigh) - and as such the fix is broken IMO : I believe > it got introduced as part of the review (the original change looked fine to > me - but I did not look at it in detail). > I dont have a PR handy for this, so if anyone wants to pick it up, please do > feel free ! > +CC [~XuanYuan] who fixed SPARK-22074 initially. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org