[jira] [Updated] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

SuYan (JIRA) Tue, 10 May 2016 01:26:13 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


SuYan updated SPARK-10796:
--------------------------
    Description: 
We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
and I think that problem still exist.

1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one Active 
TaskSet, multiple Zombie TaskSet. 
2. We think a running *ShuffleMapStage* is success only if its partition are 
all process success, namely each task‘s *MapStatus* are all add into 
*outputLocs*
3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
Zombie TaskSet2 /..../ Active TaskSetN, and may some MapStatus only belong to 
one TaskSet, and may be a Zombie TaskSet.
4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
are succeed by some Zombie TaskSet.  In current logical, The solution to 
resolved that lost *MapStatus* problem is, each *TaskSet* re-running that those 
tasks which succeed in lost-executor: re-add into *TaskSet's pendingTasks*, and 
re-add it paritions into *Stage‘s pendingPartitions* . but it is useless if 
that lost *MapStatus* only belong to *Zombie TaskSet*, it is Zombie, so will 
never be scheduled his *pendingTasks*
5. The condition for resubmit stage is only if some task throws 
*FetchFailedException*, but may the lost-executor just not empty any 
*MapStatus* of parent Stage for one of running Stages, and it‘s happen to that 
running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So if all 
Zombie TaskSets are all processed his runningTasks and Active TaskSet are all 
processed his pendingTask, then will removed by *TaskSchedulerImp*, then that 
running Stage's *pending partitions* is still nonEmpty. it will hangs......



  was:
We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
and I think that problem still exist.

1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one Active 
TaskSet, multiple Zombie TaskSet. 
2. We think a running *ShuffleMapStage* is success only if its partition are 
all process success, namely each task‘s *MapStatus* are all add into 
*outputLocs*
3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
Zombie TaskSet2 /..../ Active TaskSetN, and may some MapStatus only belong to 
one TaskSet, and may be a Zombie TaskSet.
4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
are succeed by some Zombie TaskSet.  In current logical, The solution to 
resolved that lost *MapStatus* problem is, each *TaskSet* re-running that those 
tasks which succeed in lost-executor: re-add into *TaskSet's pendingTasks*, and 
re-add it paritions into *Stage‘s pendingPartitions* . but it is useless if 
that lost *MapStatus* only belong to *Zombie TaskSet*, it is Zombie, so will 
never be scheduled his *pendingTasks*
5. The condition for resubmit stage is only if some task throws 
*FetchFailedException*, but may the lost-executor just not empty any 
*MapStatus* of parent Stage for one of running Stages, and it‘s happen to that 
running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So if all 
Zombie TaskSets are all processed his runningTasks and Active TaskSet are all 
processed his pendingTask, then will removed by *TaskSchedulerImp*, then that 
running Stage's *pending partitions* is still nonEmpty. it will hangs......


> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10796
>                 URL: https://issues.apache.org/jira/browse/SPARK-10796
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.3.0, 1.4.0, 1.5.0
>            Reporter: SuYan
>            Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
> and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one 
> Active TaskSet, multiple Zombie TaskSet. 
> 2. We think a running *ShuffleMapStage* is success only if its partition are 
> all process success, namely each task‘s *MapStatus* are all add into 
> *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
> Zombie TaskSet2 /..../ Active TaskSetN, and may some MapStatus only belong to 
> one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
> are succeed by some Zombie TaskSet.  In current logical, The solution to 
> resolved that lost *MapStatus* problem is, each *TaskSet* re-running that 
> those tasks which succeed in lost-executor: re-add into *TaskSet's 
> pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions* . but 
> it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is 
> Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws 
> *FetchFailedException*, but may the lost-executor just not empty any 
> *MapStatus* of parent Stage for one of running Stages, and it‘s happen to 
> that running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So 
> if all Zombie TaskSets are all processed his runningTasks and Active TaskSet 
> are all processed his pendingTask, then will removed by *TaskSchedulerImp*, 
> then that running Stage's *pending partitions* is still nonEmpty. it will 
> hangs......



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

Reply via email to