[ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SuYan updated SPARK-10796:
--------------------------
    Description: 
We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
and I think that problem still exist.











1. while a task in a running stage(called stage0) throw a 
"FetchFailedException", then taskSet(we called taskset0.0) will mark himself as 
Zombie, and  DAGScheduler will call "markStageAsFinished", and Call 
“resubmitFailedStages”,  and submit a new taskset(called taskset0.1).

2. the zombie taskset(taskset0.0) . In Current code, it will resubmit, but it 
useless because it is zombie, will not be scheduler again.

so if the active taskset and zombie taskset all finished the task in 
`runningtasks`, Spark will think they are finished.  but the running Stage 
still have pending partitions. so it will be hang....because no logical to 
re-run this pending partitions.

Driver logical is complicated, it will be helpful if any one will check that 



  was:
We meet that problem in Spark 1.3.0, and I also check the lastest Spark code. 
and I think that problem still exist.

1. while a stage occurs fetchFailed, then will new resubmit the running stage, 
and mark previous stage as zombie.

2. if there have a executor lost, the zombie taskset may lost the results of 
already successful tasks. In Current code, it will resubmit, but it useless 
because it is zombie, will not be scheduler again.

so if the active taskset and zombie taskset all finished the task in 
`runningtasks`, Spark will think they are finished.  but the running Stage 
still have pending partitions. so it will be hang....because no logical to 
re-run this pending partitions.

Driver logical is complicated, it will be helpful if any one will check that 




> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10796
>                 URL: https://issues.apache.org/jira/browse/SPARK-10796
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.3.0
>            Reporter: SuYan
>            Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
> and I think that problem still exist.
> 1. while a task in a running stage(called stage0) throw a 
> "FetchFailedException", then taskSet(we called taskset0.0) will mark himself 
> as Zombie, and  DAGScheduler will call "markStageAsFinished", and Call 
> “resubmitFailedStages”,  and submit a new taskset(called taskset0.1).
> 2. the zombie taskset(taskset0.0) . In Current code, it will resubmit, but it 
> useless because it is zombie, will not be scheduler again.
> so if the active taskset and zombie taskset all finished the task in 
> `runningtasks`, Spark will think they are finished.  but the running Stage 
> still have pending partitions. so it will be hang....because no logical to 
> re-run this pending partitions.
> Driver logical is complicated, it will be helpful if any one will check that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to