SuYan created SPARK-10796:
-----------------------------

             Summary: The Stage taskSets may are all removed while stage still 
have pending partitions after having lost some executors
                 Key: SPARK-10796
                 URL: https://issues.apache.org/jira/browse/SPARK-10796
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.3.0
            Reporter: SuYan


We meet that problem in Spark 1.3.0, and I also check the lastest Spark code. 
and I think that problem still exist.

1. while a stage occurs fetchFailed, then will new resubmit the running stage, 
and mark previous stage as zombie.

2. if there have a executor lost, the zombie taskset may lost the results of 
already successful tasks. In Current code, it will resubmit, but it useless 
because it is zombie, will not be scheduler again.

so if the active taskset and zombie taskset all finished the task in 
`runningtasks`, Spark will think they are finished.  but the running Stage 
still have pending partitions. so it will be hang....because no logical to 
re-run this pending partitions.

Driver logical is complicated, it will be helpful if any one will check that 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to