[ https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-10796: ------------------------------ Component/s: Scheduler > The Stage taskSets may are all removed while stage still have pending > partitions after having lost some executors > ----------------------------------------------------------------------------------------------------------------- > > Key: SPARK-10796 > URL: https://issues.apache.org/jira/browse/SPARK-10796 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 1.3.0 > Reporter: SuYan > Priority: Minor > > We meet that problem in Spark 1.3.0, and I also check the lastest Spark code. > and I think that problem still exist. > 1. while a stage occurs fetchFailed, then will new resubmit the running > stage, and mark previous stage as zombie. > 2. if there have a executor lost, the zombie taskset may lost the results of > already successful tasks. In Current code, it will resubmit, but it useless > because it is zombie, will not be scheduler again. > so if the active taskset and zombie taskset all finished the task in > `runningtasks`, Spark will think they are finished. but the running Stage > still have pending partitions. so it will be hang....because no logical to > re-run this pending partitions. > Driver logical is complicated, it will be helpful if any one will check that -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org