SuYan created SPARK-10796: ----------------------------- Summary: The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors Key: SPARK-10796 URL: https://issues.apache.org/jira/browse/SPARK-10796 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: SuYan
We meet that problem in Spark 1.3.0, and I also check the lastest Spark code. and I think that problem still exist. 1. while a stage occurs fetchFailed, then will new resubmit the running stage, and mark previous stage as zombie. 2. if there have a executor lost, the zombie taskset may lost the results of already successful tasks. In Current code, it will resubmit, but it useless because it is zombie, will not be scheduler again. so if the active taskset and zombie taskset all finished the task in `runningtasks`, Spark will think they are finished. but the running Stage still have pending partitions. so it will be hang....because no logical to re-run this pending partitions. Driver logical is complicated, it will be helpful if any one will check that -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org