Github user squito commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-134294669 Iâve only recently looked at making changes to the scheduler, but it seems to me there is widespread agreement among committers that it is very error prone. For example, consider Andrew Orâs plea in [SPARK-8987](https://issues.apache.org/jira/browse/SPARK-8987) which starts with: > DAGScheduler is one of the most monstrous piece of code in Spark. Other recent examples of similar sentiments are in this [dicussion on backporting SPARK-8103](https://github.com/apache/spark/pull/7572) or confusion in the earlier versions of SPARK-5945, SPARK-7308, and SPARK-8103. Even this [seemingly innocuous three line change](https://github.com/apache/spark/commit/702aa9d7fb16c98a50e046edfd76b8a7861d0391#diff-6a9ff7fb74fd490a50462d45db2d5e11R792) inadvertently introduced SPARK-9809 (its really lucky that somebody stumbled on that before the release). Iâve been working on issues related to fault-tolerance, primarily SPARK-8103 & SPARK-8029, which came from real customer escalations. Those took me a *long* time to wrap my head around, after painfully trying to make sense of user logs, create a reproduction, propose a fix, convince others there really was something wrong, and get lots of help to make the right fix. I did a bit of fault-injection testing as well, and things seemed to pass consistently after my fixes, so I was hoping that would be the end of the story. But then I dug through some existing jiras, and found [SPARK-5259](https://issues.apache.org/jira/browse/SPARK-5259). I couldn't believe it had been open since January! A community member had discovered it and even very clearly described exactly how it happened, but still we haven't fixed it. We're about to release spark 1.5 with it still broken, which means that's at least 3 releases where fault tolerance is knowingly broken. I find that embarrassing. I'm not saying all of this to denigrate the effort that everyone has already put into it, but I just want to be clear that I really do mean it: we are unable to deal with the complexity of the scheduler. IMO, the highest reward would be to fix the fault-tolerance issues, and focus on testing the scheduler so we gain more confidence in it. So while this feature is interesting, I think we should proceed very cautiously.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org