[ https://issues.apache.org/jira/browse/SPARK-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248367#comment-14248367 ]
Harry Brundage commented on SPARK-4732: --------------------------------------- Seems like it would, feel free to mark as duplicate! > All application progress on the standalone scheduler can be halted by one > systematically faulty node > ---------------------------------------------------------------------------------------------------- > > Key: SPARK-4732 > URL: https://issues.apache.org/jira/browse/SPARK-4732 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.1.0, 1.2.0 > Environment: - Spark Standalone scheduler > Reporter: Harry Brundage > > We've experienced several cluster wide outages caused by unexpected system > wide faults on one of our spark workers if that worker is failing > systematically. By systematically, I mean that every executor launched by > that worker will definitely fail due to some reason out of Spark's control > like the log directory disk being completely out of space, or a permissions > error for a file that's always read during executor launch. We screw up all > the time on our team and cause stuff like this to happen, but because of the > way the standalone scheduler allocates resources, our cluster doesn't recover > gracefully from these failures. > When there are more tasks to do than executors, I am pretty sure the way the > scheduler works is that it just waits for more resource offers and then > allocates tasks from the queue to those resources. If an executor dies > immediately after starting, the worker monitor process will notice that it's > dead. The master will allocate that worker's now free cores/memory to a > currently running application that is below its spark.cores.max, which in our > case I've observed as usually the app that just had the executor die. A new > executor gets spawned on the same worker that the last one just died on, gets > allocated that one task that failed, and then the whole process fails again > for the same systematic reason, and lather rinse repeat. This happens 10 > times or whatever the max task failure count is, and then the whole app is > deemed a failure by the driver and shut down completely. > This happens to us for all applications in the cluster as well. We usually > run roughly as many cores as we have hadoop nodes. We also usually have many > more input splits than we have tasks, which means the locality of the first > few tasks which I believe determines where our executors run is well spread > out over the cluster, and often covers 90-100% of nodes. This means the > likelihood of any application getting an executor scheduled any broken node > is quite high. After an old application goes through the above mentioned > process and dies, the next application to start or not be at it's requested > max capacity gets an executor scheduled on the broken node, and is promptly > taken down as well. This happens over and over as well, to the point where > none of our spark jobs are making any progress because of one tiny > permissions mistake on one node. > Now, I totally understand this is usually an "error between keyboard and > screen" kind of situation where it is the responsibility of the people > deploying spark to ensure it is deployed correctly. The systematic issues > we've encountered are almost always of this nature: permissions errors, disk > full errors, one node not getting a new spark jar from a configuration error, > configurations being out of sync, etc. That said, disks are going to fail or > half fail, fill up, node rot is going to ruin configurations, etc etc etc, > and as hadoop clusters scale in size this becomes more and more likely, so I > think its reasonable to ask that Spark be resilient to this kind of failure > and keep on truckin'. > I think a good simple fix would be to have applications, or the master, > blacklist workers (not executors) at a failure count lower than the task > failure count. This would also serve as a belt and suspenders fix for > SPARK-4498. > If the scheduler stopped trying to schedule on nodes that fail a lot, we > could still make progress. These blacklist events are really important and I > think would need to be well logged and surfaced in the UI, but I'd rather log > and carry on than fail hard. I think the tradeoff here is that you risk > blacklisting ever worker as well if there is something systematically wrong > with communication or whatever else I can't imagine. > Please let me know if I've misunderstood how the scheduler works or you need > more information or anything like that and I'll be happy to provide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org