[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375005#comment-16375005 ]
Imran Rashid commented on SPARK-23485: -------------------------------------- {quote} I think this is because the general expectation is that a failure on a given node will just cause new executors to spin up on different nodes and eventually the application will succeed. {quote} I think this is the part which may be particularly different in spark. Some types of failures do not cause the executor to die -- its just a task failure, and the executor itself is still alive. As long as Spark gets heartbeats from the executor, it figures its still fine. But a bad disk can cause *tasks* to repeatedly fail. Could be true for other resources, eg. a bad gpu, and maybe the gpu is only used by certain tasks. When that happens, without spark's internal blacklisting, an application will very quickly hit many task failures. The task fails, spark notices that, tries to find a place to assign the failed task, puts it back in the same place; repeat till spark decides there are too many failures and gives up. It can easily cause your app to fail in ~1 second. There is no communication with the cluster manager through this process, its all just between the spark's driver & executor. In one case, when this happened yarn's own health checker discovered the problem a few mins after it occurred -- but the spark app had already failed by that point. From one bad disk in a cluster w/ > 1000 disks. Spark's blacklisting is really meant to be complementary to the type of node health checks you are talking about in kubernetes. The blacklisting in spark intentionally does not try to figure out the root cause of the problem, as we don't want to get into the game of enumerating all of the possibilities. Its a heuristic which makes it safe for spark to keep going in case of these un-caught errors, but then retries the resources when it would be safe to do so. (discussed in more detail in the design doc on SPARK-8425.) anti-affinity in kubernetes may be just the trick, though this part of the doc was a little worrisome: {quote} Note: Inter-pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. We do not recommend using them in clusters larger than several hundred nodes. {quote} Blacklisting is *most* important in large clusters. It seems like its able to do something much more complicated than a simple node blacklist, though -- maybe it would already be faster with such a simple anti-affinity rule? > Kubernetes should support node blacklist > ---------------------------------------- > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler > Affects Versions: 2.3.0 > Reporter: Imran Rashid > Priority: Major > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org