[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375005#comment-16375005
 ] 

Imran Rashid commented on SPARK-23485:
--------------------------------------

{quote}
I think this is because the general expectation is that a failure on a given 
node will just cause new executors to spin up on different nodes and eventually 
the application will succeed.
{quote}

I think this is the part which may be particularly different in spark.  Some 
types of failures do not cause the executor to die -- its just a task failure, 
and the executor itself is still alive.  As long as Spark gets heartbeats from 
the executor, it figures its still fine.  But a bad disk can cause *tasks* to 
repeatedly fail.  Could be true for other resources, eg. a bad gpu, and maybe 
the gpu is only used by certain tasks.

When that happens, without spark's internal blacklisting, an application will 
very quickly hit many task failures.  The task fails, spark notices that, tries 
to find a place to assign the failed task, puts it back in the same place; 
repeat till spark decides there are too many failures and gives up.  It can 
easily cause your app to fail in ~1 second.  There is no communication with the 
cluster manager through this process, its all just between the spark's driver & 
executor.  In one case, when this happened yarn's own health checker discovered 
the problem a few mins after it occurred -- but the spark app had already 
failed by that point.  From one bad disk in a cluster w/ > 1000 disks.

Spark's blacklisting is really meant to be complementary to the type of node 
health checks you are talking about in kubernetes.  The blacklisting in spark 
intentionally does not try to figure out the root cause of the problem, as we 
don't want to get into the game of enumerating all of the possibilities.  Its a 
heuristic which makes it safe for spark to keep going in case of these 
un-caught errors, but then retries the resources when it would be safe to do 
so. (discussed in more detail in the design doc on SPARK-8425.)

anti-affinity in kubernetes may be just the trick, though this part of the doc 
was a little worrisome:

{quote}
Note: Inter-pod affinity and anti-affinity require substantial amount of 
processing which can slow down scheduling in large clusters significantly. We 
do not recommend using them in clusters larger than several hundred nodes.
{quote}

Blacklisting is *most* important in large clusters.  It seems like its able to 
do something much more complicated than a simple node blacklist, though -- 
maybe it would already be faster with such a simple anti-affinity rule?

> Kubernetes should support node blacklist
> ----------------------------------------
>
>                 Key: SPARK-23485
>                 URL: https://issues.apache.org/jira/browse/SPARK-23485
>             Project: Spark
>          Issue Type: New Feature
>          Components: Kubernetes, Scheduler
>    Affects Versions: 2.3.0
>            Reporter: Imran Rashid
>            Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to