[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374884#comment-16374884
 ] 

Anirudh Ramanathan edited comment on SPARK-23485 at 2/23/18 7:36 PM:
---------------------------------------------------------------------

Stavros - we [do currently 
differentiate|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L386-L398]
 between kubernetes causing an executor to disappear (node failure) and exit 
caused by the application itself. 

Here's some detail on node issues and k8s:

The node level problem detection is split between the Kubelet and the [Node 
Problem Detector|https://github.com/kubernetes/node-problem-detector]. This 
works for some common errors and in future, will taint nodes upon detecting 
them. Some of these errors are listed 
[here|https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L30:15].
 However, there are some categories of errors this setup won't detect. For 
example: if we have a node that has firewall rules/networking that prevents an 
executor running on it accessing a particular external service, to say - 
download/stream data. Or, a node with issues in its local disk which makes a 
spark executor on it throw read/write errors. These error conditions may only 
affect certain kinds of pods on that node and not others.

Yinan's point I think is that it is uncommon for applications on k8s to try and 
incorporate reasoning about node level conditions. I think this is because the 
general expectation is that a failure on a given node will just cause new 
executors to spin up on different nodes and eventually the application will 
succeed. However, I can see this being an issue in large-scale production 
deployments, where we'd see transient errors like above. Given the existence of 
a blacklist mechanism and anti-affinity primitives, it wouldn't be too complex 
to incorporate it I think. 

[~aash] [~mcheah], have you guys seen this in practice thus far? 


was (Author: foxish):
Stavros - we [do currently 
differentiate|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L386-L398]
 between kubernetes causing an executor to disappear (node failure) and exit 
caused by the application itself. 

Here's some detail on node issues and k8s:

The node level problem detection is split between the Kubelet and the [Node 
Problem Detector|https://github.com/kubernetes/node-problem-detector]. This 
works for some common errors and in future, will taint nodes upon detecting 
them. Some of these errors are listed 
[here|https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L30:15].
 However, there are some categories of errors this setup won't detect. For 
example: if we have a node that has firewall rules/networking that prevents it 
from accessing a particular external service, to say - download/stream data. 
Or, a node with issues in its local disk which makes it throw read/write 
errors. These error conditions may only affect certain kinds of pods on that 
node and not others.

Yinan's point I think is that it is uncommon for applications on k8s to try and 
incorporate reasoning about node level conditions. I think this is because the 
general expectation is that a failure on a given node will just cause new 
executors to spin up on different nodes and eventually the application will 
succeed. However, I can see this being an issue in large-scale production 
deployments, where we'd see transient errors like above. Given the existence of 
a blacklist mechanism and anti-affinity primitives, it wouldn't be too complex 
to incorporate it I think. 

[~aash] [~mcheah], have you guys seen this in practice thus far? 

> Kubernetes should support node blacklist
> ----------------------------------------
>
>                 Key: SPARK-23485
>                 URL: https://issues.apache.org/jira/browse/SPARK-23485
>             Project: Spark
>          Issue Type: New Feature
>          Components: Kubernetes, Scheduler
>    Affects Versions: 2.3.0
>            Reporter: Imran Rashid
>            Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to