[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374884#comment-16374884 ]
Anirudh Ramanathan edited comment on SPARK-23485 at 2/23/18 7:36 PM: --------------------------------------------------------------------- Stavros - we [do currently differentiate|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L386-L398] between kubernetes causing an executor to disappear (node failure) and exit caused by the application itself. Here's some detail on node issues and k8s: The node level problem detection is split between the Kubelet and the [Node Problem Detector|https://github.com/kubernetes/node-problem-detector]. This works for some common errors and in future, will taint nodes upon detecting them. Some of these errors are listed [here|https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L30:15]. However, there are some categories of errors this setup won't detect. For example: if we have a node that has firewall rules/networking that prevents an executor running on it accessing a particular external service, to say - download/stream data. Or, a node with issues in its local disk which makes a spark executor on it throw read/write errors. These error conditions may only affect certain kinds of pods on that node and not others. Yinan's point I think is that it is uncommon for applications on k8s to try and incorporate reasoning about node level conditions. I think this is because the general expectation is that a failure on a given node will just cause new executors to spin up on different nodes and eventually the application will succeed. However, I can see this being an issue in large-scale production deployments, where we'd see transient errors like above. Given the existence of a blacklist mechanism and anti-affinity primitives, it wouldn't be too complex to incorporate it I think. [~aash] [~mcheah], have you guys seen this in practice thus far? was (Author: foxish): Stavros - we [do currently differentiate|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L386-L398] between kubernetes causing an executor to disappear (node failure) and exit caused by the application itself. Here's some detail on node issues and k8s: The node level problem detection is split between the Kubelet and the [Node Problem Detector|https://github.com/kubernetes/node-problem-detector]. This works for some common errors and in future, will taint nodes upon detecting them. Some of these errors are listed [here|https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L30:15]. However, there are some categories of errors this setup won't detect. For example: if we have a node that has firewall rules/networking that prevents it from accessing a particular external service, to say - download/stream data. Or, a node with issues in its local disk which makes it throw read/write errors. These error conditions may only affect certain kinds of pods on that node and not others. Yinan's point I think is that it is uncommon for applications on k8s to try and incorporate reasoning about node level conditions. I think this is because the general expectation is that a failure on a given node will just cause new executors to spin up on different nodes and eventually the application will succeed. However, I can see this being an issue in large-scale production deployments, where we'd see transient errors like above. Given the existence of a blacklist mechanism and anti-affinity primitives, it wouldn't be too complex to incorporate it I think. [~aash] [~mcheah], have you guys seen this in practice thus far? > Kubernetes should support node blacklist > ---------------------------------------- > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler > Affects Versions: 2.3.0 > Reporter: Imran Rashid > Priority: Major > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org