[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2022-06-28 Thread Lijie Wang (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Lijie Wang commented on  SPARK-23485  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Kubernetes should support node blacklist   
 

  
 
 
 
 

 
 Hi, what's the current status of this issue? Does the kubernetes mode support node blacklists in lastest version (3.x)?  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375005#comment-16375005
 ] 

Imran Rashid commented on SPARK-23485:
--

{quote}
I think this is because the general expectation is that a failure on a given 
node will just cause new executors to spin up on different nodes and eventually 
the application will succeed.
{quote}

I think this is the part which may be particularly different in spark.  Some 
types of failures do not cause the executor to die -- its just a task failure, 
and the executor itself is still alive.  As long as Spark gets heartbeats from 
the executor, it figures its still fine.  But a bad disk can cause *tasks* to 
repeatedly fail.  Could be true for other resources, eg. a bad gpu, and maybe 
the gpu is only used by certain tasks.

When that happens, without spark's internal blacklisting, an application will 
very quickly hit many task failures.  The task fails, spark notices that, tries 
to find a place to assign the failed task, puts it back in the same place; 
repeat till spark decides there are too many failures and gives up.  It can 
easily cause your app to fail in ~1 second.  There is no communication with the 
cluster manager through this process, its all just between the spark's driver & 
executor.  In one case, when this happened yarn's own health checker discovered 
the problem a few mins after it occurred -- but the spark app had already 
failed by that point.  From one bad disk in a cluster w/ > 1000 disks.

Spark's blacklisting is really meant to be complementary to the type of node 
health checks you are talking about in kubernetes.  The blacklisting in spark 
intentionally does not try to figure out the root cause of the problem, as we 
don't want to get into the game of enumerating all of the possibilities.  Its a 
heuristic which makes it safe for spark to keep going in case of these 
un-caught errors, but then retries the resources when it would be safe to do 
so. (discussed in more detail in the design doc on SPARK-8425.)

anti-affinity in kubernetes may be just the trick, though this part of the doc 
was a little worrisome:

{quote}
Note: Inter-pod affinity and anti-affinity require substantial amount of 
processing which can slow down scheduling in large clusters significantly. We 
do not recommend using them in clusters larger than several hundred nodes.
{quote}

Blacklisting is *most* important in large clusters.  It seems like its able to 
do something much more complicated than a simple node blacklist, though -- 
maybe it would already be faster with such a simple anti-affinity rule?

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-23 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374884#comment-16374884
 ] 

Anirudh Ramanathan commented on SPARK-23485:


Stavros - we [do currently 
differentiate|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L386-L398]
 between kubernetes causing an executor to disappear (node failure) and exit 
caused by the application itself. 

Here's some detail on node issues and k8s:

The node level problem detection is split between the Kubelet and the [Node 
Problem Detector|https://github.com/kubernetes/node-problem-detector]. This 
works for some common errors and in future, will taint nodes upon detecting 
them. Some of these errors are listed 
[here|https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json#L30:15].
 However, there are some categories of errors this setup won't detect. For 
example: if we have a node that has firewall rules/networking that prevents it 
from accessing a particular external service, to say - download/stream data. 
Or, a node with issues in its local disk which makes it throw read/write 
errors. These error conditions may only affect certain kinds of pods on that 
node and not others.

Yinan's point I think is that it is uncommon for applications on k8s to try and 
incorporate reasoning about node level conditions. I think this is because the 
general expectation is that a failure on a given node will just cause new 
executors to spin up on different nodes and eventually the application will 
succeed. However, I can see this being an issue in large-scale production 
deployments, where we'd see transient errors like above. Given the existence of 
a blacklist mechanism and anti-affinity primitives, it wouldn't be too complex 
to incorporate it I think. 

[~aash] [~mcheah], have you guys seen this in practice thus far? 

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-23 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374778#comment-16374778
 ] 

Stavros Kontopoulos commented on SPARK-23485:
-

How about locality preferences + a hardware problem, like the disk problem? I 
see code in Spark Kubernetes scheduler related to locality (not sure if it is 
completed). Will that problem be detected and will kubernetes scheduler 
consider the node as problematic? If so then I guess there is no need for 
blacklisting.

 

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-23 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374757#comment-16374757
 ] 

Yinan Li commented on SPARK-23485:
--

It's not that I'm too confident on the capability of Kubernetes to detect node 
problems. I just don't see it as a good practice of worrying about node 
problems at application level in a containerized environment running on a 
container orchestration system. Yes, I don't think Spark on Kubernetes should 
really need to worry about blacklisting nodes.

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374748#comment-16374748
 ] 

Imran Rashid commented on SPARK-23485:
--

ok the missing jar was a bad example on kubernetes ... I still wouldn't be 
surprised if there is some app-specific failure mode we're failing to take into 
account.

I think you are too confident in kubernetes ability to detect problems with 
nodes -- I don't know what it does but I don't think it is possible for it 
handle this.  It would be great if we really could rely on the separation of 
concerns you want; in practice that just doesn't work because the app has more 
info. 
 It almost sounds like you think Spark should not even use any internal 
blacklisting with kubernetes -- from experience with large non-kubernetes 
deployments, I think that is a bad idea.

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-23 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374745#comment-16374745
 ] 

Anirudh Ramanathan commented on SPARK-23485:


While mostly I think that K8s would be better suited to make the decision to 
blacklist nodes, I think we will see that there are causes to consider nodes 
problematic beyond just the kubelet health checks, so, using Spark's 
blacklisting sounds like a good idea to me. 

Tainting nodes aren't the right solution given it's one Spark application's 
notion of a blacklist and we don't want it to be applied at a cluster level. We 
could however, use [node 
anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 to communicate said blacklist and ensure that certain nodes are avoided by 
executors of that application.

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-23 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374708#comment-16374708
 ] 

Yinan Li commented on SPARK-23485:
--

In the Yarn case, yes, it's possible that a node is missing a jar commonly 
needed by applications. In the Kubernetes mode, this will never be the case 
because containers either all have a particular jar locally or none of them has 
it. An image missing a dependency is problematic by itself. This consistency is 
one of the benefit of being containerized. Talking about node problems, 
detecting node problems and avoid scheduling pods onto problematic nodes are 
the concerns of the kubelets and the scheduler. Applications should not need to 
worry about if nodes are healthy or not. Node problems happening at runtime 
cause pods to be evicted from the problematic nodes and rescheduled somewhere 
else. Having applications be responsible for keeping track of problematic nodes 
and maintain a blacklist means unnecessarily jumping into the business of 
kubelets and the scheduler.

 

[~foxish]

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374599#comment-16374599
 ] 

Imran Rashid commented on SPARK-23485:
--

Yeah I don't think its safe to assume that its kubernetes responsibility to 
entirely figure out the equivalent of a spark application's internal blacklist. 
 You can't guarantee that it'll detect hardware issues, and it also might be an 
issue which is specific to the spark application (eg. a missing jar). 
 Yarn has some basic detection of bad nodes as well, but we observed cases in 
production where one bad disk would effectively take out an entire application 
on a large cluster without spark's blacklisting, as you could have many task 
failures pile up very quickly.

That said, the existing blacklist implementation in spark already handles that 
case, even without the extra handling I'm proposing here.  The spark app would 
still have its own node blacklist, and would avoid scheduling tasks on that 
node.

However, this is suboptimal because spark isn't really getting as many 
resources as it should.  Eg.,  it would request 10 executors, kubernetes hands 
it 10, but really spark can only use 8 of them because 2 live on a node that is 
blacklisted.

I don't think this can be directly handled with taints, if I understand 
correctly.  I assume applying a taint is an admin level thing?  that would mean 
a spark app couldn't dynamically apply a taint when it discovers a problem on a 
node (and really, it probably shouldn't be able to, as it shouldn't trust an 
arbitrary user).  Furthermore, it doesn't allow it to be application specific 
-- blacklisting is really just a heuristic, and you probably do not want it to 
be applied across applications.  Its not clear what you'd do with multiple apps 
each with their own blacklist, as nodes go into the blacklist and then move out 
of the blacklist at different times from each app.

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-22 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373652#comment-16373652
 ] 

Stavros Kontopoulos commented on SPARK-23485:
-

Does the scheduler know all the reasons an app might want to define a node as 
blacklisted? 

What are the node problems exactly?

It seems to me that this is the reason taints are introduced (might be wrong). 

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-22 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373634#comment-16373634
 ] 

Stavros Kontopoulos commented on SPARK-23485:
-

[~liyinan926] I understand the default behavior of the kubernetes scheduler (it 
makes the decisions, apps dont make them) but there is an alpha feature there 
Taint based Evictions, to help with better decisions or different ones right?

"*Taint based Evictions (alpha feature)*: A per-pod-configurable eviction 
behavior when there are node problems, which is described in the next section." 
What is wrong with that in this case, what If I want to limit where something 
runs on?

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-22 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373620#comment-16373620
 ] 

Yinan Li commented on SPARK-23485:
--

The Kubernetes scheduler backend simply creates executor pods through the 
Kubernetes API server, and the pods are scheduled by the Kubernetes scheduler 
to run on the available nodes. The scheduler backend is not interested nor it 
should know about the mapping from pods to nodes. Affinity and anti-affinity, 
or taint and toleration can be used to influence pod scheduling. But it's the 
Kubernetes scheduler and Kubelets' responsibilities to keep track of node 
problems and avoid scheduling pods onto problematic nodes.

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-22 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373558#comment-16373558
 ] 

Stavros Kontopoulos commented on SPARK-23485:
-

I guess everything is covered via handleDisconnectedExecutors which is 
scheduled at some rate and then it calls removeExecutor in 

CoarseGrainedSchedulerBackend which updates blacklist info.

 

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-22 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373544#comment-16373544
 ] 

Yinan Li commented on SPARK-23485:
--

I'm not sure if node blacklisting applies to Kubernetes. In the Kubernetes 
mode, executors run in containers that in turn run in Kubernetes pods scheduled 
to run on available cluster nodes by the Kubernetes scheduler. The Kubernetes 
Spark scheduler backend does not keep track of nor really care about which 
nodes the pods run on. This is a concern of the Kubernetes scheduler.

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-21 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372134#comment-16372134
 ] 

Imran Rashid commented on SPARK-23485:
--

Also related to SPARK-16630 ... if that is solved before this for other cluster 
managers, then we should probably roll similar behavior into this for 
kubernetes too

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org