Re: Surfacing additional issues on agent host to schedulers

James Peach Tue, 20 Feb 2018 15:55:24 -0800

> On Feb 20, 2018, at 11:11 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
> 
> Hi,
> 
> In one of recent Mesos meet up, quite a couple of cluster operators had
> expressed complaints that it is hard to model host issues with Mesos at the
> moment.
> 
> For example, in our environment, the only signal scheduler would know is
> whether Mesos agent has disconnected from the cluster. However, we have a
> family of other issues in real production which makes the hosts (sometimes
> "partially") unusable. Examples include:
> - traffic routing software malfunction (i.e, haproxy): Mesos agent does not
> require this so scheduler/deployment system is not aware, but actual
> workload on the cluster will fail;
> - broken disk;
> - other long running system agent issues.
> 
> This email is looking at how can Mesos recommend best practice to surface
> these issues to scheduler, and whether we need additional primitives in
> Mesos to achieve such goal.


In the K8s world the node can publish "conditions" that describe its status

        https://kubernetes.io/docs/concepts/architecture/nodes/#condition

The condition can automatically taint the node, which could cause pods to 
automatically be evicted (ie. if they can't tolerate that specific taint).

J

Re: Surfacing additional issues on agent host to schedulers

Reply via email to