> On Feb 20, 2018, at 11:11 AM, Zhitao Li <zhitaoli...@gmail.com> wrote: > > Hi, > > In one of recent Mesos meet up, quite a couple of cluster operators had > expressed complaints that it is hard to model host issues with Mesos at the > moment. > > For example, in our environment, the only signal scheduler would know is > whether Mesos agent has disconnected from the cluster. However, we have a > family of other issues in real production which makes the hosts (sometimes > "partially") unusable. Examples include: > - traffic routing software malfunction (i.e, haproxy): Mesos agent does not > require this so scheduler/deployment system is not aware, but actual > workload on the cluster will fail; > - broken disk; > - other long running system agent issues. > > This email is looking at how can Mesos recommend best practice to surface > these issues to scheduler, and whether we need additional primitives in > Mesos to achieve such goal.
In the K8s world the node can publish "conditions" that describe its status https://kubernetes.io/docs/concepts/architecture/nodes/#condition The condition can automatically taint the node, which could cause pods to automatically be evicted (ie. if they can't tolerate that specific taint). J