Hi,

In one of recent Mesos meet up, quite a couple of cluster operators had
expressed complaints that it is hard to model host issues with Mesos at the
moment.

For example, in our environment, the only signal scheduler would know is
whether Mesos agent has disconnected from the cluster. However, we have a
family of other issues in real production which makes the hosts (sometimes
"partially") unusable. Examples include:
- traffic routing software malfunction (i.e, haproxy): Mesos agent does not
require this so scheduler/deployment system is not aware, but actual
workload on the cluster will fail;
- broken disk;
- other long running system agent issues.

This email is looking at how can Mesos recommend best practice to surface
these issues to scheduler, and whether we need additional primitives in
Mesos to achieve such goal.

Any comment/suggestion/question is highly welcomed.

Thanks!

-- 
Cheers,

Zhitao Li

Reply via email to