Hi, In one of recent Mesos meet up, quite a couple of cluster operators had expressed complaints that it is hard to model host issues with Mesos at the moment.
For example, in our environment, the only signal scheduler would know is whether Mesos agent has disconnected from the cluster. However, we have a family of other issues in real production which makes the hosts (sometimes "partially") unusable. Examples include: - traffic routing software malfunction (i.e, haproxy): Mesos agent does not require this so scheduler/deployment system is not aware, but actual workload on the cluster will fail; - broken disk; - other long running system agent issues. This email is looking at how can Mesos recommend best practice to surface these issues to scheduler, and whether we need additional primitives in Mesos to achieve such goal. Any comment/suggestion/question is highly welcomed. Thanks! -- Cheers, Zhitao Li