David Robinson created MESOS-1028: ------------------------------------- Summary: expose internal metrics Key: MESOS-1028 URL: https://issues.apache.org/jira/browse/MESOS-1028 Project: Mesos Issue Type: Improvement Components: general Reporter: David Robinson
Mesos should export statistics that provide visibility into its internals. This would allow users to detect numerous problem without resorting to trolling log files. E.g. export counters of (some of these already exist, most don't): cgroup create cgroup destroy cgroup destroy attempts resource offers made resource offers accepted tasks launched tasks destroyed tasks lost writes to replicated log queue length export 50th, 90th, 95th, 99th percentile of time taken to: start mesos (reach a certain state) move tasks between two given states (starting -> started) create a cgroup destroy a cgroup send a message from slave to master start a task stop a task register in zookeeper write to the replicated log Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit Java) library (or [medida|http://dln.github.io/medida/] for an unmaintained(?) c++ port) We've previously seen problems where tasks were stuck in cgroup destroy with >30,000 attempts. Exposing metrics would allow us to easily detect problems like this. -- This message was sent by Atlassian JIRA (v6.1.5#6160)