[ https://issues.apache.org/jira/browse/MESOS-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908729#comment-13908729 ]
David Robinson commented on MESOS-1028: --------------------------------------- sgtm > expose internal metrics > ----------------------- > > Key: MESOS-1028 > URL: https://issues.apache.org/jira/browse/MESOS-1028 > Project: Mesos > Issue Type: Improvement > Components: general > Reporter: David Robinson > > Mesos should export statistics that provide visibility into its internals. > This would allow users to detect numerous problem without resorting to > trolling log files. > E.g. export counters of (some of these already exist, most don't): > cgroup create > cgroup destroy > cgroup destroy attempts > resource offers made > resource offers accepted > tasks launched > tasks destroyed > tasks lost > writes to replicated log > queue length > export 50th, 90th, 95th, 99th percentile of time taken to: > start mesos (reach a certain state) > move tasks between two given states (starting -> started) > create a cgroup > destroy a cgroup > send a message from slave to master > start a task > stop a task > register in zookeeper > write to the replicated log > Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See > [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit > Java) library (or [medida|http://dln.github.io/medida/] for an > unmaintained(?) c++ port) > We've previously seen problems where tasks were stuck in cgroup destroy with > >30,000 attempts. Exposing metrics would allow us to easily detect problems > like this. -- This message was sent by Atlassian JIRA (v6.1.5#6160)