[ 
https://issues.apache.org/jira/browse/MESOS-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908729#comment-13908729
 ] 

David Robinson commented on MESOS-1028:
---------------------------------------

sgtm

> expose internal metrics
> -----------------------
>
>                 Key: MESOS-1028
>                 URL: https://issues.apache.org/jira/browse/MESOS-1028
>             Project: Mesos
>          Issue Type: Improvement
>          Components: general
>            Reporter: David Robinson
>
> Mesos should export statistics that provide visibility into its internals. 
> This would allow users to detect numerous problem without resorting to 
> trolling log files.
> E.g. export counters of (some of these already exist, most don't):
> cgroup create
> cgroup destroy
> cgroup destroy attempts
> resource offers made
> resource offers accepted
> tasks launched
> tasks destroyed
> tasks lost
> writes to replicated log
> queue length
> export 50th, 90th, 95th, 99th percentile of time taken to:
> start mesos (reach a certain state)
> move tasks between two given states (starting -> started)
> create a cgroup
> destroy a cgroup
> send a message from slave to master
> start a task
> stop a task
> register in zookeeper
> write to the replicated log
> Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See 
> [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit 
> Java) library (or [medida|http://dln.github.io/medida/] for an 
> unmaintained(?) c++ port)
> We've previously seen problems where tasks were stuck in cgroup destroy with 
> >30,000 attempts. Exposing metrics would allow us to easily detect problems 
> like this.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to