RE: Question on Monitoring a Mesos Cluster

Andras Kerekes Mon, 09 Mar 2015 08:20:59 -0700

We use the same monitoring script from rayrod2030. However instead of the 
master_cpus_percent, we use the master_cpus_used and master_cpus_total to 
calculate a percentage. And this will give the allocated percentage of CPUs in 
the cluster, the actual utilization is measured by collectd.


-----Original Message-----
From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick 
Davies
Sent: Saturday, March 07, 2015 2:15 PM
To: user@mesos.apache.org
Subject: Re: Question on Monitoring a Mesos Cluster

Yeah, that confused me too - I think that figure is specific to the 
master/slave polled (and that'll just be the active one since you're only 
reporting when master/elected is true.

I'm using this one https://github.com/rayrod2030/collectd-mesos  , not sure if 
that's the same as yours?


On 7 March 2015 at 18:56, Jeff Schroeder <jeffschroe...@computer.org> wrote:
> Responses inline
>
> On Sat, Mar 7, 2015 at 12:48 PM, CCAAT <cc...@tampabay.rr.com> wrote:
>>
>> ... snip ...
>>>
>>> After getting everything working, I built a few dashboards, one of
>>> which displays these stats from http://master:5051/metrics/snapshot:
>>>
>>> master/disk_percent
>>> master/cpus_percent
>>> master/mem_percent
>>>
>>> I had assumed that this was something like aggregate cluster
>>> utilization, but this seems incorrect in practice. I have a small
>>> cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had
>>> a dozen or so small tasks running, and launched 500 tasks with 1G of
>>> memory and 1 CPU each.
>>>
>>> Now I'd expect to se the disk/cpu/mem percentage metrics above go up
>>> considerably. I did notice that cpus_percent went to around 0.94.
>>>
>>> What is the correct way to measure overall cluster utilization for
>>> capacity planning? We can have the NOC watch this and simply add
>>> more hardware when the number starts getting low.
>>
>>
>> Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
>> development group has more accurate information if not some vague
>> roadmap on resource/process monitoring. Sooner or later, this is
>> going to become a quintessential need; so I hope the "deep thinkers"
>> are all over this need both in the user and dev groups.
>>
>> In fact the monitoring can easily create a significant loading on the
>> cluster/cloud, if one is not judicious in how this is architect,
>> implemented and dynamically tuned.
>
>
>
>
> Monitoring via passive metrics gathering and application "telemetry"
> is one of the best ways to do it. That is how I've implemented things
>
>
>
> The beauty of the rest api is that it isn't heavyweight, and every
> master has it on port 5050 (by default) and every slave has it on port
> 5051 (by default). Since I'm throwing this all into graphite (well
> technically cassandra fronted by cyanite fronted by graphite-api...
> but same difference), I found a reasonable way to do capacity
> planning. Collectd will poll the master/slave on each mesos host every
> 10 seconds (localhost:5050 on masters and localhost:5151 on slaves).
> This gets put into graphite via collectd's write_graphite plugin.
> These 3 graphite targets give me percentages of utilization for nice graphs:
>
> alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
> collectd.mesos.clustername.gauge-master_cpu_total), "Total CPU Usage")
> alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
> collectd.mesos.clustername.gauge-master_mem_total), "Total Memory
> Usage")
> alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
> collectd.mesos.clustername.gauge-master_disk_total), "Total Disk
> Usage")
>
> With that data, you can have your monitoring tools such as
> nagios/icinga poll graphite. Using the native graphite render api, you can 
> do things like:
>
>     * "if the cpu usage is over 80% for 24 hours, send a warning event"
>     * "if the cpu usage is over 95% for 6 hours, send a critical event"
>
> This allows mostly no-impact monitoring since the monitoring tools are
> hitting graphite.
>
> Anyways, back to the original questions:
>
> How does everyone do proper monitoring and capacity planning for large
> mesos clusters? I expect my cluster to grow beyond what it currently
> is by quite a bit.
>
> --
> Jeff Schroeder
>
> Don't drink and derive, alcohol and analysis don't mix.
> http://www.digitalprognosis.com

smime.p7s
Description: S/MIME cryptographic signature

RE: Question on Monitoring a Mesos Cluster

Reply via email to