Yeah, that confused me too - I think that figure is specific to the
master/slave polled
(and that'll just be the active one since you're only reporting when
master/elected
is true.

I'm using this one https://github.com/rayrod2030/collectd-mesos  , not
sure if that's
the same as yours?


On 7 March 2015 at 18:56, Jeff Schroeder <jeffschroe...@computer.org> wrote:
> Responses inline
>
> On Sat, Mar 7, 2015 at 12:48 PM, CCAAT <cc...@tampabay.rr.com> wrote:
>>
>> ... snip ...
>>>
>>> After getting everything working, I built a few dashboards, one of which
>>> displays these stats from http://master:5051/metrics/snapshot:
>>>
>>> master/disk_percent
>>> master/cpus_percent
>>> master/mem_percent
>>>
>>> I had assumed that this was something like aggregate cluster
>>> utilization, but this seems incorrect in practice. I have a small
>>> cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a
>>> dozen or so small tasks running, and launched 500 tasks with 1G of
>>> memory and 1 CPU each.
>>>
>>> Now I'd expect to se the disk/cpu/mem percentage metrics above go up
>>> considerably. I did notice that cpus_percent went to around 0.94.
>>>
>>> What is the correct way to measure overall cluster utilization for
>>> capacity planning? We can have the NOC watch this and simply add more
>>> hardware when the number starts getting low.
>>
>>
>> Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
>> development group has more accurate information if not some vague roadmap on
>> resource/process monitoring. Sooner or later, this is going to become a
>> quintessential need; so I hope the "deep thinkers" are all over this need
>> both in the user and dev groups.
>>
>> In fact the monitoring can easily create a significant loading on the
>> cluster/cloud, if one is not judicious in how this is architect, implemented
>> and dynamically tuned.
>
>
>
>
> Monitoring via passive metrics gathering and application "telemetry" is one
> of the best ways to do it. That is how I've implemented things
>
>
>
> The beauty of the rest api is that it isn't heavyweight, and every master
> has it on port 5050 (by default) and every slave has it on port 5051 (by
> default). Since I'm throwing this all into graphite (well technically
> cassandra fronted by cyanite fronted by graphite-api... but same
> difference), I found a reasonable way to do capacity planning. Collectd will
> poll the master/slave on each mesos host every 10 seconds (localhost:5050 on
> masters and localhost:5151 on slaves). This gets put into graphite via
> collectd's write_graphite plugin. These 3 graphite targets give me
> percentages of utilization for nice graphs:
>
> alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
> collectd.mesos.clustername.gauge-master_cpu_total), "Total CPU Usage")
> alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
> collectd.mesos.clustername.gauge-master_mem_total), "Total Memory Usage")
> alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
> collectd.mesos.clustername.gauge-master_disk_total), "Total Disk Usage")
>
> With that data, you can have your monitoring tools such as nagios/icinga
> poll graphite. Using the native graphite render api, you can do things like:
>
>     * "if the cpu usage is over 80% for 24 hours, send a warning event"
>     * "if the cpu usage is over 95% for 6 hours, send a critical event"
>
> This allows mostly no-impact monitoring since the monitoring tools are
> hitting graphite.
>
> Anyways, back to the original questions:
>
> How does everyone do proper monitoring and capacity planning for large mesos
> clusters? I expect my cluster to grow beyond what it currently is by quite a
> bit.
>
> --
> Jeff Schroeder
>
> Don't drink and derive, alcohol and analysis don't mix.
> http://www.digitalprognosis.com

Reply via email to