Responses inline

On Sat, Mar 7, 2015 at 12:48 PM, CCAAT <cc...@tampabay.rr.com> wrote:

> ... snip ...
>
>> After getting everything working, I built a few dashboards, one of which
>> displays these stats from http://master:5051/metrics/snapshot:
>>
>> master/disk_percent
>> master/cpus_percent
>> master/mem_percent
>>
>> I had assumed that this was something like aggregate cluster
>> utilization, but this seems incorrect in practice. I have a small
>> cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a
>> dozen or so small tasks running, and launched 500 tasks with 1G of
>> memory and 1 CPU each.
>>
>> Now I'd expect to se the disk/cpu/mem percentage metrics above go up
>> considerably. I did notice that cpus_percent went to around 0.94.
>>
>> What is the correct way to measure overall cluster utilization for
>> capacity planning? We can have the NOC watch this and simply add more
>> hardware when the number starts getting low.
>>
>
> Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
> development group has more accurate information if not some vague roadmap
> on resource/process monitoring. Sooner or later, this is going to become a
> quintessential need; so I hope the "deep thinkers" are all over this need
> both in the user and dev groups.
>
> In fact the monitoring can easily create a significant loading on the
> cluster/cloud, if one is not judicious in how this is architect,
> implemented and dynamically tuned.
>



Monitoring via passive metrics gathering and application "telemetry" is one
of the best ways to do it. That is how I've implemented things



The beauty of the rest api is that it isn't heavyweight, and every master
has it on port 5050 (by default) and every slave has it on port 5051 (by
default). Since I'm throwing this all into graphite (well technically
cassandra fronted by cyanite fronted by graphite-api... but same
difference), I found a reasonable way to do capacity planning. Collectd
will poll the master/slave on each mesos host every 10 seconds
(localhost:5050 on masters and localhost:5151 on slaves). This gets put
into graphite via collectd's write_graphite plugin. These 3 graphite
targets give me percentages of utilization for nice graphs:

alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
collectd.mesos.clustername.gauge-master_cpu_total), "Total CPU Usage")
alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
collectd.mesos.clustername.gauge-master_mem_total), "Total Memory Usage")
alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
collectd.mesos.clustername.gauge-master_disk_total), "Total Disk Usage")

With that data, you can have your monitoring tools such as nagios/icinga
poll graphite. Using the native graphite render api, you can do things like:

    * "if the cpu usage is over 80% for 24 hours, send a warning event"
    * "if the cpu usage is over 95% for 6 hours, send a critical event"

This allows mostly no-impact monitoring since the monitoring tools are
hitting graphite.

Anyways, back to the original questions:

How does everyone do proper monitoring and capacity planning for large
mesos clusters? I expect my cluster to grow beyond what it currently is by
quite a bit.

-- 
Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com

Reply via email to