I wrote a python collectd plugin which pulls both master (only if
master/elected == 1) and slave stats from the rest api under
/metrics/snapshot and /slave(1)/stats.json respectively and throws those
into graphite.

After getting everything working, I built a few dashboards, one of which
displays these stats from http://master:5051/metrics/snapshot:

master/disk_percent
master/cpus_percent
master/mem_percent

I had assumed that this was something like aggregate cluster utilization,
but this seems incorrect in practice. I have a small cluster with ~1T of
memory, ~25T of Disks, and ~540 CPU cores. I had a dozen or so small tasks
running, and launched 500 tasks with 1G of memory and 1 CPU each.

Now I'd expect to se the disk/cpu/mem percentage metrics above go up
considerably. I did notice that cpus_percent went to around 0.94.

What is the correct way to measure overall cluster utilization for capacity
planning? We can have the NOC watch this and simply add more hardware when
the number starts getting low.

Thanks

-- 
Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com

Reply via email to