Yeah, that confused me too - I think that figure is specific to the master/slave polled (and that'll just be the active one since you're only reporting when master/elected is true.
I'm using this one https://github.com/rayrod2030/collectd-mesos , not sure if that's the same as yours? On 7 March 2015 at 18:56, Jeff Schroeder <jeffschroe...@computer.org> wrote: > Responses inline > > On Sat, Mar 7, 2015 at 12:48 PM, CCAAT <cc...@tampabay.rr.com> wrote: >> >> ... snip ... >>> >>> After getting everything working, I built a few dashboards, one of which >>> displays these stats from http://master:5051/metrics/snapshot: >>> >>> master/disk_percent >>> master/cpus_percent >>> master/mem_percent >>> >>> I had assumed that this was something like aggregate cluster >>> utilization, but this seems incorrect in practice. I have a small >>> cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a >>> dozen or so small tasks running, and launched 500 tasks with 1G of >>> memory and 1 CPU each. >>> >>> Now I'd expect to se the disk/cpu/mem percentage metrics above go up >>> considerably. I did notice that cpus_percent went to around 0.94. >>> >>> What is the correct way to measure overall cluster utilization for >>> capacity planning? We can have the NOC watch this and simply add more >>> hardware when the number starts getting low. >> >> >> Boy, I cannot wait to read the tidbits of wisdom here. Maybe the >> development group has more accurate information if not some vague roadmap on >> resource/process monitoring. Sooner or later, this is going to become a >> quintessential need; so I hope the "deep thinkers" are all over this need >> both in the user and dev groups. >> >> In fact the monitoring can easily create a significant loading on the >> cluster/cloud, if one is not judicious in how this is architect, implemented >> and dynamically tuned. > > > > > Monitoring via passive metrics gathering and application "telemetry" is one > of the best ways to do it. That is how I've implemented things > > > > The beauty of the rest api is that it isn't heavyweight, and every master > has it on port 5050 (by default) and every slave has it on port 5051 (by > default). Since I'm throwing this all into graphite (well technically > cassandra fronted by cyanite fronted by graphite-api... but same > difference), I found a reasonable way to do capacity planning. Collectd will > poll the master/slave on each mesos host every 10 seconds (localhost:5050 on > masters and localhost:5151 on slaves). This gets put into graphite via > collectd's write_graphite plugin. These 3 graphite targets give me > percentages of utilization for nice graphs: > > alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used, > collectd.mesos.clustername.gauge-master_cpu_total), "Total CPU Usage") > alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used, > collectd.mesos.clustername.gauge-master_mem_total), "Total Memory Usage") > alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used, > collectd.mesos.clustername.gauge-master_disk_total), "Total Disk Usage") > > With that data, you can have your monitoring tools such as nagios/icinga > poll graphite. Using the native graphite render api, you can do things like: > > * "if the cpu usage is over 80% for 24 hours, send a warning event" > * "if the cpu usage is over 95% for 6 hours, send a critical event" > > This allows mostly no-impact monitoring since the monitoring tools are > hitting graphite. > > Anyways, back to the original questions: > > How does everyone do proper monitoring and capacity planning for large mesos > clusters? I expect my cluster to grow beyond what it currently is by quite a > bit. > > -- > Jeff Schroeder > > Don't drink and derive, alcohol and analysis don't mix. > http://www.digitalprognosis.com