We use the same monitoring script from rayrod2030. However instead of the master_cpus_percent, we use the master_cpus_used and master_cpus_total to calculate a percentage. And this will give the allocated percentage of CPUs in the cluster, the actual utilization is measured by collectd.
-----Original Message----- From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick Davies Sent: Saturday, March 07, 2015 2:15 PM To: user@mesos.apache.org Subject: Re: Question on Monitoring a Mesos Cluster Yeah, that confused me too - I think that figure is specific to the master/slave polled (and that'll just be the active one since you're only reporting when master/elected is true. I'm using this one https://github.com/rayrod2030/collectd-mesos , not sure if that's the same as yours? On 7 March 2015 at 18:56, Jeff Schroeder <jeffschroe...@computer.org> wrote: > Responses inline > > On Sat, Mar 7, 2015 at 12:48 PM, CCAAT <cc...@tampabay.rr.com> wrote: >> >> ... snip ... >>> >>> After getting everything working, I built a few dashboards, one of >>> which displays these stats from http://master:5051/metrics/snapshot: >>> >>> master/disk_percent >>> master/cpus_percent >>> master/mem_percent >>> >>> I had assumed that this was something like aggregate cluster >>> utilization, but this seems incorrect in practice. I have a small >>> cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had >>> a dozen or so small tasks running, and launched 500 tasks with 1G of >>> memory and 1 CPU each. >>> >>> Now I'd expect to se the disk/cpu/mem percentage metrics above go up >>> considerably. I did notice that cpus_percent went to around 0.94. >>> >>> What is the correct way to measure overall cluster utilization for >>> capacity planning? We can have the NOC watch this and simply add >>> more hardware when the number starts getting low. >> >> >> Boy, I cannot wait to read the tidbits of wisdom here. Maybe the >> development group has more accurate information if not some vague >> roadmap on resource/process monitoring. Sooner or later, this is >> going to become a quintessential need; so I hope the "deep thinkers" >> are all over this need both in the user and dev groups. >> >> In fact the monitoring can easily create a significant loading on the >> cluster/cloud, if one is not judicious in how this is architect, >> implemented and dynamically tuned. > > > > > Monitoring via passive metrics gathering and application "telemetry" > is one of the best ways to do it. That is how I've implemented things > > > > The beauty of the rest api is that it isn't heavyweight, and every > master has it on port 5050 (by default) and every slave has it on port > 5051 (by default). Since I'm throwing this all into graphite (well > technically cassandra fronted by cyanite fronted by graphite-api... > but same difference), I found a reasonable way to do capacity > planning. Collectd will poll the master/slave on each mesos host every > 10 seconds (localhost:5050 on masters and localhost:5151 on slaves). > This gets put into graphite via collectd's write_graphite plugin. > These 3 graphite targets give me percentages of utilization for nice graphs: > > alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used, > collectd.mesos.clustername.gauge-master_cpu_total), "Total CPU Usage") > alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used, > collectd.mesos.clustername.gauge-master_mem_total), "Total Memory > Usage") > alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used, > collectd.mesos.clustername.gauge-master_disk_total), "Total Disk > Usage") > > With that data, you can have your monitoring tools such as > nagios/icinga poll graphite. Using the native graphite render api, you can > do things like: > > * "if the cpu usage is over 80% for 24 hours, send a warning event" > * "if the cpu usage is over 95% for 6 hours, send a critical event" > > This allows mostly no-impact monitoring since the monitoring tools are > hitting graphite. > > Anyways, back to the original questions: > > How does everyone do proper monitoring and capacity planning for large > mesos clusters? I expect my cluster to grow beyond what it currently > is by quite a bit. > > -- > Jeff Schroeder > > Don't drink and derive, alcohol and analysis don't mix. > http://www.digitalprognosis.com
smime.p7s
Description: S/MIME cryptographic signature