We are collecting system metrics through sysstat every minute and getting those to OpenTSDB via Sensu. We have a plethora of metrics, but I am finding it difficult to create meaningful visualizations. We have alerting for things like individual OSDs reaching capacity thresholds, memory spikes on OSD or MON hosts. I am just trying to come up with some visualizations that could become solid indicators that something is wrong with the cluster in general, or with a particular host (besides CPU or memory utilization).
This morning, I have thought of things like: - Stddev of bytes used on all disks in the cluster and individual OSD hosts - 1st and 2nd derivative of bytes used on all disks in the cluster and individual OSD hosts - bytes used in the entire cluster - % usage of cluster capacity Stddev should help us identify hotspots. Velocity and acceleration of bytes used should help us with capacity planning. Bytes used in general is just a neat thing to see, but doesn't tell us all that much. % usage of cluster capacity is another thing that's just kind of neat to see. What would you suggest looking for in dump_historic_ops? Maybe get regular metrics on things like total transaction length? The only problem is that dump_historic_ops may not always contain relevant/recent data. It is not as easily translated into time series data as some other things. On Sat, Apr 12, 2014 at 9:23 AM, Mark Nelson <mark.nel...@inktank.com>wrote: > One thing I do right now for ceph performance testing is run a copy of > collectl during every test. This gives you a TON of information about CPU > usage, network stats, disk stats, etc. It's pretty easy to import the > output data into gnuplot. Mark Seger (the creator of collectl) also has > some tools to gather aggregate statistics across multiple nodes. Beyond > collectl, you can get a ton of useful data out of the ceph admin socket. I > especially like dump_historic_ops as it some times is enough to avoid > having to parse through debug 20 logs. > > While the following tools have too much overhead to be really useful for > general system monitoring, they are really useful for specific performance > investiations: > > 1) perf with the dwarf/unwind support > 2) blktrace (optionally with seekwatcher) > 3) valgrind (cachegrind, callgrind, massif) > > Beyond that, there are some collectd plugins for Ceph and last time I > checked DreamHost was using Graphite for a lot of visualizations. There's > always ganglia too. :) > > Mark > > > On 04/12/2014 09:41 AM, Jason Villalta wrote: > >> I know ceph throws some warnings if there is high write latency. But i >> would be most intrested in the delay for io requests, linking directly >> to iops. If iops start to drop because the disk are overwhelmed then >> latency for requests would be increasing. This would tell me that I >> need to add more OSDs/Nodes. I am not sure there is a specific metric >> in ceph for this but it would be awesome if there was. >> >> >> On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier <greg.poir...@opower.com >> <mailto:greg.poir...@opower.com>> wrote: >> >> Curious as to how you define cluster latency. >> >> >> On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta <ja...@rubixnet.com >> <mailto:ja...@rubixnet.com>> wrote: >> >> Hi, i have not don't anything with metrics yet but the only ones >> I personally would be interested in is total capacity >> utilization and cluster latency. >> >> Just my 2 cents. >> >> >> On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier >> <greg.poir...@opower.com <mailto:greg.poir...@opower.com>> wrote: >> >> I'm in the process of building a dashboard for our Ceph >> nodes. I was wondering if anyone out there had instrumented >> their OSD / MON clusters and found particularly useful >> visualizations. >> >> At first, I was trying to do ridiculous things (like >> graphing % used for every disk in every OSD host), but I >> realized quickly that that is simply too many metrics and >> far too visually dense to be useful. I am attempting to put >> together a few simpler, more dense visualizations like... >> overcall cluster utilization, aggregate cpu and memory >> utilization per osd host, etc. >> >> Just looking for some suggestions. Thanks! >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> -- >> -- >> */Jason Villalta/* >> Co-founder >> >> Inline image 1 >> 800.799.4407x1230 | www.RubixTechnology.com >> <http://www.rubixtechnology.com/> >> >> >> >> >> >> -- >> -- >> */Jason Villalta/* >> Co-founder >> >> Inline image 1 >> 800.799.4407x1230 | www.RubixTechnology.com >> <http://www.rubixtechnology.com/> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com