Re: [ceph-users] Useful visualizations / metrics
Hi, i have not don't anything with metrics yet but the only ones I personally would be interested in is total capacity utilization and cluster latency. Just my 2 cents. On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.comwrote: I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ inline: EmailLogo.png___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Useful visualizations / metrics
Curious as to how you define cluster latency. On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com wrote: Hi, i have not don't anything with metrics yet but the only ones I personally would be interested in is total capacity utilization and cluster latency. Just my 2 cents. On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.comwrote: I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ inline: EmailLogo.png___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Useful visualizations / metrics
I know ceph throws some warnings if there is high write latency. But i would be most intrested in the delay for io requests, linking directly to iops. If iops start to drop because the disk are overwhelmed then latency for requests would be increasing. This would tell me that I need to add more OSDs/Nodes. I am not sure there is a specific metric in ceph for this but it would be awesome if there was. On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier greg.poir...@opower.comwrote: Curious as to how you define cluster latency. On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.comwrote: Hi, i have not don't anything with metrics yet but the only ones I personally would be interested in is total capacity utilization and cluster latency. Just my 2 cents. On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.comwrote: I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ inline: EmailLogo.png___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Useful visualizations / metrics
One thing I do right now for ceph performance testing is run a copy of collectl during every test. This gives you a TON of information about CPU usage, network stats, disk stats, etc. It's pretty easy to import the output data into gnuplot. Mark Seger (the creator of collectl) also has some tools to gather aggregate statistics across multiple nodes. Beyond collectl, you can get a ton of useful data out of the ceph admin socket. I especially like dump_historic_ops as it some times is enough to avoid having to parse through debug 20 logs. While the following tools have too much overhead to be really useful for general system monitoring, they are really useful for specific performance investiations: 1) perf with the dwarf/unwind support 2) blktrace (optionally with seekwatcher) 3) valgrind (cachegrind, callgrind, massif) Beyond that, there are some collectd plugins for Ceph and last time I checked DreamHost was using Graphite for a lot of visualizations. There's always ganglia too. :) Mark On 04/12/2014 09:41 AM, Jason Villalta wrote: I know ceph throws some warnings if there is high write latency. But i would be most intrested in the delay for io requests, linking directly to iops. If iops start to drop because the disk are overwhelmed then latency for requests would be increasing. This would tell me that I need to add more OSDs/Nodes. I am not sure there is a specific metric in ceph for this but it would be awesome if there was. On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier greg.poir...@opower.com mailto:greg.poir...@opower.com wrote: Curious as to how you define cluster latency. On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com mailto:ja...@rubixnet.com wrote: Hi, i have not don't anything with metrics yet but the only ones I personally would be interested in is total capacity utilization and cluster latency. Just my 2 cents. On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.com mailto:greg.poir...@opower.com wrote: I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- */Jason Villalta/* Co-founder Inline image 1 800.799.4407x1230 | www.RubixTechnology.com http://www.rubixtechnology.com/ -- -- */Jason Villalta/* Co-founder Inline image 1 800.799.4407x1230 | www.RubixTechnology.com http://www.rubixtechnology.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Useful visualizations / metrics
We are collecting system metrics through sysstat every minute and getting those to OpenTSDB via Sensu. We have a plethora of metrics, but I am finding it difficult to create meaningful visualizations. We have alerting for things like individual OSDs reaching capacity thresholds, memory spikes on OSD or MON hosts. I am just trying to come up with some visualizations that could become solid indicators that something is wrong with the cluster in general, or with a particular host (besides CPU or memory utilization). This morning, I have thought of things like: - Stddev of bytes used on all disks in the cluster and individual OSD hosts - 1st and 2nd derivative of bytes used on all disks in the cluster and individual OSD hosts - bytes used in the entire cluster - % usage of cluster capacity Stddev should help us identify hotspots. Velocity and acceleration of bytes used should help us with capacity planning. Bytes used in general is just a neat thing to see, but doesn't tell us all that much. % usage of cluster capacity is another thing that's just kind of neat to see. What would you suggest looking for in dump_historic_ops? Maybe get regular metrics on things like total transaction length? The only problem is that dump_historic_ops may not always contain relevant/recent data. It is not as easily translated into time series data as some other things. On Sat, Apr 12, 2014 at 9:23 AM, Mark Nelson mark.nel...@inktank.comwrote: One thing I do right now for ceph performance testing is run a copy of collectl during every test. This gives you a TON of information about CPU usage, network stats, disk stats, etc. It's pretty easy to import the output data into gnuplot. Mark Seger (the creator of collectl) also has some tools to gather aggregate statistics across multiple nodes. Beyond collectl, you can get a ton of useful data out of the ceph admin socket. I especially like dump_historic_ops as it some times is enough to avoid having to parse through debug 20 logs. While the following tools have too much overhead to be really useful for general system monitoring, they are really useful for specific performance investiations: 1) perf with the dwarf/unwind support 2) blktrace (optionally with seekwatcher) 3) valgrind (cachegrind, callgrind, massif) Beyond that, there are some collectd plugins for Ceph and last time I checked DreamHost was using Graphite for a lot of visualizations. There's always ganglia too. :) Mark On 04/12/2014 09:41 AM, Jason Villalta wrote: I know ceph throws some warnings if there is high write latency. But i would be most intrested in the delay for io requests, linking directly to iops. If iops start to drop because the disk are overwhelmed then latency for requests would be increasing. This would tell me that I need to add more OSDs/Nodes. I am not sure there is a specific metric in ceph for this but it would be awesome if there was. On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier greg.poir...@opower.com mailto:greg.poir...@opower.com wrote: Curious as to how you define cluster latency. On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com mailto:ja...@rubixnet.com wrote: Hi, i have not don't anything with metrics yet but the only ones I personally would be interested in is total capacity utilization and cluster latency. Just my 2 cents. On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.com mailto:greg.poir...@opower.com wrote: I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- */Jason Villalta/* Co-founder Inline image 1 800.799.4407x1230 | www.RubixTechnology.com http://www.rubixtechnology.com/ -- -- */Jason Villalta/* Co-founder Inline image 1 800.799.4407x1230 | www.RubixTechnology.com http://www.rubixtechnology.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] Useful visualizations / metrics
I've been graphing disk latency, osd latency, and RGW latency. It's a bit tricky to pull out of ceph --admin-daemon ceph-osd.0.asok perf dump though. perf dump gives you the total ops and total op time. You have to track the delta of those two values, then divide the deltas to get the average latency over your sample interval. I had some alerting on those values, but it was too noisy. The graphs are helpful though, especially the graphs that have all of a single node's disks (one graph) and OSDs (second graph) on it. Viewing both graphs helped me identify several problems, including a failing disk and a bad write cache battery. I'm not getting much out of the RGW latency graph though. It's pretty much just the sum of all the OSD latency graphs during that sample interval. Craig Lewis Senior Systems Engineer Office +1.714.602.1309 Email cle...@centraldesktop.com Central Desktop. Work together in ways you never thought possible. Connect with us Website | Twitter | Facebook | LinkedIn | Blog On 4/12/14 07:37 , Greg Poirier wrote: Curious as to how you define cluster latency. On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com wrote: Hi, i have not don't anything with metrics yet but the only ones I personally would be interested in is total capacity utilization and cluster latency. Just my 2 cents. On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.com wrote: I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- Jason Villalta Co-founder 800.799.4407x1230|www.RubixTechnology.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com