We are collecting system metrics through sysstat every minute and getting
those to OpenTSDB via Sensu. We have a plethora of metrics, but I am
finding it difficult to create meaningful visualizations. We have alerting
for things like individual OSDs reaching capacity thresholds, memory spikes
on OSD or MON hosts. I am just trying to come up with some visualizations
that could become solid indicators that something is wrong with the cluster
in general, or with a particular host (besides CPU or memory utilization).

This morning, I have thought of things like:

- Stddev of bytes used on all disks in the cluster and individual OSD hosts
- 1st and 2nd derivative of bytes used on all disks in the cluster and
individual OSD hosts
- bytes used in the entire cluster
- % usage of cluster capacity

Stddev should help us identify hotspots. Velocity and acceleration of bytes
used should help us with capacity planning. Bytes used in general is just a
neat thing to see, but doesn't tell us all that much. % usage of cluster
capacity is another thing that's just kind of neat to see.

What would you suggest looking for in dump_historic_ops? Maybe get regular
metrics on things like total transaction length? The only problem is that
dump_historic_ops may not always contain relevant/recent data. It is not as
easily translated into time series data as some other things.




On Sat, Apr 12, 2014 at 9:23 AM, Mark Nelson <mark.nel...@inktank.com>wrote:

> One thing I do right now for ceph performance testing is run a copy of
> collectl during every test.  This gives you a TON of information about CPU
> usage, network stats, disk stats, etc.  It's pretty easy to import the
> output data into gnuplot.  Mark Seger (the creator of collectl) also has
> some tools to gather aggregate statistics across multiple nodes. Beyond
> collectl, you can get a ton of useful data out of the ceph admin socket.  I
> especially like dump_historic_ops as it some times is enough to avoid
> having to parse through debug 20 logs.
>
> While the following tools have too much overhead to be really useful for
> general system monitoring, they are really useful for specific performance
> investiations:
>
> 1) perf with the dwarf/unwind support
> 2) blktrace (optionally with seekwatcher)
> 3) valgrind (cachegrind, callgrind, massif)
>
> Beyond that, there are some collectd plugins for Ceph and last time I
> checked DreamHost was using Graphite for a lot of visualizations. There's
> always ganglia too. :)
>
> Mark
>
>
> On 04/12/2014 09:41 AM, Jason Villalta wrote:
>
>> I know ceph throws some warnings if there is high write latency.  But i
>> would be most intrested in the delay for io requests, linking directly
>> to iops.  If iops start to drop because the disk are overwhelmed then
>> latency for requests would be increasing.  This would tell me that I
>> need to add more OSDs/Nodes.  I am not sure there is a specific metric
>> in ceph for this but it would be awesome if there was.
>>
>>
>> On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier <greg.poir...@opower.com
>> <mailto:greg.poir...@opower.com>> wrote:
>>
>>     Curious as to how you define cluster latency.
>>
>>
>>     On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta <ja...@rubixnet.com
>>     <mailto:ja...@rubixnet.com>> wrote:
>>
>>         Hi, i have not don't anything with metrics yet but the only ones
>>         I personally would be interested in is total capacity
>>         utilization and cluster latency.
>>
>>         Just my 2 cents.
>>
>>
>>         On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier
>>         <greg.poir...@opower.com <mailto:greg.poir...@opower.com>> wrote:
>>
>>             I'm in the process of building a dashboard for our Ceph
>>             nodes. I was wondering if anyone out there had instrumented
>>             their OSD / MON clusters and found particularly useful
>>             visualizations.
>>
>>             At first, I was trying to do ridiculous things (like
>>             graphing % used for every disk in every OSD host), but I
>>             realized quickly that that is simply too many metrics and
>>             far too visually dense to be useful. I am attempting to put
>>             together a few simpler, more dense visualizations like...
>>             overcall cluster utilization, aggregate cpu and memory
>>             utilization per osd host, etc.
>>
>>             Just looking for some suggestions.  Thanks!
>>
>>             _______________________________________________
>>             ceph-users mailing list
>>             ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>         --
>>         --
>>         */Jason Villalta/*
>>         Co-founder
>>
>>         Inline image 1
>>         800.799.4407x1230 | www.RubixTechnology.com
>>         <http://www.rubixtechnology.com/>
>>
>>
>>
>>
>>
>> --
>> --
>> */Jason Villalta/*
>> Co-founder
>>
>> Inline image 1
>> 800.799.4407x1230 | www.RubixTechnology.com
>> <http://www.rubixtechnology.com/>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to