Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Jason Villalta
Hi, i have not don't anything with metrics yet but the only ones I
personally would be interested in is total capacity utilization and cluster
latency.

Just my 2 cents.


On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.comwrote:

 I'm in the process of building a dashboard for our Ceph nodes. I was
 wondering if anyone out there had instrumented their OSD / MON clusters and
 found particularly useful visualizations.

 At first, I was trying to do ridiculous things (like graphing % used for
 every disk in every OSD host), but I realized quickly that that is simply
 too many metrics and far too visually dense to be useful. I am attempting
 to put together a few simpler, more dense visualizations like... overcall
 cluster utilization, aggregate cpu and memory utilization per osd host, etc.

 Just looking for some suggestions.  Thanks!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
-- 
*Jason Villalta*
Co-founder
[image: Inline image 1]
800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/
inline: EmailLogo.png___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Greg Poirier
Curious as to how you define cluster latency.


On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com wrote:

 Hi, i have not don't anything with metrics yet but the only ones I
 personally would be interested in is total capacity utilization and cluster
 latency.

 Just my 2 cents.


 On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.comwrote:

 I'm in the process of building a dashboard for our Ceph nodes. I was
 wondering if anyone out there had instrumented their OSD / MON clusters and
 found particularly useful visualizations.

 At first, I was trying to do ridiculous things (like graphing % used for
 every disk in every OSD host), but I realized quickly that that is simply
 too many metrics and far too visually dense to be useful. I am attempting
 to put together a few simpler, more dense visualizations like... overcall
 cluster utilization, aggregate cpu and memory utilization per osd host, etc.

 Just looking for some suggestions.  Thanks!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 --
 *Jason Villalta*
 Co-founder
 [image: Inline image 1]
 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/

inline: EmailLogo.png___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Jason Villalta
I know ceph throws some warnings if there is high write latency.  But i
would be most intrested in the delay for io requests, linking directly to
iops.  If iops start to drop because the disk are overwhelmed then latency
for requests would be increasing.  This would tell me that I need to add
more OSDs/Nodes.  I am not sure there is a specific metric in ceph for this
but it would be awesome if there was.


On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier greg.poir...@opower.comwrote:

 Curious as to how you define cluster latency.


 On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.comwrote:

 Hi, i have not don't anything with metrics yet but the only ones I
 personally would be interested in is total capacity utilization and cluster
 latency.

 Just my 2 cents.


 On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier 
 greg.poir...@opower.comwrote:

 I'm in the process of building a dashboard for our Ceph nodes. I was
 wondering if anyone out there had instrumented their OSD / MON clusters and
 found particularly useful visualizations.

  At first, I was trying to do ridiculous things (like graphing % used
 for every disk in every OSD host), but I realized quickly that that is
 simply too many metrics and far too visually dense to be useful. I am
 attempting to put together a few simpler, more dense visualizations like...
 overcall cluster utilization, aggregate cpu and memory utilization per osd
 host, etc.

 Just looking for some suggestions.  Thanks!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 --
 *Jason Villalta*
 Co-founder
 [image: Inline image 1]
 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/





-- 
-- 
*Jason Villalta*
Co-founder
[image: Inline image 1]
800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/
inline: EmailLogo.png___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Mark Nelson
One thing I do right now for ceph performance testing is run a copy of 
collectl during every test.  This gives you a TON of information about 
CPU usage, network stats, disk stats, etc.  It's pretty easy to import 
the output data into gnuplot.  Mark Seger (the creator of collectl) also 
has some tools to gather aggregate statistics across multiple nodes. 
Beyond collectl, you can get a ton of useful data out of the ceph admin 
socket.  I especially like dump_historic_ops as it some times is enough 
to avoid having to parse through debug 20 logs.


While the following tools have too much overhead to be really useful for 
general system monitoring, they are really useful for specific 
performance investiations:


1) perf with the dwarf/unwind support
2) blktrace (optionally with seekwatcher)
3) valgrind (cachegrind, callgrind, massif)

Beyond that, there are some collectd plugins for Ceph and last time I 
checked DreamHost was using Graphite for a lot of visualizations. 
There's always ganglia too. :)


Mark

On 04/12/2014 09:41 AM, Jason Villalta wrote:

I know ceph throws some warnings if there is high write latency.  But i
would be most intrested in the delay for io requests, linking directly
to iops.  If iops start to drop because the disk are overwhelmed then
latency for requests would be increasing.  This would tell me that I
need to add more OSDs/Nodes.  I am not sure there is a specific metric
in ceph for this but it would be awesome if there was.


On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier greg.poir...@opower.com
mailto:greg.poir...@opower.com wrote:

Curious as to how you define cluster latency.


On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com
mailto:ja...@rubixnet.com wrote:

Hi, i have not don't anything with metrics yet but the only ones
I personally would be interested in is total capacity
utilization and cluster latency.

Just my 2 cents.


On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier
greg.poir...@opower.com mailto:greg.poir...@opower.com wrote:

I'm in the process of building a dashboard for our Ceph
nodes. I was wondering if anyone out there had instrumented
their OSD / MON clusters and found particularly useful
visualizations.

At first, I was trying to do ridiculous things (like
graphing % used for every disk in every OSD host), but I
realized quickly that that is simply too many metrics and
far too visually dense to be useful. I am attempting to put
together a few simpler, more dense visualizations like...
overcall cluster utilization, aggregate cpu and memory
utilization per osd host, etc.

Just looking for some suggestions.  Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
--
*/Jason Villalta/*
Co-founder
Inline image 1
800.799.4407x1230 | www.RubixTechnology.com
http://www.rubixtechnology.com/





--
--
*/Jason Villalta/*
Co-founder
Inline image 1
800.799.4407x1230 | www.RubixTechnology.com
http://www.rubixtechnology.com/


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Greg Poirier
We are collecting system metrics through sysstat every minute and getting
those to OpenTSDB via Sensu. We have a plethora of metrics, but I am
finding it difficult to create meaningful visualizations. We have alerting
for things like individual OSDs reaching capacity thresholds, memory spikes
on OSD or MON hosts. I am just trying to come up with some visualizations
that could become solid indicators that something is wrong with the cluster
in general, or with a particular host (besides CPU or memory utilization).

This morning, I have thought of things like:

- Stddev of bytes used on all disks in the cluster and individual OSD hosts
- 1st and 2nd derivative of bytes used on all disks in the cluster and
individual OSD hosts
- bytes used in the entire cluster
- % usage of cluster capacity

Stddev should help us identify hotspots. Velocity and acceleration of bytes
used should help us with capacity planning. Bytes used in general is just a
neat thing to see, but doesn't tell us all that much. % usage of cluster
capacity is another thing that's just kind of neat to see.

What would you suggest looking for in dump_historic_ops? Maybe get regular
metrics on things like total transaction length? The only problem is that
dump_historic_ops may not always contain relevant/recent data. It is not as
easily translated into time series data as some other things.




On Sat, Apr 12, 2014 at 9:23 AM, Mark Nelson mark.nel...@inktank.comwrote:

 One thing I do right now for ceph performance testing is run a copy of
 collectl during every test.  This gives you a TON of information about CPU
 usage, network stats, disk stats, etc.  It's pretty easy to import the
 output data into gnuplot.  Mark Seger (the creator of collectl) also has
 some tools to gather aggregate statistics across multiple nodes. Beyond
 collectl, you can get a ton of useful data out of the ceph admin socket.  I
 especially like dump_historic_ops as it some times is enough to avoid
 having to parse through debug 20 logs.

 While the following tools have too much overhead to be really useful for
 general system monitoring, they are really useful for specific performance
 investiations:

 1) perf with the dwarf/unwind support
 2) blktrace (optionally with seekwatcher)
 3) valgrind (cachegrind, callgrind, massif)

 Beyond that, there are some collectd plugins for Ceph and last time I
 checked DreamHost was using Graphite for a lot of visualizations. There's
 always ganglia too. :)

 Mark


 On 04/12/2014 09:41 AM, Jason Villalta wrote:

 I know ceph throws some warnings if there is high write latency.  But i
 would be most intrested in the delay for io requests, linking directly
 to iops.  If iops start to drop because the disk are overwhelmed then
 latency for requests would be increasing.  This would tell me that I
 need to add more OSDs/Nodes.  I am not sure there is a specific metric
 in ceph for this but it would be awesome if there was.


 On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier greg.poir...@opower.com
 mailto:greg.poir...@opower.com wrote:

 Curious as to how you define cluster latency.


 On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com
 mailto:ja...@rubixnet.com wrote:

 Hi, i have not don't anything with metrics yet but the only ones
 I personally would be interested in is total capacity
 utilization and cluster latency.

 Just my 2 cents.


 On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier
 greg.poir...@opower.com mailto:greg.poir...@opower.com wrote:

 I'm in the process of building a dashboard for our Ceph
 nodes. I was wondering if anyone out there had instrumented
 their OSD / MON clusters and found particularly useful
 visualizations.

 At first, I was trying to do ridiculous things (like
 graphing % used for every disk in every OSD host), but I
 realized quickly that that is simply too many metrics and
 far too visually dense to be useful. I am attempting to put
 together a few simpler, more dense visualizations like...
 overcall cluster utilization, aggregate cpu and memory
 utilization per osd host, etc.

 Just looking for some suggestions.  Thanks!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 --
 */Jason Villalta/*
 Co-founder

 Inline image 1
 800.799.4407x1230 | www.RubixTechnology.com
 http://www.rubixtechnology.com/





 --
 --
 */Jason Villalta/*
 Co-founder

 Inline image 1
 800.799.4407x1230 | www.RubixTechnology.com
 http://www.rubixtechnology.com/



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 

Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Craig Lewis

  
  
I've been graphing disk latency, osd
  latency, and RGW latency. It's a bit tricky to pull out of ceph
--admin-daemon ceph-osd.0.asok perf dump though. perf dump
  gives you the total ops and total op time. You have to track the
  delta of those two values, then divide the deltas to get the
  average latency over your sample interval.
  
  I had some alerting on those values, but it was too noisy. The
  graphs are helpful though, especially the graphs that have all of
  a single node's disks (one graph) and OSDs (second graph) on it.
  Viewing both graphs helped me identify several problems, including
  a failing disk and a bad write cache battery.
  
  I'm not getting much out of the RGW latency graph though. It's
  pretty much just the sum of all the OSD latency graphs during that
  sample interval.
  
  
  
  
 
  Craig Lewis
  
   Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com
   
  Central Desktop.
  Work together in ways you never thought possible.
 
   Connect with us  Website | Twitter | Facebook | LinkedIn | Blog  

  

  
  On 4/12/14 07:37 , Greg Poirier wrote:


  Curious as to how you define cluster latency.
  

On Sat, Apr 12, 2014 at 7:21 AM, Jason
  Villalta ja...@rubixnet.com
  wrote:
  
Hi, i have not don't anything with metrics
  yet but the only ones I personally would be interested in
  is total capacity utilization and cluster latency.
  

  
  Just my 2 cents.


  
  
  

  On Sat, Apr 12, 2014 at 10:02 AM, Greg
Poirier greg.poir...@opower.com
wrote:
  


  

  I'm in the process of building a
dashboard for our Ceph nodes. I was wondering if
anyone out there had instrumented their OSD /
MON clusters and found particularly useful
visualizations.



  At first, I was trying to do ridiculous things
  (like graphing % used for every disk in every
  OSD host), but I realized quickly that that is
  simply too many metrics and far too visually
  dense to be useful. I am attempting to put
  together a few simpler, more dense
  visualizations like... overcall cluster
  utilization, aggregate cpu and memory
  utilization per osd host, etc.


Just looking for some suggestions. Thanks!
  
  

  
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  

  
  
  
  
  
  -- 
  --
Jason
Villalta

  Co-founder
  
  800.799.4407x1230|www.RubixTechnology.com

  

  


  
  
  
  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com