Re: [Ganglia-general] Random blank timeslots in graphs

2014-05-19 Thread Vladimir Vuksan

  
  
Error 1 sending messages are a red
  herring. 
  
  If you are seeing gaps it's most likely that storage system is not
  keeping up. What version of ganglia are you using and are you
  using rrdcached ?
  
  Vladimir
  
  On 05/19/2014 10:20 AM, Cristovao Jose Domingues Cordeiro wrote:


  
  
  Hi,

this is happening in two completely different (but with the same
deployment method) Ganglia headnodes.

I'm monitoring about 500 VM's (on each headnode), separated by
clusters of different sizes. From time to time, the summary
graphs over some cluster stop reporting, showing zero activity,
and then suddenly after a while they come back up again.


This is very undesirable since I end up with several white
"holes" per day on each cluster.

The information I can give you so far is the following:


  The attached image shows what happens
  I have a master-slave type of configuration, where the
collector gmonds are sitting in the same machine (the
headnode) as gmetad and ganglia-web, and where all the gmond
nodes are reporting their metrics through unicast to the
headnode.
  I have the latest Ganglia versions running (both core and
web)
  All VM's are based on SL6
  When I look at /var/log/messages I see a lot of this:

  May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for pkts_out#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending
the modular data for heartbeat#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending
the modular data for cpu_user#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending
the modular data for cpu_system#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending
the modular data for cpu_idle#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending
the modular data for cpu_nice#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending
the modular data for cpu_aidle#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending
the modular data for cpu_wio#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending
the modular data for cpu_steal#012
May 19 16:14:37 gangliamon gmond[22304]: Error 1 sending
the modular data for heartbeat#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending
the modular data for cpu_user#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending
the modular data for cpu_system#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending
the modular data for cpu_idle#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending
the modular data for cpu_nice#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending
the modular data for cpu_aidle#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending
the modular data for cpu_wio#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending
the modular data for cpu_steal#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for mem_free#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for mem_shared#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for mem_buffers#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for mem_cached#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for swap_free#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for bytes_out#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for bytes_in#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for pkts_in#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending
the modular data for pkts_out#012
May 19 16:14:40 gangliamon gmond[10560]: Error 1 sending
the modular data for heartbeat#012
May 19 16:14:42 gangliamon gmond[22304]: Error 1 sending
the modular data for disk_free#012

Re: [Ganglia-general] Random blank timeslots in graphs

2014-05-19 Thread Cristovao Jose Domingues Cordeiro
Hi,

I am using Ganglia Web Frontend version 3.5.12 and Ganglia Web Backend (gmetad) 
version 3.6.0. The Gmond version on the nodes is not consistent, since they are 
being set by different users, on different environments. But I believe their 
version is not below 3.1.7.

No, I am not using RRDCached...all of my Ganglia configurations are the default 
ones. I'll try to set that up.

Since you believe it is a scaling problem, should I try to store the DB in 
ramdisk?

Cumprimentos / Best regards,
Cristóvão José Domingues Cordeiro


From: Vladimir Vuksan [vli...@veus.hr]
Sent: 19 May 2014 16:37
To: Cristovao Jose Domingues Cordeiro; ganglia-general@lists.sourceforge.net
Subject: Re: [Ganglia-general] Random blank timeslots in graphs

Error 1 sending messages are a red herring.

If you are seeing gaps it's most likely that storage system is not keeping up. 
What version of ganglia are you using and are you using rrdcached ?

Vladimir

On 05/19/2014 10:20 AM, Cristovao Jose Domingues Cordeiro wrote:
Hi,

this is happening in two completely different (but with the same deployment 
method) Ganglia headnodes.

I'm monitoring about 500 VM's (on each headnode), separated by clusters of 
different sizes. From time to time, the summary graphs over some cluster stop 
reporting, showing zero activity, and then suddenly after a while they come 
back up again.

This is very undesirable since I end up with several white holes per day on 
each cluster.

The information I can give you so far is the following:


  *   The attached image shows what happens
  *   I have a master-slave type of configuration, where the collector gmonds 
are sitting in the same machine (the headnode) as gmetad and ganglia-web, and 
where all the gmond nodes are reporting their metrics through unicast to the 
headnode.
  *   I have the latest Ganglia versions running (both core and web)
  *   All VM's are based on SL6
  *   When I look at /var/log/messages I see a lot of this:
 *   May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular 
data for pkts_out#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular data for 
heartbeat#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular data for 
cpu_user#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular data for 
cpu_system#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular data for 
cpu_idle#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular data for 
cpu_nice#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular data for 
cpu_aidle#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular data for 
cpu_wio#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1 sending the modular data for 
cpu_steal#012
May 19 16:14:37 gangliamon gmond[22304]: Error 1 sending the modular data for 
heartbeat#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending the modular data for 
cpu_user#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending the modular data for 
cpu_system#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending the modular data for 
cpu_idle#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending the modular data for 
cpu_nice#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending the modular data for 
cpu_aidle#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending the modular data for 
cpu_wio#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1 sending the modular data for 
cpu_steal#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
mem_free#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
mem_shared#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
mem_buffers#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
mem_cached#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
swap_free#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
bytes_out#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
bytes_in#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
pkts_in#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1 sending the modular data for 
pkts_out#012
May 19 16:14:40 gangliamon gmond[10560]: Error 1 sending the modular data for 
heartbeat#012
May 19 16:14:42 gangliamon gmond[22304]: Error 1 sending the modular data for 
disk_free#012



Which I understand is a known unsolved issue, by looking at other discussions 
like 
https://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg06602.html
 .


Does anyone know how to solve this?

Thanks

Cumprimentos / Best regards,
Cristóvão José Domingues Cordeiro



--
Accelerate Dev Cycles

Re: [Ganglia-general] Random blank timeslots in graphs

2014-05-19 Thread Vladimir Vuksan

  
  
I would definitely consider rrdcached
  backed by some SSDs. That is what I use.
  
  3.7.0 which is in testing has some additional performance
  enhancements but I think your issue really is I/O.
  
  Vladimir
  
  On 05/19/2014 10:46 AM, Cristovao Jose Domingues Cordeiro wrote:


  
  Hi,

I am using Ganglia Web Frontend
  version 3.5.12 and
Ganglia Web Backend (gmetad)
  version 3.6.0. The
  Gmond version on the nodes is not consistent,
since they are being set by different users,
  on
  different environments.
  But I believe their version is not
below 3.1.7.
  
  No, I am not using RRDCached...all
of my Ganglia configurations are
  the default ones. I'll try to
set that up.

Since you
believe it is a scaling
  problem, should I try to store the DB in ramdisk?

  
Cumprimentos / Best regards,
  Cristvo Jos Domingues Cordeiro


  


  
  From:
  Vladimir Vuksan [vli...@veus.hr]
  Sent: 19 May 2014 16:37
  To: Cristovao Jose Domingues Cordeiro;
  ganglia-general@lists.sourceforge.net
  Subject: Re: [Ganglia-general] Random blank
  timeslots in graphs

  
  
Error 1 sending messages are a
  red herring. 
  
  If you are seeing gaps it's most likely that storage
  system is not keeping up. What version of ganglia are you
  using and are you using rrdcached ?
  
  Vladimir
  
  On 05/19/2014 10:20 AM, Cristovao Jose Domingues Cordeiro
  wrote:


  
  Hi,

this is happening in two completely different (but with
the same deployment method) Ganglia headnodes.

I'm monitoring about 500 VM's (on each headnode),
separated by clusters of different sizes. From time to
time, the summary graphs over some cluster stop
reporting, showing zero activity, and then suddenly
after a while they come back up again.


This is very undesirable since I end up with several
white "holes" per day on each cluster.

The information I can give you so far is the following:


  The attached image shows what happens 
  I have a master-slave type of configuration, where
the collector gmonds are sitting in the same machine
(the headnode) as gmetad and ganglia-web, and where
all the gmond nodes are reporting their metrics
through unicast to the headnode.
  
  I have the latest Ganglia versions running (both
core and web) 
  All VM's are based on SL6 
  When I look at /var/log/messages I see a lot of
this:

  May 19 16:14:36 gangliamon gmond[22292]: Error
1 sending the modular data for pkts_out#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for heartbeat#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_user#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_system#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_idle#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_nice#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_aidle#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_wio#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_steal#012
May 19 16:14:37 gangliamon gmond[22304