Re: [Ganglia-general] Error 1 sending the modular data

2012-08-15 Thread Chris Burroughs
Unfortunately there appears to be several problems, but they don't all
correlate with Error 1 sending the modular data.  There were reporting
problems with approximately 75 gmond instances around midnight local
time last night (which is still suspicious). Some examples:

One case box #82:
 - Sending to 3 udp unicast channels, which are polled by 2 gmetads.
 - Nothing in /var/log/messages near when it sopped reporting.
 - Both gmetad's agree it was down (they poll the same aggregator).
 - Host did not appear in 2/3 gmond aggregators.
 - Restart of gmond on #82 did not fix problem.
 - Restart of the relevant aggregator did not fix the problem.
 - Restart of aggregator with debug  0 did fix this problem.  Obviously
I missed something.
 - meta data send interval is 120 seconds.  Does that impact when HOST
NAME= update?
 - The aggregator (for old memory leak fighting) has a cron to restart
at midnight.


In another case (for the other 75) a gmetad instance stopped updating
after getting an error like:
Aug 14 23:59:01 host /usr/sbin/gmetad[5999]: Process XML (Cluster 1):
XML_ParseBuffer() error at line 22388:#012no element found#012

Restarting that gmetad caused it to start polling again
I found
http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=189  but
that seems to refer to an un-escaped name, as opposed to a transient but
not recovered error.




On 08/13/2012 01:29 PM, Chris Burroughs wrote:
 So for background, my original problem is that load_one will not be
 updated by gmetad for a period of over 600 seconds (an arbitrary timeout
 signifying that gmond/the host is probably down).  It occurs a few
 times/day across hundreds of hosts, and often occurs near midnight
 localtime. This *appears* to correlate with messages along the lines of
 the following (I didn't see anything else suspicious in syslog):
 
 Aug 12 23:53:26 adq82 /usr/sbin/gmond[28637]: Error 1 sending the
 modular data for entropy_avail#012
 Aug 12 23:59:00 adq82 /usr/sbin/gmond[28637]: Error 1 sending the
 modular data for mem_cached#012
 Aug 12 23:59:10 adq82 /usr/sbin/gmond[28637]: Error 1 sending the
 modular data for diskstat_sda_write_bytes_per_sec#012
 
 
 Since it occurs infrequently running in debug mode on every server is
 not a good option.  But false positives that keep people from sleeping
 are bad. First of all, does a correlation between these messages and all
 metrics not reporting for a period of time make sense?  If not what
 should I be looking at?
 
 Second, if this is anything other than a red herring, I'm totally
 confused how to debug it. Even if debug was enabled the debug message
 [1] does doesn't seem to include any additional information.  Also 1
 seems like it could be two different errors [2] [3].
 
 System information:
  - gmond 3.4.0
  - centos6
  - using send channels
 
 [1]
 https://github.com/ganglia/monitor-core/blob/release/3.4/gmond/gmond.c#L2735
 [2]
 https://github.com/ganglia/monitor-core/blob/release/3.4/lib/libgmond.c#L575
 [3]
 https://github.com/ganglia/monitor-core/blob/release/3.4/lib/libgmond.c#L517
 


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Error 1 sending the modular data

2011-05-03 Thread Bernard Li
Hello:

On Tue, May 3, 2011 at 7:39 AM, Iban Cabrillo cabri...@ifca.unican.es wrote:

    Does Anybody knows how i can delete/fix thousands of errors in syslog
 system like :

     /usr/sbin/gmond[4556]: Error 1 sending the modular data for  

   I see a hundred of them by second.

Can you please provide additional info regarding your setup?

- OS
- Multicast or Unicast
- Are you getting the error message for *all gmonds* or just some?
- Are you getting the error for all metrics or just some?  And if so,
which ones?

Can you please post your full gmond.conf somewhere like
http://www.pastebin.com and reference it here?

Thanks,

Bernard

--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general