We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS.
Everything is working correctly (metrics are flowing etc), except that
occasionally the gmetad process will segfault out of the blue. The gmetad
process is running on an m3.medium EC2, and is monitoring about 50 servers.
The servers are arranged into groups, each one having a bastion EC2 where
the metrics are gathered. gmetad is configured to grab the metrics from
those bastions - about 10 of them.

Some useful facts:

   - We are running Debian Wheezy on all the EC2s
   - Sometimes the crash will happen multiple times in a day, sometimes
   it'll be a day or two before it crashes
   - The crash creates no logs in normal operation other than a segfault
   log something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
   00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad manually
   with debug logging, it appears that the crash is related to gmetad doing a
   cleanup.
   - When we realised that the cleanup process might be to blame we did
   more research around that. We realised that our disk IO was way too high
   and added rrdcached in order to reduce it. The disk IO is now much lower,
   and the crash is occurring less often, but still an average of once a day
   or so.
   - We have two systems (dev and production). Both exhibit this crash, but
   the dev system, which is monitoring a much smaller group of servers crashes
   significantly less often.
   - The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
   We've upgraded ganglia in the dev systems to ganglia
   3.6.0-2~bpo70+1/rrdtool 1.4.7-2. That doesn't seem to have helped with the
   crash.
   - We have monit running on both systems configured to restart gmetad if
   it dies. It restarts immediately with no issues.
   - The production system is storing it's data on a magnetic disk, the dev
   system is using ssd.  That doesn't appear to have changed the frequency of
   the crash.

Has anyone experienced this kind of crash, especially on Amazon hardware?
We're at our wits end trying to find a solution!
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to