Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Sam Barham Sun, 14 Sep 2014 18:46:11 -0700

I've finally managed to generate a core dump (the VM wasn't set up to do it
yet), but it's 214Mb and doesn't seem to contain anything helpful -
especially as I don't have debug symbols.  The backtrace shows:
#0  0x000000000040547c in ?? ()
#1  0x00007f600a49a245 in hash_foreach () from
/usr/lib/libganglia-3.3.8.so.0
#2  0x00000000004054e1 in ?? ()
#3  0x00007f600a49a245 in hash_foreach () from
/usr/lib/libganglia-3.3.8.so.0
#4  0x00000000004054e1 in ?? ()
#5  0x00007f600a49a245 in hash_foreach () from
/usr/lib/libganglia-3.3.8.so.0
#6  0x0000000000405436 in ?? ()
#7  0x000000000040530d in ?? ()
#8  0x00000000004058fa in ?? ()
#9  0x00007f6008ef9b50 in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()


Is there a way for me to get more useful information out of it?

On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell <devon.od...@gmail.com>
wrote:

> Are you able to share a core file?
>
> 2014-09-11 14:32 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
> > We are using Ganglia to monitoring our cloud infrastructure on Amazon
> AWS.
> > Everything is working correctly (metrics are flowing etc), except that
> > occasionally the gmetad process will segfault out of the blue. The gmetad
> > process is running on an m3.medium EC2, and is monitoring about 50
> servers.
> > The servers are arranged into groups, each one having a bastion EC2 where
> > the metrics are gathered. gmetad is configured to grab the metrics from
> > those bastions - about 10 of them.
> >
> > Some useful facts:
> >
> > We are running Debian Wheezy on all the EC2s
> > Sometimes the crash will happen multiple times in a day, sometimes it'll
> be
> > a day or two before it crashes
> > The crash creates no logs in normal operation other than a segfault log
> > something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
> > 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad
> manually
> > with debug logging, it appears that the crash is related to gmetad doing
> a
> > cleanup.
> > When we realised that the cleanup process might be to blame we did more
> > research around that. We realised that our disk IO was way too high and
> > added rrdcached in order to reduce it. The disk IO is now much lower, and
> > the crash is occurring less often, but still an average of once a day or
> so.
> > We have two systems (dev and production). Both exhibit this crash, but
> the
> > dev system, which is monitoring a much smaller group of servers crashes
> > significantly less often.
> > The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
> We've
> > upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
> > 1.4.7-2. That doesn't seem to have helped with the crash.
> > We have monit running on both systems configured to restart gmetad if it
> > dies. It restarts immediately with no issues.
> > The production system is storing it's data on a magnetic disk, the dev
> > system is using ssd.  That doesn't appear to have changed the frequency
> of
> > the crash.
> >
> > Has anyone experienced this kind of crash, especially on Amazon hardware?
> > We're at our wits end trying to find a solution!
> >
> >
> >
> ------------------------------------------------------------------------------
> > Want excitement?
> > Manually upgrade your production database.
> > When you want reliability, choose Perforce
> > Perforce version control. Predictably reliable.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Reply via email to