Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Devon H. O'Dell Mon, 15 Sep 2014 19:13:54 -0700

This is the prologue of some function  and the second argument is NULL when
it shouldn't be. Unfortunately, the binary does appear to be stripped,  so
it will be slightly hard to figure out which function it is. Your previous
email with the backtrace shows that it is walking the hash tree (probably
to aggregate), so it's possible that some probe is returning data that
can't be parsed or meaningfully interpreted. However, since it is a nested
walk, it might be possible to guess which metric is that deeply nested.


But not easily.

This also means running under gdb is probably pointless. Do you have the
ability to run a version with deciding symbols? If so, that is probably
faster for reaching a solution than I can surmise from digging the
assembler.
On Sep 15, 2014 6:57 PM, "Sam Barham" <s.bar...@adinstruments.com> wrote:

> I can't read assembly, so this doesn't mean much to me, but hopefully
> it'll mean something to you :)
>
>
>   40540e:       e9 fc fe ff ff          jmpq   40530f <openlog@plt+0x242f>
>   405413:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   405418:       48 89 de                mov    %rbx,%rsi
>   40541b:       e8 b0 fd ff ff          callq  4051d0 <openlog@plt+0x22f0>
>   405420:       48 8b 7b 18             mov    0x18(%rbx),%rdi
>   405424:       48 85 ff                test   %rdi,%rdi
>   405427:       74 0d                   je     405436 <openlog@plt+0x2556>
>   405429:       4c 89 e2                mov    %r12,%rdx
>   40542c:       be 60 54 40 00          mov    $0x405460,%esi
>   405431:       e8 ca d3 ff ff          callq  402800 <hash_foreach@plt>
>   405436:       31 c0                   xor    %eax,%eax
>   405438:       e9 f8 fe ff ff          jmpq   405335 <openlog@plt+0x2455>
>   40543d:       0f 1f 00                nopl   (%rax)
>   405440:       31 c9                   xor    %ecx,%ecx
>   405442:       4c 89 ea                mov    %r13,%rdx
>   405445:       31 f6                   xor    %esi,%esi
>   405447:       4c 89 e7                mov    %r12,%rdi
>   40544a:       4c 89 04 24             mov    %r8,(%rsp)
>   40544e:       e8 3d fe ff ff          callq  405290 <openlog@plt+0x23b0>
>   405453:       4c 8b 04 24             mov    (%rsp),%r8
>   405457:       89 c5                   mov    %eax,%ebp
>   405459:       eb ab                   jmp    405406 <openlog@plt+0x2526>
>   40545b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>   405460:       48 89 6c 24 f0          mov    %rbp,-0x10(%rsp)
>   405465:       4c 89 64 24 f8          mov    %r12,-0x8(%rsp)
>   40546a:       49 89 fc                mov    %rdi,%r12
>   40546d:       48 89 5c 24 e8          mov    %rbx,-0x18(%rsp)
>   405472:       48 83 ec 18             sub    $0x18,%rsp
>   405476:       8b 7a 18                mov    0x18(%rdx),%edi
>   405479:       48 89 d5                mov    %rdx,%rbp
>   40547c:       48 8b 1e                mov    (%rsi),%rbx
>   40547f:       85 ff                   test   %edi,%edi
>   405481:       74 0c                   je     40548f <openlog@plt+0x25af>
>   405483:       48 89 de                mov    %rbx,%rsi
>   405486:       e8 15 fd ff ff          callq  4051a0 <openlog@plt+0x22c0>
>   40548b:       85 c0                   test   %eax,%eax
>   40548d:       74 12                   je     4054a1 <openlog@plt+0x25c1>
>   40548f:       31 c9                   xor    %ecx,%ecx
>   405491:       48 89 ea                mov    %rbp,%rdx
>   405494:       4c 89 e6                mov    %r12,%rsi
>   405497:       48 89 df                mov    %rbx,%rdi
>   40549a:       ff 53 08                callq  *0x8(%rbx)
>   40549d:       85 c0                   test   %eax,%eax
>   40549f:       74 1f                   je     4054c0 <openlog@plt+0x25e0>
>   4054a1:       b8 01 00 00 00          mov    $0x1,%eax
>   4054a6:       48 8b 1c 24             mov    (%rsp),%rbx
>   4054aa:       48 8b 6c 24 08          mov    0x8(%rsp),%rbp
>   4054af:       4c 8b 64 24 10          mov    0x10(%rsp),%r12
>   4054b4:       48 83 c4 18             add    $0x18,%rsp
>   4054b8:       c3                      retq
>   4054b9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
>   4054c0:       48 89 ef                mov    %rbp,%rdi
>   4054c3:       48 89 de                mov    %rbx,%rsi
>   4054c6:       e8 05 fd ff ff          callq  4051d0 <openlog@plt+0x22f0>
>   4054cb:       48 8b 7b 18             mov    0x18(%rbx),%rdi
>
>
> On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell <devon.od...@gmail.com>
>  wrote:
>
>> If you can install the dbg or dbgsym package for this, you can get
>> more information. If you cannot do this, running:
>>
>> objdump -d `which gmond` | less
>>
>> in less:
>>
>> /40547c
>>
>> Paste a little context of the disassembly before and after that
>> address, then scroll up and paste which function it's in. (That might
>> still be too little information or even bad information if the binary
>> is stripped. But it's something.)
>>
>> --dho
>>
>> 2014-09-14 18:09 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
>> > I've finally managed to generate a core dump (the VM wasn't set up to
>> do it
>> > yet), but it's 214Mb and doesn't seem to contain anything helpful -
>> > especially as I don't have debug symbols.  The backtrace shows:
>> > #0  0x000000000040547c in ?? ()
>> > #1  0x00007f600a49a245 in hash_foreach () from
>> > /usr/lib/libganglia-3.3.8.so.0
>> > #2  0x00000000004054e1 in ?? ()
>> > #3  0x00007f600a49a245 in hash_foreach () from
>> > /usr/lib/libganglia-3.3.8.so.0
>> > #4  0x00000000004054e1 in ?? ()
>> > #5  0x00007f600a49a245 in hash_foreach () from
>> > /usr/lib/libganglia-3.3.8.so.0
>> > #6  0x0000000000405436 in ?? ()
>> > #7  0x000000000040530d in ?? ()
>> > #8  0x00000000004058fa in ?? ()
>> > #9  0x00007f6008ef9b50 in start_thread () from
>> > /lib/x86_64-linux-gnu/libpthread.so.0
>> > #10 0x00007f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
>> > #11 0x0000000000000000 in ?? ()
>> >
>> > Is there a way for me to get more useful information out of it?
>> >
>> > On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell <
>> devon.od...@gmail.com>
>> > wrote:
>> >>
>> >> Are you able to share a core file?
>> >>
>> >> 2014-09-11 14:32 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
>> >> > We are using Ganglia to monitoring our cloud infrastructure on Amazon
>> >> > AWS.
>> >> > Everything is working correctly (metrics are flowing etc), except
>> that
>> >> > occasionally the gmetad process will segfault out of the blue. The
>> >> > gmetad
>> >> > process is running on an m3.medium EC2, and is monitoring about 50
>> >> > servers.
>> >> > The servers are arranged into groups, each one having a bastion EC2
>> >> > where
>> >> > the metrics are gathered. gmetad is configured to grab the metrics
>> from
>> >> > those bastions - about 10 of them.
>> >> >
>> >> > Some useful facts:
>> >> >
>> >> > We are running Debian Wheezy on all the EC2s
>> >> > Sometimes the crash will happen multiple times in a day, sometimes
>> it'll
>> >> > be
>> >> > a day or two before it crashes
>> >> > The crash creates no logs in normal operation other than a segfault
>> log
>> >> > something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
>> >> > 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad
>> >> > manually
>> >> > with debug logging, it appears that the crash is related to gmetad
>> doing
>> >> > a
>> >> > cleanup.
>> >> > When we realised that the cleanup process might be to blame we did
>> more
>> >> > research around that. We realised that our disk IO was way too high
>> and
>> >> > added rrdcached in order to reduce it. The disk IO is now much lower,
>> >> > and
>> >> > the crash is occurring less often, but still an average of once a
>> day or
>> >> > so.
>> >> > We have two systems (dev and production). Both exhibit this crash,
>> but
>> >> > the
>> >> > dev system, which is monitoring a much smaller group of servers
>> crashes
>> >> > significantly less often.
>> >> > The production system is running ganglia 3.3.8-1+nmu1/rrdtool
>> 1.4.7-2.
>> >> > We've
>> >> > upgraded ganglia in the dev systems to ganglia
>> 3.6.0-2~bpo70+1/rrdtool
>> >> > 1.4.7-2. That doesn't seem to have helped with the crash.
>> >> > We have monit running on both systems configured to restart gmetad
>> if it
>> >> > dies. It restarts immediately with no issues.
>> >> > The production system is storing it's data on a magnetic disk, the
>> dev
>> >> > system is using ssd.  That doesn't appear to have changed the
>> frequency
>> >> > of
>> >> > the crash.
>> >> >
>> >> > Has anyone experienced this kind of crash, especially on Amazon
>> >> > hardware?
>> >> > We're at our wits end trying to find a solution!
>> >> >
>> >> >
>> >> >
>> >> >
>> ------------------------------------------------------------------------------
>> >> > Want excitement?
>> >> > Manually upgrade your production database.
>> >> > When you want reliability, choose Perforce
>> >> > Perforce version control. Predictably reliable.
>> >> >
>> >> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> >> > _______________________________________________
>> >> > Ganglia-general mailing list
>> >> > Ganglia-general@lists.sourceforge.net
>> >> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> >> >
>> >
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Want excitement?
>> > Manually upgrade your production database.
>> > When you want reliability, choose Perforce
>> > Perforce version control. Predictably reliable.
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Ganglia-general mailing list
>> > Ganglia-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> >
>>
>
>
> On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell <devon.od...@gmail.com>
> wrote:
>
>> If you can install the dbg or dbgsym package for this, you can get
>> more information. If you cannot do this, running:
>>
>> objdump -d `which gmond` | less
>>
>> in less:
>>
>> /40547c
>>
>> Paste a little context of the disassembly before and after that
>> address, then scroll up and paste which function it's in. (That might
>> still be too little information or even bad information if the binary
>> is stripped. But it's something.)
>>
>> --dho
>>
>> 2014-09-14 18:09 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
>> > I've finally managed to generate a core dump (the VM wasn't set up to
>> do it
>> > yet), but it's 214Mb and doesn't seem to contain anything helpful -
>> > especially as I don't have debug symbols.  The backtrace shows:
>> > #0  0x000000000040547c in ?? ()
>> > #1  0x00007f600a49a245 in hash_foreach () from
>> > /usr/lib/libganglia-3.3.8.so.0
>> > #2  0x00000000004054e1 in ?? ()
>> > #3  0x00007f600a49a245 in hash_foreach () from
>> > /usr/lib/libganglia-3.3.8.so.0
>> > #4  0x00000000004054e1 in ?? ()
>> > #5  0x00007f600a49a245 in hash_foreach () from
>> > /usr/lib/libganglia-3.3.8.so.0
>> > #6  0x0000000000405436 in ?? ()
>> > #7  0x000000000040530d in ?? ()
>> > #8  0x00000000004058fa in ?? ()
>> > #9  0x00007f6008ef9b50 in start_thread () from
>> > /lib/x86_64-linux-gnu/libpthread.so.0
>> > #10 0x00007f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
>> > #11 0x0000000000000000 in ?? ()
>> >
>> > Is there a way for me to get more useful information out of it?
>> >
>> > On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell <
>> devon.od...@gmail.com>
>> > wrote:
>> >>
>> >> Are you able to share a core file?
>> >>
>> >> 2014-09-11 14:32 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
>> >> > We are using Ganglia to monitoring our cloud infrastructure on Amazon
>> >> > AWS.
>> >> > Everything is working correctly (metrics are flowing etc), except
>> that
>> >> > occasionally the gmetad process will segfault out of the blue. The
>> >> > gmetad
>> >> > process is running on an m3.medium EC2, and is monitoring about 50
>> >> > servers.
>> >> > The servers are arranged into groups, each one having a bastion EC2
>> >> > where
>> >> > the metrics are gathered. gmetad is configured to grab the metrics
>> from
>> >> > those bastions - about 10 of them.
>> >> >
>> >> > Some useful facts:
>> >> >
>> >> > We are running Debian Wheezy on all the EC2s
>> >> > Sometimes the crash will happen multiple times in a day, sometimes
>> it'll
>> >> > be
>> >> > a day or two before it crashes
>> >> > The crash creates no logs in normal operation other than a segfault
>> log
>> >> > something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
>> >> > 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad
>> >> > manually
>> >> > with debug logging, it appears that the crash is related to gmetad
>> doing
>> >> > a
>> >> > cleanup.
>> >> > When we realised that the cleanup process might be to blame we did
>> more
>> >> > research around that. We realised that our disk IO was way too high
>> and
>> >> > added rrdcached in order to reduce it. The disk IO is now much lower,
>> >> > and
>> >> > the crash is occurring less often, but still an average of once a
>> day or
>> >> > so.
>> >> > We have two systems (dev and production). Both exhibit this crash,
>> but
>> >> > the
>> >> > dev system, which is monitoring a much smaller group of servers
>> crashes
>> >> > significantly less often.
>> >> > The production system is running ganglia 3.3.8-1+nmu1/rrdtool
>> 1.4.7-2.
>> >> > We've
>> >> > upgraded ganglia in the dev systems to ganglia
>> 3.6.0-2~bpo70+1/rrdtool
>> >> > 1.4.7-2. That doesn't seem to have helped with the crash.
>> >> > We have monit running on both systems configured to restart gmetad
>> if it
>> >> > dies. It restarts immediately with no issues.
>> >> > The production system is storing it's data on a magnetic disk, the
>> dev
>> >> > system is using ssd.  That doesn't appear to have changed the
>> frequency
>> >> > of
>> >> > the crash.
>> >> >
>> >> > Has anyone experienced this kind of crash, especially on Amazon
>> >> > hardware?
>> >> > We're at our wits end trying to find a solution!
>> >> >
>> >> >
>> >> >
>> >> >
>> ------------------------------------------------------------------------------
>> >> > Want excitement?
>> >> > Manually upgrade your production database.
>> >> > When you want reliability, choose Perforce
>> >> > Perforce version control. Predictably reliable.
>> >> >
>> >> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> >> > _______________________________________________
>> >> > Ganglia-general mailing list
>> >> > Ganglia-general@lists.sourceforge.net
>> >> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> >> >
>> >
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Want excitement?
>> > Manually upgrade your production database.
>> > When you want reliability, choose Perforce
>> > Perforce version control. Predictably reliable.
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Ganglia-general mailing list
>> > Ganglia-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> >
>>
>
>

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Reply via email to