Re: [Ganglia-general] gmetad segmentation fault
Unfortunately, without a coredump or backtrace where debug symbols are present, I'm not going to be able to offer any additional insight. Are you running any C and / or Python modules with gmetad? --dho 2015-12-11 5:54 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>: > Hi guys, > > just to update on this: > - I've removed my ganglia-gmetad/gmond and libganglia from everywhere and > installed the most recent versions from the epel repository. The error is > still there. > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > From: Cristovao Cordeiro > Sent: 08 December 2015 11:49 > To: Marcello Morgotti > Cc: ganglia-general@lists.sourceforge.net > > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi everyone, > > sorry for the late reply. > @Devon > thanks for looking into it. > i do have .so.0 and .so.0.0.0 in my system and I am not using any custom > modules. The Ganglia deployment is however a bit different from the > standard: > - in one single VM, gmetad is running (always) and several gmond daemons > are running in the background (daemon gmond -c /etc/ganglia/gmond_N.conf), > all receiving metric through unicast. > The Ganglia package is built by me as well, from the source code. I am > currently building and using Ganglia 3.7.1 (taken from > http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.7.1/). > I build the Ganglia RPM myself for 2 reasons: > 1 - have Ganglia available in YUM > 2 - minor changes to ganglia-web's apache.conf > > I have other monitors running 3.6.0 and no errors there. But on those I have > installed Ganglia manually and directly without building a RPM. > > I also see 3.7.2 already available in the epel repository so I’ll might try > this. > > Regarding the compilation with debug symbols… > > @Marcello > did you get a chance to do it? > > > Best regards, > Cristóvão José Domingues Cordeiro > > > > > On 24 Nov 2015, at 18:51, Marcello Morgotti <m.morgo...@cineca.it> wrote: > > Hello, > > I'd like to join the discussion because this problem is affecting us as > well. We have the problem on two different installations: > > 2 server in active-active HA configuration, each with CentOS 7.1 + > ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D > 2 server in active-active HA configuration, each with RedHat 6.5 + > ganglia 3.7.2 + rrdcached monitoring systems E,F,G,H > > In both cases the ganglia rpm packages are taken from EPEL repository. > The curios thing is that every time that the segfault happens it happens > almost at the same time. > I.e. for Centos7 systems: > > Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection > ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in > libganglia.so.0.0.0[7fd70d624000+14000] > Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited, > code=killed, status=11/SEGV > Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state. > Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection > ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in > libganglia.so.0.0.0[7fc1bddda000+14000] > Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited, > code=killed, status=11/SEGV > Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state. > > > Hope this helps and adds infomations, I will try to build a debug > version of gmetad to see if it's possible to generate a core dump. > > Best Regards, > Marcello > > On 23/11/2015 17:30, Devon H. O'Dell wrote: > > It's just a system versioning thing for shared libraries. Usually .so > is a soft link to .so.0 which is a soft link to .so.0.0.0. This is > intended to be an ABI versioning interface, but it's not super > frequently used. Are these legitimately different files on your > system? > > The crash is in hash_delete: > > 003b2c00b780 : > ... > 3b2c00b797: 48 8b 07mov(%rdi),%rax > 3b2c00b79a: 48 8d 34 30 lea(%rax,%rsi,1),%rsi > 3b2c00b79e: 48 39 f0cmp%rsi,%rax > 3b2c00b7a1: 73 37 jae3b2c00b7da <hash_delete+0x5a> > 3b2c00b7a3: 48 bf b3 01 00 00 00movabs $0x10001b3,%rdi > 3b2c00b7aa: 01 00 00 > 3b2c00b7ad: 0f 1f 00nopl (%rax) > > 3b2c00b7b0: 0f b6 08movzbl (%rax),%ecx > > 3b2c00b7b3: 48 83 c0 01 add$0x1,%rax > 3b2c00b7b7: 48 31 caxor%rcx,%rdx > 3b2c00b7ba: 48 0f af d7 imul %rdi,%rdx > 3b2c00b7be: 48 39 c6cmp%rax,%rsi > 3b2c00b7c1: 77 ed
Re: [Ganglia-general] gmetad segmentation fault
It's just a system versioning thing for shared libraries. Usually .so is a soft link to .so.0 which is a soft link to .so.0.0.0. This is intended to be an ABI versioning interface, but it's not super frequently used. Are these legitimately different files on your system? The crash is in hash_delete: 003b2c00b780 : ... 3b2c00b797: 48 8b 07mov(%rdi),%rax 3b2c00b79a: 48 8d 34 30 lea(%rax,%rsi,1),%rsi 3b2c00b79e: 48 39 f0cmp%rsi,%rax 3b2c00b7a1: 73 37 jae3b2c00b7da <hash_delete+0x5a> 3b2c00b7a3: 48 bf b3 01 00 00 00movabs $0x10001b3,%rdi 3b2c00b7aa: 01 00 00 3b2c00b7ad: 0f 1f 00nopl (%rax) >>> 3b2c00b7b0: 0f b6 08movzbl (%rax),%ecx 3b2c00b7b3: 48 83 c0 01 add$0x1,%rax 3b2c00b7b7: 48 31 caxor%rcx,%rdx 3b2c00b7ba: 48 0f af d7 imul %rdi,%rdx 3b2c00b7be: 48 39 c6cmp%rax,%rsi 3b2c00b7c1: 77 ed ja 3b2c00b7b0 <hash_delete+0x30> ... %rdi is the first argument to the function, so %rax is the datum_t *key, and (%rax) is key->data. hash_key has been inlined here. Unfortunately, what appears to be happening is that some key has already been removed from the hash table and freed, and based on your description of the problem, that was attempted concurrently. Your kernel crash shows that we were trying to dereference a NULL pointer, so it would appear that key->data is NULL. Unfortunately, it is not clear without a backtrace what sort of key specifically is in question here, but perhaps someone else might have some context based on recent changes. (I don't think this is related to my work on the hashes). Are you running any custom modules (either in C or Python)? Would it be possible for you to build gmond and libganglia with debugging symbols and generate a core dump? --dho 2015-11-23 1:29 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>: > Hi Devon, > > thanks for the help. > Attached follows the binary file. > > btw, what is the difference between so.0 and so.0.0.0? > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > On 17 Nov 2015, at 19:16, Devon H. O'Dell <devon.od...@gmail.com> wrote: > > Hi! Very sorry about this, I had a draft that I thought I had sent. > > Could you email me your libganglia.so binary off-list? Alternatively, > do you have the ability to compile libganglia with debugging symbols? > > 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>: > > Hi everyone, > > any news on this? > Another symptom is that this happens quite as often as the cluster changes, > meaning that the more activity there is in the cluster (delete machines, > create...) the more this issue happens. Could it be related with the > deletion of old hosts by gmond causing gmetad to try to access files that > are already gone? > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > ____ > From: Cristovao Cordeiro [cristovao.corde...@cern.ch] > Sent: 09 November 2015 13:40 > To: Devon H. O'Dell > Cc: Ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi Devon, > > thanks! > > * I don't think there was a core dump. At least that is not stated in > /var/log/messages and I don't find anything relevant in /var/spool/abrt/ > * I am running 3.7.1 > * The addr2line returns ??:0. Also with gdb: > > gdb /usr/lib64/libganglia.so.0.0.0 > > ... > Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging > symbols found)...done. > > Some more information about my setup: > - I am running several gmonds in the same machine, so all my data_sources > are to localhost. > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > > From: Devon H. O'Dell [devon.od...@gmail.com] > Sent: 09 November 2015 13:12 > To: Cristovao Cordeiro > Cc: Ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi! > > I have a couple of initial questions that might help figure out the problem: > > * Did you get a core dump? > * What version of ganglia are you running? > * This crash happened within libganglia.so at offset 0xb7b0. Can you run: > > $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 > > and paste the output? If that does not work, there are a couple other > things we can try to get information about the fault, but hopefully we > can just work from there. > > Kind regards, > > Devon H. O'Dell > > 2
Re: [Ganglia-general] gmetad segmentation fault
Hi! Very sorry about this, I had a draft that I thought I had sent. Could you email me your libganglia.so binary off-list? Alternatively, do you have the ability to compile libganglia with debugging symbols? 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>: > Hi everyone, > > any news on this? > Another symptom is that this happens quite as often as the cluster changes, > meaning that the more activity there is in the cluster (delete machines, > create...) the more this issue happens. Could it be related with the deletion > of old hosts by gmond causing gmetad to try to access files that are already > gone? > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > > From: Cristovao Cordeiro [cristovao.corde...@cern.ch] > Sent: 09 November 2015 13:40 > To: Devon H. O'Dell > Cc: Ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi Devon, > > thanks! > > * I don't think there was a core dump. At least that is not stated in > /var/log/messages and I don't find anything relevant in /var/spool/abrt/ > * I am running 3.7.1 > * The addr2line returns ??:0. Also with gdb: >> gdb /usr/lib64/libganglia.so.0.0.0 >... >Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging > symbols found)...done. > > Some more information about my setup: > - I am running several gmonds in the same machine, so all my data_sources > are to localhost. > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > > From: Devon H. O'Dell [devon.od...@gmail.com] > Sent: 09 November 2015 13:12 > To: Cristovao Cordeiro > Cc: Ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi! > > I have a couple of initial questions that might help figure out the problem: > > * Did you get a core dump? > * What version of ganglia are you running? > * This crash happened within libganglia.so at offset 0xb7b0. Can you run: > > $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 > > and paste the output? If that does not work, there are a couple other > things we can try to get information about the fault, but hopefully we > can just work from there. > > Kind regards, > > Devon H. O'Dell > > 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>: >> Dear all, >> >> I have several Ganglia monitors running with similar configurations in >> different machines (VMs) and for a long time now I have been experiencing >> segmentation faults at random times. It seems to happen more on gmetads that >> are monitoring larger number of nodes. >> >> In /var/log/messages I see: >> >> kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0 >> error 4 in libganglia.so.0.0.0[3630c0+15000] >> >> >> and in the console output there's only this: >> >> /bin/bash: line 1: 30375 Terminated /usr/sbin/gmetad >> >>[FAILED] >> >> >> gmetad does not have any special configuration besides the RRD location >> which in on a 4Gb ramdisk. >> >> >> Cumprimentos / Best regards, >> Cristóvão José Domingues Cordeiro >> >> >> -- >> Presto, an open source distributed SQL query engine for big data, initially >> developed by Facebook, enables you to easily query your data on Hadoop in a >> more interactive manner. Teradata is also now providing full enterprise >> support for Presto. Download a free open source copy now. >> http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140 >> ___ >> Ganglia-general mailing list >> Ganglia-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/ganglia-general >> > > -- > Presto, an open source distributed SQL query engine for big data, initially > developed by Facebook, enables you to easily query your data on Hadoop in a > more interactive manner. Teradata is also now providing full enterprise > support for Presto. Download a free open source copy now. > http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140 > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad segmentation fault
Hi! I have a couple of initial questions that might help figure out the problem: * Did you get a core dump? * What version of ganglia are you running? * This crash happened within libganglia.so at offset 0xb7b0. Can you run: $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 and paste the output? If that does not work, there are a couple other things we can try to get information about the fault, but hopefully we can just work from there. Kind regards, Devon H. O'Dell 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>: > Dear all, > > I have several Ganglia monitors running with similar configurations in > different machines (VMs) and for a long time now I have been experiencing > segmentation faults at random times. It seems to happen more on gmetads that > are monitoring larger number of nodes. > > In /var/log/messages I see: > > kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0 > error 4 in libganglia.so.0.0.0[3630c0+15000] > > > and in the console output there's only this: > > /bin/bash: line 1: 30375 Terminated /usr/sbin/gmetad > >[FAILED] > > > gmetad does not have any special configuration besides the RRD location > which in on a 4Gb ramdisk. > > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > -- > Presto, an open source distributed SQL query engine for big data, initially > developed by Facebook, enables you to easily query your data on Hadoop in a > more interactive manner. Teradata is also now providing full enterprise > support for Presto. Download a free open source copy now. > http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140 > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > -- Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)
If you can install the dbg or dbgsym package for this, you can get more information. If you cannot do this, running: objdump -d `which gmond` | less in less: /40547c Paste a little context of the disassembly before and after that address, then scroll up and paste which function it's in. (That might still be too little information or even bad information if the binary is stripped. But it's something.) --dho 2014-09-14 18:09 GMT-07:00 Sam Barham s.bar...@adinstruments.com: I've finally managed to generate a core dump (the VM wasn't set up to do it yet), but it's 214Mb and doesn't seem to contain anything helpful - especially as I don't have debug symbols. The backtrace shows: #0 0x0040547c in ?? () #1 0x7f600a49a245 in hash_foreach () from /usr/lib/libganglia-3.3.8.so.0 #2 0x004054e1 in ?? () #3 0x7f600a49a245 in hash_foreach () from /usr/lib/libganglia-3.3.8.so.0 #4 0x004054e1 in ?? () #5 0x7f600a49a245 in hash_foreach () from /usr/lib/libganglia-3.3.8.so.0 #6 0x00405436 in ?? () #7 0x0040530d in ?? () #8 0x004058fa in ?? () #9 0x7f6008ef9b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #10 0x7f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #11 0x in ?? () Is there a way for me to get more useful information out of it? On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell devon.od...@gmail.com wrote: Are you able to share a core file? 2014-09-11 14:32 GMT-07:00 Sam Barham s.bar...@adinstruments.com: We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS. Everything is working correctly (metrics are flowing etc), except that occasionally the gmetad process will segfault out of the blue. The gmetad process is running on an m3.medium EC2, and is monitoring about 50 servers. The servers are arranged into groups, each one having a bastion EC2 where the metrics are gathered. gmetad is configured to grab the metrics from those bastions - about 10 of them. Some useful facts: We are running Debian Wheezy on all the EC2s Sometimes the crash will happen multiple times in a day, sometimes it'll be a day or two before it crashes The crash creates no logs in normal operation other than a segfault log something like gmetad[11291]: segfault at 71 ip 0040547c sp 7ff2d6572260 error 4 in gmetad[40+e000]. If we run gmetad manually with debug logging, it appears that the crash is related to gmetad doing a cleanup. When we realised that the cleanup process might be to blame we did more research around that. We realised that our disk IO was way too high and added rrdcached in order to reduce it. The disk IO is now much lower, and the crash is occurring less often, but still an average of once a day or so. We have two systems (dev and production). Both exhibit this crash, but the dev system, which is monitoring a much smaller group of servers crashes significantly less often. The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. We've upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool 1.4.7-2. That doesn't seem to have helped with the crash. We have monit running on both systems configured to restart gmetad if it dies. It restarts immediately with no issues. The production system is storing it's data on a magnetic disk, the dev system is using ssd. That doesn't appear to have changed the frequency of the crash. Has anyone experienced this kind of crash, especially on Amazon hardware? We're at our wits end trying to find a solution! -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140
Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)
This is the prologue of some function and the second argument is NULL when it shouldn't be. Unfortunately, the binary does appear to be stripped, so it will be slightly hard to figure out which function it is. Your previous email with the backtrace shows that it is walking the hash tree (probably to aggregate), so it's possible that some probe is returning data that can't be parsed or meaningfully interpreted. However, since it is a nested walk, it might be possible to guess which metric is that deeply nested. But not easily. This also means running under gdb is probably pointless. Do you have the ability to run a version with deciding symbols? If so, that is probably faster for reaching a solution than I can surmise from digging the assembler. On Sep 15, 2014 6:57 PM, Sam Barham s.bar...@adinstruments.com wrote: I can't read assembly, so this doesn't mean much to me, but hopefully it'll mean something to you :) 40540e: e9 fc fe ff ff jmpq 40530f openlog@plt+0x242f 405413: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 405418: 48 89 demov%rbx,%rsi 40541b: e8 b0 fd ff ff callq 4051d0 openlog@plt+0x22f0 405420: 48 8b 7b 18 mov0x18(%rbx),%rdi 405424: 48 85 fftest %rdi,%rdi 405427: 74 0d je 405436 openlog@plt+0x2556 405429: 4c 89 e2mov%r12,%rdx 40542c: be 60 54 40 00 mov$0x405460,%esi 405431: e8 ca d3 ff ff callq 402800 hash_foreach@plt 405436: 31 c0 xor%eax,%eax 405438: e9 f8 fe ff ff jmpq 405335 openlog@plt+0x2455 40543d: 0f 1f 00nopl (%rax) 405440: 31 c9 xor%ecx,%ecx 405442: 4c 89 eamov%r13,%rdx 405445: 31 f6 xor%esi,%esi 405447: 4c 89 e7mov%r12,%rdi 40544a: 4c 89 04 24 mov%r8,(%rsp) 40544e: e8 3d fe ff ff callq 405290 openlog@plt+0x23b0 405453: 4c 8b 04 24 mov(%rsp),%r8 405457: 89 c5 mov%eax,%ebp 405459: eb ab jmp405406 openlog@plt+0x2526 40545b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 405460: 48 89 6c 24 f0 mov%rbp,-0x10(%rsp) 405465: 4c 89 64 24 f8 mov%r12,-0x8(%rsp) 40546a: 49 89 fcmov%rdi,%r12 40546d: 48 89 5c 24 e8 mov%rbx,-0x18(%rsp) 405472: 48 83 ec 18 sub$0x18,%rsp 405476: 8b 7a 18mov0x18(%rdx),%edi 405479: 48 89 d5mov%rdx,%rbp 40547c: 48 8b 1emov(%rsi),%rbx 40547f: 85 ff test %edi,%edi 405481: 74 0c je 40548f openlog@plt+0x25af 405483: 48 89 demov%rbx,%rsi 405486: e8 15 fd ff ff callq 4051a0 openlog@plt+0x22c0 40548b: 85 c0 test %eax,%eax 40548d: 74 12 je 4054a1 openlog@plt+0x25c1 40548f: 31 c9 xor%ecx,%ecx 405491: 48 89 eamov%rbp,%rdx 405494: 4c 89 e6mov%r12,%rsi 405497: 48 89 dfmov%rbx,%rdi 40549a: ff 53 08callq *0x8(%rbx) 40549d: 85 c0 test %eax,%eax 40549f: 74 1f je 4054c0 openlog@plt+0x25e0 4054a1: b8 01 00 00 00 mov$0x1,%eax 4054a6: 48 8b 1c 24 mov(%rsp),%rbx 4054aa: 48 8b 6c 24 08 mov0x8(%rsp),%rbp 4054af: 4c 8b 64 24 10 mov0x10(%rsp),%r12 4054b4: 48 83 c4 18 add$0x18,%rsp 4054b8: c3 retq 4054b9: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 4054c0: 48 89 efmov%rbp,%rdi 4054c3: 48 89 demov%rbx,%rsi 4054c6: e8 05 fd ff ff callq 4051d0 openlog@plt+0x22f0 4054cb: 48 8b 7b 18 mov0x18(%rbx),%rdi On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell devon.od...@gmail.com wrote: If you can install the dbg or dbgsym package for this, you can get more information. If you cannot do this, running: objdump -d `which gmond` | less in less: /40547c Paste a little context of the disassembly before and after that address, then scroll up and paste which function it's in. (That might still be too little information or even bad information if the binary is stripped. But it's something.) --dho 2014-09-14 18:09 GMT-07:00 Sam Barham s.bar...@adinstruments.com: I've finally managed to generate a core
Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)
Are you able to share a core file? 2014-09-11 14:32 GMT-07:00 Sam Barham s.bar...@adinstruments.com: We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS. Everything is working correctly (metrics are flowing etc), except that occasionally the gmetad process will segfault out of the blue. The gmetad process is running on an m3.medium EC2, and is monitoring about 50 servers. The servers are arranged into groups, each one having a bastion EC2 where the metrics are gathered. gmetad is configured to grab the metrics from those bastions - about 10 of them. Some useful facts: We are running Debian Wheezy on all the EC2s Sometimes the crash will happen multiple times in a day, sometimes it'll be a day or two before it crashes The crash creates no logs in normal operation other than a segfault log something like gmetad[11291]: segfault at 71 ip 0040547c sp 7ff2d6572260 error 4 in gmetad[40+e000]. If we run gmetad manually with debug logging, it appears that the crash is related to gmetad doing a cleanup. When we realised that the cleanup process might be to blame we did more research around that. We realised that our disk IO was way too high and added rrdcached in order to reduce it. The disk IO is now much lower, and the crash is occurring less often, but still an average of once a day or so. We have two systems (dev and production). Both exhibit this crash, but the dev system, which is monitoring a much smaller group of servers crashes significantly less often. The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. We've upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool 1.4.7-2. That doesn't seem to have helped with the crash. We have monit running on both systems configured to restart gmetad if it dies. It restarts immediately with no issues. The production system is storing it's data on a magnetic disk, the dev system is using ssd. That doesn't appear to have changed the frequency of the crash. Has anyone experienced this kind of crash, especially on Amazon hardware? We're at our wits end trying to find a solution! -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Gmetad Platform Poll
Hi all, I'm intending to continue working on performance improvements for gmetad. I'm curious if anybody uses gmetad on architectures that are not: * ARM * PPC * PPC64 * SPARCv9 * i386 * amd64 or on systems that are not: * Linux * ${any}BSD * Solaris (I'd also be interested in hearing if people are using gmond on architectures other than those mentioned above; less interested about the operating systems for that one.) Kind regards, --dho -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmetad Platform Poll
Thanks. I think the work I'm doing should work with AIX on POWER. Would anybody with a builder be able to test and verify this? 2013/12/11 Morten Torstensen morten.torsten...@evry.com: We are using ganglia for aix on power, and possibly linux on power too in the close future. We use binaries from Michael Perzl, http://www.perzl.org/ganglia/ Best regards Morten Torstensen Chief Solution Architect, BA Nordic Open Systems Future Proof Service Development morten.torsten...@evry.com M +47 46819584 -Original Message- From: Devon H. O'Dell [mailto:devon.od...@gmail.com] Sent: Wednesday, 11 December, 2013 16:49 To: ganglia-develop...@lists.sourceforge.net; ganglia-general@lists.sourceforge.net Subject: [Ganglia-general] Gmetad Platform Poll Hi all, I'm intending to continue working on performance improvements for gmetad. I'm curious if anybody uses gmetad on architectures that are not: * ARM * PPC * PPC64 * SPARCv9 * i386 * amd64 or on systems that are not: * Linux * ${any}BSD * Solaris (I'd also be interested in hearing if people are using gmond on architectures other than those mentioned above; less interested about the operating systems for that one.) Kind regards, --dho -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general