[Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)
We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS. Everything is working correctly (metrics are flowing etc), except that occasionally the gmetad process will segfault out of the blue. The gmetad process is running on an m3.medium EC2, and is monitoring about 50 servers. The servers are arranged into groups, each one having a bastion EC2 where the metrics are gathered. gmetad is configured to grab the metrics from those bastions - about 10 of them. Some useful facts: - We are running Debian Wheezy on all the EC2s - Sometimes the crash will happen multiple times in a day, sometimes it'll be a day or two before it crashes - The crash creates no logs in normal operation other than a segfault log something like "gmetad[11291]: segfault at 71 ip 0040547c sp 7ff2d6572260 error 4 in gmetad[40+e000]". If we run gmetad manually with debug logging, it appears that the crash is related to gmetad doing a cleanup. - When we realised that the cleanup process might be to blame we did more research around that. We realised that our disk IO was way too high and added rrdcached in order to reduce it. The disk IO is now much lower, and the crash is occurring less often, but still an average of once a day or so. - We have two systems (dev and production). Both exhibit this crash, but the dev system, which is monitoring a much smaller group of servers crashes significantly less often. - The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. We've upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool 1.4.7-2. That doesn't seem to have helped with the crash. - We have monit running on both systems configured to restart gmetad if it dies. It restarts immediately with no issues. - The production system is storing it's data on a magnetic disk, the dev system is using ssd. That doesn't appear to have changed the frequency of the crash. Has anyone experienced this kind of crash, especially on Amazon hardware? We're at our wits end trying to find a solution! -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)
I've finally managed to generate a core dump (the VM wasn't set up to do it yet), but it's 214Mb and doesn't seem to contain anything helpful - especially as I don't have debug symbols. The backtrace shows: #0 0x0040547c in ?? () #1 0x7f600a49a245 in hash_foreach () from /usr/lib/libganglia-3.3.8.so.0 #2 0x004054e1 in ?? () #3 0x7f600a49a245 in hash_foreach () from /usr/lib/libganglia-3.3.8.so.0 #4 0x004054e1 in ?? () #5 0x7f600a49a245 in hash_foreach () from /usr/lib/libganglia-3.3.8.so.0 #6 0x00405436 in ?? () #7 0x0040530d in ?? () #8 0x004058fa in ?? () #9 0x7f6008ef9b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #10 0x7f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #11 0x in ?? () Is there a way for me to get more useful information out of it? On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell wrote: > Are you able to share a core file? > > 2014-09-11 14:32 GMT-07:00 Sam Barham : > > We are using Ganglia to monitoring our cloud infrastructure on Amazon > AWS. > > Everything is working correctly (metrics are flowing etc), except that > > occasionally the gmetad process will segfault out of the blue. The gmetad > > process is running on an m3.medium EC2, and is monitoring about 50 > servers. > > The servers are arranged into groups, each one having a bastion EC2 where > > the metrics are gathered. gmetad is configured to grab the metrics from > > those bastions - about 10 of them. > > > > Some useful facts: > > > > We are running Debian Wheezy on all the EC2s > > Sometimes the crash will happen multiple times in a day, sometimes it'll > be > > a day or two before it crashes > > The crash creates no logs in normal operation other than a segfault log > > something like "gmetad[11291]: segfault at 71 ip 0040547c sp > > 7ff2d6572260 error 4 in gmetad[40+e000]". If we run gmetad > manually > > with debug logging, it appears that the crash is related to gmetad doing > a > > cleanup. > > When we realised that the cleanup process might be to blame we did more > > research around that. We realised that our disk IO was way too high and > > added rrdcached in order to reduce it. The disk IO is now much lower, and > > the crash is occurring less often, but still an average of once a day or > so. > > We have two systems (dev and production). Both exhibit this crash, but > the > > dev system, which is monitoring a much smaller group of servers crashes > > significantly less often. > > The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. > We've > > upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool > > 1.4.7-2. That doesn't seem to have helped with the crash. > > We have monit running on both systems configured to restart gmetad if it > > dies. It restarts immediately with no issues. > > The production system is storing it's data on a magnetic disk, the dev > > system is using ssd. That doesn't appear to have changed the frequency > of > > the crash. > > > > Has anyone experienced this kind of crash, especially on Amazon hardware? > > We're at our wits end trying to find a solution! > > > > > > > -- > > Want excitement? > > Manually upgrade your production database. > > When you want reliability, choose Perforce > > Perforce version control. Predictably reliable. > > > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk > > ___ > > Ganglia-general mailing list > > Ganglia-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)
I can't read assembly, so this doesn't mean much to me, but hopefully it'll mean something to you :) 40540e: e9 fc fe ff ff jmpq 40530f 405413: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 405418: 48 89 demov%rbx,%rsi 40541b: e8 b0 fd ff ff callq 4051d0 405420: 48 8b 7b 18 mov0x18(%rbx),%rdi 405424: 48 85 fftest %rdi,%rdi 405427: 74 0d je 405436 405429: 4c 89 e2mov%r12,%rdx 40542c: be 60 54 40 00 mov$0x405460,%esi 405431: e8 ca d3 ff ff callq 402800 405436: 31 c0 xor%eax,%eax 405438: e9 f8 fe ff ff jmpq 405335 40543d: 0f 1f 00nopl (%rax) 405440: 31 c9 xor%ecx,%ecx 405442: 4c 89 eamov%r13,%rdx 405445: 31 f6 xor%esi,%esi 405447: 4c 89 e7mov%r12,%rdi 40544a: 4c 89 04 24 mov%r8,(%rsp) 40544e: e8 3d fe ff ff callq 405290 405453: 4c 8b 04 24 mov(%rsp),%r8 405457: 89 c5 mov%eax,%ebp 405459: eb ab jmp405406 40545b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 405460: 48 89 6c 24 f0 mov%rbp,-0x10(%rsp) 405465: 4c 89 64 24 f8 mov%r12,-0x8(%rsp) 40546a: 49 89 fcmov%rdi,%r12 40546d: 48 89 5c 24 e8 mov%rbx,-0x18(%rsp) 405472: 48 83 ec 18 sub$0x18,%rsp 405476: 8b 7a 18mov0x18(%rdx),%edi 405479: 48 89 d5mov%rdx,%rbp 40547c: 48 8b 1emov(%rsi),%rbx 40547f: 85 ff test %edi,%edi 405481: 74 0c je 40548f 405483: 48 89 demov%rbx,%rsi 405486: e8 15 fd ff ff callq 4051a0 40548b: 85 c0 test %eax,%eax 40548d: 74 12 je 4054a1 40548f: 31 c9 xor%ecx,%ecx 405491: 48 89 eamov%rbp,%rdx 405494: 4c 89 e6mov%r12,%rsi 405497: 48 89 dfmov%rbx,%rdi 40549a: ff 53 08callq *0x8(%rbx) 40549d: 85 c0 test %eax,%eax 40549f: 74 1f je 4054c0 4054a1: b8 01 00 00 00 mov$0x1,%eax 4054a6: 48 8b 1c 24 mov(%rsp),%rbx 4054aa: 48 8b 6c 24 08 mov0x8(%rsp),%rbp 4054af: 4c 8b 64 24 10 mov0x10(%rsp),%r12 4054b4: 48 83 c4 18 add$0x18,%rsp 4054b8: c3 retq 4054b9: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 4054c0: 48 89 efmov%rbp,%rdi 4054c3: 48 89 demov%rbx,%rsi 4054c6: e8 05 fd ff ff callq 4051d0 4054cb: 48 8b 7b 18 mov0x18(%rbx),%rdi On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell wrote: > If you can install the dbg or dbgsym package for this, you can get > more information. If you cannot do this, running: > > objdump -d `which gmond` | less > > in less: > > /40547c > > Paste a little context of the disassembly before and after that > address, then scroll up and paste which function it's in. (That might > still be too little information or even bad information if the binary > is stripped. But it's something.) > > --dho > > 2014-09-14 18:09 GMT-07:00 Sam Barham : > > I've finally managed to generate a core dump (the VM wasn't set up to do > it > > yet), but it's 214Mb and doesn't seem to contain anything helpful - > > especially as I don't have debug symbols. The backtrace shows: > > #0 0x0040547c in ?? () > > #1 0x7f600a49a245 in hash_foreach () from > > /usr/lib/libganglia-3.3.8.so.0 > > #2 0x004054e1 in ?? () > > #3 0x7f600a49a245 in hash_foreach () from > > /usr/lib/libganglia-3.3.8.so.0 > > #4 0x004054e1 in ?? () > > #5 0x7f600a49a245 in hash_foreach () from > > /usr/lib/libganglia-3.3.8.so.0 > > #6 0x00405436 in ?? () > > #7 0x0040530d in ?? () > > #8 0x004058fa in ?? () > > #9 0x7f6008ef9b50 in start_thread () from > > /lib/x86_64-linux-gnu/libpthread.so.0 > > #10 0x7f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > #11 0x in ?? () > > > > Is there a way for me to g
[Ganglia-general] Help understanding tmax and dmax
I'm having trouble understanding what values to use for dmax and tmax in my gmetric calls, and how those values match up to actual behaviour. The situation is that I have several cron scripts that each run once a minute, finding various custom metrics and passing them into ganglia. I then have the ganglia-alert script running, alerting on various metrics. When using the default values, I often go false alerts because a metric would appear to have disappeared for a moment, which makes sense as the script sometimes take a few seconds to run, so there is a window for the metrics age to go slightly over the 60 second mark. After some experimentation, it seems the only way I've found to not drop any metrics unnecessarily is to set BOTH dmax and tmax to something over the default of 60 - I'm using 120. But I don't understand why I should have to set tmax at all in this situation, and I don't really understand what these values are actually controlling. Can anyone shed more light on this? -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce. Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)
The debug build of 3.6.0 finally crashed over the weekend. The backtrace is: #0 0x7f042e4ba38c in hash_insert (key=0x7f0425bcc440, val=0x7f0425bcc430, hash=0x7239d0) at hash.c:233 #1 0x00408551 in startElement_METRIC (data=0x7f0425bcc770, el=0x733930 "METRIC", attr=0x709270) at process_xml.c:677 #2 0x004092b2 in start (data=0x7f0425bcc770, el=0x733930 "METRIC", attr=0x709270) at process_xml.c:1036 #3 0x7f042d55b5fb in ?? () from /lib/x86_64-linux-gnu/libexpat.so.1 #4 0x7f042d55c84e in ?? () from /lib/x86_64-linux-gnu/libexpat.so.1 #5 0x7f042d55e36e in ?? () from /lib/x86_64-linux-gnu/libexpat.so.1 #6 0x7f042d55eb1b in ?? () from /lib/x86_64-linux-gnu/libexpat.so.1 #7 0x7f042d560b5d in XML_ParseBuffer () from /lib/x86_64-linux-gnu/libexpat.so.1 #8 0x00409953 in process_xml (d=0x618900, buf=0x792360 "\n\n \n wrote: > Regardless of whether this is 3.3.8 or 3.6.0, the offending line is: > > WRITE_LOCK(hash, i); > > I was going to guess this was 3.6.0 because it's a different > backtrace, however the line number in process_xml.c doesn't make sense > unless it is 3.3.8. What this implies is that the hash table is not > properly protected by its mutex. > > There are 339 commits between 3.3.8 and the current master branch. I'd > like to heavily suggest updating because I unfortunately do not have > time to look through all the commit messages to see if this has been > solved by work others have done. > > --dho > -- Slashdot TV. Video for Nerds. Stuff that Matters. http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] gmond occasionally doesn't connect up in unicast
We've got about 100 machines running on AWS EC2s, with Ganglia for monitoring. Because we are on Amazon, we can't use multicast, so the architecture we have is each cluster has a Bastion machine, and each other machine in the cluster has gmond send its' data to the bastion, which gmetad then queries. All standard and sensible and it works just fine. Except that occasionally, when I redeploy the machines in a cluster (but not the bastion - that stays running through this operation), just one of the machines will not send data through to the bastion or something. All I can say for sure is that gmond is running OK on the problem machine, there are no error logs on the problem machine, the bastion or the gmetad machine, but the machine doesn't appear in gmetad. If I go into the problem machine and restart gmond, it reconnects just fine and appears in gmetad. Which machine has the error is random - it's not a particular type of machine or anything. Because the error only shows up rarely, and only at deployment time, I can't really turn on debug_level to investigate. Also, some of the configuration values in gmond.conf are filled in when the userdata is run. I've edited /etc/init.d/ganglia-monitor so that it starts up immediately after the userdata has run, just in case that matters. Any ideas? Sam -- Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond occasionally doesn't connect up in unicast
Until recently I wasn't controlling the start order of ec2-run-user-data and ganglia-monitor, so they were starting at the same 'time'. Yesterday I fixed that, so that now ec2-run-user-data starts at S02 and ganglia-monitor at S03. I thought the issue might be exactly what you describe - ganglia-monitor starting before ec2-run-user-data has finished altering the gmond.conf, but the error still happened today. Also, I suspect (but don't know for sure) that the gmond.conf will actually be invalid before ec2-run-user-data has run - I've altered it to have flags that get replaced with valid values. On Thu, Nov 13, 2014 at 12:20 PM, Joe Gracyk wrote: > Hi, Sam - > > We've got a similar deployment (EC2 instances unicasting to a per-AZ > gmetad) that we're managing with Puppet, and I can't say we've seen > anything like that. > > How are you automating your redeployments and gmond configurations? Could > your gmond instances be starting up before their unicast configurations > have been applied? If you had some sort of race condition where gmond could > be installed and started, and *then *getting the conf file written, I'd > expect gmond to merrily chug along, fruitlessly trying to multicast into > the void. > > Good luck! > > On Wed, Nov 12, 2014 at 2:41 PM, Sam Barham > wrote: > >> We've got about 100 machines running on AWS EC2s, with Ganglia for >> monitoring. Because we are on Amazon, we can't use multicast, so the >> architecture we have is each cluster has a Bastion machine, and each other >> machine in the cluster has gmond send its' data to the bastion, which >> gmetad then queries. All standard and sensible and it works just fine. >> >> Except that occasionally, when I redeploy the machines in a cluster (but >> not the bastion - that stays running through this operation), just one of >> the machines will not send data through to the bastion or something. All I >> can say for sure is that gmond is running OK on the problem machine, there >> are no error logs on the problem machine, the bastion or the gmetad >> machine, but the machine doesn't appear in gmetad. If I go into the >> problem machine and restart gmond, it reconnects just fine and appears in >> gmetad. >> >> Which machine has the error is random - it's not a particular type of >> machine or anything. Because the error only shows up rarely, and only at >> deployment time, I can't really turn on debug_level to investigate. >> >> Also, some of the configuration values in gmond.conf are filled in when >> the userdata is run. I've edited /etc/init.d/ganglia-monitor so that it >> starts up immediately after the userdata has run, just in case that matters. >> >> Any ideas? >> >> Sam >> >> >> -- >> Comprehensive Server Monitoring with Site24x7. >> Monitor 10 servers for $9/Month. >> Get alerted through email, SMS, voice calls or mobile push notifications. >> Take corrective actions from your mobile device. >> >> http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk >> ___ >> Ganglia-general mailing list >> Ganglia-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/ganglia-general >> >> > > > -- > > [image: logo] <http://www.marketlive.com/> > > Joe Gracyk | *DevOps Developer* > 707-780-1848 | jgra...@marketlive.com > > [image: Follow us on Facebook] <http://www.facebook.com/marketlive> > <https://twitter.com/marketliveinc> > <http://www.linkedin.com/company/marketlive> > <http://www.marketlive-blog.com/> <http://www.marketlive.com/summit2015/> > -- Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] segfault on gmetad making Ganglia unusable.
I can't help unfortunately, but I can say that I've been having exactly the same issue, although less frequent (crashes anything from several times a day to once every couple of days). What is your gmetad hosted on? Mine is on Amazon Debian EC2s. Cheers Sam On Sun, Feb 8, 2015 at 11:21 AM, jayadevan Chembakassery < jayadev...@gmail.com> wrote: > Hi, > My Gmetad is going down every 20 - 30 min with segfault > Seeing the below message on /var/log/messages. > > gmetad[2383]: segfault at 7f81ffe30df0 ip 7f7fa0a313a1 sp > 7f7f98734400 error 4 in libganglia-3.6.1.so.0.0.0[7f7fa0a26000+14000] > > Env details: > O/S : Redhat EL 6.2 > Ganglia Web Frontend version 3.6.2 > Ganglia Web Backend (gmetad) version 3.6.1 > > I had the issue with Gemtad 3.6.0, upgraded to 3.6.1 with no luck. > managed to get the core file. > not a gdb expert but could see the below info; > > $gdb gmetad core.28985 > ... > .. > .. > > Program terminated with signal 11, Segmentation fault. > #0 0x7fbf1660d3a1 in hash_insert (key=0x7fbf0e310470, > val=0x7fbf0e310480, hash=0x7fbf08087780) at hash.c:233 > 233 WRITE_LOCK(hash, i); > ... > > Can some one help? > > Thanks, > Jay > > > > > -- > Dive into the World of Parallel Programming. The Go Parallel Website, > sponsored by Intel and developed in partnership with Slashdot Media, is > your > hub for all things parallel software development, from weekly thought > leadership blogs to news, videos, case studies, tutorials and more. Take a > look and join the conversation now. http://goparallel.sourceforge.net/ > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > -- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general