from:"Devon H. O'Dell"

Re: [Ganglia-general] gmetad segmentation fault

2015-12-11 Thread Devon H. O'Dell

Unfortunately, without a coredump or backtrace where debug symbols are
present, I'm not going to be able to offer any additional insight.

Are you running any C and / or Python modules with gmetad?

--dho

2015-12-11 5:54 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
> Hi guys,
>
> just to update on this:
>  - I've removed my ganglia-gmetad/gmond and libganglia from everywhere and
> installed the most recent versions from the epel repository. The error is
> still there.
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
> 
> From: Cristovao Cordeiro
> Sent: 08 December 2015 11:49
> To: Marcello Morgotti
> Cc: ganglia-general@lists.sourceforge.net
>
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi everyone,
>
> sorry for the late reply.
> @Devon
> thanks for looking into it.
> i do have .so.0 and .so.0.0.0 in my system and I am not using any custom
> modules. The Ganglia deployment is however a bit different from the
> standard:
>   - in one single VM, gmetad is running (always) and several gmond daemons
> are running in the background (daemon gmond -c /etc/ganglia/gmond_N.conf),
> all receiving metric through unicast.
> The Ganglia package is built by me as well, from the source code. I am
> currently building and using Ganglia 3.7.1 (taken from
> http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.7.1/).
> I build the Ganglia RPM myself for 2 reasons:
> 1 - have Ganglia available in YUM
> 2 - minor changes to ganglia-web's apache.conf
>
> I have other monitors running 3.6.0 and no errors there. But on those I have
> installed Ganglia manually and directly without building a RPM.
>
> I also see 3.7.2 already available in the epel repository so I’ll might try
> this.
>
> Regarding the compilation with debug symbols…
>
> @Marcello
> did you get a chance to do it?
>
>
> Best regards,
> Cristóvão José Domingues Cordeiro
>
>
>
>
> On 24 Nov 2015, at 18:51, Marcello Morgotti <m.morgo...@cineca.it> wrote:
>
> Hello,
>
> I'd like to join the discussion because this problem is affecting us as
> well. We have the problem on two different installations:
>
> 2 server in active-active HA configuration, each with CentOS 7.1 +
> ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D
> 2 server in active-active HA configuration, each with RedHat 6.5 +
> ganglia 3.7.2 + rrdcached  monitoring systems E,F,G,H
>
> In both cases the ganglia rpm packages are taken from EPEL repository.
> The curios thing is that every time that the segfault happens it happens
> almost at the same time.
> I.e. for Centos7 systems:
>
> Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection
> ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in
> libganglia.so.0.0.0[7fd70d624000+14000]
> Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited,
> code=killed, status=11/SEGV
> Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state.
> Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection
> ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in
> libganglia.so.0.0.0[7fc1bddda000+14000]
> Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited,
> code=killed, status=11/SEGV
> Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state.
>
>
> Hope this helps and adds infomations, I will try to build a debug
> version of gmetad to see if it's possible to generate a core dump.
>
> Best Regards,
> Marcello
>
> On 23/11/2015 17:30, Devon H. O'Dell wrote:
>
> It's just a system versioning thing for shared libraries. Usually .so
> is a soft link to .so.0 which is a soft link to .so.0.0.0. This is
> intended to be an ABI versioning interface, but it's not super
> frequently used. Are these legitimately different files on your
> system?
>
> The crash is in hash_delete:
>
> 003b2c00b780 :
> ...
>   3b2c00b797:   48 8b 07mov(%rdi),%rax
>   3b2c00b79a:   48 8d 34 30 lea(%rax,%rsi,1),%rsi
>   3b2c00b79e:   48 39 f0cmp%rsi,%rax
>   3b2c00b7a1:   73 37   jae3b2c00b7da <hash_delete+0x5a>
>   3b2c00b7a3:   48 bf b3 01 00 00 00movabs $0x10001b3,%rdi
>   3b2c00b7aa:   01 00 00
>   3b2c00b7ad:   0f 1f 00nopl   (%rax)
>
>  3b2c00b7b0:   0f b6 08movzbl (%rax),%ecx
>
>   3b2c00b7b3:   48 83 c0 01 add$0x1,%rax
>   3b2c00b7b7:   48 31 caxor%rcx,%rdx
>   3b2c00b7ba:   48 0f af d7 imul   %rdi,%rdx
>   3b2c00b7be:   48 39 c6cmp%rax,%rsi
>   3b2c00b7c1:   77 ed

Re: [Ganglia-general] gmetad segmentation fault

2015-11-23 Thread Devon H. O'Dell

It's just a system versioning thing for shared libraries. Usually .so
is a soft link to .so.0 which is a soft link to .so.0.0.0. This is
intended to be an ABI versioning interface, but it's not super
frequently used. Are these legitimately different files on your
system?

The crash is in hash_delete:

003b2c00b780 :
...
  3b2c00b797:   48 8b 07mov(%rdi),%rax
  3b2c00b79a:   48 8d 34 30 lea(%rax,%rsi,1),%rsi
  3b2c00b79e:   48 39 f0cmp%rsi,%rax
  3b2c00b7a1:   73 37   jae3b2c00b7da <hash_delete+0x5a>
  3b2c00b7a3:   48 bf b3 01 00 00 00movabs $0x10001b3,%rdi
  3b2c00b7aa:   01 00 00
  3b2c00b7ad:   0f 1f 00nopl   (%rax)
>>>  3b2c00b7b0:   0f b6 08movzbl (%rax),%ecx
  3b2c00b7b3:   48 83 c0 01 add$0x1,%rax
  3b2c00b7b7:   48 31 caxor%rcx,%rdx
  3b2c00b7ba:   48 0f af d7 imul   %rdi,%rdx
  3b2c00b7be:   48 39 c6cmp%rax,%rsi
  3b2c00b7c1:   77 ed   ja 3b2c00b7b0 <hash_delete+0x30>
...

%rdi is the first argument to the function, so %rax is the datum_t
*key, and (%rax) is key->data. hash_key has been inlined here.
Unfortunately, what appears to be happening is that some key has
already been removed from the hash table and freed, and based on your
description of the problem, that was attempted concurrently. Your
kernel crash shows that we were trying to dereference a NULL pointer,
so it would appear that key->data is NULL.

Unfortunately, it is not clear without a backtrace what sort of key
specifically is in question here, but perhaps someone else might have
some context based on recent changes. (I don't think this is related
to my work on the hashes).

Are you running any custom modules (either in C or Python)? Would it
be possible for you to build gmond and libganglia with debugging
symbols and generate a core dump?

--dho


2015-11-23 1:29 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
> Hi Devon,
>
> thanks for the help.
> Attached follows the binary file.
>
> btw, what is the difference between so.0 and so.0.0.0?
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> On 17 Nov 2015, at 19:16, Devon H. O'Dell <devon.od...@gmail.com> wrote:
>
> Hi! Very sorry about this, I had a draft that I thought I had sent.
>
> Could you email me your libganglia.so binary off-list? Alternatively,
> do you have the ability to compile libganglia with debugging symbols?
>
> 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
>
> Hi everyone,
>
> any news on this?
> Another symptom is that this happens quite as often as the cluster changes,
> meaning that the more activity there is in the cluster (delete machines,
> create...) the more this issue happens. Could it be related with the
> deletion of old hosts by gmond causing gmetad to try to access files that
> are already gone?
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> ____
> From: Cristovao Cordeiro [cristovao.corde...@cern.ch]
> Sent: 09 November 2015 13:40
> To: Devon H. O'Dell
> Cc: Ganglia-general@lists.sourceforge.net
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi Devon,
>
> thanks!
>
> * I don't think there was a core dump. At least that is not stated in
> /var/log/messages and I don't find anything relevant in /var/spool/abrt/
> * I am running 3.7.1
> * The addr2line returns ??:0. Also with gdb:
>
> gdb /usr/lib64/libganglia.so.0.0.0
>
>   ...
>   Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging
> symbols found)...done.
>
> Some more information about my setup:
> - I am running several gmonds in the same machine, so all my data_sources
> are to localhost.
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> 
> From: Devon H. O'Dell [devon.od...@gmail.com]
> Sent: 09 November 2015 13:12
> To: Cristovao Cordeiro
> Cc: Ganglia-general@lists.sourceforge.net
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi!
>
> I have a couple of initial questions that might help figure out the problem:
>
> * Did you get a core dump?
> * What version of ganglia are you running?
> * This crash happened within libganglia.so at offset 0xb7b0. Can you run:
>
> $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0
>
> and paste the output? If that does not work, there are a couple other
> things we can try to get information about the fault, but hopefully we
> can just work from there.
>
> Kind regards,
>
> Devon H. O'Dell
>
> 2

Re: [Ganglia-general] gmetad segmentation fault

2015-11-17 Thread Devon H. O'Dell

Hi! Very sorry about this, I had a draft that I thought I had sent.

Could you email me your libganglia.so binary off-list? Alternatively,
do you have the ability to compile libganglia with debugging symbols?

2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
> Hi everyone,
>
> any news on this?
> Another symptom is that this happens quite as often as the cluster changes, 
> meaning that the more activity there is in the cluster (delete machines, 
> create...) the more this issue happens. Could it be related with the deletion 
> of old hosts by gmond causing gmetad to try to access files that are already 
> gone?
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> 
> From: Cristovao Cordeiro [cristovao.corde...@cern.ch]
> Sent: 09 November 2015 13:40
> To: Devon H. O'Dell
> Cc: Ganglia-general@lists.sourceforge.net
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi Devon,
>
> thanks!
>
>  * I don't think there was a core dump. At least that is not stated in 
> /var/log/messages and I don't find anything relevant in /var/spool/abrt/
>  * I am running 3.7.1
>  * The addr2line returns ??:0. Also with gdb:
>> gdb /usr/lib64/libganglia.so.0.0.0
>...
>Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging 
> symbols found)...done.
>
> Some more information about my setup:
>  - I am running several gmonds in the same machine, so all my data_sources 
> are to localhost.
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> 
> From: Devon H. O'Dell [devon.od...@gmail.com]
> Sent: 09 November 2015 13:12
> To: Cristovao Cordeiro
> Cc: Ganglia-general@lists.sourceforge.net
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi!
>
> I have a couple of initial questions that might help figure out the problem:
>
>  * Did you get a core dump?
>  * What version of ganglia are you running?
>  * This crash happened within libganglia.so at offset 0xb7b0. Can you run:
>
> $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0
>
> and paste the output? If that does not work, there are a couple other
> things we can try to get information about the fault, but hopefully we
> can just work from there.
>
> Kind regards,
>
> Devon H. O'Dell
>
> 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
>> Dear all,
>>
>> I have several Ganglia monitors running with similar configurations in
>> different machines (VMs) and for a long time now I have been experiencing
>> segmentation faults at random times. It seems to happen more on gmetads that
>> are monitoring larger number of nodes.
>>
>> In /var/log/messages I see:
>>
>> kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0
>> error 4 in libganglia.so.0.0.0[3630c0+15000]
>>
>>
>> and in the console output there's only this:
>>
>> /bin/bash: line 1: 30375 Terminated  /usr/sbin/gmetad
>>
>>[FAILED]
>>
>>
>> gmetad does not have any special configuration besides the RRD location
>> which in on a 4Gb ramdisk.
>>
>>
>> Cumprimentos / Best regards,
>> Cristóvão José Domingues Cordeiro
>>
>>
>> --
>> Presto, an open source distributed SQL query engine for big data, initially
>> developed by Facebook, enables you to easily query your data on Hadoop in a
>> more interactive manner. Teradata is also now providing full enterprise
>> support for Presto. Download a free open source copy now.
>> http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140
>> ___
>> Ganglia-general mailing list
>> Ganglia-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>
>
> --
> Presto, an open source distributed SQL query engine for big data, initially
> developed by Facebook, enables you to easily query your data on Hadoop in a
> more interactive manner. Teradata is also now providing full enterprise
> support for Presto. Download a free open source copy now.
> http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140
> ___
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general

--
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segmentation fault

2015-11-09 Thread Devon H. O'Dell

Hi!

I have a couple of initial questions that might help figure out the problem:

 * Did you get a core dump?
 * What version of ganglia are you running?
 * This crash happened within libganglia.so at offset 0xb7b0. Can you run:

$ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0

and paste the output? If that does not work, there are a couple other
things we can try to get information about the fault, but hopefully we
can just work from there.

Kind regards,

Devon H. O'Dell

2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
> Dear all,
>
> I have several Ganglia monitors running with similar configurations in
> different machines (VMs) and for a long time now I have been experiencing
> segmentation faults at random times. It seems to happen more on gmetads that
> are monitoring larger number of nodes.
>
> In /var/log/messages I see:
>
> kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0
> error 4 in libganglia.so.0.0.0[3630c0+15000]
>
>
> and in the console output there's only this:
>
> /bin/bash: line 1: 30375 Terminated  /usr/sbin/gmetad
>
>[FAILED]
>
>
> gmetad does not have any special configuration besides the RRD location
> which in on a 4Gb ramdisk.
>
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> --
> Presto, an open source distributed SQL query engine for big data, initially
> developed by Facebook, enables you to easily query your data on Hadoop in a
> more interactive manner. Teradata is also now providing full enterprise
> support for Presto. Download a free open source copy now.
> http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140
> ___
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

--
Presto, an open source distributed SQL query engine for big data, initially
developed by Facebook, enables you to easily query your data on Hadoop in a 
more interactive manner. Teradata is also now providing full enterprise
support for Presto. Download a free open source copy now.
http://pubads.g.doubleclick.net/gampad/clk?id=250295911=/4140
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-15 Thread Devon H. O'Dell

If you can install the dbg or dbgsym package for this, you can get
more information. If you cannot do this, running:

objdump -d `which gmond` | less

in less:

/40547c

Paste a little context of the disassembly before and after that
address, then scroll up and paste which function it's in. (That might
still be too little information or even bad information if the binary
is stripped. But it's something.)

--dho

2014-09-14 18:09 GMT-07:00 Sam Barham s.bar...@adinstruments.com:
 I've finally managed to generate a core dump (the VM wasn't set up to do it
 yet), but it's 214Mb and doesn't seem to contain anything helpful -
 especially as I don't have debug symbols.  The backtrace shows:
 #0  0x0040547c in ?? ()
 #1  0x7f600a49a245 in hash_foreach () from
 /usr/lib/libganglia-3.3.8.so.0
 #2  0x004054e1 in ?? ()
 #3  0x7f600a49a245 in hash_foreach () from
 /usr/lib/libganglia-3.3.8.so.0
 #4  0x004054e1 in ?? ()
 #5  0x7f600a49a245 in hash_foreach () from
 /usr/lib/libganglia-3.3.8.so.0
 #6  0x00405436 in ?? ()
 #7  0x0040530d in ?? ()
 #8  0x004058fa in ?? ()
 #9  0x7f6008ef9b50 in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #10 0x7f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #11 0x in ?? ()

 Is there a way for me to get more useful information out of it?

 On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell devon.od...@gmail.com
 wrote:

 Are you able to share a core file?

 2014-09-11 14:32 GMT-07:00 Sam Barham s.bar...@adinstruments.com:
  We are using Ganglia to monitoring our cloud infrastructure on Amazon
  AWS.
  Everything is working correctly (metrics are flowing etc), except that
  occasionally the gmetad process will segfault out of the blue. The
  gmetad
  process is running on an m3.medium EC2, and is monitoring about 50
  servers.
  The servers are arranged into groups, each one having a bastion EC2
  where
  the metrics are gathered. gmetad is configured to grab the metrics from
  those bastions - about 10 of them.
 
  Some useful facts:
 
  We are running Debian Wheezy on all the EC2s
  Sometimes the crash will happen multiple times in a day, sometimes it'll
  be
  a day or two before it crashes
  The crash creates no logs in normal operation other than a segfault log
  something like gmetad[11291]: segfault at 71 ip 0040547c sp
  7ff2d6572260 error 4 in gmetad[40+e000]. If we run gmetad
  manually
  with debug logging, it appears that the crash is related to gmetad doing
  a
  cleanup.
  When we realised that the cleanup process might be to blame we did more
  research around that. We realised that our disk IO was way too high and
  added rrdcached in order to reduce it. The disk IO is now much lower,
  and
  the crash is occurring less often, but still an average of once a day or
  so.
  We have two systems (dev and production). Both exhibit this crash, but
  the
  dev system, which is monitoring a much smaller group of servers crashes
  significantly less often.
  The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
  We've
  upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
  1.4.7-2. That doesn't seem to have helped with the crash.
  We have monit running on both systems configured to restart gmetad if it
  dies. It restarts immediately with no issues.
  The production system is storing it's data on a magnetic disk, the dev
  system is using ssd.  That doesn't appear to have changed the frequency
  of
  the crash.
 
  Has anyone experienced this kind of crash, especially on Amazon
  hardware?
  We're at our wits end trying to find a solution!
 
 
 
  --
  Want excitement?
  Manually upgrade your production database.
  When you want reliability, choose Perforce
  Perforce version control. Predictably reliable.
 
  http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
  ___
  Ganglia-general mailing list
  Ganglia-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/ganglia-general
 



 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general


--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-15 Thread Devon H. O'Dell

This is the prologue of some function  and the second argument is NULL when
it shouldn't be. Unfortunately, the binary does appear to be stripped,  so
it will be slightly hard to figure out which function it is. Your previous
email with the backtrace shows that it is walking the hash tree (probably
to aggregate), so it's possible that some probe is returning data that
can't be parsed or meaningfully interpreted. However, since it is a nested
walk, it might be possible to guess which metric is that deeply nested.

But not easily.

This also means running under gdb is probably pointless. Do you have the
ability to run a version with deciding symbols? If so, that is probably
faster for reaching a solution than I can surmise from digging the
assembler.
On Sep 15, 2014 6:57 PM, Sam Barham s.bar...@adinstruments.com wrote:

 I can't read assembly, so this doesn't mean much to me, but hopefully
 it'll mean something to you :)


   40540e:   e9 fc fe ff ff  jmpq   40530f openlog@plt+0x242f
   405413:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
   405418:   48 89 demov%rbx,%rsi
   40541b:   e8 b0 fd ff ff  callq  4051d0 openlog@plt+0x22f0
   405420:   48 8b 7b 18 mov0x18(%rbx),%rdi
   405424:   48 85 fftest   %rdi,%rdi
   405427:   74 0d   je 405436 openlog@plt+0x2556
   405429:   4c 89 e2mov%r12,%rdx
   40542c:   be 60 54 40 00  mov$0x405460,%esi
   405431:   e8 ca d3 ff ff  callq  402800 hash_foreach@plt
   405436:   31 c0   xor%eax,%eax
   405438:   e9 f8 fe ff ff  jmpq   405335 openlog@plt+0x2455
   40543d:   0f 1f 00nopl   (%rax)
   405440:   31 c9   xor%ecx,%ecx
   405442:   4c 89 eamov%r13,%rdx
   405445:   31 f6   xor%esi,%esi
   405447:   4c 89 e7mov%r12,%rdi
   40544a:   4c 89 04 24 mov%r8,(%rsp)
   40544e:   e8 3d fe ff ff  callq  405290 openlog@plt+0x23b0
   405453:   4c 8b 04 24 mov(%rsp),%r8
   405457:   89 c5   mov%eax,%ebp
   405459:   eb ab   jmp405406 openlog@plt+0x2526
   40545b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
   405460:   48 89 6c 24 f0  mov%rbp,-0x10(%rsp)
   405465:   4c 89 64 24 f8  mov%r12,-0x8(%rsp)
   40546a:   49 89 fcmov%rdi,%r12
   40546d:   48 89 5c 24 e8  mov%rbx,-0x18(%rsp)
   405472:   48 83 ec 18 sub$0x18,%rsp
   405476:   8b 7a 18mov0x18(%rdx),%edi
   405479:   48 89 d5mov%rdx,%rbp
   40547c:   48 8b 1emov(%rsi),%rbx
   40547f:   85 ff   test   %edi,%edi
   405481:   74 0c   je 40548f openlog@plt+0x25af
   405483:   48 89 demov%rbx,%rsi
   405486:   e8 15 fd ff ff  callq  4051a0 openlog@plt+0x22c0
   40548b:   85 c0   test   %eax,%eax
   40548d:   74 12   je 4054a1 openlog@plt+0x25c1
   40548f:   31 c9   xor%ecx,%ecx
   405491:   48 89 eamov%rbp,%rdx
   405494:   4c 89 e6mov%r12,%rsi
   405497:   48 89 dfmov%rbx,%rdi
   40549a:   ff 53 08callq  *0x8(%rbx)
   40549d:   85 c0   test   %eax,%eax
   40549f:   74 1f   je 4054c0 openlog@plt+0x25e0
   4054a1:   b8 01 00 00 00  mov$0x1,%eax
   4054a6:   48 8b 1c 24 mov(%rsp),%rbx
   4054aa:   48 8b 6c 24 08  mov0x8(%rsp),%rbp
   4054af:   4c 8b 64 24 10  mov0x10(%rsp),%r12
   4054b4:   48 83 c4 18 add$0x18,%rsp
   4054b8:   c3  retq
   4054b9:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
   4054c0:   48 89 efmov%rbp,%rdi
   4054c3:   48 89 demov%rbx,%rsi
   4054c6:   e8 05 fd ff ff  callq  4051d0 openlog@plt+0x22f0
   4054cb:   48 8b 7b 18 mov0x18(%rbx),%rdi


 On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell devon.od...@gmail.com
  wrote:

 If you can install the dbg or dbgsym package for this, you can get
 more information. If you cannot do this, running:

 objdump -d `which gmond` | less

 in less:

 /40547c

 Paste a little context of the disassembly before and after that
 address, then scroll up and paste which function it's in. (That might
 still be too little information or even bad information if the binary
 is stripped. But it's something.)

 --dho

 2014-09-14 18:09 GMT-07:00 Sam Barham s.bar...@adinstruments.com:
  I've finally managed to generate a core

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-11 Thread Devon H. O'Dell

Are you able to share a core file?

2014-09-11 14:32 GMT-07:00 Sam Barham s.bar...@adinstruments.com:
 We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS.
 Everything is working correctly (metrics are flowing etc), except that
 occasionally the gmetad process will segfault out of the blue. The gmetad
 process is running on an m3.medium EC2, and is monitoring about 50 servers.
 The servers are arranged into groups, each one having a bastion EC2 where
 the metrics are gathered. gmetad is configured to grab the metrics from
 those bastions - about 10 of them.

 Some useful facts:

 We are running Debian Wheezy on all the EC2s
 Sometimes the crash will happen multiple times in a day, sometimes it'll be
 a day or two before it crashes
 The crash creates no logs in normal operation other than a segfault log
 something like gmetad[11291]: segfault at 71 ip 0040547c sp
 7ff2d6572260 error 4 in gmetad[40+e000]. If we run gmetad manually
 with debug logging, it appears that the crash is related to gmetad doing a
 cleanup.
 When we realised that the cleanup process might be to blame we did more
 research around that. We realised that our disk IO was way too high and
 added rrdcached in order to reduce it. The disk IO is now much lower, and
 the crash is occurring less often, but still an average of once a day or so.
 We have two systems (dev and production). Both exhibit this crash, but the
 dev system, which is monitoring a much smaller group of servers crashes
 significantly less often.
 The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. We've
 upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
 1.4.7-2. That doesn't seem to have helped with the crash.
 We have monit running on both systems configured to restart gmetad if it
 dies. It restarts immediately with no issues.
 The production system is storing it's data on a magnetic disk, the dev
 system is using ssd.  That doesn't appear to have changed the frequency of
 the crash.

 Has anyone experienced this kind of crash, especially on Amazon hardware?
 We're at our wits end trying to find a solution!


 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general


--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] Gmetad Platform Poll

2013-12-11 Thread Devon H. O'Dell

Hi all,

I'm intending to continue working on performance improvements for
gmetad. I'm curious if anybody uses gmetad on architectures that are
not:

 * ARM
 * PPC
 * PPC64
 * SPARCv9
 * i386
 * amd64

or on systems that are not:

 * Linux
 * ${any}BSD
 * Solaris

(I'd also be interested in hearing if people are using gmond on
architectures other than those mentioned above; less interested about
the operating systems for that one.)

Kind regards,

--dho

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Gmetad Platform Poll

2013-12-11 Thread Devon H. O'Dell

Thanks. I think the work I'm doing should work with AIX on POWER.
Would anybody with a builder be able to test and verify this?

2013/12/11 Morten Torstensen morten.torsten...@evry.com:
 We are using ganglia for aix on power, and possibly linux on power too in the 
 close future.

 We use binaries from Michael Perzl, http://www.perzl.org/ganglia/


 Best regards
 Morten Torstensen
 Chief Solution Architect, BA Nordic Open Systems
 Future Proof Service Development
 morten.torsten...@evry.com
 M +47 46819584

 -Original Message-
 From: Devon H. O'Dell [mailto:devon.od...@gmail.com]
 Sent: Wednesday, 11 December, 2013 16:49
 To: ganglia-develop...@lists.sourceforge.net; 
 ganglia-general@lists.sourceforge.net
 Subject: [Ganglia-general] Gmetad Platform Poll

 Hi all,

 I'm intending to continue working on performance improvements for gmetad. I'm 
 curious if anybody uses gmetad on architectures that are
 not:

  * ARM
  * PPC
  * PPC64
  * SPARCv9
  * i386
  * amd64

 or on systems that are not:

  * Linux
  * ${any}BSD
  * Solaris

 (I'd also be interested in hearing if people are using gmond on architectures 
 other than those mentioned above; less interested about the operating systems 
 for that one.)

 Kind regards,

 --dho

 --
 Rapidly troubleshoot problems before they affect your business. Most IT 
 organizations don't have a clear picture of how application performance 
 affects their revenue. With AppDynamics, you get 100% visibility into your 
 Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
 http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general

--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segmentation fault

Re: [Ganglia-general] gmetad segmentation fault

Re: [Ganglia-general] gmetad segmentation fault

Re: [Ganglia-general] gmetad segmentation fault

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

[Ganglia-general] Gmetad Platform Poll

Re: [Ganglia-general] Gmetad Platform Poll

9 matches

Site Navigation

Mail list logo

Footer information