Re: [Ganglia-general] gmetad segmentation fault

2015-12-11 Thread Cristovao Cordeiro
Hi guys,

just to update on this:
 - I've removed my ganglia-gmetad/gmond and libganglia from everywhere and 
installed the most recent versions from the epel repository. The error is still 
there.

Cumprimentos / Best regards,
Cristóvão José Domingues Cordeiro


From: Cristovao Cordeiro
Sent: 08 December 2015 11:49
To: Marcello Morgotti
Cc: ganglia-general@lists.sourceforge.net
Subject: Re: [Ganglia-general] gmetad segmentation fault

Hi everyone,

sorry for the late reply.
@Devon
thanks for looking into it.
i do have .so.0 and .so.0.0.0 in my system and I am not using any custom 
modules. The Ganglia deployment is however a bit different from the standard:
  - in one single VM, gmetad is running (always) and several gmond daemons are 
running in the background (daemon gmond -c /etc/ganglia/gmond_N.conf), all 
receiving metric through unicast.
The Ganglia package is built by me as well, from the source code. I am 
currently building and using Ganglia 3.7.1 (taken from 
http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.7.1/).
 I build the Ganglia RPM myself for 2 reasons:
1 - have Ganglia available in YUM
2 - minor changes to ganglia-web's apache.conf

I have other monitors running 3.6.0 and no errors there. But on those I have 
installed Ganglia manually and directly without building a RPM.

I also see 3.7.2 already available in the epel repository so I’ll might try 
this.

Regarding the compilation with debug symbols…

@Marcello
did you get a chance to do it?


Best regards,
Cristóvão José Domingues Cordeiro




On 24 Nov 2015, at 18:51, Marcello Morgotti 
> wrote:

Hello,

I'd like to join the discussion because this problem is affecting us as
well. We have the problem on two different installations:

2 server in active-active HA configuration, each with CentOS 7.1 +
ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D
2 server in active-active HA configuration, each with RedHat 6.5 +
ganglia 3.7.2 + rrdcached  monitoring systems E,F,G,H

In both cases the ganglia rpm packages are taken from EPEL repository.
The curios thing is that every time that the segfault happens it happens
almost at the same time.
I.e. for Centos7 systems:

Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection
ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in
libganglia.so.0.0.0[7fd70d624000+14000]
Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited,
code=killed, status=11/SEGV
Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state.
Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection
ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in
libganglia.so.0.0.0[7fc1bddda000+14000]
Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited,
code=killed, status=11/SEGV
Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state.


Hope this helps and adds infomations, I will try to build a debug
version of gmetad to see if it's possible to generate a core dump.

Best Regards,
Marcello

On 23/11/2015 17:30, Devon H. O'Dell wrote:
It's just a system versioning thing for shared libraries. Usually .so
is a soft link to .so.0 which is a soft link to .so.0.0.0. This is
intended to be an ABI versioning interface, but it's not super
frequently used. Are these legitimately different files on your
system?

The crash is in hash_delete:

003b2c00b780 :
...
  3b2c00b797:   48 8b 07mov(%rdi),%rax
  3b2c00b79a:   48 8d 34 30 lea(%rax,%rsi,1),%rsi
  3b2c00b79e:   48 39 f0cmp%rsi,%rax
  3b2c00b7a1:   73 37   jae3b2c00b7da 
  3b2c00b7a3:   48 bf b3 01 00 00 00movabs $0x10001b3,%rdi
  3b2c00b7aa:   01 00 00
  3b2c00b7ad:   0f 1f 00nopl   (%rax)
 3b2c00b7b0:   0f b6 08movzbl (%rax),%ecx
  3b2c00b7b3:   48 83 c0 01 add$0x1,%rax
  3b2c00b7b7:   48 31 caxor%rcx,%rdx
  3b2c00b7ba:   48 0f af d7 imul   %rdi,%rdx
  3b2c00b7be:   48 39 c6cmp%rax,%rsi
  3b2c00b7c1:   77 ed   ja 3b2c00b7b0 
...

%rdi is the first argument to the function, so %rax is the datum_t
*key, and (%rax) is key->data. hash_key has been inlined here.
Unfortunately, what appears to be happening is that some key has
already been removed from the hash table and freed, and based on your
description of the problem, that was attempted concurrently. Your
kernel crash shows that we were trying to dereference a NULL pointer,
so it would appear that key->data is NULL.

Unfortunately, it is not clear without a backtrace what sort of key
specifically is in question here, but perhaps someone else might have
some context based on recent changes. (I don't think this is related
to my work on the hashes).

Are you running any custom modules (either in C or Python)? Would it
be possible for you to build 

Re: [Ganglia-general] gmetad segmentation fault

2015-12-11 Thread Devon H. O'Dell
Unfortunately, without a coredump or backtrace where debug symbols are
present, I'm not going to be able to offer any additional insight.

Are you running any C and / or Python modules with gmetad?

--dho

2015-12-11 5:54 GMT-08:00 Cristovao Cordeiro :
> Hi guys,
>
> just to update on this:
>  - I've removed my ganglia-gmetad/gmond and libganglia from everywhere and
> installed the most recent versions from the epel repository. The error is
> still there.
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
> 
> From: Cristovao Cordeiro
> Sent: 08 December 2015 11:49
> To: Marcello Morgotti
> Cc: ganglia-general@lists.sourceforge.net
>
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi everyone,
>
> sorry for the late reply.
> @Devon
> thanks for looking into it.
> i do have .so.0 and .so.0.0.0 in my system and I am not using any custom
> modules. The Ganglia deployment is however a bit different from the
> standard:
>   - in one single VM, gmetad is running (always) and several gmond daemons
> are running in the background (daemon gmond -c /etc/ganglia/gmond_N.conf),
> all receiving metric through unicast.
> The Ganglia package is built by me as well, from the source code. I am
> currently building and using Ganglia 3.7.1 (taken from
> http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.7.1/).
> I build the Ganglia RPM myself for 2 reasons:
> 1 - have Ganglia available in YUM
> 2 - minor changes to ganglia-web's apache.conf
>
> I have other monitors running 3.6.0 and no errors there. But on those I have
> installed Ganglia manually and directly without building a RPM.
>
> I also see 3.7.2 already available in the epel repository so I’ll might try
> this.
>
> Regarding the compilation with debug symbols…
>
> @Marcello
> did you get a chance to do it?
>
>
> Best regards,
> Cristóvão José Domingues Cordeiro
>
>
>
>
> On 24 Nov 2015, at 18:51, Marcello Morgotti  wrote:
>
> Hello,
>
> I'd like to join the discussion because this problem is affecting us as
> well. We have the problem on two different installations:
>
> 2 server in active-active HA configuration, each with CentOS 7.1 +
> ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D
> 2 server in active-active HA configuration, each with RedHat 6.5 +
> ganglia 3.7.2 + rrdcached  monitoring systems E,F,G,H
>
> In both cases the ganglia rpm packages are taken from EPEL repository.
> The curios thing is that every time that the segfault happens it happens
> almost at the same time.
> I.e. for Centos7 systems:
>
> Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection
> ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in
> libganglia.so.0.0.0[7fd70d624000+14000]
> Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited,
> code=killed, status=11/SEGV
> Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state.
> Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection
> ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in
> libganglia.so.0.0.0[7fc1bddda000+14000]
> Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited,
> code=killed, status=11/SEGV
> Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state.
>
>
> Hope this helps and adds infomations, I will try to build a debug
> version of gmetad to see if it's possible to generate a core dump.
>
> Best Regards,
> Marcello
>
> On 23/11/2015 17:30, Devon H. O'Dell wrote:
>
> It's just a system versioning thing for shared libraries. Usually .so
> is a soft link to .so.0 which is a soft link to .so.0.0.0. This is
> intended to be an ABI versioning interface, but it's not super
> frequently used. Are these legitimately different files on your
> system?
>
> The crash is in hash_delete:
>
> 003b2c00b780 :
> ...
>   3b2c00b797:   48 8b 07mov(%rdi),%rax
>   3b2c00b79a:   48 8d 34 30 lea(%rax,%rsi,1),%rsi
>   3b2c00b79e:   48 39 f0cmp%rsi,%rax
>   3b2c00b7a1:   73 37   jae3b2c00b7da 
>   3b2c00b7a3:   48 bf b3 01 00 00 00movabs $0x10001b3,%rdi
>   3b2c00b7aa:   01 00 00
>   3b2c00b7ad:   0f 1f 00nopl   (%rax)
>
>  3b2c00b7b0:   0f b6 08movzbl (%rax),%ecx
>
>   3b2c00b7b3:   48 83 c0 01 add$0x1,%rax
>   3b2c00b7b7:   48 31 caxor%rcx,%rdx
>   3b2c00b7ba:   48 0f af d7 imul   %rdi,%rdx
>   3b2c00b7be:   48 39 c6cmp%rax,%rsi
>   3b2c00b7c1:   77 ed   ja 3b2c00b7b0 
> ...
>
> %rdi is the first argument to the function, so %rax is the datum_t
> *key, and (%rax) is key->data. hash_key has been inlined here.
> Unfortunately, what appears to be happening is that some key has
> already been removed from the hash table and freed, and based on your
> description of the problem, that was attempted 

[Ganglia-general] Gweb cluster page stopped working after adding 500 custom metrics per server

2015-12-11 Thread Sergey
Hi All!

We added ~500 custom metrics/server in one cluster and now this cluster page 
stopped working. 
All other clusters are working properly.
It looks like some timeout value should be updated in Gweb, because the data 
retrieving time was increased.
Do you know how to fix this?

The mobile page is still showing all data from this cluster.

Thanks!
Sergey
--
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general