Hi, I am working on a Cluster with 752 Nodes at the University of Freiburg (Germany).
In this setup gmetad is crashing with a segfault. /var/log/messages: Aug 16 20:07:47 monitor kernel: gmetad[38792]: segfault at 0 ip 00007f9b5122d82c sp 00007f9b38a31af0 error 4 in libganglia.so.0.0.0[7f9b51222000+14000] Aug 16 20:07:47 monitor systemd: gmetad.service: main process exited, code=killed, status=11/SEGV System: CentOS Linux release 7.2.1511 Ganglia-Versions: ganglia-web-3.7.1-2.el7.x86_64 ganglia-3.7.2-2.el7.x86_64 ganglia-gmond-3.7.2-2.el7.x86_64 ganglia-debuginfo-3.7.2-2.el7.x86_64 ganglia-gmetad-3.7.2-2.el7.x86_64 The ganglia configuration files are attached to this email. The crash always happens when gmetad removes nodes that disappeared: $ journalctl -u gmetad.service ... Aug 23 16:58:30 monitor gmetad: Updating host n4262.nemo.privat, metric disk_total Aug 23 16:58:30 monitor gmetad: Updating host n4262.nemo.privat, metric mem_shared Aug 23 16:59:00 monitor journal: Suppressed 48289 messages from /system.slice/gmetad.service Aug 23 16:59:00 monitor gmetad: Cleanup thread running... Aug 23 16:59:00 monitor gmetad: Cleanup deleting host "n4385.nemo.privat" Aug 23 16:59:00 monitor gmetad: Cleanup deleting host "n4385.nemo.privat" Aug 23 16:59:00 monitor kernel: gmetad[68347]: segfault at 160 ip 00007ffb0e9de9a6 sp 00007ffaf6b3da60 error 4 in libganglia.so.0.0.0[7ffb0e9d2000+16000] Aug 23 16:59:00 monitor systemd: gmetad.service: main process exited, code=killed, status=11/SEGV Aug 23 16:59:00 monitor systemd: Unit gmetad.service entered failed state. I started gmetad with gdb to get more information: gdb /usr/sbin/gmetad (gdb) run -d 10 -c /etc/ganglia/gmetad.conf The crash looks like this: ... Writing Root Summary data for metric mem_shared Writing Root Summary data for metric proc_run Cleanup thread running... Cleanup deleting host "n4527.nemo.privat" Cleanup deleting host "n4527.nemo.privat" Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffdfb00700 (LWP 175149)] 0x00007ffff799f82c in hash_key (seed=0, len=<optimized out>, key=<optimized out>) at hash.c:182 182 seed ^= (uint64_t)*bp++; Additional gdb information: (gdb) where #0 0x00007ffff799f82c in hash_key (seed=0, len=<optimized out>, key=<optimized out>) at hash.c:182 #1 hashval (hash=0x7fffd9a5f290, key=<optimized out>) at hash.c:195 #2 hash_delete (key=<optimized out>, hash=hash@entry=0x7fffd9a5f290) at hash.c:335 #3 0x00007ffff799f927 in hash_destroy (hash=0x7fffd9a5f290) at hash.c:145 #4 0x000000000040f35c in cleanup_source (key=0x7fffd92140d0, val=0x7fffd92143d0, arg=0x7fffdfaffc10) at cleanup.c:170 #5 0x00007ffff799f9d9 in hash_walkfrom (hash=0x62ffc0, from=<optimized out>, func=0x40f219 <cleanup_source>, arg=0x7fffdfaffc10) at hash.c:402 #6 0x000000000040f50b in cleanup_thread (arg=0x0) at cleanup.c:206 #7 0x00007ffff635ddc5 in start_thread (arg=0x7fffdfb00700) at pthread_create.c:308 #8 0x00007ffff608aced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) list 177 unsigned char *be = bp + len; /* beyond end of buffer */ 178 179 /* FNV-1a hash; assume we have stdint.h available */ 180 while (bp < be) { 181 /* xor the bottom with the current octet */ 182 seed ^= (uint64_t)*bp++; 183 /* multiply by the 64 bit FNV magic prime mod 2^64 */ 184 seed *= FNV_64_PRIME; 185 } 186 (gdb) print bp $1 = (unsigned char *) 0x1 <Address 0x1 out of bounds> (gdb) print be $2 = (unsigned char *) 0x161 <Address 0x161 out of bounds> (gdb) print seed $3 = 0 best regards, Konrad Meier
ganglia-conf.tar.gz
Description: application/gzip
------------------------------------------------------------------------------
_______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers