After upgrading slurm to 2.6.5 (from 2.5.4), we are experiencing problems with segfault errors.

This is what we see in the logs:
slurmstepd[8459]: segfault at 0 ip 00007ffff713d7a5 sp 00007ffff7fa6af8 error 4 in libc-2.12.so[7ffff7013000+18b000] slurmstepd[32345]: segfault at 0 ip 00007ffff709dd96 sp 00007ffff7fa68f8 error 4 in libc-2.12.so[7ffff7013000+18b000] slurmstepd[36025]: segfault at 0 ip 00007ffff709dd96 sp 00007ffff7fa68f8 error 4 in libc-2.12.so[7ffff7013000+18b000] slurmstepd[37942]: segfault at 0 ip 00007ffff709dd96 sp 00007ffff7fa68f8 error 4 in libc-2.12.so[7ffff7013000+18b000]

Strace shows:
onnect(9, {sa_family=AF_FILE, path="/var/run/munge/munge.socket.2"}, 110) = 0 writev(9, [{"\0`mK\4\2\0\0\0\0\24", 11}, {"\1\1\1\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0", 20}], 2) = 31 read(9, 0x7fff8a4f59c0, 11) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=9, events=POLLIN}], 1, 3000) = 1 ([{fd=9, revents=POLLIN|POLLHUP}])
read(9, "\0`mK\4\3\0\0\0\0\206", 11)    = 11
read(9, "\0\0\0\0\0\200MUNGE:AwQDAAASHb46boKmiwmh"..., 134) = 134
close(9)                                = 0
rt_sigaction(SIGALRM, {SIG_DFL, [ALRM], SA_RESTORER, 0x7f065eacf710}, {SIG_DFL, [ALRM], SA_RESTORER, 0x7f065eacf710}, 8) = 0 rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f065eacf710}, {SIG_IGN, [], 0}, 8) = 0
fcntl(6, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(6, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
poll([{fd=6, events=POLLOUT}], 1, 60000) = 1 ([{fd=6, revents=POLLOUT}])
recvfrom(6, 0x7fff8a4f5ac0, 1, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(6, "\0\0\1C", 4, 0, NULL, 0)     = 4
fcntl(6, F_SETFL, O_RDWR)               = 0
fcntl(6, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(6, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
poll([{fd=6, events=POLLOUT}], 1, 60000) = 1 ([{fd=6, revents=POLLOUT}])
recvfrom(6, 0x7fff8a4f5ac0, 1, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)

There are a bunch of errors in munge log:
2014-01-26 11:53:05 Info:      Decode retry #2 for client uid=0 gid=0
2014-01-26 11:53:05 Info:      Unable to send message: Broken pipe

I have run slurmctld and slurmdbd in debug mode and I didn't notice anything "abnormal". Since a lot of jobs are killed because of those segfault errors, I would really appreciate any help/advice to solve the problem. Could it be related to openssl 1.0.1?
And let me know if you need additional info.

Cheers,
Barbara

Reply via email to