I have a bug report which describes a node in the "down" state with a
reason of "Not responding". The node turns out to have a deadlock
between two pairs of threads, one-half of each pair being in slurmd, and
the other half in slurmstepd. The original stack traces and analysis by
the site support taken from the bug report is attached below.
Briefly, the deadlock occurs because slurmd has received two requests: a
REQUEST_PING, and a REQUEST_STEP_COMPLETE, and has started a thread to
handle each of them. The REQUEST_PING code takes the opportunity to
perform an "_enforce_job_mem_limit" call, which requires getting some job
accounting information. To get this information, it opens a socket to
slurmstepd and sends a REQUEST_STEP_STAT, and then eventually calls
"jobacct_gather_g_getinfo" to receive the response. This routine locks
the "g_jobacct_gather_context_lock" before eventually calling "safe_read"
to read from the socket.
Meanwhile the REQUEST_STEP_COMPLETE code opens a socket to slurmstepd and
sends a REQUEST_STEP_COMPLETION request, then eventually calls
"jobacct_gather_g_setinfo". This would lead to sending the data via
"safe_write" except that the thread attempts to lock the
"g_jobacct_gather_context_lock" and waits behind the REQUEST_PING thread.
Since both of the slurmd threads issued their requests to slumstepd over
their respective sockets before attempting to lock the above lock,
slurmstepd has kicked off two threads to process them. Each of these
threads needs to call some job_account_gather routines, which attempt to
lock "g_jobacct_gather_context_lock". But since slurmstepd is a separate
process from slurmd, this copy of the lock is not the same as the slurmd
one. One of the threads in slurmstepd gets the lock first and the other
waits. In the deadlock case, we have the one that needs to respond to the
REQUEST_STEP_STAT stuck behind the lock, while the other slurmstepd
thread is attempting to receive the REQUEST_STEP_COMPLETE information, and
can't because its corresponding slurmd thread is waiting on the slurmd
copy of the "g_jobacct_gather_context_lock" lock.
Looking at the code in "slurm_jobacct_gather.c", where all the
"jobacct_gather_g_<xxx>" routines reside, it seems that each of these
routines locks the "g_jobacct_gather_context_lock" lock before proceeding
and calling the job_account_gather plugin routines. This lock ostensibly
protects the "g_jobacct_gather_context" data structure, but it seems to
me that there isn't anything there that needs protecting, except for the
initialization of the structure. The routines all call
"_slurm_jobacct_gather_init" first, which checks for the existence of the
structure, and if necessary creates it and populates the "ops" table by
causing the loading of the appropriate plugin. Once this is done, there
doesn't seem to me to be any reason to lock this particular mutex.
Removing the lock/unlock calls from these routines would prevent the
deadlock described above.
But not being the author of this code, I am not sure if removing this
locking/unlocking is really safe, or might lead to other problems.
-Don Albert-
Detailed description
The failing node is in the *down state with a reason of "Not responding" :
[root@node3066 ~] # sinfo -V
slurm 2.2.7
[root@node3066 ~] # sinfo -Rln node3066
Tue Oct 25 16:44:52 2011
REASON USER TIMESTAMP STATE NODELIST
Not responding root(0) 2011-10-21T01:48:14 down* node3066
>> On the node, the slurmd has reached its max amount of threads. Slurmd is
>> trying in _rpc_step_complete to write
stats data to the step in thread 130, but needs the lock that is owned by
thread 131 which is waiting in _rpc_ping for
stepd_stat_jobacct to complete.
[root@node3066 ~] # ps -f -p 25398
UID PID PPID C STIME TTY TIME CMD
root 25398 1 0 Oct04 ? 00:01:45 /usr/sbin/slurmd
[root@node3066 ~] #
[root@node3066 ~] # gdb -p 25398
...
(gdb) thread apply 131 130 bt
Thread 131 (Thread 0x2b2e3d6aa700 (LWP 41978)):
#0 0x0000003fa520e50d in read () from /lib64/libpthread.so.0
0000001 0x0000000000497d86 in read (jobacct=<value optimized out>, type=<value
optimized out>,
data=0x2b2e3d6a9bbc) at /usr/include/bits/unistd.h:45
0000002 jobacct_common_getinfo (jobacct=<value optimized out>, type=<value
optimized out>, data=0x2b2e3d6a9bbc)
at jobacct_common.c:209
0000003 0x000000000049b21b in jobacct_gather_g_getinfo (jobacct=0x2b2e74001a98,
type=JOBACCT_DATA_PIPE, data=0x2b2e3d6a9bbc)
at slurm_jobacct_gather.c:365
0000004 0x00000000004ad87f in stepd_stat_jobacct (fd=10, sent=<value optimized
out>, resp=0x2b2e740014d8)
at stepd_api.c:915
0000005 0x000000000042605f in _enforce_job_mem_limit () at req.c:1691
0000006 0x000000000042652c in _rpc_ping (msg=0x2b2e740028f8) at req.c:1787
0000007 0x00000000004281f8 in slurmd_req (msg=0x2b2e740028f8) at req.c:320
0000008 0x000000000041ec68 in _service_connection (arg=<value optimized out>)
at slurmd.c:538
0000009 0x0000003fa52077e1 in start_thread () from /lib64/libpthread.so.0
0000010 0x0000003fa4ee151d in clone () from /lib64/libc.so.6
Thread 130 (Thread 0x2b2e64100700 (LWP 42012)):
#0 0x0000003fa520e034 in __lll_lock_wait () from /lib64/libpthread.so.0
0000001 0x0000003fa5209345 in _L_lock_870 () from /lib64/libpthread.so.0
0000002 0x0000003fa5209217 in pthread_mutex_lock () from /lib64/libpthread.so.0
0000003 0x000000000049b2fb in jobacct_gather_g_setinfo (jobacct=0x2b2e44002d38,
type=JOBACCT_DATA_PIPE, data=0x2b2e640ffc2c)
at slurm_jobacct_gather.c:347
0000004 0x00000000004adbc3 in stepd_completion (fd=12, sent=<value optimized
out>) at stepd_api.c:884
0000005 0x0000000000424f0b in _rpc_step_complete (msg=0x2b2e440029c8) at
req.c:2019
0000006 0x0000000000427b02 in slurmd_req (msg=0x2b2e440029c8) at req.c:339
0000007 0x000000000041ec68 in _service_connection (arg=<value optimized out>)
at slurmd.c:538
0000008 0x0000003fa52077e1 in start_thread () from /lib64/libpthread.so.0
0000009 0x0000003fa4ee151d in clone () from /lib64/libc.so.6
...
[root@node3066 ~] # lsof -p 25398 | egrep -w "10u|12u"
slurmd 25398 root 10u unix 0xffff8804ecf30c80 0t0 5501148 socket
slurmd 25398 root 12u unix 0xffff880c955dc980 0t0 5501147 socket
[root@node3066 ~] #
>> On the node, the completing slurmstepd is waiting for stats data in thread
>> 53 in order to complete the completion
stage but owns the lock that is required by thread 52 waiting in
_handle_stat_jobacct. As a result thread 52 of slurmstepd
can not answer thread 131 of slurmd which can not liberate the lock required by
thread 130 of slurmd that is necessary
to send the data to thread 53 of slurmstepd.... deadlock situation
[root@node3066 ~] # ps -f -p 39636
UID PID PPID C STIME TTY TIME CMD
root 39636 1 0 Oct20 ? 00:00:07 slurmstepd: [1005797.2]
[root@node3066 ~] # gdb -p 39636
...
(gdb) thread apply 52 53 bt
Thread 52 (Thread 0x2b762c302700 (LWP 42017)):
#0 0x0000003fa520e034 in __lll_lock_wait () from /lib64/libpthread.so.0
0000001 0x0000003fa5209345 in _L_lock_870 () from /lib64/libpthread.so.0
0000002 0x0000003fa5209217 in pthread_mutex_lock () from /lib64/libpthread.so.0
0000003 0x000000000049dedb in _slurm_jobacct_gather_init () at
slurm_jobacct_gather.c:261
0000004 0x000000000049f090 in jobacct_gather_g_create (jobacct_id=0x0) at
slurm_jobacct_gather.c:314
0000005 0x000000000042bd57 in _handle_stat_jobacct (fd=36, job=0x112d468,
uid=0, gid=<value optimized out>) at
req.c:1366
0000006 _handle_request (fd=36, job=0x112d468, uid=0, gid=<value optimized
out>) at req.c:528
0000007 0x000000000042c9dc in _handle_accept (arg=0x0) at req.c:422
0000008 0x0000003fa52077e1 in start_thread () from /lib64/libpthread.so.0
0000009 0x0000003fa4ee151d in clone () from /lib64/libc.so.6
Thread 53 (Thread 0x2b762c201700 (LWP 42016)):
#0 0x0000003fa520e50d in read () from /lib64/libpthread.so.0
0000001 0x000000000049b966 in read (jobacct=<value optimized out>, type=<value
optimized out>,
data=0x2b762c200d5c) at /usr/include/bits/unistd.h:45
0000002 jobacct_common_getinfo (jobacct=<value optimized out>, type=<value
optimized out>, data=0x2b762c200d5c)
at jobacct_common.c:209
0000003 0x000000000049edfb in jobacct_gather_g_getinfo (jobacct=0x2b7634004948,
type=JOBACCT_DATA_PIPE, data=0x2b762c200d5c)
at slurm_jobacct_gather.c:365
0000004 0x000000000042aaec in _handle_completion (fd=35, job=<value optimized
out>, uid=<value optimized out>)
at req.c:1293
0000005 0x000000000042b9f9 in _handle_request (fd=35, job=0x112d468, uid=0,
gid=<value optimized out>) at
req.c:520
0000006 0x000000000042c9dc in _handle_accept (arg=0x0) at req.c:422
0000007 0x0000003fa52077e1 in start_thread () from /lib64/libpthread.so.0
0000008 0x0000003fa4ee151d in clone () from /lib64/libc.so.6
...
[root@node3066 ~] # ls -l /proc/39636/fd/{35,36}
lrwx------ 1 root root 64 Oct 25 14:55 /proc/39636/fd/35 -> socket:[5501150]
lrwx------ 1 root root 64 Oct 25 14:55 /proc/39636/fd/36 -> socket:[5501151]
[root@node3066 ~] #
With crash, we can confirm that the involved socket are connected to each
others :
[root@node3066 ~] # crash
...
crash> net -s 25398 -R 10
PID: 25398 TASK: ffff88104c3f63c0 CPU: 7 COMMAND: "slurmd"
FD SOCKET SOCK FAMILY:TYPE SOURCE-PORT DESTINATION-PORT
10 ffff880508b17c40 ffff8804ecf30c80 UNIX:STREAM
crash> net -s 25398 -R 12
PID: 25398 TASK: ffff88104c3f63c0 CPU: 7 COMMAND: "slurmd"
FD SOCKET SOCK FAMILY:TYPE SOURCE-PORT DESTINATION-PORT
12 ffff88107b428400 ffff880c955dc980 UNIX:STREAM
crash> net -s 39636 -R 35
PID: 39636 TASK: ffff88031e6e16c0 CPU: 1 COMMAND: "slurmstepd"
FD SOCKET SOCK FAMILY:TYPE SOURCE-PORT DESTINATION-PORT
35 ffff8809f8fcfd00 ffff880d6680f3c0 UNIX:STREAM
crash> net -s 39636 -R 36
PID: 39636 TASK: ffff88031e6e16c0 CPU: 1 COMMAND: "slurmstepd"
FD SOCKET SOCK FAMILY:TYPE SOURCE-PORT DESTINATION-PORT
36 ffff8809f8fcf780 ffff8804ecf30980 UNIX:STREAM
crash> net -S 25398 -R 10 | grep pprev
pprev = 0xffff8804ecf30980
pprev = 0xffff8804ecf30980
pprev = 0x0
crash> net -S 25398 -R 12 | grep pprev
pprev = 0xffff880d6680f3c0
pprev = 0xffff880d6680f3c0
pprev = 0x0
crash>
12/35 and 10/36 are the two pairs of FD involved in the deadlock.