I tried to reproduce the deadlock described in the previous message on my local system by increasing the frequency that the REQUEST_PING was issued and the frequency of calls to the job accounting routines. I did this by setting "SlurmdTimeout=5" to increase the rate of PINGs to every 2-3 seconds, and also setting "--acctg-freq=1" in the job submissions (see script below) to increase attempts to get job accounting statistics. I also added a couple of "info" displays where the PING and STEP_COMPLETE requests were processed so I could see them in the log file without turning on all the detailed traces.
I then ran this script: #!/bin/bash count=0 until [ $count = 10000 ] do srun --acctg-freq=1 -N4 -n4 sleep 8 & srun --acctg-freq=1 -N4 -n4 sleep 7 & sleep 5 count=`expr $count + 1` done I had to play around with the sleep times a bit, because if I submitted jobs too quickly, I would not get the PING requests to the node because it was *too* busy. But with the above script, I could get the PING to occur fairly close to the step complete requests. For example: [2011-12-16T11:46:44] launch task 22945.0 request from [email protected] (port 37606) [2011-12-16T11:46:44] launch task 22944.0 request from [email protected] (port 37094) [2011-12-16T11:46:46] Ping request received [2011-12-16T11:46:46] Step complete request received [2011-12-16T11:46:46] Step complete request received [2011-12-16T11:46:46] Step complete request received [2011-12-16T11:46:47] Step complete request received [2011-12-16T11:46:47] Step complete request received [2011-12-16T11:46:47] Step complete request received [2011-12-16T11:46:47] [22943.0] done with job [2011-12-16T11:46:47] [22942.0] done with job I ran more 10000 iterations of jobs and never encountered the deadlock. This is possibly because I only have a four node cluster to try it on, so the probability of hitting the deadlock is low. As I described below, I believe the deadlock is caused by excessive locking/unlocking of the "g_jobacct_gather_context_lock" when calling the job accounting plugin routines. I removed all the lock/unlock pairs around the calls and recompiled. I then ran the same set of jobs again, to see if removing the lock/unlock calls caused any problems. None were apparent. Attached is a SLURM 2.4.0-pre2 patch to "slurm_jobacct_gather.c" that removes the lock calls. -Don Albert- [email protected] wrote on 12/07/2011 03:06:44 PM: > From: [email protected] > To: [email protected], > Date: 12/07/2011 03:07 PM > Subject: [slurm-dev] Slurmd/slurmstepd race condition and deadlock > on "g_jobacct_gather_context_lock" > Sent by: [email protected] > > I have a bug report which describes a node in the "down" state with > a reason of "Not responding". The node turns out to have a > deadlock between two pairs of threads, one-half of each pair being > in slurmd, and the other half in slurmstepd. The original stack > traces and analysis by the site support taken from the bug report is > attached below. > > Briefly, the deadlock occurs because slurmd has received two > requests: a REQUEST_PING, and a REQUEST_STEP_COMPLETE, and has > started a thread to handle each of them. The REQUEST_PING code > takes the opportunity to perform an "_enforce_job_mem_limit" call, > which requires getting some job accounting information. To get this > information, it opens a socket to slurmstepd and sends a > REQUEST_STEP_STAT, and then eventually calls > "jobacct_gather_g_getinfo" to receive the response. This routine > locks the "g_jobacct_gather_context_lock" before eventually calling > "safe_read" to read from the socket. > > Meanwhile the REQUEST_STEP_COMPLETE code opens a socket to > slurmstepd and sends a REQUEST_STEP_COMPLETION request, then > eventually calls "jobacct_gather_g_setinfo". This would lead to > sending the data via "safe_write" except that the thread attempts to > lock the "g_jobacct_gather_context_lock" and waits behind the > REQUEST_PING thread. > > Since both of the slurmd threads issued their requests to slumstepd > over their respective sockets before attempting to lock the above > lock, slurmstepd has kicked off two threads to process them. Each > of these threads needs to call some job_account_gather routines, > which attempt to lock "g_jobacct_gather_context_lock". But since > slurmstepd is a separate process from slurmd, this copy of the lock > is not the same as the slurmd one. One of the threads in > slurmstepd gets the lock first and the other waits. In the deadlock > case, we have the one that needs to respond to the REQUEST_STEP_STAT > stuck behind the lock, while the other slurmstepd thread is > attempting to receive the REQUEST_STEP_COMPLETE information, and > can't because its corresponding slurmd thread is waiting on the > slurmd copy of the "g_jobacct_gather_context_lock" lock. > > Looking at the code in "slurm_jobacct_gather.c", where all the > "jobacct_gather_g_<xxx>" routines reside, it seems that each of > these routines locks the "g_jobacct_gather_context_lock" lock before > proceeding and calling the job_account_gather plugin routines. > This lock ostensibly protects the "g_jobacct_gather_context" data > structure, but it seems to me that there isn't anything there that > needs protecting, except for the initialization of the structure. > The routines all call "_slurm_jobacct_gather_init" first, which > checks for the existence of the structure, and if necessary creates > it and populates the "ops" table by causing the loading of the > appropriate plugin. Once this is done, there doesn't seem to me to > be any reason to lock this particular mutex. Removing the lock/ > unlock calls from these routines would prevent the deadlock described above. > > But not being the author of this code, I am not sure if removing > this locking/unlocking is really safe, or might lead to other problems. > > -Don Albert- > > > > [attachment "analysis.txt" deleted by Don Albert/US/BULL]
jobacct_gather_lock.patch
Description: Binary data
