I tried to reproduce the deadlock described in the previous message on my 
local system by increasing the frequency that the REQUEST_PING was issued 
and the frequency of calls to the job accounting routines.   I did this by 
setting "SlurmdTimeout=5" to increase the rate of PINGs to every 2-3 
seconds,  and also setting "--acctg-freq=1" in the job submissions (see 
script below) to increase attempts to get job accounting statistics.   I 
also added a couple of "info" displays where the PING and STEP_COMPLETE 
requests were processed so I could see them in the log file without 
turning on all the detailed traces.

I then ran this script:

#!/bin/bash
count=0
until [ $count = 10000 ]
do
   srun --acctg-freq=1 -N4 -n4 sleep 8 &
   srun --acctg-freq=1 -N4 -n4 sleep 7 &
   sleep 5
   count=`expr $count + 1`
done

I had to play around with the sleep times a bit, because if I submitted 
jobs too quickly,  I would not get the PING requests to the node because 
it was *too* busy.  But with the above script,  I could get the PING to 
occur fairly close to the step complete requests.  For example:

[2011-12-16T11:46:44] launch task 22945.0 request from 
[email protected] (port 37606)
[2011-12-16T11:46:44] launch task 22944.0 request from 
[email protected] (port 37094)
[2011-12-16T11:46:46] Ping request received
[2011-12-16T11:46:46] Step complete request received
[2011-12-16T11:46:46] Step complete request received
[2011-12-16T11:46:46] Step complete request received
[2011-12-16T11:46:47] Step complete request received
[2011-12-16T11:46:47] Step complete request received
[2011-12-16T11:46:47] Step complete request received
[2011-12-16T11:46:47] [22943.0] done with job
[2011-12-16T11:46:47] [22942.0] done with job

I ran more 10000 iterations of jobs and never encountered the deadlock. 
This is possibly because I only have a four node cluster to try it on, so 
the probability of hitting the deadlock is low.

As I described below,  I believe the deadlock is caused by excessive 
locking/unlocking of the "g_jobacct_gather_context_lock" when calling the 
job accounting plugin routines.  I removed all the lock/unlock pairs 
around the calls and recompiled.  I then ran the same set of jobs again, 
to see if removing the lock/unlock calls caused any problems.   None were 
apparent.

Attached is a SLURM 2.4.0-pre2 patch to "slurm_jobacct_gather.c" that 
removes the lock calls.

  -Don Albert-






[email protected] wrote on 12/07/2011 03:06:44 PM:

> From: [email protected]
> To: [email protected], 
> Date: 12/07/2011 03:07 PM
> Subject: [slurm-dev] Slurmd/slurmstepd race condition and deadlock 
> on "g_jobacct_gather_context_lock"
> Sent by: [email protected]
> 
> I have a bug report which describes a node in the "down" state with 
> a reason of "Not responding".   The node turns out to have a 
> deadlock between two pairs of threads, one-half of each pair being 
> in slurmd, and the other half in slurmstepd.  The original stack 
> traces and analysis by the site support taken from the bug report is
> attached below. 
> 
> Briefly, the deadlock occurs because slurmd has received two 
> requests:  a REQUEST_PING, and a REQUEST_STEP_COMPLETE, and has 
> started a thread to handle each of them.   The REQUEST_PING code 
> takes the opportunity to perform an "_enforce_job_mem_limit" call, 
> which requires getting some job accounting information.  To get this
> information, it opens a socket to slurmstepd and sends a 
> REQUEST_STEP_STAT,  and then eventually calls 
> "jobacct_gather_g_getinfo" to receive the response.  This routine 
> locks the "g_jobacct_gather_context_lock" before eventually calling 
> "safe_read" to read from the socket. 
> 
> Meanwhile the REQUEST_STEP_COMPLETE code opens a socket to 
> slurmstepd and sends a REQUEST_STEP_COMPLETION request,  then 
> eventually calls "jobacct_gather_g_setinfo".  This would lead to 
> sending the data via "safe_write" except that the thread attempts to
> lock the "g_jobacct_gather_context_lock" and waits behind the 
> REQUEST_PING thread. 
> 
> Since both of the slurmd threads issued their requests to slumstepd 
> over their respective sockets before attempting to lock the above 
> lock,  slurmstepd has kicked off two threads to process them.  Each 
> of these threads needs to call some job_account_gather routines, 
> which attempt to lock "g_jobacct_gather_context_lock".   But since 
> slurmstepd is a separate process from slurmd,  this copy of the lock
> is not the same as the slurmd one.   One of the threads in 
> slurmstepd gets the lock first and the other waits.  In the deadlock
> case, we have the one that needs to respond to the REQUEST_STEP_STAT
> stuck behind the lock,  while the other slurmstepd thread is 
> attempting to receive the REQUEST_STEP_COMPLETE information,  and 
> can't because its corresponding slurmd thread is waiting on the 
> slurmd copy of the  "g_jobacct_gather_context_lock" lock. 
> 
> Looking at the code in "slurm_jobacct_gather.c",  where all the 
> "jobacct_gather_g_<xxx>" routines reside,  it seems that each of 
> these routines locks the "g_jobacct_gather_context_lock" lock before
> proceeding and calling the job_account_gather plugin routines. 
> This lock ostensibly protects the "g_jobacct_gather_context" data 
> structure,  but it seems to me that there isn't anything there that 
> needs protecting, except for the initialization of the structure. 
> The routines all call "_slurm_jobacct_gather_init" first,  which 
> checks for the existence of the structure, and if necessary creates 
> it and populates the "ops" table by causing the loading of the 
> appropriate plugin.  Once this is done,  there doesn't seem to me to
> be any reason to lock this particular mutex.  Removing the lock/
> unlock calls from these routines would prevent the deadlock described 
above. 
> 
> But not being the author of this code,  I am not sure if removing 
> this locking/unlocking is really safe, or might lead to other problems. 
> 
>   -Don Albert- 
> 
> 
> 
> [attachment "analysis.txt" deleted by Don Albert/US/BULL] 

Attachment: jobacct_gather_lock.patch
Description: Binary data

Reply via email to