I have been struggling with this problem for some weeks now. Once the two pairs of threads in the two processes are in the deadlock, it is easy enough to see how it happened and why they are stuck waiting on each other. The root cause is that both slurmd and slurmstepd use the "job accounting gather plugin" and its associated common code within their own processes, but in certain cases, need to transfer data between the processes. The functions within the job accounting gather code contain a mutex that is shut for each function. But since slurmd and slurmstepd are separate processes, there is a separate copy of this mutex in each process. So when a thread in slurmd invokes a job accounting gather function that needs to communicate to slurmstepd, and the associated thread in slurmstepd also has to invoke a job accounting gather function, two separate mutexes must be shut. This creates an ordering problem, and in the case where one of the threads in slurmd is doing a "getinfo" and the other a "setinfo", the current arrangement of the code ends up trying to shut the mutexes in opposite order, leading to a race condition and deadlock.
When I first looked at the job accounting gather code, the mutex in question seemed to me to be unnecessary. You can see my reason for this in my original message at the end of this message. Therefore I proposed that the mutex be simply eliminated, which would avoid the deadlock. In private email, Moe suggested that this mutex might in fact be needed to protect some structures inside of the plugin itself, and Matthieu has shown that eliminating the mutex calls does not prevent a deadlock further down in the code. So that approach isn't the solution. If I understand it correctly, Moe has proposed a change which delays the call in slurmd to "getinfo" by adding a test and wait on the pipe to slurmstepd to make sure that there is something to read before calling the code that locks the mutex and reads the data. This avoids the deadlock by ensuring that the slurmstepd side has successfully obtained its copy of the mutex and written the data (i.e., is not waiting on its side). It resolves the ordering problem by always having the sender shut its mutex first. This should fix the deadlock that we have already seen, but I am concerned that if the race condition on the mutex goes the other way, we could end up with both sides trying to write instead of trying to read, and still have a deadlock. If we can be sure that "safe_write" will always complete immediately (i.e., that it will never block on the write to the pipe) and return to open its side of the mutex, then this may not be a problem. Again, if I understand correctly, Matthieu's change takes a different approach and solves the deadlock by eliminating the locking on the "setinfo" call. It does this by doing its own packing/unpacking of the data and calls to write the data over the pipe outside of the job account gather routines, so the mutex lock/unlock calls for this direction of data movement are never done. Since his system is the one where the original problem showed up, and with this change the problem has gone away, I am leaning toward this solution. My only comment is that eliminating the "jobacct_gather_g_setinfo" and "jobacct_gather_g_getinfo" calls in this case means that the job account gather plugin is not invoked, and any special handling for this case in the plugin is not possible (e.g., doing something different for aix vs linux). But that probably isn't a big issue. -Don Albert- [email protected] wrote on 01/24/2012 03:52:10 PM: > From: Moe Jette <[email protected]> > To: [email protected], > Date: 01/24/2012 03:52 PM > Subject: Re: [slurm-dev] Slurmd/slurmstepd race condition and > deadlock on "g_jobacct_gather_context_lock" > Sent by: [email protected] > > Hi Matthieu, > > I had the same concern and approached the problem somewhat differently > than you by waiting until there is data ready to read before calling > jobacct_gather_g_getinfo(), see: > https://github.com/SchedMD/slurm/commit/ > 4c0eea7b8c20ccb1cacad51838a1ea8257cc637d > > I was hoping to avoid making this change to SLURM version 2.3, but if > you have also seen the problem then perhaps either your patch or mine > should be made to v2.3. I need to look at your patch more closely, but > it seems a better approach than my patch. > > Moe > > Quoting Matthieu Hautreux <[email protected]>: > > > Don, > > > > having encountered the same exact problem... and applying this patch, > > it does not work as expected. > > > > Indeed, there is an other layer of lock in > > jobacct_common-{setinfo,getinfo} that triggers the same exact > > deadlock. > > > > I think that the reason you did not see the problem is because of the > > size of your jobs. The reason of the deadlock is the use of both > > jobacct_gather_g_setinfo and jobacct_gather_g_getinfo over a pipe in > > both slurmd and slurmstepd in their communications. As long as the > > simultaneous usage of both in each of the daemon does not appear, you > > do not have the problem. jobacct_gather_g_getinfo over a pipe in > > slurmd is mostly use to collect statistics from the slurmstepds (like > > when enforcing memory limits). jobacct_gather_g_setinfo over a pipe in > > slurmd is used in stepd_completion when slurmd receive a completion > > message from an other node of a same job and push the accounting data > > to the related local slurmstepd for aggregation. This does not occur > > with small jobs. > > > > I think that it is harmfull to use both setinfo and getinfo other a > > pipe when communicating between slurmd and slurmstepd. I modified the > > code to ensure that slurmd only do getinfo calls and slurmstepd only > > do setinfo calls. I have installed a patched version of slurm that > > does not reproduce the problem for now. You will find it enclosed. It > > does not require to use the patch removing the locks. > > > > Note that before applying this patch on a node, it must be drained. > > Indeed, as the communication protocol is modified to avoid the > > deadlock, both slumrd and slurmstepd must be aware of the > > modification. Hot swap of the rpms could result in a old slurmstepd > > trying to reply to a new slurmd in the older way thus resulting in a > > failure in job termination. > > > > Moe, Danny, what is your view concerning that deadlock scenario ? Do > > you think that this patch is sufficient ? > > > > Thanks in advance for your feedback > > > > Regards, > > Matthieu > > > > > > > > 2011/12/16 <[email protected]>: > >> I tried to reproduce the deadlock described in the previous message on my > >> local system by increasing the frequency that the REQUEST_PING was issued > >> and the frequency of calls to the job accounting routines. I did this by > >> setting "SlurmdTimeout=5" to increase the rate of PINGs to every 2-3 > >> seconds, and also setting "--acctg-freq=1" in the job submissions (see > >> script below) to increase attempts to get job accounting statistics. I > >> also added a couple of "info" displays where the PING and STEP_COMPLETE > >> requests were processed so I could see them in the log file without turning > >> on all the detailed traces. > >> > >> I then ran this script: > >> > >> #!/bin/bash > >> count=0 > >> until [ $count = 10000 ] > >> do > >> srun --acctg-freq=1 -N4 -n4 sleep 8 & > >> srun --acctg-freq=1 -N4 -n4 sleep 7 & > >> sleep 5 > >> count=`expr $count + 1` > >> done > >> > >> I had to play around with the sleep times a bit, because if I > submitted jobs > >> too quickly, I would not get the PING requests to the node because it was > >> *too* busy. But with the above script, I could get the PING to occur > >> fairly close to the step complete requests. For example: > >> > >> [2011-12-16T11:46:44] launch task 22945.0 request from > >> [email protected] (port 37606) > >> [2011-12-16T11:46:44] launch task 22944.0 request from > >> [email protected] (port 37094) > >> [2011-12-16T11:46:46] Ping request received > >> [2011-12-16T11:46:46] Step complete request received > >> [2011-12-16T11:46:46] Step complete request received > >> [2011-12-16T11:46:46] Step complete request received > >> [2011-12-16T11:46:47] Step complete request received > >> [2011-12-16T11:46:47] Step complete request received > >> [2011-12-16T11:46:47] Step complete request received > >> [2011-12-16T11:46:47] [22943.0] done with job > >> [2011-12-16T11:46:47] [22942.0] done with job > >> > >> I ran more 10000 iterations of jobs and never encountered the deadlock. > >> This is possibly because I only have a four node cluster to try it on, so > >> the probability of hitting the deadlock is low. > >> > >> As I described below, I believe the deadlock is caused by excessive > >> locking/unlocking of the "g_jobacct_gather_context_lock" when calling the > >> job accounting plugin routines. I removed all the lock/unlock pairs around > >> the calls and recompiled. I then ran the same set of jobs again,to see if > >> removing the lock/unlock calls caused any problems. None were apparent. > >> > >> Attached is a SLURM 2.4.0-pre2 patch to "slurm_jobacct_gather.c" that > >> removes the lock calls. > >> > >> -Don Albert- > >> > >> > >> > >> > >> > >> > >> [email protected] wrote on 12/07/2011 03:06:44 PM: > >> > >>> From: [email protected] > >>> To: [email protected], > >>> Date: 12/07/2011 03:07 PM > >>> Subject: [slurm-dev] Slurmd/slurmstepd race condition and deadlock > >>> on "g_jobacct_gather_context_lock" > >>> Sent by: [email protected] > >>> > >>> I have a bug report which describes a node in the "down" state with > >>> a reason of "Not responding". The node turns out to have a > >>> deadlock between two pairs of threads, one-half of each pair being > >>> in slurmd, and the other half in slurmstepd. The original stack > >>> traces and analysis by the site support taken from the bug report is > >>> attached below. > >>> > >>> Briefly, the deadlock occurs because slurmd has received two > >>> requests: a REQUEST_PING, and a REQUEST_STEP_COMPLETE, and has > >>> started a thread to handle each of them. The REQUEST_PING code > >>> takes the opportunity to perform an "_enforce_job_mem_limit" call, > >>> which requires getting some job accounting information. To get this > >>> information, it opens a socket to slurmstepd and sends a > >>> REQUEST_STEP_STAT, and then eventually calls > >>> "jobacct_gather_g_getinfo" to receive the response. This routine > >>> locks the "g_jobacct_gather_context_lock" before eventually calling > >>> "safe_read" to read from the socket. > >>> > >>> Meanwhile the REQUEST_STEP_COMPLETE code opens a socket to > >>> slurmstepd and sends a REQUEST_STEP_COMPLETION request, then > >>> eventually calls "jobacct_gather_g_setinfo". This would lead to > >>> sending the data via "safe_write" except that the thread attempts to > >>> lock the "g_jobacct_gather_context_lock" and waits behind the > >>> REQUEST_PING thread. > >>> > >>> Since both of the slurmd threads issued their requests to slumstepd > >>> over their respective sockets before attempting to lock the above > >>> lock, slurmstepd has kicked off two threads to process them. Each > >>> of these threads needs to call some job_account_gather routines, > >>> which attempt to lock "g_jobacct_gather_context_lock". But since > >>> slurmstepd is a separate process from slurmd, this copy of the lock > >>> is not the same as the slurmd one. One of the threads in > >>> slurmstepd gets the lock first and the other waits. In the deadlock > >>> case, we have the one that needs to respond to the REQUEST_STEP_STAT > >>> stuck behind the lock, while the other slurmstepd thread is > >>> attempting to receive the REQUEST_STEP_COMPLETE information, and > >>> can't because its corresponding slurmd thread is waiting on the > >>> slurmd copy of the "g_jobacct_gather_context_lock" lock. > >>> > >>> Looking at the code in "slurm_jobacct_gather.c", where all the > >>> "jobacct_gather_g_<xxx>" routines reside, it seems that each of > >>> these routines locks the "g_jobacct_gather_context_lock" lock before > >>> proceeding and calling the job_account_gather plugin routines. > >>> This lock ostensibly protects the "g_jobacct_gather_context" data > >>> structure, but it seems to me that there isn't anything there that > >>> needs protecting, except for the initialization of the structure. > >>> The routines all call "_slurm_jobacct_gather_init" first, which > >>> checks for the existence of the structure, and if necessary creates > >>> it and populates the "ops" table by causing the loading of the > >>> appropriate plugin. Once this is done, there doesn't seem to me to > >>> be any reason to lock this particular mutex. Removing the lock/ > >>> unlock calls from these routines would prevent the deadlock described > >>> above. > >>> > >>> But not being the author of this code, I am not sure if removing > >>> this locking/unlocking is really safe, or might lead to other problems. > >>> > >>> -Don Albert- > >>> > >>> > >>> > >>> [attachment "analysis.txt" deleted by Don Albert/US/BULL] > > > > >
