I have been struggling with this problem for some weeks now.  Once the
two pairs of threads in the two processes are in the deadlock, it is
easy enough to see how it happened and why they are stuck waiting on
each other.  The root cause is that both slurmd and slurmstepd use the
"job accounting gather plugin" and its associated common code within
their own processes, but in certain cases, need to transfer data
between the processes.  The functions within the job accounting gather
code contain a mutex that is shut for each function.  But since slurmd
and slurmstepd are separate processes, there is a separate copy of
this mutex in each process.  So when a thread in slurmd invokes a job
accounting gather function that needs to communicate to slurmstepd,
and the associated thread in slurmstepd also has to invoke a job
accounting gather function, two separate mutexes must be shut.  This
creates an ordering problem, and in the case where one of the threads
in slurmd is doing a "getinfo" and the other a "setinfo", the current
arrangement of the code ends up trying to shut the mutexes in opposite
order, leading to a race condition and deadlock.

When I first looked at the job accounting gather code, the mutex in
question seemed to me to be unnecessary.  You can see my reason for
this in my original message at the end of this message.  Therefore I
proposed that the mutex be simply eliminated, which would avoid the
deadlock.  In private email, Moe suggested that this mutex might in
fact be needed to protect some structures inside of the plugin itself,
and Matthieu has shown that eliminating the mutex calls does not
prevent a deadlock further down in the code.  So that approach isn't
the solution.

If I understand it correctly, Moe has proposed a change which delays
the call in slurmd to "getinfo" by adding a test and wait on the pipe
to slurmstepd to make sure that there is something to read before
calling the code that locks the mutex and reads the data. This avoids
the deadlock by ensuring that the slurmstepd side has successfully
obtained its copy of the mutex and written the data (i.e., is not
waiting on its side).  It resolves the ordering problem by always
having the sender shut its mutex first. This should fix the deadlock
that we have already seen, but I am concerned that if the race
condition on the mutex goes the other way, we could end up with both
sides trying to write instead of trying to read, and still have a
deadlock.  If we can be sure that "safe_write" will always complete
immediately (i.e., that it will never block on the write to the pipe)
and return to open its side of the mutex, then this may not be a
problem.

Again, if I understand correctly, Matthieu's change takes a different
approach and solves the deadlock by eliminating the locking on the
"setinfo" call.  It does this by doing its own packing/unpacking of
the data and calls to write the data over the pipe outside of the job
account gather routines, so the mutex lock/unlock calls for this
direction of data movement are never done.  Since his system is the
one where the original problem showed up, and with this change the
problem has gone away, I am leaning toward this solution.  My only
comment is that eliminating the "jobacct_gather_g_setinfo" and
"jobacct_gather_g_getinfo" calls in this case means that the job
account gather plugin is not invoked, and any special handling for
this case in the plugin is not possible (e.g., doing something
different for aix vs linux).  But that probably isn't a big issue.

  -Don Albert-

[email protected] wrote on 01/24/2012 03:52:10 PM:

> From: Moe Jette <[email protected]>
> To: [email protected], 
> Date: 01/24/2012 03:52 PM
> Subject: Re: [slurm-dev] Slurmd/slurmstepd race condition and 
> deadlock on "g_jobacct_gather_context_lock"
> Sent by: [email protected]
> 
> Hi Matthieu,
> 
> I had the same concern and approached the problem somewhat differently 
> than you by waiting until there is data ready to read before calling 
> jobacct_gather_g_getinfo(), see:
> https://github.com/SchedMD/slurm/commit/
> 4c0eea7b8c20ccb1cacad51838a1ea8257cc637d
> 
> I was hoping to avoid making this change to SLURM version 2.3, but if 
> you have also seen the problem then perhaps either your patch or mine 
> should be made to v2.3. I need to look at your patch more closely, but 
> it seems a better approach than my patch.
> 
> Moe
> 
> Quoting Matthieu Hautreux <[email protected]>:
> 
> > Don,
> >
> > having encountered the same exact problem... and applying this patch,
> > it does not work as expected.
> >
> > Indeed, there is an other layer of lock in
> > jobacct_common-{setinfo,getinfo} that triggers the same exact
> > deadlock.
> >
> > I think that the reason you did not see the problem is because of the
> > size of your jobs. The reason of the deadlock is the use of both
> > jobacct_gather_g_setinfo and jobacct_gather_g_getinfo over a pipe in
> > both slurmd and slurmstepd in their communications. As long as the
> > simultaneous usage of both in each of the daemon does not appear, you
> > do not have the problem.  jobacct_gather_g_getinfo over a pipe in
> > slurmd is mostly use to collect statistics from the slurmstepds (like
> > when enforcing memory limits). jobacct_gather_g_setinfo over a pipe in
> > slurmd is used in stepd_completion when slurmd receive a completion
> > message from an other node of a same job and push the accounting data
> > to the related local slurmstepd for aggregation. This does not occur
> > with small jobs.
> >
> > I think that it is harmfull to use both setinfo and getinfo other a
> > pipe when communicating between slurmd and slurmstepd. I modified the
> > code to ensure that slurmd only do getinfo calls and slurmstepd only
> > do setinfo calls. I have installed a patched version of slurm that
> > does not reproduce the problem for now. You will find it enclosed. It
> > does not require to use the patch removing the locks.
> >
> > Note that before applying this patch on a node, it must be drained.
> > Indeed, as the communication protocol is modified to avoid the
> > deadlock, both slumrd and slurmstepd must be aware of the
> > modification. Hot swap of the rpms could result in a old slurmstepd
> > trying to reply to a new slurmd in the older way thus resulting in a
> > failure in job termination.
> >
> > Moe, Danny, what is your view concerning that deadlock scenario ? Do
> > you think that this patch is sufficient ?
> >
> > Thanks in advance for your feedback
> >
> > Regards,
> > Matthieu
> >
> >
> >
> > 2011/12/16  <[email protected]>:
> >> I tried to reproduce the deadlock described in the previous message 
on my
> >> local system by increasing the frequency that the REQUEST_PING was 
issued
> >> and the frequency of calls to the job accounting routines.   I did 
this by
> >> setting "SlurmdTimeout=5" to increase the rate of PINGs to every 2-3
> >> seconds,  and also setting "--acctg-freq=1" in the job submissions 
(see
> >> script below) to increase attempts to get job accounting statistics. 
  I
> >> also added a couple of "info" displays where the PING and 
STEP_COMPLETE
> >> requests were processed so I could see them in the log file without 
turning
> >> on all the detailed traces.
> >>
> >> I then ran this script:
> >>
> >> #!/bin/bash
> >> count=0
> >> until [ $count = 10000 ]
> >> do
> >>    srun --acctg-freq=1 -N4 -n4 sleep 8 &
> >>    srun --acctg-freq=1 -N4 -n4 sleep 7 &
> >>    sleep 5
> >>    count=`expr $count + 1`
> >> done
> >>
> >> I had to play around with the sleep times a bit, because if I 
> submitted jobs
> >> too quickly,  I would not get the PING requests to the node because 
it was
> >> *too* busy.  But with the above script,  I could get the PING to 
occur
> >> fairly close to the step complete requests.  For example:
> >>
> >> [2011-12-16T11:46:44] launch task 22945.0 request from
> >> [email protected] (port 37606)
> >> [2011-12-16T11:46:44] launch task 22944.0 request from
> >> [email protected] (port 37094)
> >> [2011-12-16T11:46:46] Ping request received
> >> [2011-12-16T11:46:46] Step complete request received
> >> [2011-12-16T11:46:46] Step complete request received
> >> [2011-12-16T11:46:46] Step complete request received
> >> [2011-12-16T11:46:47] Step complete request received
> >> [2011-12-16T11:46:47] Step complete request received
> >> [2011-12-16T11:46:47] Step complete request received
> >> [2011-12-16T11:46:47] [22943.0] done with job
> >> [2011-12-16T11:46:47] [22942.0] done with job
> >>
> >> I ran more 10000 iterations of jobs and never encountered the 
deadlock.
> >> This is possibly because I only have a four node cluster to try it 
on, so
> >> the probability of hitting the deadlock is low.
> >>
> >> As I described below,  I believe the deadlock is caused by excessive
> >> locking/unlocking of the "g_jobacct_gather_context_lock" when calling 
the
> >> job accounting plugin routines.  I removed all the lock/unlock pairs 
around
> >> the calls and recompiled.  I then ran the same set of jobs again,to 
see if
> >> removing the lock/unlock calls caused any problems.   None were 
apparent.
> >>
> >> Attached is a SLURM 2.4.0-pre2 patch to "slurm_jobacct_gather.c" that
> >> removes the lock calls.
> >>
> >>   -Don Albert-
> >>
> >>
> >>
> >>
> >>
> >>
> >> [email protected] wrote on 12/07/2011 03:06:44 PM:
> >>
> >>> From: [email protected]
> >>> To: [email protected],
> >>> Date: 12/07/2011 03:07 PM
> >>> Subject: [slurm-dev] Slurmd/slurmstepd race condition and deadlock
> >>> on "g_jobacct_gather_context_lock"
> >>> Sent by: [email protected]
> >>>
> >>> I have a bug report which describes a node in the "down" state with
> >>> a reason of "Not responding".   The node turns out to have a
> >>> deadlock between two pairs of threads, one-half of each pair being
> >>> in slurmd, and the other half in slurmstepd.  The original stack
> >>> traces and analysis by the site support taken from the bug report is
> >>> attached below.
> >>>
> >>> Briefly, the deadlock occurs because slurmd has received two
> >>> requests:  a REQUEST_PING, and a REQUEST_STEP_COMPLETE, and has
> >>> started a thread to handle each of them.   The REQUEST_PING code
> >>> takes the opportunity to perform an "_enforce_job_mem_limit" call,
> >>> which requires getting some job accounting information.  To get this
> >>> information, it opens a socket to slurmstepd and sends a
> >>> REQUEST_STEP_STAT,  and then eventually calls
> >>> "jobacct_gather_g_getinfo" to receive the response.  This routine
> >>> locks the "g_jobacct_gather_context_lock" before eventually calling
> >>> "safe_read" to read from the socket.
> >>>
> >>> Meanwhile the REQUEST_STEP_COMPLETE code opens a socket to
> >>> slurmstepd and sends a REQUEST_STEP_COMPLETION request,  then
> >>> eventually calls "jobacct_gather_g_setinfo".  This would lead to
> >>> sending the data via "safe_write" except that the thread attempts to
> >>> lock the "g_jobacct_gather_context_lock" and waits behind the
> >>> REQUEST_PING thread.
> >>>
> >>> Since both of the slurmd threads issued their requests to slumstepd
> >>> over their respective sockets before attempting to lock the above
> >>> lock,  slurmstepd has kicked off two threads to process them.  Each
> >>> of these threads needs to call some job_account_gather routines,
> >>> which attempt to lock "g_jobacct_gather_context_lock".   But since
> >>> slurmstepd is a separate process from slurmd,  this copy of the lock
> >>> is not the same as the slurmd one.   One of the threads in
> >>> slurmstepd gets the lock first and the other waits.  In the deadlock
> >>> case, we have the one that needs to respond to the REQUEST_STEP_STAT
> >>> stuck behind the lock,  while the other slurmstepd thread is
> >>> attempting to receive the REQUEST_STEP_COMPLETE information,  and
> >>> can't because its corresponding slurmd thread is waiting on the
> >>> slurmd copy of the  "g_jobacct_gather_context_lock" lock.
> >>>
> >>> Looking at the code in "slurm_jobacct_gather.c",  where all the
> >>> "jobacct_gather_g_<xxx>" routines reside,  it seems that each of
> >>> these routines locks the "g_jobacct_gather_context_lock" lock before
> >>> proceeding and calling the job_account_gather plugin routines.
> >>> This lock ostensibly protects the "g_jobacct_gather_context" data
> >>> structure,  but it seems to me that there isn't anything there that
> >>> needs protecting, except for the initialization of the structure.
> >>> The routines all call "_slurm_jobacct_gather_init" first,  which
> >>> checks for the existence of the structure, and if necessary creates
> >>> it and populates the "ops" table by causing the loading of the
> >>> appropriate plugin.  Once this is done,  there doesn't seem to me to
> >>> be any reason to lock this particular mutex.  Removing the lock/
> >>> unlock calls from these routines would prevent the deadlock 
described
> >>> above.
> >>>
> >>> But not being the author of this code,  I am not sure if removing
> >>> this locking/unlocking is really safe, or might lead to other 
problems.
> >>>
> >>>   -Don Albert-
> >>>
> >>>
> >>>
> >>> [attachment "analysis.txt" deleted by Don Albert/US/BULL]
> >
> 
> 
> 

Reply via email to