The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT, 
which is only ever raised by slurm_send_timeout() and slurm_recv_timeout().  
Those functions raise that error when a generic socket-based send/receive 
operation exceeds an arbitrary time limit imposed by the caller.  The functions 
use gettimeofday() to grab an initial timestamp and on each iteration of the 
poll() loop call gettimeofday() again, calculating a delta from the initial and 
current values returned by that function and subtracting from the timeout 
period.


Do you have any reason to suspect that your local times are fluctuating on the 
cluster?  That use of gettimeofday() to calculate actual time deltas is not 
recommended for that very reason:


NOTES
       The time returned by gettimeofday() is affected by discontinuous jumps 
in the system time (e.g., if the system
       administrator manually changes the system time).  If you need a 
monotonically increasing clock, see clock_get‐
       time(2).







> On Jun 13, 2019, at 10:47 AM, Christopher Harrop - NOAA Affiliate 
> <christopher.w.har...@noaa.gov> wrote:
> 
> Hi,
> 
> My group is struggling with this also.  
> 
> The worst part of this, which no one has brought up yet, is that the sbatch 
> command does not necessarily fail to submit the job in this situation.  In 
> fact, most of the time (for us), it succeeds.  There appears to be some sort 
> of race condition or something else going on.  The job is often (maybe most 
> of the time?) submitted just fine, but sbatch returns a non-zero status 
> (meaning the submission failed) and reports the error message.  
> 
> From a workflow management perspective this is an absolute disaster that 
> leads to workflow corruption and messes that are difficult to clean up.  
> Workflow management systems rely on the status for sbatch to tell the truth 
> about whether a job submission succeeded or not.  If submission fails the 
> workflow manager will resubmit the job, and if it succeeds it expects a jobid 
> to be returned.  Because sbatch usually lies about the failure of job 
> submission when these events happen, workflow management systems think the 
> submission failed and then resubmit the job.  This causes two copies of the 
> same job to be running at the same time, each trampling over the other and 
> causing a cascade of other failures that become difficult to deal with.
> 
> The problem is that the job submission request has already been received by 
> the time sbatch dies with that error.  So, the timeout happens after the job 
> request has already been made.  I don’t know how one would solve this 
> problem.  In my experience in interfacing various batch schedulers to 
> workflow management systems I’ve learned that attempting to time out 
> qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t 
> time it out (barring ridiculously long timeouts to catch truly pathological 
> scenarios) because the request has already been sent and received; it’s the 
> response that never makes it back to you.  Because of the race condition 
> there is probably no way to guarantee that failure really means failure and 
> success really means success and use a timeout that guarantees failure.  The 
> best option that I know of is to never (this means a finite, but long, time) 
> time out a job submission command; just wait for the response.  That’s the 
> only way to get the correct response.
> 
> One way I’m using to work around this is to inject a long random string into 
> the —comment option.  Then, if I see the socket timeout, I use squeue to look 
> for that job and retrieve its ID.  It’s not ideal, but it can work.
> 
> Chris
> 


::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::




Reply via email to