Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Christopher Harrop - NOAA Affiliate Fri, 14 Jun 2019 06:31:44 -0700

> Hi Chris
> 
> You are right in pointing that the job actually runs, despite of the error in 
> the sbatch. The customer mention that:
> === start ===
> Problem had usual scenario - job script was submitted and executed, but 
> sbatch command returned non-zero exit status to ecflow, which thus  assumed 
> job to be dead.
> === end ===
> 
> Which version of slurm are you using? I'm using " 17.02.4-1", and we are 
> wondering about the possibility of upgrading to a newer version, that is, I 
> hope that there was a bug and Schedmd fixed the problem.


Sorry I missed that.  I am not the admin of the system, but I believe we are 
using 18.08.7.  I believe we have a ticket open with SchedMD and our admin team 
is working with them.  And I believe the approach being taken is to capture 
statistics with sdiag and use that info to tune configuration parameters.  It 
is my understanding that they view the problem as a configuration issue rather 
than a bug in the scheduler.  What this means to me is that the timeouts can 
only be minimized, not eliminated.  And because workflow corruption is such a 
disastrous event, I have built in attempts to try to work around it even though 
occurrences are “rare”.  

Chris

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Reply via email to