Hi,

My group is struggling with this also.  

The worst part of this, which no one has brought up yet, is that the sbatch 
command does not necessarily fail to submit the job in this situation.  In 
fact, most of the time (for us), it succeeds.  There appears to be some sort of 
race condition or something else going on.  The job is often (maybe most of the 
time?) submitted just fine, but sbatch returns a non-zero status (meaning the 
submission failed) and reports the error message.  

From a workflow management perspective this is an absolute disaster that leads 
to workflow corruption and messes that are difficult to clean up.  Workflow 
management systems rely on the status for sbatch to tell the truth about 
whether a job submission succeeded or not.  If submission fails the workflow 
manager will resubmit the job, and if it succeeds it expects a jobid to be 
returned.  Because sbatch usually lies about the failure of job submission when 
these events happen, workflow management systems think the submission failed 
and then resubmit the job.  This causes two copies of the same job to be 
running at the same time, each trampling over the other and causing a cascade 
of other failures that become difficult to deal with.

The problem is that the job submission request has already been received by the 
time sbatch dies with that error.  So, the timeout happens after the job 
request has already been made.  I don’t know how one would solve this problem.  
In my experience in interfacing various batch schedulers to workflow management 
systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands 
always leads to a race condition. You can’t time it out (barring ridiculously 
long timeouts to catch truly pathological scenarios) because the request has 
already been sent and received; it’s the response that never makes it back to 
you.  Because of the race condition there is probably no way to guarantee that 
failure really means failure and success really means success and use a timeout 
that guarantees failure.  The best option that I know of is to never (this 
means a finite, but long, time) time out a job submission command; just wait 
for the response.  That’s the only way to get the correct response.

One way I’m using to work around this is to inject a long random string into 
the —comment option.  Then, if I see the socket timeout, I use squeue to look 
for that job and retrieve its ID.  It’s not ideal, but it can work.

Chris

Reply via email to