Dear All,

Noticing a bit strange behavior. We have some jobs that within a run launches 
multiple parallel jobs after making sure all dependencies are met.

In short

#!/bin/bash
#SBATCH ...
...
 srun namd2 xyz
 checks to make sure all went well ..if true continue else fail
 srun namd2 abc
 checks to make sure all went well ..if true  continue else fail
 ....continue this for 5 different configs....
//end
Alternatively we could do this by adding dependencies but the volume of jobs is 
deterring and cannot manually check if dependencies are satisfied.

My issue here is we are randomly seeing the launching of tasks by srun 
fail/killed in one of the intermediate steps above. Since we are running the 
tasks on the same set of nodes I wonder why would they fail for the next 
launch. I have confirmed it is not application related. I am repeatedly using 
an already run example and we see this behavior. Could I be running into a 
timeout in-between next launch??

Any thoughts will be greatly appreciated.
Regards,
Amit

Reply via email to