Dear All, Noticing a bit strange behavior. We have some jobs that within a run launches multiple parallel jobs after making sure all dependencies are met.
In short #!/bin/bash #SBATCH ... ... srun namd2 xyz checks to make sure all went well ..if true continue else fail srun namd2 abc checks to make sure all went well ..if true continue else fail ....continue this for 5 different configs.... //end Alternatively we could do this by adding dependencies but the volume of jobs is deterring and cannot manually check if dependencies are satisfied. My issue here is we are randomly seeing the launching of tasks by srun fail/killed in one of the intermediate steps above. Since we are running the tasks on the same set of nodes I wonder why would they fail for the next launch. I have confirmed it is not application related. I am repeatedly using an already run example and we see this behavior. Could I be running into a timeout in-between next launch?? Any thoughts will be greatly appreciated. Regards, Amit
