Thank you Moe, Do I also need to apply the following patch which is part of the description in "Fix gres tracking for multiple steps"
https://github.com/SchedMD/slurm/commit/dd842d7997773088087bb5e9beb4f1ed306ba708 Please advise. Regards, Amit ________________________________________ From: Moe Jette [[email protected]] Sent: Friday, September 11, 2015 12:35 PM To: slurm-dev Subject: [slurm-dev] Re: Multiple srun commands within a job script That patch only relates to running multiple simultaneous steps which use GRES. Quoting "Kumar, Amit" <[email protected]>: > Hi Bob, > > Interesting!! Although I fall short to understand this. Just so I > understand: Github points to "gres tracking for multiple steps" and > I am not scheduling any gpu's or special resources. And I understood > that GRES was designed to handle those special kinds of resources or > probably I get it wrong? > > I will patch gres.c as pointed out in github and see if the solves > my problems ... > > Thank you, > Amit > > ________________________________________ > From: Bob Moench [[email protected]] > Sent: Friday, September 11, 2015 10:28 AM > To: slurm-dev > Subject: [slurm-dev] Re: Multiple srun commands within a job script > > Amit, > > This sounds a fair amount like something I reported. I > believe that the problem is described at this link: > > > https://github.com/SchedMD/slurm/commit/af1163a20e1f82db6e177b13584de398c48fa9fe > > Bob > > On Fri, 11 Sep 2015, Kumar, Amit wrote: > >> Dear All, >> >> Noticing a bit strange behavior. We have some jobs that within a >> run launches multiple parallel jobs after making sure all >> dependencies are met. >> >> In short >> >> #!/bin/bash >> #SBATCH ... >> ... >> srun namd2 xyz >> checks to make sure all went well ..if true continue else fail >> srun namd2 abc >> checks to make sure all went well ..if true continue else fail >> ....continue this for 5 different configs.... >> //end >> Alternatively we could do this by adding dependencies but the >> volume of jobs is deterring and cannot manually check if >> dependencies are satisfied. >> >> My issue here is we are randomly seeing the launching of tasks by >> srun fail/killed in one of the intermediate steps above. Since we >> are running the tasks on the same set of nodes I wonder why would >> they fail for the next launch. I have confirmed it is not >> application related. I am repeatedly using an already run example >> and we see this behavior. Could I be running into a timeout >> in-between next launch?? >> >> Any thoughts will be greatly appreciated. >> Regards, >> Amit >> >> > > -- > Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227 -- Morris "Moe" Jette CTO, SchedMD LLC Commercial Slurm Development and Support =============================================================== Slurm User Group Meeting, 15-16 September 2015, Washington D.C. http://slurm.schedmd.com/slurm_ug_agenda.html
