[slurm-dev] Re: Multiple srun commands within a job script

Kumar, Amit Fri, 11 Sep 2015 10:55:07 -0700

Thank you Moe, 

Do I also need to apply the following patch which is part of the description in 
"Fix gres tracking for multiple steps"


https://github.com/SchedMD/slurm/commit/dd842d7997773088087bb5e9beb4f1ed306ba708

Please advise.
Regards,
Amit
________________________________________
From: Moe Jette [[email protected]]
Sent: Friday, September 11, 2015 12:35 PM
To: slurm-dev
Subject: [slurm-dev] Re: Multiple srun commands within a job script

That patch only relates to running multiple simultaneous steps which use GRES.

Quoting "Kumar, Amit" <[email protected]>:
> Hi Bob,
>
> Interesting!! Although I fall short to understand this. Just so I
> understand: Github points to "gres tracking for multiple steps" and
> I am not scheduling any gpu's or special resources. And I understood
> that GRES was designed to handle those special kinds of resources or
> probably I get it wrong?
>
> I will patch gres.c as pointed out in github and see if the solves
> my problems ...
>
> Thank you,
> Amit
>
> ________________________________________
> From: Bob Moench [[email protected]]
> Sent: Friday, September 11, 2015 10:28 AM
> To: slurm-dev
> Subject: [slurm-dev] Re: Multiple srun commands within a job script
>
> Amit,
>
> This sounds a fair amount like something I reported. I
> believe that the problem is described at this link:
>
>
> https://github.com/SchedMD/slurm/commit/af1163a20e1f82db6e177b13584de398c48fa9fe
>
> Bob
>
> On Fri, 11 Sep 2015, Kumar, Amit wrote:
>
>> Dear All,
>>
>> Noticing a bit strange behavior. We have some jobs that within a
>> run launches multiple parallel jobs after making sure all
>> dependencies are met.
>>
>> In short
>>
>> #!/bin/bash
>> #SBATCH ...
>> ...
>> srun namd2 xyz
>> checks to make sure all went well ..if true continue else fail
>> srun namd2 abc
>> checks to make sure all went well ..if true  continue else fail
>> ....continue this for 5 different configs....
>> //end
>> Alternatively we could do this by adding dependencies but the
>> volume of jobs is deterring and cannot manually check if
>> dependencies are satisfied.
>>
>> My issue here is we are randomly seeing the launching of tasks by
>> srun fail/killed in one of the intermediate steps above. Since we
>> are running the tasks on the same set of nodes I wonder why would
>> they fail for the next launch. I have confirmed it is not
>> application related. I am repeatedly using an already run example
>> and we see this behavior. Could I be running into a timeout
>> in-between next launch??
>>
>> Any thoughts will be greatly appreciated.
>> Regards,
>> Amit
>>
>>
>
> --
> Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support
===============================================================
Slurm User Group Meeting, 15-16 September 2015, Washington D.C.
http://slurm.schedmd.com/slurm_ug_agenda.html

[slurm-dev] Re: Multiple srun commands within a job script

Reply via email to