Re: [slurm-users] good practices

2019-11-26 Thread Eli V
Inline below

On Tue, Nov 26, 2019 at 5:50 AM Loris Bennett
 wrote:
>
> Hi Nigella,
>
> Nigella Sanders  writes:
>
> > Thank you all for such interesting replies.
> >
> > The --dependency option is quite useful but in practice it has some
> > inconvenients. Firstly, all 20 jobs are instantly queued which some
> > users may be interpreting as an abusive use of common resources.
>
> This doesn't seem a problem to me, since no common resources are being
> used by jobs in the queue.  It only becomes a problem if a single person
> can queue enough jobs to consume all the resources *and* you are not using
> any form of fairshare.  Otherwise job started later, but with a higher
> priority will start earlier, if the resources become available.
>
> This is not to say that users might *think* that a large number of jobs
> belonging other users automatically means that later jobs will be
> disadvantages.  However, that is more an issue of educating your users.
>
> > Even worse, if a job fails, the rest one will stay queued forever (?)
> > being the first tagged as "DependencyNeverSatisfied", and the rest
> > just as "Dependency".
>
> This is just a consequence of your requirement that "each job ... needs
> the previous one to be completed", but it also isn't a problem, because
> pending jobs don't consume resources for which users complete.

Also, using kill_invalid_depend in your slurm.conf's
SchedulerParameters will automatically remove the jobs from the queue
once their dependency can't be satisfied.


>
> Regards
>
> Loris
>
> > PS: Yarom, with queue time I meant the total run time allowed. I my case, 
> > after a job starts running it will be killed if it takes more than 10 hours 
> > of execution time. If the partition queue time limit were of 10 days
> > for instance I guess I could use a single sbatch to launch an script 
> > containing the 20 executions as steps with srun
> >
> > Regards,
> > Nigella
> >
> > El lun., 25 nov. 2019 a las 15:08, Yair Yarom () 
> > escribió:
> >
> >  Hi,
> >
> >  I'm not sure what queue time limit of 10 hours is. If you can't have jobs 
> > waiting for more than 10 hours, than it seems to be very small for 8 hours 
> > jobs.
> >  Generally, a few options:
> >  a. The --dependency option (either afterok or singleton)
> >  b. The --array option of sbatch with limit of 1 job at a time (instead of 
> > the for loop): sbatch --array=1-20%1
> >  c. At the end of the script of each job, call the sbatch line of the next 
> > job (this is probably the only option if indeed I understood the queue time 
> > limit correctly).
> >
> >  And indeed, srun should probably be reserved for strictly interactive jobs.
> >
> >  Regards,
> >  Yair.
> >
> >  On Mon, Nov 25, 2019 at 11:21 AM Nigella Sanders 
> >  wrote:
> >
> >  Hi all,
> >
> >  I guess this is a simple matter but I still find it confusing.
> >
> >  I have to run 20 jobs on our supercomputer.
> >  Each job takes about 8 hours and every one need the previous one to be 
> > completed.
> >  The queue time limit for jobs is 10 hours.
> >
> >  So my first approach is serially launching them in a loop using srun:
> >
> >  #!/bin/bash
> >  for i in {1..20};do
> >  srun  --time 08:10:00  [options]
> >  done
> >
> >  However SLURM literature keeps saying that 'srun' should be only used for 
> > short command line tests. So that some sysadmins would consider this a bad 
> > practice (see this).
> >
> >  My second approach switched to sbatch:
> >
> >  #!/bin/bash
> >  for i in {1..20};do
> >  sbatch  --time 08:10:00 [options]
> >  [polling to queue to see if job is done]
> >  done
> >
> >  But since sbatch returns the prompt I had to add code to check for job 
> > termination. Polling make use of sleep command and it is prone to race 
> > conditions so it doesn't like to sysadmins either.
> >
> >  I guess there must be a --wait option in some recent versions of SLURM 
> > (see this). Not yet available in our system though.
> >
> >  Is there any prefererable/canonical/friendly way to do this?
> >  Any thoughts would be really appreciated,
> >
> >  Regards,
> >  Nigella.
> >
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>



Re: [slurm-users] good practices

2019-11-26 Thread Loris Bennett
Hi Nigella,

Nigella Sanders  writes:

> Thank you all for such interesting replies.
>
> The --dependency option is quite useful but in practice it has some
> inconvenients. Firstly, all 20 jobs are instantly queued which some
> users may be interpreting as an abusive use of common resources.

This doesn't seem a problem to me, since no common resources are being
used by jobs in the queue.  It only becomes a problem if a single person
can queue enough jobs to consume all the resources *and* you are not using
any form of fairshare.  Otherwise job started later, but with a higher
priority will start earlier, if the resources become available.

This is not to say that users might *think* that a large number of jobs
belonging other users automatically means that later jobs will be
disadvantages.  However, that is more an issue of educating your users.

> Even worse, if a job fails, the rest one will stay queued forever (?)
> being the first tagged as "DependencyNeverSatisfied", and the rest
> just as "Dependency".

This is just a consequence of your requirement that "each job ... needs
the previous one to be completed", but it also isn't a problem, because
pending jobs don't consume resources for which users complete.

Regards

Loris

> PS: Yarom, with queue time I meant the total run time allowed. I my case, 
> after a job starts running it will be killed if it takes more than 10 hours 
> of execution time. If the partition queue time limit were of 10 days
> for instance I guess I could use a single sbatch to launch an script 
> containing the 20 executions as steps with srun
>
> Regards,
> Nigella
>
> El lun., 25 nov. 2019 a las 15:08, Yair Yarom () 
> escribió:
>
>  Hi,
>
>  I'm not sure what queue time limit of 10 hours is. If you can't have jobs 
> waiting for more than 10 hours, than it seems to be very small for 8 hours 
> jobs.
>  Generally, a few options:
>  a. The --dependency option (either afterok or singleton)
>  b. The --array option of sbatch with limit of 1 job at a time (instead of 
> the for loop): sbatch --array=1-20%1 
>  c. At the end of the script of each job, call the sbatch line of the next 
> job (this is probably the only option if indeed I understood the queue time 
> limit correctly).
>
>  And indeed, srun should probably be reserved for strictly interactive jobs.
>
>  Regards,
>  Yair.
>
>  On Mon, Nov 25, 2019 at 11:21 AM Nigella Sanders  
> wrote:
>
>  Hi all,
>
>  I guess this is a simple matter but I still find it confusing.
>
>  I have to run 20 jobs on our supercomputer. 
>  Each job takes about 8 hours and every one need the previous one to be 
> completed.
>  The queue time limit for jobs is 10 hours.
>
>  So my first approach is serially launching them in a loop using srun:
>
>  #!/bin/bash
>  for i in {1..20};do
>  srun  --time 08:10:00  [options]
>  done
>
>  However SLURM literature keeps saying that 'srun' should be only used for 
> short command line tests. So that some sysadmins would consider this a bad 
> practice (see this).
>
>  My second approach switched to sbatch:
>
>  #!/bin/bash 
>  for i in {1..20};do
>  sbatch  --time 08:10:00 [options]
>  [polling to queue to see if job is done]
>  done
>
>  But since sbatch returns the prompt I had to add code to check for job 
> termination. Polling make use of sleep command and it is prone to race 
> conditions so it doesn't like to sysadmins either.
>
>  I guess there must be a --wait option in some recent versions of SLURM (see 
> this). Not yet available in our system though.
>
>  Is there any prefererable/canonical/friendly way to do this?
>  Any thoughts would be really appreciated,
>
>  Regards,
>  Nigella.
>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] good practices

2019-11-26 Thread Nigella Sanders
Thank you all for such interesting replies.

The --dependency option is quite useful but in practice it has some
inconvenients. Firstly, all 20 jobs are *instantly queued* which some users
may be interpreting as an abusive use of common resources. Even worse, if a
job fails, the rest one will stay queued forever (?) being the first tagged
as "DependencyNeverSatisfied", and the rest just as "Dependency".

PS: Yarom, with queue time I meant the total run time allowed. I my case,
after a job starts running it will be killed if it takes more than 10 hours
of execution time. If the partition queue time limit were of 10 days for
instance I guess I could use a single sbatch to launch an script containing
the 20 executions as steps with srun

Regards,
Nigella







El lun., 25 nov. 2019 a las 15:08, Yair Yarom ()
escribió:

> Hi,
>
> I'm not sure what queue time limit of 10 hours is. If you can't have jobs
> waiting for more than 10 hours, than it seems to be very small for 8 hours
> jobs.
> Generally, a few options:
> a. The --dependency option (either afterok or singleton)
> b. The --array option of sbatch with limit of 1 job at a time (instead of
> the for loop): sbatch --array=1-20%1
> c. At the end of the script of each job, call the sbatch line of the next
> job (this is probably the only option if indeed I understood the queue time
> limit correctly).
>
> And indeed, srun should probably be reserved for strictly interactive jobs.
>
> Regards,
> Yair.
>
> On Mon, Nov 25, 2019 at 11:21 AM Nigella Sanders <
> nigella.sand...@gmail.com> wrote:
>
>>
>> Hi all,
>>
>> I guess this is a simple matter but I still find it confusing.
>>
>> I have to run 20 jobs on our supercomputer.
>> Each job takes about 8 hours and every one need the previous one to be
>> completed.
>> The queue time limit for jobs is 10 hours.
>>
>> So my first approach is serially launching them in a loop using srun:
>>
>>
>> *#!/bin/bash*
>> *for i in {1..20};do*
>>
>> *srun  --time 08:10:00  [options]*
>>
>> *done*
>>
>> However SLURM literature keeps saying that 'srun' should be only used for
>> short command line tests. So that some sysadmins would consider this a bad
>> practice (see this
>> 
>> ).
>>
>> My second approach switched to sbatch:
>>
>> * #!/bin/bash *
>> *for i in {1..20};do*
>> *sbatch  --time 08:10:00 [options]*
>>
>> *[polling to queue to see if job is done]*
>> *done*
>>
>> But since sbatch returns the prompt I had to add code to check for job
>> termination. Polling make use of sleep command and it is prone to race
>> conditions so it doesn't like to sysadmins either.
>>
>> I guess there must be a --wait option in some recent versions of SLURM (see
>> this ). Not yet available
>> in our system though.
>>
>> Is there any prefererable/canonical/friendly way to do this?
>> Any thoughts would be really appreciated,
>>
>> Regards,
>> Nigella.
>>
>>
>>
>


Re: [slurm-users] good practices

2019-11-25 Thread Yair Yarom
Hi,

I'm not sure what queue time limit of 10 hours is. If you can't have jobs
waiting for more than 10 hours, than it seems to be very small for 8 hours
jobs.
Generally, a few options:
a. The --dependency option (either afterok or singleton)
b. The --array option of sbatch with limit of 1 job at a time (instead of
the for loop): sbatch --array=1-20%1
c. At the end of the script of each job, call the sbatch line of the next
job (this is probably the only option if indeed I understood the queue time
limit correctly).

And indeed, srun should probably be reserved for strictly interactive jobs.

Regards,
Yair.

On Mon, Nov 25, 2019 at 11:21 AM Nigella Sanders 
wrote:

>
> Hi all,
>
> I guess this is a simple matter but I still find it confusing.
>
> I have to run 20 jobs on our supercomputer.
> Each job takes about 8 hours and every one need the previous one to be
> completed.
> The queue time limit for jobs is 10 hours.
>
> So my first approach is serially launching them in a loop using srun:
>
>
> *#!/bin/bash*
> *for i in {1..20};do*
>
> *srun  --time 08:10:00  [options]*
>
> *done*
>
> However SLURM literature keeps saying that 'srun' should be only used for
> short command line tests. So that some sysadmins would consider this a bad
> practice (see this
> 
> ).
>
> My second approach switched to sbatch:
>
> * #!/bin/bash *
> *for i in {1..20};do*
> *sbatch  --time 08:10:00 [options]*
>
> *[polling to queue to see if job is done]*
> *done*
>
> But since sbatch returns the prompt I had to add code to check for job
> termination. Polling make use of sleep command and it is prone to race
> conditions so it doesn't like to sysadmins either.
>
> I guess there must be a --wait option in some recent versions of SLURM (see
> this ). Not yet available
> in our system though.
>
> Is there any prefererable/canonical/friendly way to do this?
> Any thoughts would be really appreciated,
>
> Regards,
> Nigella.
>
>
>


Re: [slurm-users] good practices

2019-11-25 Thread Gennaro Oliva
Hi Nigella,

On Mon, Nov 25, 2019 at 09:12:17AM +, Nigella Sanders wrote:
> I have to run 20 jobs on our supercomputer.
> Each job takes about 8 hours and every one need the previous one to be
> completed.
> ...
> Is there any prefererable/canonical/friendly way to do this?

I would use --dependency option.
Regards,
-- 
Gennaro Oliva



Re: [slurm-users] good practices

2019-11-25 Thread Huda, Zia Ul
Hi,

I would recommend to use dependencies (-d) option available in sbatch or srun.

You need -d afterok:jobid. Hopefully it works.

Best


Zia Ul Huda
Forschungszentrum Jülich GmbH
Institute for Advanced Simulation (IAS)
Jülich Supercomputing Centre (JSC)
Wilhelm-Johnen-Straße
52425 Jülich, Germany

Phone: +49 2461 61 96905
E-mail:  z.h...@fz-juelich.de

WWW: http://www.fz-juelich.de/ias/jsc/


JSC is the coordinator of the
John von Neumann Institute for Computing
and member of the
Gauss Centre for Supercomputing



On 25. Nov 2019, at 10:12, Nigella Sanders 
mailto:nigella.sand...@gmail.com>> wrote:


Hi all,

I guess this is a simple matter but I still find it confusing.

I have to run 20 jobs on our supercomputer.
Each job takes about 8 hours and every one need the previous one to be 
completed.
The queue time limit for jobs is 10 hours.

So my first approach is serially launching them in a loop using srun:

#!/bin/bash
for i in {1..20};do
srun  --time 08:10:00  [options]
done

However SLURM literature keeps saying that 'srun' should be only used for short 
command line tests. So that some sysadmins would consider this a bad practice 
(see 
this).

My second approach switched to sbatch:

#!/bin/bash
for i in {1..20};do
sbatch  --time 08:10:00 [options]
[polling to queue to see if job is done]
done

But since sbatch returns the prompt I had to add code to check for job 
termination. Polling make use of sleep command and it is prone to race 
conditions so it doesn't like to sysadmins either.

I guess there must be a --wait option in some recent versions of SLURM (see 
this). Not yet available in our 
system though.

Is there any prefererable/canonical/friendly way to do this?
Any thoughts would be really appreciated,

Regards,
Nigella.







Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt