Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-31 Thread Steven Dick
Probably the ideal solution would be a mix of array jobs and QOS.
I'd at least use the array jobs within a single set of regressions,
with or without per-set run limits on the array.

On Sat, Aug 31, 2019 at 11:17 AM Guillaume Perrault Archambault
 wrote:
>
> Hi Steven,
>
> Thanks for your help.
>
> Looks like QOS is the way to go if I want both job arrays + user limits on 
> jobs/resources (in the context of a regression-test).
>
> Regards,
> Guillaume.
>
> On Fri, Aug 30, 2019 at 6:11 PM Steven Dick  wrote:
>>
>> On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perrault Archambault
>>  wrote:
>> > My problem with that though, is what if each script (the 9 scripts in my 
>> > earlier example) each require different requirements? For example, run on 
>> > a different partition, or set a different time limit? My understanding is 
>> > that for a single job array, each job will get the same job requirements.
>>
>> That's a little messier and may be less suitable for an array job.
>> However, some of that can be accomplished.   You can for instance,
>> submit a job to multiple partitions and then use srun within the job
>> to allocate resources to individual tasks within the job.
>> But you get a lot less control over how the resources are spread, so
>> it might not be workable.
>>
>> > The other problem is that with the way I've implemented it, I can change 
>> > the max jobs dynamically.
>>
>> Others have indicated in this thread that qos can be dynamically
>> changed; I don't recall trying that, but if you did, I think you'd do
>> it with scontrol.
>>



Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-31 Thread Guillaume Perrault Archambault
Hi Steven,

Thanks for your help.

Looks like QOS is the way to go if I want both job arrays + user limits on
jobs/resources (in the context of a regression-test).

Regards,
Guillaume.

On Fri, Aug 30, 2019 at 6:11 PM Steven Dick  wrote:

> On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perrault Archambault
>  wrote:
> > My problem with that though, is what if each script (the 9 scripts in my
> earlier example) each require different requirements? For example, run on a
> different partition, or set a different time limit? My understanding is
> that for a single job array, each job will get the same job requirements.
>
> That's a little messier and may be less suitable for an array job.
> However, some of that can be accomplished.   You can for instance,
> submit a job to multiple partitions and then use srun within the job
> to allocate resources to individual tasks within the job.
> But you get a lot less control over how the resources are spread, so
> it might not be workable.
>
> > The other problem is that with the way I've implemented it, I can change
> the max jobs dynamically.
>
> Others have indicated in this thread that qos can be dynamically
> changed; I don't recall trying that, but if you did, I think you'd do
> it with scontrol.
>
>


Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Steven Dick
On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perrault Archambault
 wrote:
> My problem with that though, is what if each script (the 9 scripts in my 
> earlier example) each require different requirements? For example, run on a 
> different partition, or set a different time limit? My understanding is that 
> for a single job array, each job will get the same job requirements.

That's a little messier and may be less suitable for an array job.
However, some of that can be accomplished.   You can for instance,
submit a job to multiple partitions and then use srun within the job
to allocate resources to individual tasks within the job.
But you get a lot less control over how the resources are spread, so
it might not be workable.

> The other problem is that with the way I've implemented it, I can change the 
> max jobs dynamically.

Others have indicated in this thread that qos can be dynamically
changed; I don't recall trying that, but if you did, I think you'd do
it with scontrol.



Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Thank you Paul.. If admin does agree to creating various QOS job limits or
GPU limits (eg 5,10,15,20,...) then tat could be a powerful solution. This
would allow me to use job arrays.

I still prefer a user side solution if possible because I'd like my script
to be cluster-agnostic as much as possible, so avoiding to task admin on
each cluster with QOS creation would make easier going of porting these
scripts across clusters.

That said it may well end up being the best solution.

Regards,
Guillaume.

On Fri, Aug 30, 2019 at 3:16 PM Paul Edmon  wrote:

> Yes, QoS's are dynamic.
>
> -Paul Edmon-
> On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote:
>
> Hi Paul,
>
> Thanks for your pointers.
>
> I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as
> expressed to Peter in the reply I just now sent, I wonder if it the QOS of
> a job can be change while it's pending (submitted but not yet running).
>
> Regards,
> Guillaume.
>
> On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon 
> wrote:
>
>> A QoS is probably your best bet.  Another variant might be MCS, which
>> you can use to help reduce resource fragmentation.  For limits though
>> QoS will be your best bet.
>>
>> -Paul Edmon-
>>
>> On 8/30/19 7:33 AM, Steven Dick wrote:
>> > It would still be possible to use job arrays in this situation, it's
>> > just slightly messy.
>> > So the way a job array works is that you submit a single script, and
>> > that script is provided an integer for each subjob.  The integer is in
>> > a range, with a possible step (default=1).
>> >
>> > To run the situation you describe, you would have to predetermine how
>> > many of each test you want to run (i.e., you coudln't dynamically
>> > change the number of jobs that run within one array)., and a master
>> > script would map the integer range to the job that was to be started.
>> >
>> > The most trivial way to do it would be to put the list of regressions
>> > in a text file and the master script would index it by line number and
>> > then run the appropriate command.
>> > A more complex way would be to do some math (a divide?) to get the
>> > script name and subindex (modulus?) for each regression.
>> >
>> > Both of these would require some semi-advanced scripting, but nothing
>> > that couldn't be cut and pasted with some trivial modifications for
>> > each job set.
>> >
>> > As to the unavailability of the admin ...
>> > An alternate approach that would require the admin's help would be to
>> > come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
>> > gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
>> > maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
>> > starting it to set the overall allocation for all the jobs.  The admin
>> > woudln't need to tweak this except once, you just pick which tweak to
>> > use.
>> >
>> > On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
>> >  wrote:
>> >> Hi Steven,
>> >>
>> >> Thanks for taking the time to reply to my post.
>> >>
>> >> Setting a limit on the number of jobs for a single array isn't
>> sufficient because regression-tests need to launch multiple arrays, and I
>> would need a job limit that would take effect over all launched jobs.
>> >>
>> >> It's very possible I'm not understand something. I'll lay out a very
>> specific example in the hopes you can correct me if I've gone wrong
>> somewhere.
>> >>
>> >> Let's take the small cluster with 140 GPUs and no fairshare as an
>> example, because it's easier for me to explain.
>> >>
>> >> The users, who all know each other personally and interact via chat,
>> decide on a daily basis how many jobs each user can run at a time.
>> >>
>> >> Let's say today is Sunday (hypothetically). Nobody is actively
>> developing today, except that user 1 has 10 jobs running for the entire
>> weekend. That leaves 130 GPUs unused.
>> >>
>> >> User 2, whose jobs all run on 1 GPU decides to run a regression test.
>> The regression test comprises of 9 different scripts each run 40 times, for
>> a grand total of 360 jobs. The duration of the scripts vary from 1 and 5
>> hours to complete, and the jobs take on average 4 hours to complete.
>> >>
>> >> User 2 gets the user group's approval (via chat) to use 90 GPUs (so
>> that 40 GPUs will remain for anyone else wanting to work that day).
>> >>
>> >> The problem I'm trying to solve is this: how do I ensure that user 2
>> launches his 360 jobs in such a way that 90 jobs are in the run state
>> consistently until the regression test is finished?
>> >>
>> >> Keep in mind that:
>> >>
>> >> limiting each job array to 10 jobs is inefficient: when the first job
>> array finishes (long before the last one), only 80 GPUs will be used, and
>> so on as other arrays finish
>> >> the admin is not available, he cannot be asked to set a hard limit of
>> 90 jobs for user 2 just for today
>> >>
>> >> I would be happy to use job arrays if they allow me to set an
>> overarching job limit across

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Paul Edmon

Yes, QoS's are dynamic.

-Paul Edmon-

On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote:

Hi Paul,

Thanks for your pointers.

I'll looking into QOS and MCS after my paper deadline (Sept 5). Re 
QOS, as expressed to Peter in the reply I just now sent, I wonder if 
it the QOS of a job can be change while it's pending (submitted but 
not yet running).


Regards,
Guillaume.

On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon > wrote:


A QoS is probably your best bet.  Another variant might be MCS, which
you can use to help reduce resource fragmentation.  For limits though
QoS will be your best bet.

-Paul Edmon-

On 8/30/19 7:33 AM, Steven Dick wrote:
> It would still be possible to use job arrays in this situation, it's
> just slightly messy.
> So the way a job array works is that you submit a single script, and
> that script is provided an integer for each subjob.  The integer
is in
> a range, with a possible step (default=1).
>
> To run the situation you describe, you would have to
predetermine how
> many of each test you want to run (i.e., you coudln't dynamically
> change the number of jobs that run within one array)., and a master
> script would map the integer range to the job that was to be
started.
>
> The most trivial way to do it would be to put the list of
regressions
> in a text file and the master script would index it by line
number and
> then run the appropriate command.
> A more complex way would be to do some math (a divide?) to get the
> script name and subindex (modulus?) for each regression.
>
> Both of these would require some semi-advanced scripting, but
nothing
> that couldn't be cut and pasted with some trivial modifications for
> each job set.
>
> As to the unavailability of the admin ...
> An alternate approach that would require the admin's help would
be to
> come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
> gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
> maxtrespu=gpu=40 ) Then the user would assign that QOS to the
job when
> starting it to set the overall allocation for all the jobs.  The
admin
> woudln't need to tweak this except once, you just pick which
tweak to
> use.
>
> On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
> mailto:gperr...@uottawa.ca>> wrote:
>> Hi Steven,
>>
>> Thanks for taking the time to reply to my post.
>>
>> Setting a limit on the number of jobs for a single array isn't
sufficient because regression-tests need to launch multiple
arrays, and I would need a job limit that would take effect over
all launched jobs.
>>
>> It's very possible I'm not understand something. I'll lay out a
very specific example in the hopes you can correct me if I've gone
wrong somewhere.
>>
>> Let's take the small cluster with 140 GPUs and no fairshare as
an example, because it's easier for me to explain.
>>
>> The users, who all know each other personally and interact via
chat, decide on a daily basis how many jobs each user can run at a
time.
>>
>> Let's say today is Sunday (hypothetically). Nobody is actively
developing today, except that user 1 has 10 jobs running for the
entire weekend. That leaves 130 GPUs unused.
>>
>> User 2, whose jobs all run on 1 GPU decides to run a regression
test. The regression test comprises of 9 different scripts each
run 40 times, for a grand total of 360 jobs. The duration of the
scripts vary from 1 and 5 hours to complete, and the jobs take on
average 4 hours to complete.
>>
>> User 2 gets the user group's approval (via chat) to use 90 GPUs
(so that 40 GPUs will remain for anyone else wanting to work that
day).
>>
>> The problem I'm trying to solve is this: how do I ensure that
user 2 launches his 360 jobs in such a way that 90 jobs are in the
run state consistently until the regression test is finished?
>>
>> Keep in mind that:
>>
>> limiting each job array to 10 jobs is inefficient: when the
first job array finishes (long before the last one), only 80 GPUs
will be used, and so on as other arrays finish
>> the admin is not available, he cannot be asked to set a hard
limit of 90 jobs for user 2 just for today
>>
>> I would be happy to use job arrays if they allow me to set an
overarching job limit across multiple arrays. Perhaps this is
doable. Admttedly I'm working on a paper to be submitted in a few
days, so I don't have time to test jobs arrays thoroughly, but I
will try out job arrays more thoroughly once I've submitted my
paper (ie after sept 5).
>>
>> My solution, for now, is to not use job arrays. Instead, I
launch each job individually, and I use sing

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Hi Paul,

Thanks for your pointers.

I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as
expressed to Peter in the reply I just now sent, I wonder if it the QOS of
a job can be change while it's pending (submitted but not yet running).

Regards,
Guillaume.

On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon  wrote:

> A QoS is probably your best bet.  Another variant might be MCS, which
> you can use to help reduce resource fragmentation.  For limits though
> QoS will be your best bet.
>
> -Paul Edmon-
>
> On 8/30/19 7:33 AM, Steven Dick wrote:
> > It would still be possible to use job arrays in this situation, it's
> > just slightly messy.
> > So the way a job array works is that you submit a single script, and
> > that script is provided an integer for each subjob.  The integer is in
> > a range, with a possible step (default=1).
> >
> > To run the situation you describe, you would have to predetermine how
> > many of each test you want to run (i.e., you coudln't dynamically
> > change the number of jobs that run within one array)., and a master
> > script would map the integer range to the job that was to be started.
> >
> > The most trivial way to do it would be to put the list of regressions
> > in a text file and the master script would index it by line number and
> > then run the appropriate command.
> > A more complex way would be to do some math (a divide?) to get the
> > script name and subindex (modulus?) for each regression.
> >
> > Both of these would require some semi-advanced scripting, but nothing
> > that couldn't be cut and pasted with some trivial modifications for
> > each job set.
> >
> > As to the unavailability of the admin ...
> > An alternate approach that would require the admin's help would be to
> > come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
> > gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
> > maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
> > starting it to set the overall allocation for all the jobs.  The admin
> > woudln't need to tweak this except once, you just pick which tweak to
> > use.
> >
> > On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
> >  wrote:
> >> Hi Steven,
> >>
> >> Thanks for taking the time to reply to my post.
> >>
> >> Setting a limit on the number of jobs for a single array isn't
> sufficient because regression-tests need to launch multiple arrays, and I
> would need a job limit that would take effect over all launched jobs.
> >>
> >> It's very possible I'm not understand something. I'll lay out a very
> specific example in the hopes you can correct me if I've gone wrong
> somewhere.
> >>
> >> Let's take the small cluster with 140 GPUs and no fairshare as an
> example, because it's easier for me to explain.
> >>
> >> The users, who all know each other personally and interact via chat,
> decide on a daily basis how many jobs each user can run at a time.
> >>
> >> Let's say today is Sunday (hypothetically). Nobody is actively
> developing today, except that user 1 has 10 jobs running for the entire
> weekend. That leaves 130 GPUs unused.
> >>
> >> User 2, whose jobs all run on 1 GPU decides to run a regression test.
> The regression test comprises of 9 different scripts each run 40 times, for
> a grand total of 360 jobs. The duration of the scripts vary from 1 and 5
> hours to complete, and the jobs take on average 4 hours to complete.
> >>
> >> User 2 gets the user group's approval (via chat) to use 90 GPUs (so
> that 40 GPUs will remain for anyone else wanting to work that day).
> >>
> >> The problem I'm trying to solve is this: how do I ensure that user 2
> launches his 360 jobs in such a way that 90 jobs are in the run state
> consistently until the regression test is finished?
> >>
> >> Keep in mind that:
> >>
> >> limiting each job array to 10 jobs is inefficient: when the first job
> array finishes (long before the last one), only 80 GPUs will be used, and
> so on as other arrays finish
> >> the admin is not available, he cannot be asked to set a hard limit of
> 90 jobs for user 2 just for today
> >>
> >> I would be happy to use job arrays if they allow me to set an
> overarching job limit across multiple arrays. Perhaps this is doable.
> Admttedly I'm working on a paper to be submitted in a few days, so I don't
> have time to test jobs arrays thoroughly, but I will try out job arrays
> more thoroughly once I've submitted my paper (ie after sept 5).
> >>
> >> My solution, for now, is to not use job arrays. Instead, I launch each
> job individually, and I use singleton (by launching all jobs with the same
> 90 unique names) to ensure that exactly 90 jobs are run at a time (in this
> case, corresponding to 90 GPUs in use).
> >>
> >> Side note: the unavailability of the admin might sound contrived by
> picking Sunday as an example, but it's in fact very typical. The admin is
> not available:
> >>
> >> on weekends (the present example)
> >> at any time ou

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Hi Steven,

Those both sound like potentially good solutions.

So basically, you're saying that if I script it properly, I can use a
single job array to launch multiple scripts by using a master sbatch script.

My problem with that though, is what if each script (the 9 scripts in my
earlier example) each require different requirements? For example, run on a
different partition, or set a different time limit? My understanding is
that for a single job array, each job will get the same job requirements.

The other problem is that with the way I've implemented it, I can change
the max jobs dynamically.

I'll illustrate this using my earlier example. Suppose user 2 launches his
360 jobs with a 90 job limit (leaving 40 unused GPUs), and then user 3
realizes he needs to use 45 GPUs.

User 2 decides to drop his usage to 45 max jobs.

He can simply change the names of his pending singleton jobs to have 45
unique names, so that he will reduce his max jobs to 45 instead of 90 (I
wrote a script to do that, so it's a one liner for user 2)

Can the max job limit be modified after submission time using one big job
array?

In the docs it gives the '%' separator to limit the concurrent number of
jobs "--array=0-15%4" I could be wrong, but this sounds like a submit
time-only option that cannot be change after submission.

I also kindof like the varoius QOS for different job limits. I'm not sure
I'll be able to get the admin on board, but I'll bring it up. Even if I do
get them on board, will I have the same problem of locking the max limit at
submit time?

Can you change the QOS of a job when it's still pending?

Thanks a lot for your help!

Regards,
Guillaume

On Fri, Aug 30, 2019 at 7:36 AM Steven Dick  wrote:

> It would still be possible to use job arrays in this situation, it's
> just slightly messy.
> So the way a job array works is that you submit a single script, and
> that script is provided an integer for each subjob.  The integer is in
> a range, with a possible step (default=1).
>
> To run the situation you describe, you would have to predetermine how
> many of each test you want to run (i.e., you coudln't dynamically
> change the number of jobs that run within one array)., and a master
> script would map the integer range to the job that was to be started.
>
> The most trivial way to do it would be to put the list of regressions
> in a text file and the master script would index it by line number and
> then run the appropriate command.
> A more complex way would be to do some math (a divide?) to get the
> script name and subindex (modulus?) for each regression.
>
> Both of these would require some semi-advanced scripting, but nothing
> that couldn't be cut and pasted with some trivial modifications for
> each job set.
>
> As to the unavailability of the admin ...
> An alternate approach that would require the admin's help would be to
> come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
> gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
> maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
> starting it to set the overall allocation for all the jobs.  The admin
> woudln't need to tweak this except once, you just pick which tweak to
> use.
>
> On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
>  wrote:
> >
> > Hi Steven,
> >
> > Thanks for taking the time to reply to my post.
> >
> > Setting a limit on the number of jobs for a single array isn't
> sufficient because regression-tests need to launch multiple arrays, and I
> would need a job limit that would take effect over all launched jobs.
> >
> > It's very possible I'm not understand something. I'll lay out a very
> specific example in the hopes you can correct me if I've gone wrong
> somewhere.
> >
> > Let's take the small cluster with 140 GPUs and no fairshare as an
> example, because it's easier for me to explain.
> >
> > The users, who all know each other personally and interact via chat,
> decide on a daily basis how many jobs each user can run at a time.
> >
> > Let's say today is Sunday (hypothetically). Nobody is actively
> developing today, except that user 1 has 10 jobs running for the entire
> weekend. That leaves 130 GPUs unused.
> >
> > User 2, whose jobs all run on 1 GPU decides to run a regression test.
> The regression test comprises of 9 different scripts each run 40 times, for
> a grand total of 360 jobs. The duration of the scripts vary from 1 and 5
> hours to complete, and the jobs take on average 4 hours to complete.
> >
> > User 2 gets the user group's approval (via chat) to use 90 GPUs (so that
> 40 GPUs will remain for anyone else wanting to work that day).
> >
> > The problem I'm trying to solve is this: how do I ensure that user 2
> launches his 360 jobs in such a way that 90 jobs are in the run state
> consistently until the regression test is finished?
> >
> > Keep in mind that:
> >
> > limiting each job array to 10 jobs is inefficient: when the first job
> array finishe

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Paul Edmon
A QoS is probably your best bet.  Another variant might be MCS, which 
you can use to help reduce resource fragmentation.  For limits though 
QoS will be your best bet.


-Paul Edmon-

On 8/30/19 7:33 AM, Steven Dick wrote:

It would still be possible to use job arrays in this situation, it's
just slightly messy.
So the way a job array works is that you submit a single script, and
that script is provided an integer for each subjob.  The integer is in
a range, with a possible step (default=1).

To run the situation you describe, you would have to predetermine how
many of each test you want to run (i.e., you coudln't dynamically
change the number of jobs that run within one array)., and a master
script would map the integer range to the job that was to be started.

The most trivial way to do it would be to put the list of regressions
in a text file and the master script would index it by line number and
then run the appropriate command.
A more complex way would be to do some math (a divide?) to get the
script name and subindex (modulus?) for each regression.

Both of these would require some semi-advanced scripting, but nothing
that couldn't be cut and pasted with some trivial modifications for
each job set.

As to the unavailability of the admin ...
An alternate approach that would require the admin's help would be to
come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
starting it to set the overall allocation for all the jobs.  The admin
woudln't need to tweak this except once, you just pick which tweak to
use.

On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
 wrote:

Hi Steven,

Thanks for taking the time to reply to my post.

Setting a limit on the number of jobs for a single array isn't sufficient 
because regression-tests need to launch multiple arrays, and I would need a job 
limit that would take effect over all launched jobs.

It's very possible I'm not understand something. I'll lay out a very specific 
example in the hopes you can correct me if I've gone wrong somewhere.

Let's take the small cluster with 140 GPUs and no fairshare as an example, 
because it's easier for me to explain.

The users, who all know each other personally and interact via chat, decide on 
a daily basis how many jobs each user can run at a time.

Let's say today is Sunday (hypothetically). Nobody is actively developing 
today, except that user 1 has 10 jobs running for the entire weekend. That 
leaves 130 GPUs unused.

User 2, whose jobs all run on 1 GPU decides to run a regression test. The 
regression test comprises of 9 different scripts each run 40 times, for a grand 
total of 360 jobs. The duration of the scripts vary from 1 and 5 hours to 
complete, and the jobs take on average 4 hours to complete.

User 2 gets the user group's approval (via chat) to use 90 GPUs (so that 40 
GPUs will remain for anyone else wanting to work that day).

The problem I'm trying to solve is this: how do I ensure that user 2 launches 
his 360 jobs in such a way that 90 jobs are in the run state consistently until 
the regression test is finished?

Keep in mind that:

limiting each job array to 10 jobs is inefficient: when the first job array 
finishes (long before the last one), only 80 GPUs will be used, and so on as 
other arrays finish
the admin is not available, he cannot be asked to set a hard limit of 90 jobs 
for user 2 just for today

I would be happy to use job arrays if they allow me to set an overarching job 
limit across multiple arrays. Perhaps this is doable. Admttedly I'm working on 
a paper to be submitted in a few days, so I don't have time to test jobs arrays 
thoroughly, but I will try out job arrays more thoroughly once I've submitted 
my paper (ie after sept 5).

My solution, for now, is to not use job arrays. Instead, I launch each job 
individually, and I use singleton (by launching all jobs with the same 90 
unique names) to ensure that exactly 90 jobs are run at a time (in this case, 
corresponding to 90 GPUs in use).

Side note: the unavailability of the admin might sound contrived by picking 
Sunday as an example, but it's in fact very typical. The admin is not available:

on weekends (the present example)
at any time outside of 9am to 5pm (keep in mind, this is a cluster used by 
students in different time zones)
any time he is on vacation
anytime the he is looking after his many other responsibilities. Constantly 
setting user limits that change on a daily basis would be too much too ask.


I'd be happy if you corrected my misunderstandings, especially if you could 
show me how to set a job limit that takes effect over multiple job arrays.

I may have very glaring oversights as I don't necessarily have a big picture 
view of things (I've never been an admin, most notably), so feel free to poke 
holes at the way I've constructed things.

Regards,
Gu

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Steven Dick
It would still be possible to use job arrays in this situation, it's
just slightly messy.
So the way a job array works is that you submit a single script, and
that script is provided an integer for each subjob.  The integer is in
a range, with a possible step (default=1).

To run the situation you describe, you would have to predetermine how
many of each test you want to run (i.e., you coudln't dynamically
change the number of jobs that run within one array)., and a master
script would map the integer range to the job that was to be started.

The most trivial way to do it would be to put the list of regressions
in a text file and the master script would index it by line number and
then run the appropriate command.
A more complex way would be to do some math (a divide?) to get the
script name and subindex (modulus?) for each regression.

Both of these would require some semi-advanced scripting, but nothing
that couldn't be cut and pasted with some trivial modifications for
each job set.

As to the unavailability of the admin ...
An alternate approach that would require the admin's help would be to
come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
starting it to set the overall allocation for all the jobs.  The admin
woudln't need to tweak this except once, you just pick which tweak to
use.

On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
 wrote:
>
> Hi Steven,
>
> Thanks for taking the time to reply to my post.
>
> Setting a limit on the number of jobs for a single array isn't sufficient 
> because regression-tests need to launch multiple arrays, and I would need a 
> job limit that would take effect over all launched jobs.
>
> It's very possible I'm not understand something. I'll lay out a very specific 
> example in the hopes you can correct me if I've gone wrong somewhere.
>
> Let's take the small cluster with 140 GPUs and no fairshare as an example, 
> because it's easier for me to explain.
>
> The users, who all know each other personally and interact via chat, decide 
> on a daily basis how many jobs each user can run at a time.
>
> Let's say today is Sunday (hypothetically). Nobody is actively developing 
> today, except that user 1 has 10 jobs running for the entire weekend. That 
> leaves 130 GPUs unused.
>
> User 2, whose jobs all run on 1 GPU decides to run a regression test. The 
> regression test comprises of 9 different scripts each run 40 times, for a 
> grand total of 360 jobs. The duration of the scripts vary from 1 and 5 hours 
> to complete, and the jobs take on average 4 hours to complete.
>
> User 2 gets the user group's approval (via chat) to use 90 GPUs (so that 40 
> GPUs will remain for anyone else wanting to work that day).
>
> The problem I'm trying to solve is this: how do I ensure that user 2 launches 
> his 360 jobs in such a way that 90 jobs are in the run state consistently 
> until the regression test is finished?
>
> Keep in mind that:
>
> limiting each job array to 10 jobs is inefficient: when the first job array 
> finishes (long before the last one), only 80 GPUs will be used, and so on as 
> other arrays finish
> the admin is not available, he cannot be asked to set a hard limit of 90 jobs 
> for user 2 just for today
>
> I would be happy to use job arrays if they allow me to set an overarching job 
> limit across multiple arrays. Perhaps this is doable. Admttedly I'm working 
> on a paper to be submitted in a few days, so I don't have time to test jobs 
> arrays thoroughly, but I will try out job arrays more thoroughly once I've 
> submitted my paper (ie after sept 5).
>
> My solution, for now, is to not use job arrays. Instead, I launch each job 
> individually, and I use singleton (by launching all jobs with the same 90 
> unique names) to ensure that exactly 90 jobs are run at a time (in this case, 
> corresponding to 90 GPUs in use).
>
> Side note: the unavailability of the admin might sound contrived by picking 
> Sunday as an example, but it's in fact very typical. The admin is not 
> available:
>
> on weekends (the present example)
> at any time outside of 9am to 5pm (keep in mind, this is a cluster used by 
> students in different time zones)
> any time he is on vacation
> anytime the he is looking after his many other responsibilities. Constantly 
> setting user limits that change on a daily basis would be too much too ask.
>
>
> I'd be happy if you corrected my misunderstandings, especially if you could 
> show me how to set a job limit that takes effect over multiple job arrays.
>
> I may have very glaring oversights as I don't necessarily have a big picture 
> view of things (I've never been an admin, most notably), so feel free to poke 
> holes at the way I've constructed things.
>
> Regards,
> Guillaume.
>
>
> On Fri, Aug 30, 2019 at 1:22 AM Steven Dick  wrote:
>>
>> This makes no sense and

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Guillaume Perrault Archambault
Hi Steven,

Thanks for taking the time to reply to my post.

Setting a limit on the number of jobs for a single array isn't sufficient
because regression-tests need to launch multiple arrays, and I would need a
job limit that would take effect over all launched jobs.

It's very possible I'm not understand something. I'll lay out a very
specific example in the hopes you can correct me if I've gone wrong
somewhere.

Let's take the small cluster with 140 GPUs and no fairshare as an example,
because it's easier for me to explain.

The users, who all know each other personally and interact via chat, decide
on a daily basis how many jobs each user can run at a time.

Let's say today is Sunday (hypothetically). Nobody is actively developing
today, except that user 1 has 10 jobs running for the entire weekend. That
leaves 130 GPUs unused.

User 2, whose jobs all run on 1 GPU decides to run a regression test. The
regression test comprises of 9 different scripts each run 40 times, for a
grand total of 360 jobs. The duration of the scripts vary from 1 and 5
hours to complete, and the jobs take on average 4 hours to complete.

User 2 gets the user group's approval (via chat) to use 90 GPUs (so that 40
GPUs will remain for anyone else wanting to work that day).

The problem I'm trying to solve is this: how do I ensure that user 2
launches his 360 jobs in such a way that 90 jobs are in the run state
consistently until the regression test is finished?

Keep in mind that:

   - limiting each job array to 10 jobs is inefficient: when the first job
   array finishes (long before the last one), only 80 GPUs will be used, and
   so on as other arrays finish
   - the admin is not available, he cannot be asked to set a hard limit of
   90 jobs for user 2 just for today

I would be happy to use job arrays if they allow me to set an overarching
job limit across multiple arrays. Perhaps this is doable. Admttedly I'm
working on a paper to be submitted in a few days, so I don't have time to
test jobs arrays thoroughly, but I will try out job arrays more thoroughly
once I've submitted my paper (ie after sept 5).

My solution, for now, is to not use job arrays. Instead, I launch each job
individually, and I use singleton (by launching all jobs with the same 90
unique names) to ensure that exactly 90 jobs are run at a time (in this
case, corresponding to 90 GPUs in use).

Side note: the unavailability of the admin might sound contrived by picking
Sunday as an example, but it's in fact very typical. The admin is not
available:

   - on weekends (the present example)
   - at any time outside of 9am to 5pm (keep in mind, this is a cluster
   used by students in different time zones)
   - any time he is on vacation
   - anytime the he is looking after his many other responsibilities.
   Constantly setting user limits that change on a daily basis would be too
   much too ask.


I'd be happy if you corrected my misunderstandings, especially if you could
show me how to set a job limit that takes effect over multiple job arrays.

I may have very glaring oversights as I don't necessarily have a big
picture view of things (I've never been an admin, most notably), so feel
free to poke holes at the way I've constructed things.

Regards,
Guillaume.


On Fri, Aug 30, 2019 at 1:22 AM Steven Dick  wrote:

> This makes no sense and seems backwards to me.
>
> When you submit an array job, you can specify how many jobs from the
> array you want to run at once.
> So, an administrator can create a QOS that explicitly limits the user.
> However, you keep saying that they probably won't modify the system
> for just you...
>
> That seems to me to be the perfect case to use array jobs and tell it
> how many elements of the array to run at once.
> You're not using array jobs for exactly the wrong reason.
>
> On Tue, Aug 27, 2019 at 1:19 PM Guillaume Perrault Archambault
>  wrote:
> > The reason I don't use job arrays is to be able limit the number of jobs
> per users
>
>


Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Steven Dick
This makes no sense and seems backwards to me.

When you submit an array job, you can specify how many jobs from the
array you want to run at once.
So, an administrator can create a QOS that explicitly limits the user.
However, you keep saying that they probably won't modify the system
for just you...

That seems to me to be the perfect case to use array jobs and tell it
how many elements of the array to run at once.
You're not using array jobs for exactly the wrong reason.

On Tue, Aug 27, 2019 at 1:19 PM Guillaume Perrault Archambault
 wrote:
> The reason I don't use job arrays is to be able limit the number of jobs per 
> users



Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Mark Hahn

Here's an example on how to do so from the Compute Canada docs:
https://docs.computecanada.ca/wiki/GNU_Parallel#Running_on_Multiple_Nodes


[name@server ~]$ parallel --jobs 32 --sshloginfile
./node_list_${SLURM_JOB_ID} --env MY_VARIABLE --workdir $PWD ./my_program

To me it looks like you're circumventing the scheduler when you do this;
maybe I'm missing something?


our (ComputeCanada) setup includes slurm_adopt, so if a user sshes to a 
node on which they have resources, any processes get put into the job's 
cgroup.  we don't really care how the user consumes the resources, as long

as it's only what's allocated to their jobs, doesn't interfere with other
users, and is hopefully reasonably efficient.  heck, we configure clusters
with hostbased trust, so it's easy for users to ssh among nodes.

regards,
--
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
  | Compute/Calcul Canada| http://www.computecanada.ca




Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Jarno van der Kolk
On 8/29/19 12:48 PM, Goetz, Patrick G wrote:
> On 8/29/19 9:38 AM, Jarno van der Kolk wrote:
> > Here's an example on how to do so from the Compute Canada docs:
> > 
> https://docs.computecanada.ca/wiki/GNU_Parallel#Running_on_Multiple_Nodes
> >
> 
> [name@server ~]$ parallel --jobs 32 --sshloginfile
> ./node_list_${SLURM_JOB_ID} --env MY_VARIABLE --workdir $PWD ./my_program
> 
> 
> To me it looks like you're circumventing the scheduler when you do this;
> maybe I'm missing something?
> 
> Also, where are these environment variables:
> 
>    SLURM_JOB_NODELIST, SLURM_JOB_ID
> 
> being set?
> 

I guess you kind of are. The advantage of this over array jobs is that you can 
provide a list of jobs instead on depending on SLURM_ARRAY_TASK_ID while still 
only doing one submission to the scheduler. So instead of submitting hundreds 
or even thousands of little jobs and waiting for the scheduler to accept them 
all, you submit once and are done. So parallel functions as a subscheduler if 
you will.

Those environment variables are set when the job starts.
See also 
https://slurm.schedmd.com/sbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES

Regards,
Jarno

Jarno van der Kolk, PhD Phys.
Analyste principal en informatique scientifique | Senior Scientific Computing 
Specialist
Solutions TI | IT Solutions
Université d’Ottawa | University of Ottawa



Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Goetz, Patrick G
On 8/29/19 9:38 AM, Jarno van der Kolk wrote:
> Here's an example on how to do so from the Compute Canada docs:
> https://docs.computecanada.ca/wiki/GNU_Parallel#Running_on_Multiple_Nodes
> 

[name@server ~]$ parallel --jobs 32 --sshloginfile 
./node_list_${SLURM_JOB_ID} --env MY_VARIABLE --workdir $PWD ./my_program


To me it looks like you're circumventing the scheduler when you do this; 
maybe I'm missing something?

Also, where are these environment variables:

   SLURM_JOB_NODELIST, SLURM_JOB_ID

being set?





Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Jarno van der Kolk
On 8/29/19 10:15 AM, Goetz, Patrick G wrote:
> On 8/27/19 11:47 AM, Brian Andrus wrote:
> > 1) If you can, either use xargs or parallel to do the forking so you can
> > limit the number of simultaneous submissions
> >
> 
> Sorry if this is a naive question, but I'm not following how you would
> use parallel with Slurm (unless you're talking about using it on a
> single node).  Parallel is what my non-Slurm users use to
> parallelize/distribute jobs.

Here's an example on how to do so from the Compute Canada docs:
https://docs.computecanada.ca/wiki/GNU_Parallel#Running_on_Multiple_Nodes

It uses the --sshlogin parameter for parallel combined with SLURM_JOB_NODELIST.

Jarno van der Kolk, PhD Phys.
Analyste principal en informatique scientifique | Senior Scientific Computing 
Specialist
Solutions TI | IT Solutions
Université d’Ottawa | University of Ottawa



Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Goetz, Patrick G
On 8/27/19 11:47 AM, Brian Andrus wrote:
> 1) If you can, either use xargs or parallel to do the forking so you can 
> limit the number of simultaneous submissions
> 

Sorry if this is a naive question, but I'm not following how you would 
use parallel with Slurm (unless you're talking about using it on a 
single node).  Parallel is what my non-Slurm users use to 
parallelize/distribute jobs.






Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
HI Brian,

Thanks a lot for your recommendations.

I'll do my best to address your three points in line. I hope I've
understood you correctly, please correct me if i've misunderstood parts.

"1) If you can, either use xargs or parallel to do the forking so you can
limit the number of simultaneous submissions  "
I think that's a great idea! I will look into it

"2) I have yet to see where it is a good idea to have many separate jobs
when using an array can work.

If you can prep up a proper input file for a script, a single
submission is all it takes. Then you can control how many are currently
running (MaxArrayTask) and can change that to scale up/down"

The reason I don't use job arrays is to be able limit the number of jobs
per users (as explained, not sure how well, in my reply to Paul's second
message).

Perhaps it doesn't fall into the "good idea" because there are good methods
to limit jobs run per user at the job array level; I'm still trying to
figure out if there are.

"Here is where you may want to look into slurmdbd and sacct

Then you can create a qos that has MaxJobsPerUser to limit the total number
running on a per-user basis: https://slurm.schedmd.com/resource_limits.html";

Correct me if I'm wrong, but this an admin-side solution, am I right? I
cannot create a qos as a user?

I'm trying to implement a user-side solution so that the user can limit the
number of jobs.

I use two clusters (or two sets of clusters to be exact), and this is
valuable on both for (slightly) different reasons:

1) A small-ish cluster of about 140 GPUs that does not use fair share, and
where they do not want to set a hard limit on number of jobs per users
because it may change over time, especially around conference dedalines.
But they do want users to have the ability to self-manage the number of
jobs they run, hence my script doing this from the user side

2) A large cluster that uses fairshare. But the fairshare is at the
association level. So within an association, we want users to be able to
limit the number of jobs they run, similar to cluste 1.

I hope I've understood your suggestions and replied on-topic. My apologies
if I've misunderstood anythig.

Regards,

Guillaume.

On Tue, Aug 27, 2019 at 12:53 PM Brian Andrus  wrote:

> Here is where you may want to look into slurmdbd and sacct
>
> Then you can create a qos that has MaxJobsPerUser to limit the total
> number running on a per-user basis:
> https://slurm.schedmd.com/resource_limits.html
>
> Brian Andrus
> On 8/27/2019 9:38 AM, Guillaume Perrault Archambault wrote:
>
> Hi Paul,
>
> Your comment confirms my worst fear, that I should either implement job
> arrays or stick to a sequential for loop.
>
> My problem with job arrays is that, as far as I understand them, they
> cannot be used with singleton to set a max job limit.
>
> I use singleton to limit the number of jobs a user can be running at a
> time. For example if the limit is 3 jobs per user and the user launches 10
> jobs, the sbatch submissions via my scripts may look this:
> sbatch --job-name=job1 [OPTIONS SET1] Dependency=singleton my.sbatch
> sbatch --job-name=job2 [OTHER  SET1] Dependency=singleton my.sbatch
> sbatch --job-name=job3 [OTHER SET1] Dependency=singleton my.sbatch
> sbatch --job-name=job1 [OTHER SET1 Dependency=singleton my.sbatch
> sbatch --job-name=job2 [OTHER  SET1 ] Dependency=singleton my.sbatch
> sbatch --job-name=job3 [OTHER  SET2] Dependency=singleton my.sbatch2
> sbatch --job-name=job1 [OTHER  SET2] Dependency=singleton my.sbatch2
> sbatch --job-name=job2 [OTHER  SET2 ] Dependency=singleton my.sbatch2
> sbatch --job-name=job2 [OTHER  SET2 ] Dependency=singleton my.sbatch2
> sbatch --job-name=job1 [OTHER  SET2 ] Dependency=singleton my.sbatch 2
>
> This way, at most 3 jobs will run at a time (ie a job with name job1, a
> job with name job2, and job with name job3).
>
> Notice that my example has two option sets provided to sbatch, so the
> example would be suitable for conversion to two Job Arrays.
>
> This is the problem I can't obercome.
>
> In the job array documentation, I see
> A maximum number of simultaneously running tasks from the job array may be
> specified using a "%" separator. For example "--array=0-15%4" will limit
> the number of simultaneously running tasks from this job array to 4.
>
> But this '%' separator cannot specify a max number of tasks over two (or
> more) separate job arrays, as far as I can tell.
>
> And the job array element names cannot be made to modulo rotate in the way
> they do in my above example.
>
> Perhaps I need to play more with job arrays, and try harder to find a
> solution to limit number of jobs across multiple arrays. Or ask this
> question in a separate post, since it's a bit off topic.
>
> In any case, thanks so much for answer my question. I think it answer my
> original post perfectly :)
>
> Regards,
> Guillaume.
>
> On Tue, Aug 27, 2019 at 10:08 AM Paul Edmon 
> wrote:
>
>> At least for our cluster we generally recomme

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Brian Andrus

Here is where you may want to look into slurmdbd and sacct

Then you can create a qos that has MaxJobsPerUser to limit the total 
number running on a per-user basis: 
https://slurm.schedmd.com/resource_limits.html


Brian Andrus

On 8/27/2019 9:38 AM, Guillaume Perrault Archambault wrote:

Hi Paul,

Your comment confirms my worst fear, that I should either implement 
job arrays or stick to a sequential for loop.


My problem with job arrays is that, as far as I understand them, they 
cannot be used with singleton to set a max job limit.


I use singleton to limit the number of jobs a user can be running at a 
time. For example if the limit is 3 jobs per user and the user 
launches 10 jobs, the sbatch submissions via my scripts may look this:

sbatch --job-name=job1 [OPTIONS SET1] Dependency=singleton my.sbatch
sbatch --job-name=job2 [OTHER  SET1] Dependency=singleton my.sbatch
sbatch --job-name=job3 [OTHER SET1] Dependency=singleton my.sbatch
sbatch --job-name=job1 [OTHER SET1 Dependency=singleton my.sbatch
sbatch --job-name=job2 [OTHER SET1 ] Dependency=singleton my.sbatch
sbatch --job-name=job3 [OTHER SET2] Dependency=singleton my.sbatch2
sbatch --job-name=job1 [OTHER SET2] Dependency=singleton my.sbatch2
sbatch --job-name=job2 [OTHER SET2 ] Dependency=singleton my.sbatch2
sbatch --job-name=job2 [OTHER SET2 ] Dependency=singleton my.sbatch2
sbatch --job-name=job1 [OTHER SET2 ] Dependency=singleton my.sbatch 2

This way, at most 3 jobs will run at a time (ie a job with name job1, 
a job with name job2, and job with name job3).


Notice that my example has two option sets provided to sbatch, so the 
example would be suitable for conversion to two Job Arrays.


This is the problem I can't obercome.

In the job array documentation, I see
A maximum number of simultaneously running tasks from the job array 
may be specified using a "%" separator. For example "--array=0-15%4" 
will limit the number of simultaneously running tasks from this job 
array to 4.


But this '%' separator cannot specify a max number of tasks over two 
(or more) separate job arrays, as far as I can tell.


And the job array element names cannot be made to modulo rotate in the 
way they do in my above example.


Perhaps I need to play more with job arrays, and try harder to find a 
solution to limit number of jobs across multiple arrays. Or ask this 
question in a separate post, since it's a bit off topic.


In any case, thanks so much for answer my question. I think it answer 
my original post perfectly :)


Regards,
Guillaume.

On Tue, Aug 27, 2019 at 10:08 AM Paul Edmon > wrote:


At least for our cluster we generally recommend that if you are
submitting large numbers of jobs you either use a job array or you
just for loop over the jobs you want to submit.  A fork bomb is
definitely not recommended.  For highest throughput submission a
job array is your best bet as in one submission it will generate
thousands of jobs which then the scheduler can handle sensibly. 
So I highly recommend using job arrays.

-Paul Edmon-

On 8/27/19 3:45 AM, Guillaume Perrault Archambault wrote:

Hi Paul,

Thanks a lot for your suggestion.

The cluster I'm using has thousands of users, so I'm doubtful the
admins will change this setting just for me. But I'll mention it
to the support team I'm working with.

I was hoping more for something that can be done on the user end.

Is there some way for the user to measure whether the scheduler
is in RPC saturation? And then if it is, I could make sure my
script doesn't launch too many jobs in parallel.

Sorry if my question is too vague, I don't understand the backend
of the SLURM scheduler too well, so my questions are using the
limited terminology of a user.

My concern is just to make sure that my scripts don't send out
more commands (simultaneously) than the scheduler can handle.

For example, as an extreme scenario, suppose a user forks off
1000 sbatch commands in parallel, is that more than the scheduler
can handle? As a user, how can I know whether it is?

Regards,
Guillaume.



On Mon, Aug 26, 2019 at 10:15 AM Paul Edmon
mailto:ped...@cfa.harvard.edu>> wrote:

We've hit this before due to RPC saturation.  I highly
recommend using max_rpc_cnt and/or defer for scheduling. 
That should help alleviate this problem.

-Paul Edmon-

On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:

Hello,

I wrote a regression-testing toolkit to manage large numbers
of SLURM jobs and their output (the toolkit can be found
here 
if anyone is interested).

To make job launching faster, sbatch commands are forked, so
that numerous jobs may be submitted in parallel.

We (the cluster admin and myself) are concerned that t

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Brian Andrus

Just a couple comments from experience in general:

1) If you can, either use xargs or parallel to do the forking so you can 
limit the number of simultaneous submissions


2) I have yet to see where it is a good idea to have many separate jobs 
when using an array can work.


    If you can prep up a proper input file for a script, a single 
submission is all it takes. Then you can control how many are currently 
running (MaxArrayTask) and can change that to scale up/down.



Brian Andrus


On 8/25/2019 11:12 PM, Guillaume Perrault Archambault wrote:

Hello,

I wrote a regression-testing toolkit to manage large numbers of SLURM 
jobs and their output (the toolkit can be found here 
 if anyone is 
interested).


To make job launching faster, sbatch commands are forked, so that 
numerous jobs may be submitted in parallel.


We (the cluster admin and myself) are concerned that this may cause 
unresponsiveness for other users.


I cannot say for sure since I don't have visibility over all users of 
the cluster, but unresponsiveness doesn't seem to have occurred so 
far. That being said, the fact that it hasn't occurred yet doesn't 
mean it won't in the future. So I'm treating this as a ticking time 
bomb to be fixed asap.


My questions are the following:
1) Does anyone have experience with large numbers of jobs submitted in 
parallel? What are the limits that can be hit? For example is there 
some hard limit on how many jobs a SLURM scheduler can handle before 
blacking out / slowing down?

2) Is there a way for me to find/measure/ping this resource limit?
3) How can I make sure I don't hit this resource limit?

From what I've observed, parallel submission can improve submission 
time by a factor at least 10x. This can make a big difference in 
users' workflows.


For that reason I would like to keep the option of launching jobs 
sequentially as a last resort.


Thanks in advance.

Regards,
Guillaume.


Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
Hi Paul,

Your comment confirms my worst fear, that I should either implement job
arrays or stick to a sequential for loop.

My problem with job arrays is that, as far as I understand them, they
cannot be used with singleton to set a max job limit.

I use singleton to limit the number of jobs a user can be running at a
time. For example if the limit is 3 jobs per user and the user launches 10
jobs, the sbatch submissions via my scripts may look this:
sbatch --job-name=job1 [OPTIONS SET1] Dependency=singleton my.sbatch
sbatch --job-name=job2 [OTHER  SET1] Dependency=singleton my.sbatch
sbatch --job-name=job3 [OTHER SET1] Dependency=singleton my.sbatch
sbatch --job-name=job1 [OTHER SET1 Dependency=singleton my.sbatch
sbatch --job-name=job2 [OTHER  SET1 ] Dependency=singleton my.sbatch
sbatch --job-name=job3 [OTHER  SET2] Dependency=singleton my.sbatch2
sbatch --job-name=job1 [OTHER  SET2] Dependency=singleton my.sbatch2
sbatch --job-name=job2 [OTHER  SET2 ] Dependency=singleton my.sbatch2
sbatch --job-name=job2 [OTHER  SET2 ] Dependency=singleton my.sbatch2
sbatch --job-name=job1 [OTHER  SET2 ] Dependency=singleton my.sbatch 2

This way, at most 3 jobs will run at a time (ie a job with name job1, a job
with name job2, and job with name job3).

Notice that my example has two option sets provided to sbatch, so the
example would be suitable for conversion to two Job Arrays.

This is the problem I can't obercome.

In the job array documentation, I see
A maximum number of simultaneously running tasks from the job array may be
specified using a "%" separator. For example "--array=0-15%4" will limit
the number of simultaneously running tasks from this job array to 4.

But this '%' separator cannot specify a max number of tasks over two (or
more) separate job arrays, as far as I can tell.

And the job array element names cannot be made to modulo rotate in the way
they do in my above example.

Perhaps I need to play more with job arrays, and try harder to find a
solution to limit number of jobs across multiple arrays. Or ask this
question in a separate post, since it's a bit off topic.

In any case, thanks so much for answer my question. I think it answer my
original post perfectly :)

Regards,
Guillaume.

On Tue, Aug 27, 2019 at 10:08 AM Paul Edmon  wrote:

> At least for our cluster we generally recommend that if you are submitting
> large numbers of jobs you either use a job array or you just for loop over
> the jobs you want to submit.  A fork bomb is definitely not recommended.
> For highest throughput submission a job array is your best bet as in one
> submission it will generate thousands of jobs which then the scheduler can
> handle sensibly.  So I highly recommend using job arrays.
>
> -Paul Edmon-
> On 8/27/19 3:45 AM, Guillaume Perrault Archambault wrote:
>
> Hi Paul,
>
> Thanks a lot for your suggestion.
>
> The cluster I'm using has thousands of users, so I'm doubtful the admins
> will change this setting just for me. But I'll mention it to the support
> team I'm working with.
>
> I was hoping more for something that can be done on the user end.
>
> Is there some way for the user to measure whether the scheduler is in RPC
> saturation? And then if it is, I could make sure my script doesn't launch
> too many jobs in parallel.
>
> Sorry if my question is too vague, I don't understand the backend of the
> SLURM scheduler too well, so my questions are using the limited terminology
> of a user.
>
> My concern is just to make sure that my scripts don't send out more
> commands (simultaneously) than the scheduler can handle.
>
> For example, as an extreme scenario, suppose a user forks off 1000 sbatch
> commands in parallel, is that more than the scheduler can handle? As a
> user, how can I know whether it is?
>
> Regards,
> Guillaume.
>
>
>
> On Mon, Aug 26, 2019 at 10:15 AM Paul Edmon 
> wrote:
>
>> We've hit this before due to RPC saturation.  I highly recommend using
>> max_rpc_cnt and/or defer for scheduling.  That should help alleviate this
>> problem.
>>
>> -Paul Edmon-
>> On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:
>>
>> Hello,
>>
>> I wrote a regression-testing toolkit to manage large numbers of SLURM
>> jobs and their output (the toolkit can be found here
>>  if anyone is
>> interested).
>>
>> To make job launching faster, sbatch commands are forked, so that
>> numerous jobs may be submitted in parallel.
>>
>> We (the cluster admin and myself) are concerned that this may cause
>> unresponsiveness for other users.
>>
>> I cannot say for sure since I don't have visibility over all users of the
>> cluster, but unresponsiveness doesn't seem to have occurred so far. That
>> being said, the fact that it hasn't occurred yet doesn't mean it won't in
>> the future. So I'm treating this as a ticking time bomb to be fixed asap.
>>
>> My questions are the following:
>> 1) Does anyone have experience with large numbers of jobs submitted in
>> para

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
Thanks Ole for giving so much thought into my question. I'll pass a long
these suggestions. Unfortunately as a user there's not a whole lot I can do
about the choice of hardware.

Thanks for the link to the guide, I'll have a look at it. Even as a user
it's helpful to be well informed on the admin side :)

Regards,
Guillaume.

On Tue, Aug 27, 2019 at 4:26 AM Ole Holm Nielsen 
wrote:

> Hi Guillaume,
>
> The performance of the slurmctld server depends strongly on the server
> hardware on which it is running!  This should be taken into account when
> considering your question.
>
> SchedMD recommends that the slurmctld server should have only a few, but
> very fast CPU cores, in order to ensure the best responsiveness.
>
> The file system for /var/spool/slurmctld/ should be mounted on the
> fastest possible disks (SSD or NVMe if possible).
>
> You should also read the Large Cluster Administration Guide at
> https://slurm.schedmd.com/big_sys.html
>
> Furthermore, it may perhaps be a good idea to have the MySQL database
> server installed on a separate server so that it doesn't slow down the
> slurmctld.
>
> Best regards,
> Ole
>
> On 8/27/19 9:45 AM, Guillaume Perrault Archambault wrote:
> > Hi Paul,
> >
> > Thanks a lot for your suggestion.
> >
> > The cluster I'm using has thousands of users, so I'm doubtful the admins
> > will change this setting just for me. But I'll mention it to the support
> > team I'm working with.
> >
> > I was hoping more for something that can be done on the user end.
> >
> > Is there some way for the user to measure whether the scheduler is in
> > RPC saturation? And then if it is, I could make sure my script doesn't
> > launch too many jobs in parallel.
> >
> > Sorry if my question is too vague, I don't understand the backend of the
> > SLURM scheduler too well, so my questions are using the limited
> > terminology of a user.
> >
> > My concern is just to make sure that my scripts don't send out more
> > commands (simultaneously) than the scheduler can handle.
> >
> > For example, as an extreme scenario, suppose a user forks off 1000
> > sbatch commands in parallel, is that more than the scheduler can handle?
> > As a user, how can I know whether it is?
>
>


Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Paul Edmon
At least for our cluster we generally recommend that if you are 
submitting large numbers of jobs you either use a job array or you just 
for loop over the jobs you want to submit.  A fork bomb is definitely 
not recommended.  For highest throughput submission a job array is your 
best bet as in one submission it will generate thousands of jobs which 
then the scheduler can handle sensibly. So I highly recommend using job 
arrays.


-Paul Edmon-

On 8/27/19 3:45 AM, Guillaume Perrault Archambault wrote:

Hi Paul,

Thanks a lot for your suggestion.

The cluster I'm using has thousands of users, so I'm doubtful the 
admins will change this setting just for me. But I'll mention it to 
the support team I'm working with.


I was hoping more for something that can be done on the user end.

Is there some way for the user to measure whether the scheduler is in 
RPC saturation? And then if it is, I could make sure my script doesn't 
launch too many jobs in parallel.


Sorry if my question is too vague, I don't understand the backend of 
the SLURM scheduler too well, so my questions are using the limited 
terminology of a user.


My concern is just to make sure that my scripts don't send out more 
commands (simultaneously) than the scheduler can handle.


For example, as an extreme scenario, suppose a user forks off 1000 
sbatch commands in parallel, is that more than the scheduler can 
handle? As a user, how can I know whether it is?


Regards,
Guillaume.



On Mon, Aug 26, 2019 at 10:15 AM Paul Edmon > wrote:


We've hit this before due to RPC saturation.  I highly recommend
using max_rpc_cnt and/or defer for scheduling. That should help
alleviate this problem.

-Paul Edmon-

On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:

Hello,

I wrote a regression-testing toolkit to manage large numbers of
SLURM jobs and their output (the toolkit can be found here
 if anyone
is interested).

To make job launching faster, sbatch commands are forked, so that
numerous jobs may be submitted in parallel.

We (the cluster admin and myself) are concerned that this may
cause unresponsiveness for other users.

I cannot say for sure since I don't have visibility over all
users of the cluster, but unresponsiveness doesn't seem to have
occurred so far. That being said, the fact that it hasn't
occurred yet doesn't mean it won't in the future. So I'm treating
this as a ticking time bomb to be fixed asap.

My questions are the following:
1) Does anyone have experience with large numbers of jobs
submitted in parallel? What are the limits that can be hit? For
example is there some hard limit on how many jobs a SLURM
scheduler can handle before blacking out / slowing down?
2) Is there a way for me to find/measure/ping this resource limit?
3) How can I make sure I don't hit this resource limit?

From what I've observed, parallel submission can improve
submission time by a factor at least 10x. This can make a big
difference in users' workflows.

For that reason I would like to keep the option of launching jobs
sequentially as a last resort.

Thanks in advance.

Regards,
Guillaume.




Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Ole Holm Nielsen

Hi Guillaume,

The performance of the slurmctld server depends strongly on the server 
hardware on which it is running!  This should be taken into account when 
considering your question.


SchedMD recommends that the slurmctld server should have only a few, but 
very fast CPU cores, in order to ensure the best responsiveness.


The file system for /var/spool/slurmctld/ should be mounted on the 
fastest possible disks (SSD or NVMe if possible).


You should also read the Large Cluster Administration Guide at 
https://slurm.schedmd.com/big_sys.html


Furthermore, it may perhaps be a good idea to have the MySQL database 
server installed on a separate server so that it doesn't slow down the 
slurmctld.


Best regards,
Ole

On 8/27/19 9:45 AM, Guillaume Perrault Archambault wrote:

Hi Paul,

Thanks a lot for your suggestion.

The cluster I'm using has thousands of users, so I'm doubtful the admins 
will change this setting just for me. But I'll mention it to the support 
team I'm working with.


I was hoping more for something that can be done on the user end.

Is there some way for the user to measure whether the scheduler is in 
RPC saturation? And then if it is, I could make sure my script doesn't 
launch too many jobs in parallel.


Sorry if my question is too vague, I don't understand the backend of the 
SLURM scheduler too well, so my questions are using the limited 
terminology of a user.


My concern is just to make sure that my scripts don't send out more 
commands (simultaneously) than the scheduler can handle.


For example, as an extreme scenario, suppose a user forks off 1000 
sbatch commands in parallel, is that more than the scheduler can handle? 
As a user, how can I know whether it is?




Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
Hi Paul,

Thanks a lot for your suggestion.

The cluster I'm using has thousands of users, so I'm doubtful the admins
will change this setting just for me. But I'll mention it to the support
team I'm working with.

I was hoping more for something that can be done on the user end.

Is there some way for the user to measure whether the scheduler is in RPC
saturation? And then if it is, I could make sure my script doesn't launch
too many jobs in parallel.

Sorry if my question is too vague, I don't understand the backend of the
SLURM scheduler too well, so my questions are using the limited terminology
of a user.

My concern is just to make sure that my scripts don't send out more
commands (simultaneously) than the scheduler can handle.

For example, as an extreme scenario, suppose a user forks off 1000 sbatch
commands in parallel, is that more than the scheduler can handle? As a
user, how can I know whether it is?

Regards,
Guillaume.



On Mon, Aug 26, 2019 at 10:15 AM Paul Edmon  wrote:

> We've hit this before due to RPC saturation.  I highly recommend using
> max_rpc_cnt and/or defer for scheduling.  That should help alleviate this
> problem.
>
> -Paul Edmon-
> On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:
>
> Hello,
>
> I wrote a regression-testing toolkit to manage large numbers of SLURM jobs
> and their output (the toolkit can be found here
>  if anyone is
> interested).
>
> To make job launching faster, sbatch commands are forked, so that numerous
> jobs may be submitted in parallel.
>
> We (the cluster admin and myself) are concerned that this may cause
> unresponsiveness for other users.
>
> I cannot say for sure since I don't have visibility over all users of the
> cluster, but unresponsiveness doesn't seem to have occurred so far. That
> being said, the fact that it hasn't occurred yet doesn't mean it won't in
> the future. So I'm treating this as a ticking time bomb to be fixed asap.
>
> My questions are the following:
> 1) Does anyone have experience with large numbers of jobs submitted in
> parallel? What are the limits that can be hit? For example is there some
> hard limit on how many jobs a SLURM scheduler can handle before blacking
> out / slowing down?
> 2) Is there a way for me to find/measure/ping this resource limit?
> 3) How can I make sure I don't hit this resource limit?
>
> From what I've observed, parallel submission can improve submission time
> by a factor at least 10x. This can make a big difference in users'
> workflows.
>
> For that reason I would like to keep the option of launching jobs
> sequentially as a last resort.
>
> Thanks in advance.
>
> Regards,
> Guillaume.
>
>


Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-26 Thread Paul Edmon
We've hit this before due to RPC saturation.  I highly recommend using 
max_rpc_cnt and/or defer for scheduling.  That should help alleviate 
this problem.


-Paul Edmon-

On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:

Hello,

I wrote a regression-testing toolkit to manage large numbers of SLURM 
jobs and their output (the toolkit can be found here 
 if anyone is 
interested).


To make job launching faster, sbatch commands are forked, so that 
numerous jobs may be submitted in parallel.


We (the cluster admin and myself) are concerned that this may cause 
unresponsiveness for other users.


I cannot say for sure since I don't have visibility over all users of 
the cluster, but unresponsiveness doesn't seem to have occurred so 
far. That being said, the fact that it hasn't occurred yet doesn't 
mean it won't in the future. So I'm treating this as a ticking time 
bomb to be fixed asap.


My questions are the following:
1) Does anyone have experience with large numbers of jobs submitted in 
parallel? What are the limits that can be hit? For example is there 
some hard limit on how many jobs a SLURM scheduler can handle before 
blacking out / slowing down?

2) Is there a way for me to find/measure/ping this resource limit?
3) How can I make sure I don't hit this resource limit?

From what I've observed, parallel submission can improve submission 
time by a factor at least 10x. This can make a big difference in 
users' workflows.


For that reason I would like to keep the option of launching jobs 
sequentially as a last resort.


Thanks in advance.

Regards,
Guillaume.