Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Steven Dick
On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perrault Archambault
 wrote:
> My problem with that though, is what if each script (the 9 scripts in my 
> earlier example) each require different requirements? For example, run on a 
> different partition, or set a different time limit? My understanding is that 
> for a single job array, each job will get the same job requirements.

That's a little messier and may be less suitable for an array job.
However, some of that can be accomplished.   You can for instance,
submit a job to multiple partitions and then use srun within the job
to allocate resources to individual tasks within the job.
But you get a lot less control over how the resources are spread, so
it might not be workable.

> The other problem is that with the way I've implemented it, I can change the 
> max jobs dynamically.

Others have indicated in this thread that qos can be dynamically
changed; I don't recall trying that, but if you did, I think you'd do
it with scontrol.



Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Thank you Paul.. If admin does agree to creating various QOS job limits or
GPU limits (eg 5,10,15,20,...) then tat could be a powerful solution. This
would allow me to use job arrays.

I still prefer a user side solution if possible because I'd like my script
to be cluster-agnostic as much as possible, so avoiding to task admin on
each cluster with QOS creation would make easier going of porting these
scripts across clusters.

That said it may well end up being the best solution.

Regards,
Guillaume.

On Fri, Aug 30, 2019 at 3:16 PM Paul Edmon  wrote:

> Yes, QoS's are dynamic.
>
> -Paul Edmon-
> On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote:
>
> Hi Paul,
>
> Thanks for your pointers.
>
> I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as
> expressed to Peter in the reply I just now sent, I wonder if it the QOS of
> a job can be change while it's pending (submitted but not yet running).
>
> Regards,
> Guillaume.
>
> On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon 
> wrote:
>
>> A QoS is probably your best bet.  Another variant might be MCS, which
>> you can use to help reduce resource fragmentation.  For limits though
>> QoS will be your best bet.
>>
>> -Paul Edmon-
>>
>> On 8/30/19 7:33 AM, Steven Dick wrote:
>> > It would still be possible to use job arrays in this situation, it's
>> > just slightly messy.
>> > So the way a job array works is that you submit a single script, and
>> > that script is provided an integer for each subjob.  The integer is in
>> > a range, with a possible step (default=1).
>> >
>> > To run the situation you describe, you would have to predetermine how
>> > many of each test you want to run (i.e., you coudln't dynamically
>> > change the number of jobs that run within one array)., and a master
>> > script would map the integer range to the job that was to be started.
>> >
>> > The most trivial way to do it would be to put the list of regressions
>> > in a text file and the master script would index it by line number and
>> > then run the appropriate command.
>> > A more complex way would be to do some math (a divide?) to get the
>> > script name and subindex (modulus?) for each regression.
>> >
>> > Both of these would require some semi-advanced scripting, but nothing
>> > that couldn't be cut and pasted with some trivial modifications for
>> > each job set.
>> >
>> > As to the unavailability of the admin ...
>> > An alternate approach that would require the admin's help would be to
>> > come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
>> > gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
>> > maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
>> > starting it to set the overall allocation for all the jobs.  The admin
>> > woudln't need to tweak this except once, you just pick which tweak to
>> > use.
>> >
>> > On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
>> >  wrote:
>> >> Hi Steven,
>> >>
>> >> Thanks for taking the time to reply to my post.
>> >>
>> >> Setting a limit on the number of jobs for a single array isn't
>> sufficient because regression-tests need to launch multiple arrays, and I
>> would need a job limit that would take effect over all launched jobs.
>> >>
>> >> It's very possible I'm not understand something. I'll lay out a very
>> specific example in the hopes you can correct me if I've gone wrong
>> somewhere.
>> >>
>> >> Let's take the small cluster with 140 GPUs and no fairshare as an
>> example, because it's easier for me to explain.
>> >>
>> >> The users, who all know each other personally and interact via chat,
>> decide on a daily basis how many jobs each user can run at a time.
>> >>
>> >> Let's say today is Sunday (hypothetically). Nobody is actively
>> developing today, except that user 1 has 10 jobs running for the entire
>> weekend. That leaves 130 GPUs unused.
>> >>
>> >> User 2, whose jobs all run on 1 GPU decides to run a regression test.
>> The regression test comprises of 9 different scripts each run 40 times, for
>> a grand total of 360 jobs. The duration of the scripts vary from 1 and 5
>> hours to complete, and the jobs take on average 4 hours to complete.
>> >>
>> >> User 2 gets the user group's approval (via chat) to use 90 GPUs (so
>> that 40 GPUs will remain for anyone else wanting to work that day).
>> >>
>> >> The problem I'm trying to solve is this: how do I ensure that user 2
>> launches his 360 jobs in such a way that 90 jobs are in the run state
>> consistently until the regression test is finished?
>> >>
>> >> Keep in mind that:
>> >>
>> >> limiting each job array to 10 jobs is inefficient: when the first job
>> array finishes (long before the last one), only 80 GPUs will be used, and
>> so on as other arrays finish
>> >> the admin is not available, he cannot be asked to set a hard limit of
>> 90 jobs for user 2 just for today
>> >>
>> >> I would be happy to use job arrays if they allow me to set an
>> overarching job limit across

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Paul Edmon

Yes, QoS's are dynamic.

-Paul Edmon-

On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote:

Hi Paul,

Thanks for your pointers.

I'll looking into QOS and MCS after my paper deadline (Sept 5). Re 
QOS, as expressed to Peter in the reply I just now sent, I wonder if 
it the QOS of a job can be change while it's pending (submitted but 
not yet running).


Regards,
Guillaume.

On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon > wrote:


A QoS is probably your best bet.  Another variant might be MCS, which
you can use to help reduce resource fragmentation.  For limits though
QoS will be your best bet.

-Paul Edmon-

On 8/30/19 7:33 AM, Steven Dick wrote:
> It would still be possible to use job arrays in this situation, it's
> just slightly messy.
> So the way a job array works is that you submit a single script, and
> that script is provided an integer for each subjob.  The integer
is in
> a range, with a possible step (default=1).
>
> To run the situation you describe, you would have to
predetermine how
> many of each test you want to run (i.e., you coudln't dynamically
> change the number of jobs that run within one array)., and a master
> script would map the integer range to the job that was to be
started.
>
> The most trivial way to do it would be to put the list of
regressions
> in a text file and the master script would index it by line
number and
> then run the appropriate command.
> A more complex way would be to do some math (a divide?) to get the
> script name and subindex (modulus?) for each regression.
>
> Both of these would require some semi-advanced scripting, but
nothing
> that couldn't be cut and pasted with some trivial modifications for
> each job set.
>
> As to the unavailability of the admin ...
> An alternate approach that would require the admin's help would
be to
> come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
> gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
> maxtrespu=gpu=40 ) Then the user would assign that QOS to the
job when
> starting it to set the overall allocation for all the jobs.  The
admin
> woudln't need to tweak this except once, you just pick which
tweak to
> use.
>
> On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
> mailto:gperr...@uottawa.ca>> wrote:
>> Hi Steven,
>>
>> Thanks for taking the time to reply to my post.
>>
>> Setting a limit on the number of jobs for a single array isn't
sufficient because regression-tests need to launch multiple
arrays, and I would need a job limit that would take effect over
all launched jobs.
>>
>> It's very possible I'm not understand something. I'll lay out a
very specific example in the hopes you can correct me if I've gone
wrong somewhere.
>>
>> Let's take the small cluster with 140 GPUs and no fairshare as
an example, because it's easier for me to explain.
>>
>> The users, who all know each other personally and interact via
chat, decide on a daily basis how many jobs each user can run at a
time.
>>
>> Let's say today is Sunday (hypothetically). Nobody is actively
developing today, except that user 1 has 10 jobs running for the
entire weekend. That leaves 130 GPUs unused.
>>
>> User 2, whose jobs all run on 1 GPU decides to run a regression
test. The regression test comprises of 9 different scripts each
run 40 times, for a grand total of 360 jobs. The duration of the
scripts vary from 1 and 5 hours to complete, and the jobs take on
average 4 hours to complete.
>>
>> User 2 gets the user group's approval (via chat) to use 90 GPUs
(so that 40 GPUs will remain for anyone else wanting to work that
day).
>>
>> The problem I'm trying to solve is this: how do I ensure that
user 2 launches his 360 jobs in such a way that 90 jobs are in the
run state consistently until the regression test is finished?
>>
>> Keep in mind that:
>>
>> limiting each job array to 10 jobs is inefficient: when the
first job array finishes (long before the last one), only 80 GPUs
will be used, and so on as other arrays finish
>> the admin is not available, he cannot be asked to set a hard
limit of 90 jobs for user 2 just for today
>>
>> I would be happy to use job arrays if they allow me to set an
overarching job limit across multiple arrays. Perhaps this is
doable. Admttedly I'm working on a paper to be submitted in a few
days, so I don't have time to test jobs arrays thoroughly, but I
will try out job arrays more thoroughly once I've submitted my
paper (ie after sept 5).
>>
>> My solution, for now, is to not use job arrays. Instead, I
launch each job individually, and I use sing

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Hi Paul,

Thanks for your pointers.

I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as
expressed to Peter in the reply I just now sent, I wonder if it the QOS of
a job can be change while it's pending (submitted but not yet running).

Regards,
Guillaume.

On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon  wrote:

> A QoS is probably your best bet.  Another variant might be MCS, which
> you can use to help reduce resource fragmentation.  For limits though
> QoS will be your best bet.
>
> -Paul Edmon-
>
> On 8/30/19 7:33 AM, Steven Dick wrote:
> > It would still be possible to use job arrays in this situation, it's
> > just slightly messy.
> > So the way a job array works is that you submit a single script, and
> > that script is provided an integer for each subjob.  The integer is in
> > a range, with a possible step (default=1).
> >
> > To run the situation you describe, you would have to predetermine how
> > many of each test you want to run (i.e., you coudln't dynamically
> > change the number of jobs that run within one array)., and a master
> > script would map the integer range to the job that was to be started.
> >
> > The most trivial way to do it would be to put the list of regressions
> > in a text file and the master script would index it by line number and
> > then run the appropriate command.
> > A more complex way would be to do some math (a divide?) to get the
> > script name and subindex (modulus?) for each regression.
> >
> > Both of these would require some semi-advanced scripting, but nothing
> > that couldn't be cut and pasted with some trivial modifications for
> > each job set.
> >
> > As to the unavailability of the admin ...
> > An alternate approach that would require the admin's help would be to
> > come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
> > gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
> > maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
> > starting it to set the overall allocation for all the jobs.  The admin
> > woudln't need to tweak this except once, you just pick which tweak to
> > use.
> >
> > On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
> >  wrote:
> >> Hi Steven,
> >>
> >> Thanks for taking the time to reply to my post.
> >>
> >> Setting a limit on the number of jobs for a single array isn't
> sufficient because regression-tests need to launch multiple arrays, and I
> would need a job limit that would take effect over all launched jobs.
> >>
> >> It's very possible I'm not understand something. I'll lay out a very
> specific example in the hopes you can correct me if I've gone wrong
> somewhere.
> >>
> >> Let's take the small cluster with 140 GPUs and no fairshare as an
> example, because it's easier for me to explain.
> >>
> >> The users, who all know each other personally and interact via chat,
> decide on a daily basis how many jobs each user can run at a time.
> >>
> >> Let's say today is Sunday (hypothetically). Nobody is actively
> developing today, except that user 1 has 10 jobs running for the entire
> weekend. That leaves 130 GPUs unused.
> >>
> >> User 2, whose jobs all run on 1 GPU decides to run a regression test.
> The regression test comprises of 9 different scripts each run 40 times, for
> a grand total of 360 jobs. The duration of the scripts vary from 1 and 5
> hours to complete, and the jobs take on average 4 hours to complete.
> >>
> >> User 2 gets the user group's approval (via chat) to use 90 GPUs (so
> that 40 GPUs will remain for anyone else wanting to work that day).
> >>
> >> The problem I'm trying to solve is this: how do I ensure that user 2
> launches his 360 jobs in such a way that 90 jobs are in the run state
> consistently until the regression test is finished?
> >>
> >> Keep in mind that:
> >>
> >> limiting each job array to 10 jobs is inefficient: when the first job
> array finishes (long before the last one), only 80 GPUs will be used, and
> so on as other arrays finish
> >> the admin is not available, he cannot be asked to set a hard limit of
> 90 jobs for user 2 just for today
> >>
> >> I would be happy to use job arrays if they allow me to set an
> overarching job limit across multiple arrays. Perhaps this is doable.
> Admttedly I'm working on a paper to be submitted in a few days, so I don't
> have time to test jobs arrays thoroughly, but I will try out job arrays
> more thoroughly once I've submitted my paper (ie after sept 5).
> >>
> >> My solution, for now, is to not use job arrays. Instead, I launch each
> job individually, and I use singleton (by launching all jobs with the same
> 90 unique names) to ensure that exactly 90 jobs are run at a time (in this
> case, corresponding to 90 GPUs in use).
> >>
> >> Side note: the unavailability of the admin might sound contrived by
> picking Sunday as an example, but it's in fact very typical. The admin is
> not available:
> >>
> >> on weekends (the present example)
> >> at any time ou

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Hi Steven,

Those both sound like potentially good solutions.

So basically, you're saying that if I script it properly, I can use a
single job array to launch multiple scripts by using a master sbatch script.

My problem with that though, is what if each script (the 9 scripts in my
earlier example) each require different requirements? For example, run on a
different partition, or set a different time limit? My understanding is
that for a single job array, each job will get the same job requirements.

The other problem is that with the way I've implemented it, I can change
the max jobs dynamically.

I'll illustrate this using my earlier example. Suppose user 2 launches his
360 jobs with a 90 job limit (leaving 40 unused GPUs), and then user 3
realizes he needs to use 45 GPUs.

User 2 decides to drop his usage to 45 max jobs.

He can simply change the names of his pending singleton jobs to have 45
unique names, so that he will reduce his max jobs to 45 instead of 90 (I
wrote a script to do that, so it's a one liner for user 2)

Can the max job limit be modified after submission time using one big job
array?

In the docs it gives the '%' separator to limit the concurrent number of
jobs "--array=0-15%4" I could be wrong, but this sounds like a submit
time-only option that cannot be change after submission.

I also kindof like the varoius QOS for different job limits. I'm not sure
I'll be able to get the admin on board, but I'll bring it up. Even if I do
get them on board, will I have the same problem of locking the max limit at
submit time?

Can you change the QOS of a job when it's still pending?

Thanks a lot for your help!

Regards,
Guillaume

On Fri, Aug 30, 2019 at 7:36 AM Steven Dick  wrote:

> It would still be possible to use job arrays in this situation, it's
> just slightly messy.
> So the way a job array works is that you submit a single script, and
> that script is provided an integer for each subjob.  The integer is in
> a range, with a possible step (default=1).
>
> To run the situation you describe, you would have to predetermine how
> many of each test you want to run (i.e., you coudln't dynamically
> change the number of jobs that run within one array)., and a master
> script would map the integer range to the job that was to be started.
>
> The most trivial way to do it would be to put the list of regressions
> in a text file and the master script would index it by line number and
> then run the appropriate command.
> A more complex way would be to do some math (a divide?) to get the
> script name and subindex (modulus?) for each regression.
>
> Both of these would require some semi-advanced scripting, but nothing
> that couldn't be cut and pasted with some trivial modifications for
> each job set.
>
> As to the unavailability of the admin ...
> An alternate approach that would require the admin's help would be to
> come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
> gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
> maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
> starting it to set the overall allocation for all the jobs.  The admin
> woudln't need to tweak this except once, you just pick which tweak to
> use.
>
> On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
>  wrote:
> >
> > Hi Steven,
> >
> > Thanks for taking the time to reply to my post.
> >
> > Setting a limit on the number of jobs for a single array isn't
> sufficient because regression-tests need to launch multiple arrays, and I
> would need a job limit that would take effect over all launched jobs.
> >
> > It's very possible I'm not understand something. I'll lay out a very
> specific example in the hopes you can correct me if I've gone wrong
> somewhere.
> >
> > Let's take the small cluster with 140 GPUs and no fairshare as an
> example, because it's easier for me to explain.
> >
> > The users, who all know each other personally and interact via chat,
> decide on a daily basis how many jobs each user can run at a time.
> >
> > Let's say today is Sunday (hypothetically). Nobody is actively
> developing today, except that user 1 has 10 jobs running for the entire
> weekend. That leaves 130 GPUs unused.
> >
> > User 2, whose jobs all run on 1 GPU decides to run a regression test.
> The regression test comprises of 9 different scripts each run 40 times, for
> a grand total of 360 jobs. The duration of the scripts vary from 1 and 5
> hours to complete, and the jobs take on average 4 hours to complete.
> >
> > User 2 gets the user group's approval (via chat) to use 90 GPUs (so that
> 40 GPUs will remain for anyone else wanting to work that day).
> >
> > The problem I'm trying to solve is this: how do I ensure that user 2
> launches his 360 jobs in such a way that 90 jobs are in the run state
> consistently until the regression test is finished?
> >
> > Keep in mind that:
> >
> > limiting each job array to 10 jobs is inefficient: when the first job
> array finishe

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Brian Andrus

After you restart slurmctld do "scontrol reconfigure"

Brian Andrus

On 8/30/2019 6:57 AM, Robert Kudyba wrote:
I had set RealMemory to a really high number as I mis-interpreted the 
recommendation.
NodeName=node[001-003]  CoresPerSocket=12 RealMemory= 
196489092  Sockets=2 Gres=gpu:1


But now I set it to:
RealMemory=191000

I restarted slurmctld. And according to the Bright Cluster support team:
"Unless it has been overridden in the image, the nodes will have a 
symlink directly to the slurm.conf on the head node. This means that 
any changes made to the file on the head node will automatically be 
available to the compute nodes. All they would need in that case is to 
have slurmd restarted"


But now I see these errors:

mcs: MCSParameters = (null). ondemand set.
[2019-08-30T09:22:41.700] error: Node node001 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and make 
sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2019-08-30T09:22:41.700] error: Node node002 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and make 
sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2019-08-30T09:22:41.701] error: Node node003 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and make 
sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.

[2019-08-30T09:23:16.347] update_node: node node001 state set to IDLE
[2019-08-30T09:23:16.347] got (nil)
[2019-08-30T09:23:16.766] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

[2019-08-30T09:23:19.082] update_node: node node002 state set to IDLE
[2019-08-30T09:23:19.082] got (nil)
[2019-08-30T09:23:20.929] update_node: node node003 state set to IDLE
[2019-08-30T09:23:20.929] got (nil)
[2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job: JobId=449 
InitPrio=4294901759 usec=355
[2019-08-30T09:45:46.430] sched: Allocate JobID=449 
NodeList=node[001-003] #CPUs=30 Partition=defq
[2019-08-30T09:45:46.670] prolog_running_decr: Configuration for 
JobID=449 is complete
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1 NodeCnt=3 
WEXITSTATUS 127
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x8005 
NodeCnt=3 done


Is this another option that needs to be set?

On Thu, Aug 29, 2019 at 3:27 PM Alex Chekholko > wrote:


Sounds like maybe you didn't correctly roll out / update your
slurm.conf everywhere as your RealMemory value is back to your
large wrong number.  You need to update your slurm.conf everywhere
and restart all the slurm daemons.

I recommend the "safe procedure" from here:
https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes


Your Bright manual may have a similar process for updating SLURM
config "the Bright way".

On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba
mailto:rkud...@fordham.edu>> wrote:

I thought I had taken care of this a while back but it appears
the issue has returned. A very simply sbatch slurmhello.sh:
 cat slurmhello.sh
#!/bin/sh
#SBATCH -o my.stdout
#SBATCH -N 3
#SBATCH --ntasks=16
module add shared openmpi/gcc/64/1.10.7 slurm
mpirun hello

sbatch slurmhello.sh
Submitted batch job 419

squeue
             JOBID PARTITION     NAME     USER ST  TIME  NODES
NODELIST(REASON)
               419      defq slurmhel     root PD  0:00      3
(Resources)

In /etc/slurm/slurm.conf:
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092
Sockets=2 Gres=gpu:1

Logs show:
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
node=node001: Invalid argument
[2019-08-29T14:24:40.025] error: Node node002 has low
real_memory size (191840 < 196489092)
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
node=node002: Invalid argument
[2019-08-29T14:24:40.026] error: Node node003 has low
real_memory size (191840 < 196489092)
[2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration
node=node003: Invalid argument

scontrol show jobid -

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Paul Edmon
A QoS is probably your best bet.  Another variant might be MCS, which 
you can use to help reduce resource fragmentation.  For limits though 
QoS will be your best bet.


-Paul Edmon-

On 8/30/19 7:33 AM, Steven Dick wrote:

It would still be possible to use job arrays in this situation, it's
just slightly messy.
So the way a job array works is that you submit a single script, and
that script is provided an integer for each subjob.  The integer is in
a range, with a possible step (default=1).

To run the situation you describe, you would have to predetermine how
many of each test you want to run (i.e., you coudln't dynamically
change the number of jobs that run within one array)., and a master
script would map the integer range to the job that was to be started.

The most trivial way to do it would be to put the list of regressions
in a text file and the master script would index it by line number and
then run the appropriate command.
A more complex way would be to do some math (a divide?) to get the
script name and subindex (modulus?) for each regression.

Both of these would require some semi-advanced scripting, but nothing
that couldn't be cut and pasted with some trivial modifications for
each job set.

As to the unavailability of the admin ...
An alternate approach that would require the admin's help would be to
come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
starting it to set the overall allocation for all the jobs.  The admin
woudln't need to tweak this except once, you just pick which tweak to
use.

On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
 wrote:

Hi Steven,

Thanks for taking the time to reply to my post.

Setting a limit on the number of jobs for a single array isn't sufficient 
because regression-tests need to launch multiple arrays, and I would need a job 
limit that would take effect over all launched jobs.

It's very possible I'm not understand something. I'll lay out a very specific 
example in the hopes you can correct me if I've gone wrong somewhere.

Let's take the small cluster with 140 GPUs and no fairshare as an example, 
because it's easier for me to explain.

The users, who all know each other personally and interact via chat, decide on 
a daily basis how many jobs each user can run at a time.

Let's say today is Sunday (hypothetically). Nobody is actively developing 
today, except that user 1 has 10 jobs running for the entire weekend. That 
leaves 130 GPUs unused.

User 2, whose jobs all run on 1 GPU decides to run a regression test. The 
regression test comprises of 9 different scripts each run 40 times, for a grand 
total of 360 jobs. The duration of the scripts vary from 1 and 5 hours to 
complete, and the jobs take on average 4 hours to complete.

User 2 gets the user group's approval (via chat) to use 90 GPUs (so that 40 
GPUs will remain for anyone else wanting to work that day).

The problem I'm trying to solve is this: how do I ensure that user 2 launches 
his 360 jobs in such a way that 90 jobs are in the run state consistently until 
the regression test is finished?

Keep in mind that:

limiting each job array to 10 jobs is inefficient: when the first job array 
finishes (long before the last one), only 80 GPUs will be used, and so on as 
other arrays finish
the admin is not available, he cannot be asked to set a hard limit of 90 jobs 
for user 2 just for today

I would be happy to use job arrays if they allow me to set an overarching job 
limit across multiple arrays. Perhaps this is doable. Admttedly I'm working on 
a paper to be submitted in a few days, so I don't have time to test jobs arrays 
thoroughly, but I will try out job arrays more thoroughly once I've submitted 
my paper (ie after sept 5).

My solution, for now, is to not use job arrays. Instead, I launch each job 
individually, and I use singleton (by launching all jobs with the same 90 
unique names) to ensure that exactly 90 jobs are run at a time (in this case, 
corresponding to 90 GPUs in use).

Side note: the unavailability of the admin might sound contrived by picking 
Sunday as an example, but it's in fact very typical. The admin is not available:

on weekends (the present example)
at any time outside of 9am to 5pm (keep in mind, this is a cluster used by 
students in different time zones)
any time he is on vacation
anytime the he is looking after his many other responsibilities. Constantly 
setting user limits that change on a daily basis would be too much too ask.


I'd be happy if you corrected my misunderstandings, especially if you could 
show me how to set a job limit that takes effect over multiple job arrays.

I may have very glaring oversights as I don't necessarily have a big picture 
view of things (I've never been an admin, most notably), so feel free to poke 
holes at the way I've constructed things.

Regards,
Gu

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Robert Kudyba
I had set RealMemory to a really high number as I mis-interpreted the
recommendation.
NodeName=node[001-003]  CoresPerSocket=12 RealMemory= 196489092  Sockets=2
Gres=gpu:1

But now I set it to:
RealMemory=191000

I restarted slurmctld. And according to the Bright Cluster support team:
"Unless it has been overridden in the image, the nodes will have a symlink
directly to the slurm.conf on the head node. This means that any changes
made to the file on the head node will automatically be available to the
compute nodes. All they would need in that case is to have slurmd restarted"

But now I see these errors:

mcs: MCSParameters = (null). ondemand set.
[2019-08-30T09:22:41.700] error: Node node001 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:22:41.700] error: Node node002 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:22:41.701] error: Node node003 appears to have a different
slurm.conf than the slurmctld.  This could cause issues with communication
and functionality.  Please review both files and make sure they are the
same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:23:16.347] update_node: node node001 state set to IDLE
[2019-08-30T09:23:16.347] got (nil)
[2019-08-30T09:23:16.766]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2019-08-30T09:23:19.082] update_node: node node002 state set to IDLE
[2019-08-30T09:23:19.082] got (nil)
[2019-08-30T09:23:20.929] update_node: node node003 state set to IDLE
[2019-08-30T09:23:20.929] got (nil)
[2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job: JobId=449
InitPrio=4294901759 usec=355
[2019-08-30T09:45:46.430] sched: Allocate JobID=449 NodeList=node[001-003]
#CPUs=30 Partition=defq
[2019-08-30T09:45:46.670] prolog_running_decr: Configuration for JobID=449
is complete
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1 NodeCnt=3
WEXITSTATUS 127
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x8005 NodeCnt=3
done

Is this another option that needs to be set?

On Thu, Aug 29, 2019 at 3:27 PM Alex Chekholko  wrote:

> Sounds like maybe you didn't correctly roll out / update your slurm.conf
> everywhere as your RealMemory value is back to your large wrong number.
> You need to update your slurm.conf everywhere and restart all the slurm
> daemons.
>
> I recommend the "safe procedure" from here:
> https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes
> 
> Your Bright manual may have a similar process for updating SLURM config
> "the Bright way".
>
> On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba 
> wrote:
>
>> I thought I had taken care of this a while back but it appears the issue
>> has returned. A very simply sbatch slurmhello.sh:
>>  cat slurmhello.sh
>> #!/bin/sh
>> #SBATCH -o my.stdout
>> #SBATCH -N 3
>> #SBATCH --ntasks=16
>> module add shared openmpi/gcc/64/1.10.7 slurm
>> mpirun hello
>>
>> sbatch slurmhello.sh
>> Submitted batch job 419
>>
>> squeue
>>  JOBID PARTITION NAME USER ST   TIME  NODES
>> NODELIST(REASON)
>>419  defq slurmhel root PD   0:00  3
>> (Resources)
>>
>> In /etc/slurm/slurm.conf:
>> # Nodes
>> NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
>> Gres=gpu:1
>>
>> Logs show:
>> [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
>> node=node001: Invalid argument
>> [2019-08-29T14:24:40.025] error: Node node002 has low real_memory size
>> (191840 < 196489092)
>> [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
>> node=node002: Invalid argument
>> [2019-08-29T14:24:40.026] error: Node node003 has low real_memory size
>> (191840 < 196489092)
>> [2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration
>> node=node003: Invalid argument
>>
>> scontrol show jobid -dd 419
>> JobId=419 JobName=slurmhello.sh
>>UserId=root(0) GroupId=root(0) MCS_label=N/A
>>Priority=4294901759 Nice=0 Account=root QOS=normal
>>JobState=PENDING Reason=Resources Dependency=(null)
>>Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>DerivedExitCode=0:0
>>RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>>Submi

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Steven Dick
It would still be possible to use job arrays in this situation, it's
just slightly messy.
So the way a job array works is that you submit a single script, and
that script is provided an integer for each subjob.  The integer is in
a range, with a possible step (default=1).

To run the situation you describe, you would have to predetermine how
many of each test you want to run (i.e., you coudln't dynamically
change the number of jobs that run within one array)., and a master
script would map the integer range to the job that was to be started.

The most trivial way to do it would be to put the list of regressions
in a text file and the master script would index it by line number and
then run the appropriate command.
A more complex way would be to do some math (a divide?) to get the
script name and subindex (modulus?) for each regression.

Both of these would require some semi-advanced scripting, but nothing
that couldn't be cut and pasted with some trivial modifications for
each job set.

As to the unavailability of the admin ...
An alternate approach that would require the admin's help would be to
come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
starting it to set the overall allocation for all the jobs.  The admin
woudln't need to tweak this except once, you just pick which tweak to
use.

On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
 wrote:
>
> Hi Steven,
>
> Thanks for taking the time to reply to my post.
>
> Setting a limit on the number of jobs for a single array isn't sufficient 
> because regression-tests need to launch multiple arrays, and I would need a 
> job limit that would take effect over all launched jobs.
>
> It's very possible I'm not understand something. I'll lay out a very specific 
> example in the hopes you can correct me if I've gone wrong somewhere.
>
> Let's take the small cluster with 140 GPUs and no fairshare as an example, 
> because it's easier for me to explain.
>
> The users, who all know each other personally and interact via chat, decide 
> on a daily basis how many jobs each user can run at a time.
>
> Let's say today is Sunday (hypothetically). Nobody is actively developing 
> today, except that user 1 has 10 jobs running for the entire weekend. That 
> leaves 130 GPUs unused.
>
> User 2, whose jobs all run on 1 GPU decides to run a regression test. The 
> regression test comprises of 9 different scripts each run 40 times, for a 
> grand total of 360 jobs. The duration of the scripts vary from 1 and 5 hours 
> to complete, and the jobs take on average 4 hours to complete.
>
> User 2 gets the user group's approval (via chat) to use 90 GPUs (so that 40 
> GPUs will remain for anyone else wanting to work that day).
>
> The problem I'm trying to solve is this: how do I ensure that user 2 launches 
> his 360 jobs in such a way that 90 jobs are in the run state consistently 
> until the regression test is finished?
>
> Keep in mind that:
>
> limiting each job array to 10 jobs is inefficient: when the first job array 
> finishes (long before the last one), only 80 GPUs will be used, and so on as 
> other arrays finish
> the admin is not available, he cannot be asked to set a hard limit of 90 jobs 
> for user 2 just for today
>
> I would be happy to use job arrays if they allow me to set an overarching job 
> limit across multiple arrays. Perhaps this is doable. Admttedly I'm working 
> on a paper to be submitted in a few days, so I don't have time to test jobs 
> arrays thoroughly, but I will try out job arrays more thoroughly once I've 
> submitted my paper (ie after sept 5).
>
> My solution, for now, is to not use job arrays. Instead, I launch each job 
> individually, and I use singleton (by launching all jobs with the same 90 
> unique names) to ensure that exactly 90 jobs are run at a time (in this case, 
> corresponding to 90 GPUs in use).
>
> Side note: the unavailability of the admin might sound contrived by picking 
> Sunday as an example, but it's in fact very typical. The admin is not 
> available:
>
> on weekends (the present example)
> at any time outside of 9am to 5pm (keep in mind, this is a cluster used by 
> students in different time zones)
> any time he is on vacation
> anytime the he is looking after his many other responsibilities. Constantly 
> setting user limits that change on a daily basis would be too much too ask.
>
>
> I'd be happy if you corrected my misunderstandings, especially if you could 
> show me how to set a job limit that takes effect over multiple job arrays.
>
> I may have very glaring oversights as I don't necessarily have a big picture 
> view of things (I've never been an admin, most notably), so feel free to poke 
> holes at the way I've constructed things.
>
> Regards,
> Guillaume.
>
>
> On Fri, Aug 30, 2019 at 1:22 AM Steven Dick  wrote:
>>
>> This makes no sense and

[slurm-users] Usage splitting

2019-08-30 Thread Stefan Staeglich
Hi,

we have some compute nodes paid by different project owners. 10% are owned by 
project A and 90% are owned by project B.

We want to implement the following policy such that every certain time period 
(e.g. two weeks):
- Project A doesn't use more than 10% of the cluster in this time period
- But project B is allowed to use more than 90%

What's the best way to enforce this?

Best,
Stefan
-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.74,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax: +49 761 203-54217