Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Angelos Ching
Hi Timo,

We have faced similar problem and our solution was to run an hourly cron job to 
set a random node weight for each node. It works pretty well for us.

Best regards,
Angelos
(Sent from mobile, please pardon me for typos and cursoriness.)

> 2020/07/03 2:24、Timo Rothenpieler のメール:
> 
> Hello,
> 
> Our cluster is very rarely fully utilized, often only a handful of jobs are 
> running.
> This has the effect that the first couple nodes get used a whole lot more 
> frequently than the ones further near the end of the list.
> 
> This is primarily a problem because of the SSDs in the nodes. They already 
> show a significant difference in their wear level between the first couple 
> and all the remaining nodes.
> 
> Is there some way to nicely assign jobs to nodes such that all nodes get 
> roughly equal amounts of jobs/work?
> I looked through possible options to the SelectType, but nothing looks like 
> it does anything like that.




Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Timo Rothenpieler

On 02.07.2020 20:28, Luis Huang wrote:
You can look into the CR_LLN feature. It works fairly well in our 
environment and jobs are distributed evenly.


SelectTypeParameters=CR_Core_Memory,CR_LLN


From how I understand it, CR_LLN will schedule jobs to the least used 
node. But if there's nearly no jobs running, it will still only use the 
first few nodes all the time, and unless enough jobs come up to fill all 
nodes it will never touch node 20+.


My current idea for a workaround is writing a cron-script that 
periodically collects the current amount of data written on the nodes, 
and assigns a Weight to the nodes according to that.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Chris Samuel
On Thursday, 2 July 2020 6:52:15 AM PDT Prentice Bisbal wrote:

> [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill
> event count: 1

We get that line for pretty much every job, I don't think it reflects the OOM 
killer being invoked on something in the extern step.

OOM killer invocations should be recorded in the kernel logs on the node, 
check with "dmesg -T" to see if it's being invoked (or whether they are 
getting logged to via syslog if they've got dropped from the ring buffer due to 
later messages).

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] [External] Re: Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Prentice Bisbal
Not 100%, which is why I'm asking here.I searched the log files and that 
line was only present after a handful of jobs, including the ones I'm 
investigating, so it's not something happening after/to every job. 
However, this is happening on nodes with plenty of RAM, so if the OOM 
Killer is being invoked, something odd is definitely going on.


On 7/2/20 10:20 AM, Ryan Novosielski wrote:
Are you sure that the OOM killer is involved? I can get you specifics 
later, but if it’s that one line about OOM events, you may see it 
after successful jobs too. I just had a SLURM bug where this came up.


--

|| \\UTGERS, |---*O*---
||_// the State     | Ryan Novosielski - novos...@rutgers.edu 

|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS 
Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB 
C630, Newark

`'


On Jul 2, 2020, at 09:53, Prentice Bisbal  wrote:

I maintain a very heterogeneous cluster (different processors, 
different amounts of RAM, etc.) I have a user reporting the following 
problem.


He's running the same job multiple times with different input 
parameters. The jobs run fine unless they land on specific nodes. 
He's specifying --mem=2G in his sbatch files. On the nodes where the 
jobs fail, I see that the OOM killer is invoked, so I asked him to 
specify more RAM, so he did. He set --mem=4G, and still the jobs fail 
on these 2 nodes. However, they run just fine on other nodes with 
--mem=2G.


When I look at the slurm log file on the nodes, I see something like 
this for a failing job (in this case, --mem=4G was set)


[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 
801777 ran for 0 seconds
[2020-07-01T16:19:06.479] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.483] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777/step_extern: alloc=4096MB 
mem.limit=4096MB memsw.limit=unlimited

[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
[2020-07-01T16:19:06.621] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.623] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:19.385] [801777.batch] sending 
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0

[2020-07-01T16:19:19.389] [801777.batch] done with job
[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: 
oom-kill event count: 1

[2020-07-01T16:19:19.508] [801777.extern] done with job

Any ideas why the jobs are failing on just these two nodes, while 
they run just fine on many other nodes?


For now, the user is excluding these two nodes using the -x option to 
sbatch, but I'd really like to understand what's going on here.


--

Prentice



--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov



Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Luis Huang
You can look into the CR_LLN feature. It works fairly well in our environment 
and jobs are distributed evenly.

SelectTypeParameters=CR_Core_Memory,CR_LLN

https://slurm.schedmd.com/slurm.conf.html

From: slurm-users  on behalf of Timo 
Rothenpieler 
Sent: Thursday, July 2, 2020 2:21:48 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Evenly use all nodes

Hello,

Our cluster is very rarely fully utilized, often only a handful of jobs
are running.
This has the effect that the first couple nodes get used a whole lot
more frequently than the ones further near the end of the list.

This is primarily a problem because of the SSDs in the nodes. They
already show a significant difference in their wear level between the
first couple and all the remaining nodes.

Is there some way to nicely assign jobs to nodes such that all nodes get
roughly equal amounts of jobs/work?
I looked through possible options to the SelectType, but nothing looks
like it does anything like that.


This message is for the recipient's use only, and may contain confidential, 
privileged or protected information. Any unauthorized use or dissemination of 
this communication is prohibited. If you received this message in error, please 
immediately notify the sender and destroy all copies of this message. The 
recipient should check this email and any attachments for the presence of 
viruses, as we accept no liability for any damage caused by any virus 
transmitted by this email.


[slurm-users] Evenly use all nodes

2020-07-02 Thread Timo Rothenpieler

Hello,

Our cluster is very rarely fully utilized, often only a handful of jobs 
are running.
This has the effect that the first couple nodes get used a whole lot 
more frequently than the ones further near the end of the list.


This is primarily a problem because of the SSDs in the nodes. They 
already show a significant difference in their wear level between the 
first couple and all the remaining nodes.


Is there some way to nicely assign jobs to nodes such that all nodes get 
roughly equal amounts of jobs/work?
I looked through possible options to the SelectType, but nothing looks 
like it does anything like that.




smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Prentice Bisbal
I maintain a very heterogeneous cluster (different processors, different 
amounts of RAM, etc.) I have a user reporting the following problem.


He's running the same job multiple times with different input 
parameters. The jobs run fine unless they land on specific nodes. He's 
specifying --mem=2G in his sbatch files. On the nodes where the jobs 
fail, I see that the OOM killer is invoked, so I asked him to specify 
more RAM, so he did. He set --mem=4G, and still the jobs fail on these 2 
nodes. However, they run just fine on other nodes with --mem=2G.


When I look at the slurm log file on the nodes, I see something like 
this for a failing job (in this case, --mem=4G was set)


[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 801777 
ran for 0 seconds
[2020-07-01T16:19:06.479] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.483] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777/step_extern: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited

[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
[2020-07-01T16:19:06.621] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.623] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:19.385] [801777.batch] sending 
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0

[2020-07-01T16:19:19.389] [801777.batch] done with job
[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill 
event count: 1

[2020-07-01T16:19:19.508] [801777.extern] done with job

Any ideas why the jobs are failing on just these two nodes, while they 
run just fine on many other nodes?


For now, the user is excluding these two nodes using the -x option to 
sbatch, but I'd really like to understand what's going on here.


--

Prentice




Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Ryan Novosielski
Are you sure that the OOM killer is involved? I can get you specifics later, 
but if it’s that one line about OOM events, you may see it after successful 
jobs too. I just had a SLURM bug where this came up.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jul 2, 2020, at 09:53, Prentice Bisbal  wrote:

I maintain a very heterogeneous cluster (different processors, different 
amounts of RAM, etc.) I have a user reporting the following problem.

He's running the same job multiple times with different input parameters. The 
jobs run fine unless they land on specific nodes. He's specifying --mem=2G in 
his sbatch files. On the nodes where the jobs fail, I see that the OOM killer 
is invoked, so I asked him to specify more RAM, so he did. He set --mem=4G, and 
still the jobs fail on these 2 nodes. However, they run just fine on other 
nodes with --mem=2G.

When I look at the slurm log file on the nodes, I see something like this for a 
failing job (in this case, --mem=4G was set)

[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 801777 ran for 
0 seconds
[2020-07-01T16:19:06.479] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-07-01T16:19:06.483] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777/step_extern: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
[2020-07-01T16:19:06.621] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-07-01T16:19:06.623] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:19.385] [801777.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:0 status:0
[2020-07-01T16:19:19.389] [801777.batch] done with job
[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill event 
count: 1
[2020-07-01T16:19:19.508] [801777.extern] done with job

Any ideas why the jobs are failing on just these two nodes, while they run just 
fine on many other nodes?

For now, the user is excluding these two nodes using the -x option to sbatch, 
but I'd really like to understand what's going on here.

--

Prentice




[slurm-users] Automatically cancel jobs not utilizing their GPUs

2020-07-02 Thread Stephan Roth

Hi all,

Does anyone have ideas or suggestions on how to automatically cancel 
jobs which don't utilize the GPUs allocated to them?


The Slurm version in use is 19.05.

I'm thinking about collecting GPU utilization per process on all nodes 
with NVML/nvidia-smi, update a mean value of the collected utilization 
per GPU and cancel a job if the mean value is below a to-be-defined 
threshold after a to-be-defined amount of minutes.


Thank you for any input,

Cheers,
Stephan