Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Chris Samuel
On Thursday, 2 July 2020 6:52:15 AM PDT Prentice Bisbal wrote:

> [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill
> event count: 1

We get that line for pretty much every job, I don't think it reflects the OOM 
killer being invoked on something in the extern step.

OOM killer invocations should be recorded in the kernel logs on the node, 
check with "dmesg -T" to see if it's being invoked (or whether they are 
getting logged to via syslog if they've got dropped from the ring buffer due to 
later messages).

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






[slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Prentice Bisbal
I maintain a very heterogeneous cluster (different processors, different 
amounts of RAM, etc.) I have a user reporting the following problem.


He's running the same job multiple times with different input 
parameters. The jobs run fine unless they land on specific nodes. He's 
specifying --mem=2G in his sbatch files. On the nodes where the jobs 
fail, I see that the OOM killer is invoked, so I asked him to specify 
more RAM, so he did. He set --mem=4G, and still the jobs fail on these 2 
nodes. However, they run just fine on other nodes with --mem=2G.


When I look at the slurm log file on the nodes, I see something like 
this for a failing job (in this case, --mem=4G was set)


[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 801777 
ran for 0 seconds
[2020-07-01T16:19:06.479] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.483] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777/step_extern: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited

[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
[2020-07-01T16:19:06.621] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.623] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:19.385] [801777.batch] sending 
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0

[2020-07-01T16:19:19.389] [801777.batch] done with job
[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill 
event count: 1

[2020-07-01T16:19:19.508] [801777.extern] done with job

Any ideas why the jobs are failing on just these two nodes, while they 
run just fine on many other nodes?


For now, the user is excluding these two nodes using the -x option to 
sbatch, but I'd really like to understand what's going on here.


--

Prentice




Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.

2020-07-02 Thread Ryan Novosielski
Are you sure that the OOM killer is involved? I can get you specifics later, 
but if it’s that one line about OOM events, you may see it after successful 
jobs too. I just had a SLURM bug where this came up.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Jul 2, 2020, at 09:53, Prentice Bisbal  wrote:

I maintain a very heterogeneous cluster (different processors, different 
amounts of RAM, etc.) I have a user reporting the following problem.

He's running the same job multiple times with different input parameters. The 
jobs run fine unless they land on specific nodes. He's specifying --mem=2G in 
his sbatch files. On the nodes where the jobs fail, I see that the OOM killer 
is invoked, so I asked him to specify more RAM, so he did. He set --mem=4G, and 
still the jobs fail on these 2 nodes. However, they run just fine on other 
nodes with --mem=2G.

When I look at the slurm log file on the nodes, I see something like this for a 
failing job (in this case, --mem=4G was set)

[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 801777 ran for 
0 seconds
[2020-07-01T16:19:06.479] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-07-01T16:19:06.483] [801777.extern] task/cgroup: 
/slurm/uid_40324/job_801777/step_extern: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
[2020-07-01T16:19:06.621] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-07-01T16:19:06.623] [801777.batch] task/cgroup: 
/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB 
memsw.limit=unlimited
[2020-07-01T16:19:19.385] [801777.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:0 status:0
[2020-07-01T16:19:19.389] [801777.batch] done with job
[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill event 
count: 1
[2020-07-01T16:19:19.508] [801777.extern] done with job

Any ideas why the jobs are failing on just these two nodes, while they run just 
fine on many other nodes?

For now, the user is excluding these two nodes using the -x option to sbatch, 
but I'd really like to understand what's going on here.

--

Prentice