Hi ~
Still struggling with a subtle version of this memory
size issue in version 15.08.5. I don’t get explicit errors but I
believe it’s at the root cause of the seg faults in parallel codes.
SLURM seems to “intercept” the max memory limit (-m) and ignore /
override any system settings.
On the master node:
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515700
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
On the node (interactively):
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127880
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Now run through SLURM on the same node:
$ srun -n1 bash -c "ulimit -a"
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127880
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 1024
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 127880
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I have added a “ulimit –m unlimited” to all the SLURM
init.d scripts but no change. Any advice on how to eliminate this
permanently? I’ve tried various PropagateResourceLimitsExcept
settings in slurm.conf without any effect on the max memory size.
Thanks,
~MC
From: Colonno, Michael Richard
Sent: Thursday, April 9, 2015 11:10 AM
To: slurm-dev <[email protected]>
Subject: [slurm-dev] Re: default memory limit (14.11.5)?
Nope – my slurm.conf is very basic (been using it for
several versions).
# COMPUTE NODES
NodeName=node[1-8] Sockets=2 CoresPerSocket=6
ThreadsPerCore=1 State=IDLE
PartitionName=all Nodes=node[1-8] Default=YES
MaxTime=INFINITE State=UP
Perhaps a system-level limit or something not set in the
slurm init.d script? This all looks pretty normal:
# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256422
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 256422
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Thanks,
~Mike C.
From: Morris Jette [mailto:[email protected]]
Sent: Thursday, April 09, 2015 10:59 AM
To: slurm-dev
Subject: [slurm-dev] Re: default memory limit (14.11.5)?
Do you have a DefMemPerCPU or DefMemPerNode configured in slurm.conf?
On April 9, 2015 10:52:37 AM PDT, Michael Colonno
<[email protected]<mailto:[email protected]>> wrote:
Hi ~
I just upgraded my cluster to SLURM 14.11.5. Everything
went smoothly but when I run a test case it seems there is now a
(very small) memory limit on jobs:
$ srun -n4 date
slurmstepd: Step 19293.0 exceeded memory limit (3324 > 1024), being killed
srun: Exceeded job memory limit
slurmstepd: *** STEP 19293.0 CANCELLED AT 2015-04-09T10:46:17 *** on node6
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: node6: tasks 0-3: Killed
How can I disable / fix this?
Thanks,
~Mike C.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.