See FAQ:
http://slurm.schedmd.com/faq.html#rlimit

Quoting Michael Richard Colonno <[email protected]>:

Hi ~

Still struggling with a subtle version of this memory size issue in version 15.08.5. I don’t get explicit errors but I believe it’s at the root cause of the seg faults in parallel codes. SLURM seems to “intercept” the max memory limit (-m) and ignore / override any system settings.

            On the master node:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515700
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

            On the node (interactively):

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127880
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

            Now run through SLURM on the same node:

$ srun -n1 bash -c "ulimit -a"
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127880
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 1024
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 127880
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I have added a “ulimit –m unlimited” to all the SLURM init.d scripts but no change. Any advice on how to eliminate this permanently? I’ve tried various PropagateResourceLimitsExcept settings in slurm.conf without any effect on the max memory size.

            Thanks,
            ~MC

From: Colonno, Michael Richard
Sent: Thursday, April 9, 2015 11:10 AM
To: slurm-dev <[email protected]>
Subject: [slurm-dev] Re: default memory limit (14.11.5)?

Nope – my slurm.conf is very basic (been using it for several versions).

# COMPUTE NODES
NodeName=node[1-8] Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 State=IDLE PartitionName=all Nodes=node[1-8] Default=YES MaxTime=INFINITE State=UP

Perhaps a system-level limit or something not set in the slurm init.d script? This all looks pretty normal:

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256422
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256422
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

            Thanks,
            ~Mike C.

From: Morris Jette [mailto:[email protected]]
Sent: Thursday, April 09, 2015 10:59 AM
To: slurm-dev
Subject: [slurm-dev] Re: default memory limit (14.11.5)?

Do you have a DefMemPerCPU or DefMemPerNode configured in slurm.conf?
On April 9, 2015 10:52:37 AM PDT, Michael Colonno <[email protected]<mailto:[email protected]>> wrote:

            Hi ~



I just upgraded my cluster to SLURM 14.11.5. Everything went smoothly but when I run a test case it seems there is now a (very small) memory limit on jobs:



$ srun -n4 date

slurmstepd: Step 19293.0 exceeded memory limit (3324 > 1024), being killed

srun: Exceeded job memory limit

slurmstepd: *** STEP 19293.0 CANCELLED AT 2015-04-09T10:46:17 *** on node6

srun: Job step aborted: Waiting up to 2 seconds for job step to finish.

srun: error: node6: tasks 0-3: Killed



            How can I disable / fix this?



            Thanks,

            ~Mike C.

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply via email to