Thanks, in the end option 1 did the trick (switching to 
SelectTypeParameters=CR_Core). scontrol shows…

$ scontrol show config | grep Mem
DefMemPerNode           = UNLIMITED
MaxMemPerNode           = UNLIMITED
MemLimitEnforce         = Yes

And even though I set any and all of

DefMemPerNode           = UNLIMITED
MaxMemPerNode           = UNLIMITED
DefMemPerCPU           = UNLIMITED
MaxMemPerCPU           = UNLIMITED

And restart slurm, ….scontrol show config still shows MemLimitEnforce = Yes. 
However, I’m able to lauch jobs now without immediate failure. Still testing.

From: Lachlan Musicman [mailto:data...@gmail.com]
Sent: Thursday, October 13, 2016 11:13 PM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: ulimit issue I'm sure someone has seen before

Mike, I would suggest that the limit is a SLURM limit rather than a ulimit.
What is the result of

scontrol show config | grep Mem

?

Because you have set your
SelectTypeParameters=CR_Core_Memory

Memory will cause jobs to fail if they go over the default memory limit. The 
SLURM head will kill jobs that eat too many resources.
There are a number of ways to solve this:
1. Change
SelectTypeParameters=CR_Core_Memory
to
SelectTypeParameters=CR_Core

noting that Memory will no longer be a measured resource when distributing 
resources.
2. tell everyone that their jobs will fail if they don't use --mem=X or 
--mem-per-cpu=X
3. Set one of:

DefMemPerNode           = UNLIMITED
MaxMemPerNode           = UNLIMITED
DefMemPerCPU           = UNLIMITED
MaxMemPerCPU           = UNLIMITED




Cheers
L.


------
The most dangerous phrase in the language is, "We've always done it this way."

- Grace Hopper

On 14 October 2016 at 14:21, Mike Cammilleri 
<mi...@stat.wisc.edu<mailto:mi...@stat.wisc.edu>> wrote:

OK, should be easy but I am stumped



Built Slurm 16.05.0 on Ubuntu-14.04 LTS. Worked fine until I decided I wanted 
to get email notifications going using smail and seff, then realized that would 
need slurmdbd which we weren't using because we have no need for accounting.



Fast forward to getting slurmdbd built and job accounting going. Now for some 
reason when I submit any job I get:



slurmstepd: error: Job 3049 exceeded memory limit (1336 > 1024), being killed



And of course I see lots of stuff on slurm-dev about setting /etc/default/slurm 
to have 'ulimit -m unlimited' etc. I also see suggestions to put it in 
/etc/security/limits.conf. I also see suggestions to set ulimit limits at the 
top of my slurmd init scripts.



I've done all these tricks and restarted services to no avail. Users have 
ulimits of unlimited (when checking with ulimit -a) but running sbatch or srun 
results in a cap on memory size. Memory lock seems to be ok.


[mikec@lunchbox] (34)$ srun bash -c "ulimit -a"
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1030449
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 1024   <=====!!!!
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 514377
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Here is an example error when just trying to sbatch a script to run 
/bin/hostname

[mikec@lunchbox] (38)$ cat slurm-3053.out

slurmstepd: error: Job 3053 exceeded memory limit (1096 > 1024), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 3053 ON marzano01 CANCELLED AT 2016-10-13T22:17:08 
***





Where can I set ulimit -m so that it will actually take effect?



Here is my slurm.conf. I tried using PropagateResourceLimitsExcept=MEMLOCK,RSS 
but did not alleviate max memory size issue.


#
ClusterName=marzano
ControlMachine=lunchbox
ControlAddr=xxx.xxx.xxx.xxx
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/slurm.state
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=/etc/slurm/plugstack.conf
#PropagatePrioProcess=
#PropagateResourceLimits=
PropagateResourceLimitsExcept=MEMLOCK,RSS
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
MailProg=/s/slurm/bin/smail
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=6
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmdDebug=6
SlurmdLogFile=/var/log/slurmd/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=lunchbox
AccountingStorageLoc=slurm_acct_db
AccountingStoragePass=auth/munge
AccountingStorageUser=slurm
#
# COMPUTE NODES
NodeName=marzano0[1-6] CPUs=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 
State=UNKNOWN

PartitionName=debug Nodes=marzano0[1-6] Default=YES MaxTime=INFINITE State=UP


Reply via email to