Thanks, in the end option 1 did the trick (switching to SelectTypeParameters=CR_Core). scontrol shows…
$ scontrol show config | grep Mem DefMemPerNode = UNLIMITED MaxMemPerNode = UNLIMITED MemLimitEnforce = Yes And even though I set any and all of DefMemPerNode = UNLIMITED MaxMemPerNode = UNLIMITED DefMemPerCPU = UNLIMITED MaxMemPerCPU = UNLIMITED And restart slurm, ….scontrol show config still shows MemLimitEnforce = Yes. However, I’m able to lauch jobs now without immediate failure. Still testing. From: Lachlan Musicman [mailto:data...@gmail.com] Sent: Thursday, October 13, 2016 11:13 PM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Re: ulimit issue I'm sure someone has seen before Mike, I would suggest that the limit is a SLURM limit rather than a ulimit. What is the result of scontrol show config | grep Mem ? Because you have set your SelectTypeParameters=CR_Core_Memory Memory will cause jobs to fail if they go over the default memory limit. The SLURM head will kill jobs that eat too many resources. There are a number of ways to solve this: 1. Change SelectTypeParameters=CR_Core_Memory to SelectTypeParameters=CR_Core noting that Memory will no longer be a measured resource when distributing resources. 2. tell everyone that their jobs will fail if they don't use --mem=X or --mem-per-cpu=X 3. Set one of: DefMemPerNode = UNLIMITED MaxMemPerNode = UNLIMITED DefMemPerCPU = UNLIMITED MaxMemPerCPU = UNLIMITED Cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 14 October 2016 at 14:21, Mike Cammilleri <mi...@stat.wisc.edu<mailto:mi...@stat.wisc.edu>> wrote: OK, should be easy but I am stumped Built Slurm 16.05.0 on Ubuntu-14.04 LTS. Worked fine until I decided I wanted to get email notifications going using smail and seff, then realized that would need slurmdbd which we weren't using because we have no need for accounting. Fast forward to getting slurmdbd built and job accounting going. Now for some reason when I submit any job I get: slurmstepd: error: Job 3049 exceeded memory limit (1336 > 1024), being killed And of course I see lots of stuff on slurm-dev about setting /etc/default/slurm to have 'ulimit -m unlimited' etc. I also see suggestions to put it in /etc/security/limits.conf. I also see suggestions to set ulimit limits at the top of my slurmd init scripts. I've done all these tricks and restarted services to no avail. Users have ulimits of unlimited (when checking with ulimit -a) but running sbatch or srun results in a cap on memory size. Memory lock seems to be ok. [mikec@lunchbox] (34)$ srun bash -c "ulimit -a" core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1030449 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 1024 <=====!!!! open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 514377 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Here is an example error when just trying to sbatch a script to run /bin/hostname [mikec@lunchbox] (38)$ cat slurm-3053.out slurmstepd: error: Job 3053 exceeded memory limit (1096 > 1024), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: *** JOB 3053 ON marzano01 CANCELLED AT 2016-10-13T22:17:08 *** Where can I set ulimit -m so that it will actually take effect? Here is my slurm.conf. I tried using PropagateResourceLimitsExcept=MEMLOCK,RSS but did not alleviate max memory size issue. # ClusterName=marzano ControlMachine=lunchbox ControlAddr=xxx.xxx.xxx.xxx #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/slurm.state SlurmdSpoolDir=/tmp/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmdPidFile=/var/run/slurm/slurmd.pid ProctrackType=proctrack/pgid #PluginDir= #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig=/etc/slurm/plugstack.conf #PropagatePrioProcess= #PropagateResourceLimits= PropagateResourceLimitsExcept=MEMLOCK,RSS #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= MailProg=/s/slurm/bin/smail # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory FastSchedule=1 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=6 SlurmctldLogFile=/var/log/slurmctld/slurmctld.log SlurmdDebug=6 SlurmdLogFile=/var/log/slurmd/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 # AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=lunchbox AccountingStorageLoc=slurm_acct_db AccountingStoragePass=auth/munge AccountingStorageUser=slurm # # COMPUTE NODES NodeName=marzano0[1-6] CPUs=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN PartitionName=debug Nodes=marzano0[1-6] Default=YES MaxTime=INFINITE State=UP