Mike, I would suggest that the limit is a SLURM limit rather than a ulimit.

What is the result of

scontrol show config | grep Mem


Because you have set your


Memory will cause jobs to fail if they go over the default memory limit.
The SLURM head will kill jobs that eat too many resources.

There are a number of ways to solve this:

1. Change

noting that Memory will no longer be a measured resource when distributing

2. tell everyone that their jobs will fail if they don't use --mem=X or

3. Set one of:

DefMemPerNode           = UNLIMITED
MaxMemPerNode           = UNLIMITED
DefMemPerCPU           = UNLIMITED
MaxMemPerCPU           = UNLIMITED


The most dangerous phrase in the language is, "We've always done it this

- Grace Hopper

On 14 October 2016 at 14:21, Mike Cammilleri <mi...@stat.wisc.edu> wrote:

> OK, should be easy but I am stumped
> Built Slurm 16.05.0 on Ubuntu-14.04 LTS. Worked fine until I decided I
> wanted to get email notifications going using smail and seff, then realized
> that would need slurmdbd which we weren't using because we have no need for
> accounting.
> Fast forward to getting slurmdbd built and job accounting going. Now for
> some reason when I submit any job I get:
> slurmstepd: error: Job 3049 exceeded memory limit (1336 > 1024), being
> killed
> And of course I see lots of stuff on slurm-dev about setting
> /etc/default/slurm to have 'ulimit -m unlimited' etc. I also see
> suggestions to put it in /etc/security/limits.conf. I also see suggestions
> to set ulimit limits at the top of my slurmd init scripts.
> I've done all these tricks and restarted services to no avail. Users have
> ulimits of unlimited (when checking with ulimit -a) but running sbatch or
> srun results in a cap on memory size. Memory lock seems to be ok.
> [mikec@lunchbox] (34)$ srun bash -c "ulimit -a"
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1030449
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) 1024   <=====!!!!
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 514377
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> Here is an example error when just trying to sbatch a script to run
> /bin/hostname
> [mikec@lunchbox] (38)$ cat slurm-3053.out
> slurmstepd: error: Job 3053 exceeded memory limit (1096 > 1024), being
> killed
> slurmstepd: error: Exceeded job memory limit
> slurmstepd: error: *** JOB 3053 ON marzano01 CANCELLED AT
> 2016-10-13T22:17:08 ***
> Where can I set ulimit -m so that it will actually take effect?
> Here is my slurm.conf. I tried using PropagateResourceLimitsExcept=MEMLOCK,RSS
> but did not alleviate max memory size issue.
> #
> ClusterName=marzano
> ControlMachine=lunchbox
> ControlAddr=xxx.xxx.xxx.xxx
> #BackupController=
> #BackupAddr=
> #
> SlurmUser=slurm
> #SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> #JobCredentialPrivateKey=
> #JobCredentialPublicCertificate=
> StateSaveLocation=/slurm.state
> SlurmdSpoolDir=/tmp/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurm/slurmctld.pid
> SlurmdPidFile=/var/run/slurm/slurmd.pid
> ProctrackType=proctrack/pgid
> #PluginDir=
> #FirstJobId=
> ReturnToService=2
> #MaxJobCount=
> #PlugStackConfig=/etc/slurm/plugstack.conf
> #PropagatePrioProcess=
> #PropagateResourceLimits=
> PropagateResourceLimitsExcept=MEMLOCK,RSS
> #Prolog=
> #Epilog=
> #SrunProlog=
> #SrunEpilog=
> #TaskProlog=
> #TaskEpilog=
> #TaskPlugin=
> #TrackWCKey=no
> #TreeWidth=50
> #TmpFS=
> #UsePAM=
> MailProg=/s/slurm/bin/smail
> #
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> #
> SchedulerType=sched/backfill
> #SchedulerAuth=
> #SchedulerPort=
> #SchedulerRootFilter=
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> FastSchedule=1
> #PriorityType=priority/multifactor
> #PriorityDecayHalfLife=14-0
> #PriorityUsageResetPeriod=14-0
> #PriorityWeightFairshare=100000
> #PriorityWeightAge=1000
> #PriorityWeightPartition=10000
> #PriorityWeightJobSize=1000
> #PriorityMaxAge=1-0
> #
> SlurmctldDebug=6
> SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
> SlurmdDebug=6
> SlurmdLogFile=/var/log/slurmd/slurmd.log
> JobCompType=jobcomp/none
> #JobCompLoc=
> #
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
> #
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=lunchbox
> AccountingStorageLoc=slurm_acct_db
> AccountingStoragePass=auth/munge
> AccountingStorageUser=slurm
> #
> NodeName=marzano0[1-6] CPUs=48 Sockets=2 CoresPerSocket=12
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=debug Nodes=marzano0[1-6] Default=YES MaxTime=INFINITE
> State=UP

