Mike, I would suggest that the limit is a SLURM limit rather than a ulimit.

What is the result of

scontrol show config | grep Mem

?

Because you have set your

SelectTypeParameters=CR_Core_Memory

Memory will cause jobs to fail if they go over the default memory limit.
The SLURM head will kill jobs that eat too many resources.

There are a number of ways to solve this:

1. Change
SelectTypeParameters=CR_Core_Memory
to
SelectTypeParameters=CR_Core

noting that Memory will no longer be a measured resource when distributing
resources.

2. tell everyone that their jobs will fail if they don't use --mem=X or
--mem-per-cpu=X

3. Set one of:

DefMemPerNode           = UNLIMITED
MaxMemPerNode           = UNLIMITED
DefMemPerCPU           = UNLIMITED
MaxMemPerCPU           = UNLIMITED





Cheers
L.



------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper

On 14 October 2016 at 14:21, Mike Cammilleri <mi...@stat.wisc.edu> wrote:

> OK, should be easy but I am stumped
>
>
> Built Slurm 16.05.0 on Ubuntu-14.04 LTS. Worked fine until I decided I
> wanted to get email notifications going using smail and seff, then realized
> that would need slurmdbd which we weren't using because we have no need for
> accounting.
>
>
> Fast forward to getting slurmdbd built and job accounting going. Now for
> some reason when I submit any job I get:
>
>
> slurmstepd: error: Job 3049 exceeded memory limit (1336 > 1024), being
> killed
>
>
> And of course I see lots of stuff on slurm-dev about setting
> /etc/default/slurm to have 'ulimit -m unlimited' etc. I also see
> suggestions to put it in /etc/security/limits.conf. I also see suggestions
> to set ulimit limits at the top of my slurmd init scripts.
>
>
> I've done all these tricks and restarted services to no avail. Users have
> ulimits of unlimited (when checking with ulimit -a) but running sbatch or
> srun results in a cap on memory size. Memory lock seems to be ok.
>
>
> [mikec@lunchbox] (34)$ srun bash -c "ulimit -a"
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1030449
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) 1024   <=====!!!!
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 514377
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> Here is an example error when just trying to sbatch a script to run
> /bin/hostname
>
> [mikec@lunchbox] (38)$ cat slurm-3053.out
>
> slurmstepd: error: Job 3053 exceeded memory limit (1096 > 1024), being
> killed
> slurmstepd: error: Exceeded job memory limit
> slurmstepd: error: *** JOB 3053 ON marzano01 CANCELLED AT
> 2016-10-13T22:17:08 ***
>
>
>
> Where can I set ulimit -m so that it will actually take effect?
>
>
> Here is my slurm.conf. I tried using PropagateResourceLimitsExcept=MEMLOCK,RSS
> but did not alleviate max memory size issue.
>
>
> #
> ClusterName=marzano
> ControlMachine=lunchbox
> ControlAddr=xxx.xxx.xxx.xxx
> #BackupController=
> #BackupAddr=
> #
> SlurmUser=slurm
> #SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> #JobCredentialPrivateKey=
> #JobCredentialPublicCertificate=
> StateSaveLocation=/slurm.state
> SlurmdSpoolDir=/tmp/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurm/slurmctld.pid
> SlurmdPidFile=/var/run/slurm/slurmd.pid
> ProctrackType=proctrack/pgid
> #PluginDir=
> #FirstJobId=
> ReturnToService=2
> #MaxJobCount=
> #PlugStackConfig=/etc/slurm/plugstack.conf
> #PropagatePrioProcess=
> #PropagateResourceLimits=
> PropagateResourceLimitsExcept=MEMLOCK,RSS
> #Prolog=
> #Epilog=
> #SrunProlog=
> #SrunEpilog=
> #TaskProlog=
> #TaskEpilog=
> #TaskPlugin=
> #TrackWCKey=no
> #TreeWidth=50
> #TmpFS=
> #UsePAM=
> MailProg=/s/slurm/bin/smail
> #
> # TIMERS
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> #SchedulerAuth=
> #SchedulerPort=
> #SchedulerRootFilter=
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> FastSchedule=1
> #PriorityType=priority/multifactor
> #PriorityDecayHalfLife=14-0
> #PriorityUsageResetPeriod=14-0
> #PriorityWeightFairshare=100000
> #PriorityWeightAge=1000
> #PriorityWeightPartition=10000
> #PriorityWeightJobSize=1000
> #PriorityMaxAge=1-0
> #
> # LOGGING
> SlurmctldDebug=6
> SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
> SlurmdDebug=6
> SlurmdLogFile=/var/log/slurmd/slurmd.log
> JobCompType=jobcomp/none
> #JobCompLoc=
> #
> # ACCOUNTING
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
> #
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=lunchbox
> AccountingStorageLoc=slurm_acct_db
> AccountingStoragePass=auth/munge
> AccountingStorageUser=slurm
> #
> # COMPUTE NODES
> NodeName=marzano0[1-6] CPUs=48 Sockets=2 CoresPerSocket=12
> ThreadsPerCore=2 State=UNKNOWN
>
> PartitionName=debug Nodes=marzano0[1-6] Default=YES MaxTime=INFINITE
> State=UP
>
>

Reply via email to