Dear list, this week I updated from 15.8.12 to 16.05.6. Together with this upgrade I also changed some of the configuration options to allow a shared usage (user exclusive) of nodes. Since then some of my users report that their jobs get killed when they allocate more than half of the installed memory (on the exclusive usage partition).
Did anyone experience the same? Any advise? Cluster information: * all nodes are defined as "NodeName=node001 RealMemory=258290 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2" * previously ther was only one partition that allocated nodes exclusively * a new partition was added to allow shared usage (user exclusive) Selected config parameters that were not changed: EnforcePartLimits=YES ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup JobAcctGatherType=jobacct_gather/linux Changed configuration items: old | new -------------------------+------------------------- DefMemPerCPU=6144 | #DefMemPerCPU #MaxMemPerNode | MaxMemPerNode=245760 SelectType=select/linear | SelectType=select/cons_res #SelectTypeParameters | SelectTypeParameters=CR_Core_Memory Old Partition: PartitionName=TS Nodes=ALL Default=YES MaxTime=70-00:00:00 State=UP Shared=EXCLUSIVE was changed to PartitionName=TS Nodes=ALL Default=YES MaxTime=70-00:00:00 State=UP OverSubscribe=EXCLUSIVE New Partition was added: PartitionName=TS-shared Nodes=ALL Default=YES MaxTime=70-00:00:00 State=UP OverSubscribe=FORCE ExcluseiveUser=YES This is my cgroup.conf (no changes were made). ###### cgroup.conf ######## CgroupMountpoint=/dev/shm/cgroup CgroupAutomount=yes CgroupReleaseAgentDir="/opt/system/slurm/default/etc/cgroup" MaxRAMPercent=98 AllowedRAMSpace=100 ConstrainRAMSpace=yes MaxSwapPercent=0 AllowedSwapSpace=0 ConstrainSwapSpace=yes #ConstrainCores=no ConstrainCores=yes TaskAffinity=yes ########################### Error from user 1: slurmstepd: error: Step 89204.0 exceeded memory limit (271364344 > 264488960), being killed Error from user 2 running VASP: srun: error: node001: task 5: Bus error Thank you, Uwe