Dear list,

this week I updated from 15.8.12 to 16.05.6. Together with this upgrade I also 
changed some of the configuration
options to allow a shared usage (user exclusive) of nodes. Since then some of 
my users report that their jobs get
killed when they allocate more than half of the installed memory (on the 
exclusive usage partition).

Did anyone experience the same? Any advise?


Cluster information:

* all nodes are defined as "NodeName=node001 RealMemory=258290 Sockets=2 
CoresPerSocket=10 ThreadsPerCore=2"
* previously ther was only one partition that allocated nodes exclusively
* a new partition was added to allow shared usage (user exclusive)


Selected config parameters that were not changed:

EnforcePartLimits=YES
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
JobAcctGatherType=jobacct_gather/linux


Changed configuration items:

 old                     | new
-------------------------+-------------------------
DefMemPerCPU=6144        | #DefMemPerCPU
#MaxMemPerNode           | MaxMemPerNode=245760
SelectType=select/linear | SelectType=select/cons_res
#SelectTypeParameters    | SelectTypeParameters=CR_Core_Memory

Old Partition:
  PartitionName=TS Nodes=ALL Default=YES MaxTime=70-00:00:00 State=UP 
Shared=EXCLUSIVE
was changed to
  PartitionName=TS Nodes=ALL Default=YES MaxTime=70-00:00:00 State=UP 
OverSubscribe=EXCLUSIVE

New Partition was added:
  PartitionName=TS-shared Nodes=ALL Default=YES MaxTime=70-00:00:00 State=UP 
OverSubscribe=FORCE ExcluseiveUser=YES


This is my cgroup.conf (no changes were made).

###### cgroup.conf ########
CgroupMountpoint=/dev/shm/cgroup
CgroupAutomount=yes
CgroupReleaseAgentDir="/opt/system/slurm/default/etc/cgroup"
MaxRAMPercent=98
AllowedRAMSpace=100
ConstrainRAMSpace=yes
MaxSwapPercent=0
AllowedSwapSpace=0
ConstrainSwapSpace=yes
#ConstrainCores=no
ConstrainCores=yes
TaskAffinity=yes
###########################


Error from user 1:
slurmstepd: error: Step 89204.0 exceeded memory limit (271364344 > 264488960), 
being killed

Error from user 2 running VASP:
srun: error: node001: task 5: Bus error


Thank you,

  Uwe

Reply via email to