Hi Slurm community, Recently we found a small problem triggered by one of our jobs. We have a *MaxMemPerNode*=*532000* setting in our compute node in slurm.conf file, however we found out that a job that started with mem=65536, and after hours of execution it was able to grow its memory usage during execution up to ~650GB. We expected that *MaxMemPerNode* would stop any job exceeding the limit of 532000, did we miss something in the slurm.conf file? We were trying to avoid going into setting QOS for each group of users. any help is welcome
Here is the node definition in the conf file ## Nodes list ## use native GPUs NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu And here is the full slurm.conf file # node health check HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=300 ## Timeouts SlurmctldTimeout=600 SlurmdTimeout=600 GresTypes=gpu AccountingStorageTRES=gres/gpu DebugFlags=CPU_Bind,gres ## We don't want a node to go back in pool without sys admin acknowledgement ReturnToService=0 ## Basic scheduling SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE SchedulerType=sched/backfill ## Accounting AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES AccountingStorageHost=10.10.0.1 AccountingStorageEnforce=limits JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux TaskPlugin=task/cgroup ProctrackType=proctrack/cgroup ## scripts Epilog=/etc/slurm/epilog Prolog=/etc/slurm/prolog PrologFlags=Alloc ## MPI MpiDefault=pmi2 ## Nodes list ## use native GPUs NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu ## Partitions list PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01 Default=YES PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 MaxMemPerNode=420000 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01 -- Cristóbal A. Navarro