[slurm-users] job_container/tmpfs and autofs

2023-01-11 Thread Hagdorn, Magnus Karl Moritz
Hi there,
we excitedly found the job_container/tmpfs plugin which neatly allows
us to provide local scratch space and a way of ensuring that /dev/shm
gets cleaned up after a job finishes. Unfortunately we found that it
does not play nicely with autofs which we use to provide networked
project and scratch directories. We found that this is a known issue
[1]. I was wondering if that has been solved? I think it would be
really useful to have a warning about this issue in the documentation
for the job_container/tmpfs plugin.
Regards
magnus

[1]
https://cernvm-forum.cern.ch/t/intermittent-client-failures-too-many-levels-of-symbolic-links/156/4
-- 
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
 
Campus Charité Virchow Klinikum
Forum 4 | Ebene 02 | Raum 2.020
Augustenburger Platz 1
13353 Berlin
 
magnus.hagd...@charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpd...@charite.de


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-11 Thread Rodrigo Santibáñez
Hi Cristóbal,

I would guess you need to set up a cgroup.conf file

###
# Slurm cgroup support configuration file
###
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRAMSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=0
#ConstrainDevices=yes
MemorySwappiness=0
TaskAffinity=no
CgroupAutomount=yes
ConstrainCores=yes
#

Best,
Rodrigo

On Wed, Jan 11, 2023 at 10:50 PM Cristóbal Navarro <
cristobal.navarr...@gmail.com> wrote:

> Hi Slurm community,
> Recently we found a small problem triggered by one of our jobs. We have a
> *MaxMemPerNode*=*532000* setting in our compute node in slurm.conf file,
> however we found out that a job that started with mem=65536, and after
> hours of execution it was able to grow its memory usage during execution up
> to ~650GB. We expected that *MaxMemPerNode* would stop any job exceeding
> the limit of 532000, did we miss something in the slurm.conf file? We were
> trying to avoid going into setting QOS for each group of users.
> any help is welcome
>
> Here is the node definition in the conf file
> ## Nodes list
> ## use native GPUs
> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
> RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8
> Feature=gpu
>
>
> And here is the full slurm.conf file
> # node health check
> HealthCheckProgram=/usr/sbin/nhc
> HealthCheckInterval=300
>
> ## Timeouts
> SlurmctldTimeout=600
> SlurmdTimeout=600
>
> GresTypes=gpu
> AccountingStorageTRES=gres/gpu
> DebugFlags=CPU_Bind,gres
>
> ## We don't want a node to go back in pool without sys admin
> acknowledgement
> ReturnToService=0
>
> ## Basic scheduling
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
> SchedulerType=sched/backfill
>
> ## Accounting
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStoreJobComment=YES
> AccountingStorageHost=10.10.0.1
> AccountingStorageEnforce=limits
>
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
>
> TaskPlugin=task/cgroup
> ProctrackType=proctrack/cgroup
>
> ## scripts
> Epilog=/etc/slurm/epilog
> Prolog=/etc/slurm/prolog
> PrologFlags=Alloc
>
> ## MPI
> MpiDefault=pmi2
>
> ## Nodes list
> ## use native GPUs
> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
> RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8
> Feature=gpu
>
> ## Partitions list
> PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556
> DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=3-12:00:00
> State=UP Nodes=nodeGPU01 Default=YES
> PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384
> MaxMemPerNode=42 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01
>
>
> --
> Cristóbal A. Navarro
>


[slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-11 Thread Cristóbal Navarro
Hi Slurm community,
Recently we found a small problem triggered by one of our jobs. We have a
*MaxMemPerNode*=*532000* setting in our compute node in slurm.conf file,
however we found out that a job that started with mem=65536, and after
hours of execution it was able to grow its memory usage during execution up
to ~650GB. We expected that *MaxMemPerNode* would stop any job exceeding
the limit of 532000, did we miss something in the slurm.conf file? We were
trying to avoid going into setting QOS for each group of users.
any help is welcome

Here is the node definition in the conf file
## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8
Feature=gpu


And here is the full slurm.conf file
# node health check
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=300

## Timeouts
SlurmctldTimeout=600
SlurmdTimeout=600

GresTypes=gpu
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres

## We don't want a node to go back in pool without sys admin acknowledgement
ReturnToService=0

## Basic scheduling
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SchedulerType=sched/backfill

## Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStorageHost=10.10.0.1
AccountingStorageEnforce=limits

JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux

TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

## scripts
Epilog=/etc/slurm/epilog
Prolog=/etc/slurm/prolog
PrologFlags=Alloc

## MPI
MpiDefault=pmi2

## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8
Feature=gpu

## Partitions list
PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556
DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=3-12:00:00
State=UP Nodes=nodeGPU01 Default=YES
PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384
MaxMemPerNode=42 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01


-- 
Cristóbal A. Navarro