Re: [slurm-users] job_container/tmpfs and autofs
In my opinion, the problem is with autofs, not with tmpfs. Autofs simply doesn't work well when you are using detached fs name spaces and bind mounting. We ran into this problem years ago (with an inhouse spank plugin doing more or less what tmpfs does), and ended up simply not using autofs. I guess you could try using systemd's auto-mounting features, but I have no idea if they work better than autofs in situations like this. We ended up using a system where the prolog script mounts any needed file systems, and then the healthcheck script unmounts file systems that are no longer needed. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] job_container/tmpfs and autofs
We had the same issue when we switched to job_container plugin. We ended up running cvmfs_cpnfig probe as part of the health check tool so that the cvmfs repos stay mounted. However after we switched on power saving we ran into some race conditions (job landed on a node before the cvmfs was mounted). We ended up switching to static mounts for the cvmfs repos on the compute nodes Best Ümit On Thu, Jan 12, 2023, 09:17 Bjørn-Helge Mevik wrote: > In my opinion, the problem is with autofs, not with tmpfs. Autofs > simply doesn't work well when you are using detached fs name spaces and > bind mounting. We ran into this problem years ago (with an inhouse > spank plugin doing more or less what tmpfs does), and ended up simply > not using autofs. > > I guess you could try using systemd's auto-mounting features, but I have > no idea if they work better than autofs in situations like this. > > We ended up using a system where the prolog script mounts any needed > file systems, and then the healthcheck script unmounts file systems that > are no longer needed. > > -- > Regards, > Bjørn-Helge Mevik, dr. scient, > Department for Research Computing, University of Oslo >
Re: [slurm-users] job_container/tmpfs and autofs
Hi Magnus, We had the same challenge some time ago. A long description of solutions is in my Wiki page at https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories The issue may have been solved in https://bugs.schedmd.com/show_bug.cgi?id=12567 which will be in Slurm 23.02. At this time, the auto_tmpdir SPANK plugin seems to be the best solution. IHTH, Ole On 1/12/23 08:49, Hagdorn, Magnus Karl Moritz wrote: Hi there, we excitedly found the job_container/tmpfs plugin which neatly allows us to provide local scratch space and a way of ensuring that /dev/shm gets cleaned up after a job finishes. Unfortunately we found that it does not play nicely with autofs which we use to provide networked project and scratch directories. We found that this is a known issue [1]. I was wondering if that has been solved? I think it would be really useful to have a warning about this issue in the documentation for the job_container/tmpfs plugin. Regards magnus [1] https://cernvm-forum.cern.ch/t/intermittent-client-failures-too-many-levels-of-symbolic-links/156/4 -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark
Re: [slurm-users] job_container/tmpfs and autofs
Hello, another workaround could be to use the InitScript=/path/to/script.sh option of the plugin. For example, if user's home directory is under autofs: script.sh: uid=$(squeue -h -O username -j $SLURM_JOB_ID) cd /home/$uid Best regards Gizo > Hi there, > we excitedly found the job_container/tmpfs plugin which neatly allows > us to provide local scratch space and a way of ensuring that /dev/shm > gets cleaned up after a job finishes. Unfortunately we found that it > does not play nicely with autofs which we use to provide networked > project and scratch directories. We found that this is a known issue > [1]. I was wondering if that has been solved? I think it would be > really useful to have a warning about this issue in the documentation > for the job_container/tmpfs plugin. > Regards > magnus > > [1] > https://cernvm-forum.cern.ch/t/intermittent-client-failures-too-many-levels-of-symbolic-links/156/4 > -- > Magnus Hagdorn > Charité – Universitätsmedizin Berlin > Geschäftsbereich IT | Scientific Computing > > Campus Charité Virchow Klinikum > Forum 4 | Ebene 02 | Raum 2.020 > Augustenburger Platz 1 > 13353 Berlin > > magnus.hagd...@charite.de > https://www.charite.de > HPC Helpdesk: sc-hpc-helpd...@charite.de -- ___ Dr. Gizo Nanava Group Leader, Scientific Computing Leibniz Universität IT Services Leibniz Universität Hannover Schlosswender Str. 5 D-30159 Hannover Tel +49 511 762 7919085 http://www.luis.uni-hannover.de
[slurm-users] Regression from slurm-22.05.2 to slurm-22.05.7 when using "--gpus=N" option.
Hello, I have a small 2 compute node GPU cluster, where each node as 2 GPUs. $ sinfo -o "%20N %10c %10m %25f %30G " NODELIST CPUS MEMORY AVAIL_FEATURES GRES o186i[126-127] 128 64000 (null) gpu:nvidia_a40:2(S:0-1) In my batch script, I request 4 GPUs and let Slurm decide how many nodes to automatically allocate. I also tell it I want 1 task per node. $ cat rig_batch.sh #!/usr/bin/env bash #SBATCH --ntasks-per-node=1 #SBATCH --nodes=1-9 #SBATCH --gpus=4 #SBATCH --error=/home/corujor/slurm-error.log #SBATCH --output=/home/corujor/slurm-output.log bash -c 'echo $(hostname):SLURM_JOBID=${SLURM_JOBID}:SLURM_PROCID=${SLURM_PROCID}:CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}' I submit my batch script on slurm-22.05.2. $ sbatch rig_batch.sh Submitted batch job 7 I get the expected results. That is, since each compute node has 2 GPUs and I requested 4 GPUs, Slurm allocated 2 nodes, and 1 task per node. $ cat slurm-output.log o186i126:SLURM_JOBID=7:SLURM_PROCID=0:CUDA_VISIBLE_DEVICES=0,1 o186i127:SLURM_JOBID=7:SLURM_PROCID=1:CUDA_VISIBLE_DEVICES=0,1 However, when I try to submit the same batch script on slurm-22.05.7, it fails. $ sbatch rig_batch.sh sbatch: error: Batch job submission failed: Requested node configuration is not available Here is my configuration. $ scontrol show config Configuration data as of 2023-01-12T21:38:55 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none AccountingStorageHost = localhost AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreFlags = (null) AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthAltParameters = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64 BcastParameters = (null) BOOT_TIME = 2023-01-12T17:17:11 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = grenoble_test CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = OnDemand,Performance,UserSpace CredType = cred/munge DebugFlags = Gres DefMemPerNode = UNLIMITED DependencyParameters = (null) DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = ANY Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec InteractiveStepOptions = --interactive --preserve-env --pty $SHELL JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/none JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = /apps/slurm/etc/.slurm.key JobCredentialPublicCertificate = /apps/slurm/etc/slurm.cert JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = use_interactive_step LaunchType = launch/slurm Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxDBDMsgs = 20008 MaxJobCount = 1 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxNodeCount = 2 MaxStepCount = 4 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 10 sec MinJobAge = 300 sec MpiDef
[slurm-users] Cannot enable Gang scheduling
Hi, I am trying to enable gang scheduling on a server with a CPU with 32 cores and 4 GPUs. However, using Gang sched, the cpu jobs (or gpu jobs) are not being preempted after the time slice, which is set to 30 secs. Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The first 2 jobs launched are never preempted. The 3rd job is forever (or at least until one of the other 2 ends) starving: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 313 asimov01 cpu-only hdaniel PD 0:00 1 (Resources) 311 asimov01 cpu-only hdaniel R 1:52 1 asimov 312 asimov01 cpu-only hdaniel R 1:49 1 asimov The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU each, the 5th job will never run. The preemption is not working with the specified timeslice. I tried several combinations: SchedulerType=sched/builtin and backfill SelectType=select/cons_tres and linear I'll appreciate any help and suggestions The slurm.conf is below. Thanks ClusterName=asimov SlurmctldHost=localhost MpiDefault=none ProctrackType=proctrack/linuxproc # proctrack/cgroup ReturnToService=2 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm/slurmctld SwitchType=switch/none TaskPlugin=task/none # task/cgroup # # TIMERS InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 # # SCHEDULING #FastSchedule=1 #obsolete SchedulerType=sched/builtin #backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core#CR_Core_Memory let's only one job run at a time PreemptType = preempt/partition_prio PreemptMode = SUSPEND,GANG SchedulerTimeSlice=30 #in seconds, default 30 # # LOGGING AND ACCOUNTING #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageEnforce=associations #ClusterName=bip-cluster JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log # # # COMPUTE NODES #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP # Partitions GresTypes=gpu NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP
Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode
Hello Cristóbal, I think you might have a slight misunderstanding of how Slurm works, which can cause this difference in expectation. The MaxMemPerNode is there to allow the scheduler to plan job placement according to resources. It does not enforce limitations during job execution, only placement with the assumption that the job will not use more than the resources it requested. One option to limit the job during execution is through cgroups, another might be using JobAcctGatherParams/OverMemoryKill but I would suspect cgroups would indeed be the better option for your use case, and see from the slurm.conf man page: Kill processes that are being detected to use more memory than requested by steps every time accounting information is gathered by the JobAcctGather plugin. This parameter should be used with caution because a job exceeding its memory allocation may affect other processes and/or machine health. NOTE: If available, it is recommended to limit memory by enabling task/cgroup as a TaskPlugin and making use of ConstrainRAMSpace=yes in the cgroup.conf instead of using this JobAcctGather mechanism for memory enforcement. Using JobAcctGather is polling based and there is a delay before a job is killed, which could lead to system Out of Memory events. NOTE: When using OverMemoryKill, if the combined memory used by all the processes in a step exceeds the memory limit, the entire step will be killed/cancelled by the JobAcctGather plugin. This differs from the behavior when using ConstrainRAMSpace, where processes in the step will be killed, but the step will be left active, possibly with other processes left running. On 12/01/2023 03:47:53, Cristóbal Navarro wrote: Hi Slurm community, Recently we found a small problem triggered by one of our jobs. We have a MaxMemPerNode=532000 setting in our compute node in slurm.conf file, however we found out that a job that started with mem=65536, and after hours of execution it was able to grow its memory usage during execution up to ~650GB. We expected that MaxMemPerNode would stop any job exceeding the limit of 532000, did we miss something in the slurm.conf file? We were trying to avoid going into setting QOS for each group of users. any help is welcome Here is the node definition in the conf file ## Nodes list ## use native GPUs NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu And here is the full slurm.conf file # node health check HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=300 ## Timeouts SlurmctldTimeout=600 SlurmdTimeout=600 GresTypes=gpu AccountingStorageTRES=gres/gpu DebugFlags=CPU_Bind,gres ## We don't want a node to go back in pool without sys admin acknowledgement ReturnToService=0 ## Basic scheduling SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE SchedulerType=sched/backfill ## Accounting AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES AccountingStorageHost=10.10.0.1 AccountingStorageEnforce=limits JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux TaskPlugin=task/cgroup ProctrackType=proctrack/cgroup ## scripts Epilog=/etc/slurm/epilog Prolog=/etc/slurm/prolog PrologFlags=Alloc ## MPI MpiDefault=pmi2 ## Nodes list ## use native GPUs NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu ## Partitions list PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556