Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Bjørn-Helge Mevik
In my opinion, the problem is with autofs, not with tmpfs.  Autofs
simply doesn't work well when you are using detached fs name spaces and
bind mounting.  We ran into this problem years ago (with an inhouse
spank plugin doing more or less what tmpfs does), and ended up simply
not using autofs.

I guess you could try using systemd's auto-mounting features, but I have
no idea if they work better than autofs in situations like this.

We ended up using a system where the prolog script mounts any needed
file systems, and then the healthcheck script unmounts file systems that
are no longer needed.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Ümit Seren
We had the same issue when we switched to job_container plugin. We ended up
running cvmfs_cpnfig probe as part of the health check tool so that the
cvmfs repos stay mounted. However after we switched on power saving we ran
into some race conditions (job landed on a node before the cvmfs was
mounted). We ended up switching to static mounts for the cvmfs repos on the
compute nodes

Best
Ümit

On Thu, Jan 12, 2023, 09:17 Bjørn-Helge Mevik  wrote:

> In my opinion, the problem is with autofs, not with tmpfs.  Autofs
> simply doesn't work well when you are using detached fs name spaces and
> bind mounting.  We ran into this problem years ago (with an inhouse
> spank plugin doing more or less what tmpfs does), and ended up simply
> not using autofs.
>
> I guess you could try using systemd's auto-mounting features, but I have
> no idea if they work better than autofs in situations like this.
>
> We ended up using a system where the prolog script mounts any needed
> file systems, and then the healthcheck script unmounts file systems that
> are no longer needed.
>
> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo
>


Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Ole Holm Nielsen

Hi Magnus,

We had the same challenge some time ago.  A long description of solutions 
is in my Wiki page at 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories


The issue may have been solved in 
https://bugs.schedmd.com/show_bug.cgi?id=12567 which will be in Slurm 23.02.


At this time, the auto_tmpdir SPANK plugin seems to be the best solution.

IHTH,
Ole

On 1/12/23 08:49, Hagdorn, Magnus Karl Moritz wrote:

Hi there,
we excitedly found the job_container/tmpfs plugin which neatly allows
us to provide local scratch space and a way of ensuring that /dev/shm
gets cleaned up after a job finishes. Unfortunately we found that it
does not play nicely with autofs which we use to provide networked
project and scratch directories. We found that this is a known issue
[1]. I was wondering if that has been solved? I think it would be
really useful to have a warning about this issue in the documentation
for the job_container/tmpfs plugin.
Regards
magnus

[1]
https://cernvm-forum.cern.ch/t/intermittent-client-failures-too-many-levels-of-symbolic-links/156/4


--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark



Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Gizo Nanava
Hello, 

another workaround could be to use the InitScript=/path/to/script.sh option of 
the plugin.

For example, if user's home directory is under autofs:
script.sh:
uid=$(squeue -h -O username -j $SLURM_JOB_ID)
cd /home/$uid

Best regards 
Gizo

> Hi there,
> we excitedly found the job_container/tmpfs plugin which neatly allows
> us to provide local scratch space and a way of ensuring that /dev/shm
> gets cleaned up after a job finishes. Unfortunately we found that it
> does not play nicely with autofs which we use to provide networked
> project and scratch directories. We found that this is a known issue
> [1]. I was wondering if that has been solved? I think it would be
> really useful to have a warning about this issue in the documentation
> for the job_container/tmpfs plugin.
> Regards
> magnus
> 
> [1]
> https://cernvm-forum.cern.ch/t/intermittent-client-failures-too-many-levels-of-symbolic-links/156/4
> -- 
> Magnus Hagdorn
> Charité – Universitätsmedizin Berlin
> Geschäftsbereich IT | Scientific Computing
>  
> Campus Charité Virchow Klinikum
> Forum 4 | Ebene 02 | Raum 2.020
> Augustenburger Platz 1
> 13353 Berlin
>  
> magnus.hagd...@charite.de
> https://www.charite.de
> HPC Helpdesk: sc-hpc-helpd...@charite.de



-- 
___
Dr. Gizo Nanava
Group Leader, Scientific Computing 
Leibniz Universität IT Services
Leibniz Universität Hannover
Schlosswender Str. 5
D-30159 Hannover
Tel +49 511 762 7919085
http://www.luis.uni-hannover.de



[slurm-users] Regression from slurm-22.05.2 to slurm-22.05.7 when using "--gpus=N" option.

2023-01-12 Thread Rigoberto Corujo
Hello,

I have a small 2 compute node GPU cluster, where each node as 2 GPUs.


$ sinfo -o "%20N  %10c  %10m  %25f  %30G "

NODELIST  CPUS    MEMORY  AVAIL_FEATURES GRES   
   

o186i[126-127]    128 64000   (null) 
gpu:nvidia_a40:2(S:0-1) 


In my batch script, I request 4 GPUs and let Slurm decide how many nodes to 
automatically allocate.  I also tell it I want 1 task per node.


$ cat rig_batch.sh

#!/usr/bin/env bash

 

#SBATCH --ntasks-per-node=1

#SBATCH --nodes=1-9

#SBATCH --gpus=4

#SBATCH --error=/home/corujor/slurm-error.log

#SBATCH --output=/home/corujor/slurm-output.log

 

bash -c 'echo 
$(hostname):SLURM_JOBID=${SLURM_JOBID}:SLURM_PROCID=${SLURM_PROCID}:CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}'



I submit my batch script on slurm-22.05.2.


$ sbatch rig_batch.sh

Submitted batch job 7


I get the expected results.  That is, since each compute node has 2 GPUs and I 
requested 4 GPUs, Slurm allocated 2 nodes, and 1 task per node.


$ cat slurm-output.log

o186i126:SLURM_JOBID=7:SLURM_PROCID=0:CUDA_VISIBLE_DEVICES=0,1

o186i127:SLURM_JOBID=7:SLURM_PROCID=1:CUDA_VISIBLE_DEVICES=0,1


However, when I try to submit the same batch script on slurm-22.05.7, it fails.


$ sbatch rig_batch.sh

sbatch: error: Batch job submission failed: Requested node configuration is not 
available


Here is my configuration.


$ scontrol show config

Configuration data as of 2023-01-12T21:38:55

AccountingStorageBackupHost = (null)

AccountingStorageEnforce = none

AccountingStorageHost   = localhost

AccountingStorageExternalHost = (null)

AccountingStorageParameters = (null)

AccountingStoragePort   = 6819

AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages

AccountingStorageType   = accounting_storage/slurmdbd

AccountingStorageUser   = N/A

AccountingStoreFlags    = (null)

AcctGatherEnergyType    = acct_gather_energy/none

AcctGatherFilesystemType = acct_gather_filesystem/none

AcctGatherInterconnectType = acct_gather_interconnect/none

AcctGatherNodeFreq  = 0 sec

AcctGatherProfileType   = acct_gather_profile/none

AllowSpecResourcesUsage = No

AuthAltTypes    = (null)

AuthAltParameters   = (null)

AuthInfo    = (null)

AuthType    = auth/munge

BatchStartTimeout   = 10 sec

BcastExclude    = /lib,/usr/lib,/lib64,/usr/lib64

BcastParameters = (null)

BOOT_TIME   = 2023-01-12T17:17:11

BurstBufferType = (null)

CliFilterPlugins    = (null)

ClusterName = grenoble_test

CommunicationParameters = (null)

CompleteWait    = 0 sec

CoreSpecPlugin  = core_spec/none

CpuFreqDef  = Unknown

CpuFreqGovernors    = OnDemand,Performance,UserSpace

CredType    = cred/munge

DebugFlags  = Gres

DefMemPerNode   = UNLIMITED

DependencyParameters    = (null)

DisableRootJobs = Yes

EioTimeout  = 60

EnforcePartLimits   = ANY

Epilog  = (null)

EpilogMsgTime   = 2000 usec

EpilogSlurmctld = (null)

ExtSensorsType  = ext_sensors/none

ExtSensorsFreq      = 0 sec

FederationParameters    = (null)

FirstJobId  = 1

GetEnvTimeout   = 2 sec

GresTypes   = gpu

GpuFreqDef  = high,memory=high

GroupUpdateForce    = 1

GroupUpdateTime = 600 sec

HASH_VAL    = Match

HealthCheckInterval = 0 sec

HealthCheckNodeState    = ANY

HealthCheckProgram  = (null)

InactiveLimit   = 0 sec

InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL

JobAcctGatherFrequency  = 30

JobAcctGatherType   = jobacct_gather/none

JobAcctGatherParams = (null)

JobCompHost = localhost

JobCompLoc  = /var/log/slurm_jobcomp.log

JobCompPort = 0

JobCompType = jobcomp/none

JobCompUser = root

JobContainerType    = job_container/none

JobCredentialPrivateKey = /apps/slurm/etc/.slurm.key

JobCredentialPublicCertificate = /apps/slurm/etc/slurm.cert

JobDefaults = (null)

JobFileAppend   = 0

JobRequeue  = 1

JobSubmitPlugins    = (null)

KillOnBadExit   = 0

KillWait    = 30 sec

LaunchParameters    = use_interactive_step

LaunchType  = launch/slurm

Licenses    = (null)

LogTimeFormat       = iso8601_ms

MailDomain  = (null)

MailProg    = /bin/mail

MaxArraySize    = 1001

MaxDBDMsgs  = 20008

MaxJobCount = 1

MaxJobId    = 67043328

MaxMemPerNode   = UNLIMITED

MaxNodeCount    = 2

MaxStepCount    = 4

MaxTasksPerNode = 512

MCSPlugin   = mcs/none

MCSParameters   = (null)

MessageTimeout  = 10 sec

MinJobAge   = 300 sec

MpiDef

[slurm-users] Cannot enable Gang scheduling

2023-01-12 Thread Helder Daniel
Hi,

I am trying to enable gang scheduling on a server with a CPU with 32 cores
and 4 GPUs.

However, using Gang sched, the cpu jobs (or gpu jobs) are not being
preempted after the time slice, which is set to 30 secs.

Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The
first 2 jobs launched are never preempted. The 3rd job is forever (or at
least until one of the other 2 ends) starving:

 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
   313  asimov01 cpu-only  hdaniel PD   0:00  1
(Resources)
   311  asimov01 cpu-only  hdaniel  R   1:52  1 asimov
   312  asimov01 cpu-only  hdaniel  R   1:49  1 asimov

The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU each,
the 5th job will never run. The preemption is not working with the
specified timeslice.

I tried several combinations:

SchedulerType=sched/builtin  and backfill
SelectType=select/cons_tres   and linear

I'll appreciate any help and suggestions
The slurm.conf is below.
Thanks

ClusterName=asimov
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc # proctrack/cgroup
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none # task/cgroup
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
#FastSchedule=1 #obsolete
SchedulerType=sched/builtin #backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core#CR_Core_Memory let's only one job run at a
time
PreemptType = preempt/partition_prio
PreemptMode = SUSPEND,GANG
SchedulerTimeSlice=30   #in seconds, default 30
#
# LOGGING AND ACCOUNTING
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageEnforce=associations
#ClusterName=bip-cluster
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
#NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
#PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP

# Partitions
GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
State=UNKNOWN
PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1
DefCpuPerGPU=2 State=UP


Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-12 Thread Daniel Letai

  
  
Hello Cristóbal,


I think you might have a slight misunderstanding of how Slurm
  works, which can cause this difference in expectation.


The MaxMemPerNode is there to allow the scheduler to plan job
  placement according to resources. It does not enforce limitations
  during job execution, only placement with the assumption that the
  job will not use more than the resources it requested.


One option to limit the job during execution is through cgroups,
  another might be using JobAcctGatherParams/OverMemoryKill
  but I would suspect cgroups would indeed be the better option
  for your use case, and see from the slurm.conf man page:



  
Kill processes that are being detected to use more memory
  than requested by
  steps every time accounting information is gathered by the
  JobAcctGather plugin.
  This parameter should be used with caution because a job
  exceeding its memory
  allocation may affect other processes and/or machine health.


  NOTE: If available, it is recommended to limit memory
  by enabling
  task/cgroup as a TaskPlugin and making use of
  ConstrainRAMSpace=yes in the
  cgroup.conf instead of using this JobAcctGather mechanism for
  memory
  enforcement. Using JobAcctGather is polling based and there is
  a
  delay before a job is killed, which could lead to system Out
  of Memory events.


  NOTE: When using OverMemoryKill, if the
  combined memory used by
  all the processes in a step exceeds the memory limit, the
  entire step will be
  killed/cancelled by the JobAcctGather plugin.
  This differs from the behavior when using ConstrainRAMSpace,
  where
  processes in the step will be killed, but the step will be
  left active,
  possibly with other processes left running.
  
  



On 12/01/2023 03:47:53, Cristóbal
  Navarro wrote:


  
  
Hi Slurm community,
Recently we found a small problem triggered by one of our
  jobs. We have a MaxMemPerNode=532000 setting in
  our compute node in slurm.conf file, however we found out that
  a job that started with mem=65536, and after hours of
  execution it was able to grow its memory usage during
  execution up to ~650GB. We expected that MaxMemPerNode
  would stop any job exceeding the limit of 532000, did we miss
  something in the slurm.conf file? We were trying to avoid
  going into setting QOS for each group of users.

any help is welcome



Here is the node definition in the conf file
## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556
State=UNKNOWN Gres=gpu:A100:8 Feature=gpu

  

  
And here is the full slurm.conf
  file
  
# node health check
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=300

## Timeouts
SlurmctldTimeout=600
SlurmdTimeout=600

GresTypes=gpu
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres

## We don't want a node to go back in pool without sys admin
acknowledgement
ReturnToService=0

## Basic scheduling
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SchedulerType=sched/backfill

## Accounting 
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStorageHost=10.10.0.1
AccountingStorageEnforce=limits

JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux

TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

## scripts
Epilog=/etc/slurm/epilog
Prolog=/etc/slurm/prolog
PrologFlags=Alloc

## MPI
MpiDefault=pmi2

## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556
State=UNKNOWN Gres=gpu:A100:8 Feature=gpu

## Partitions list
PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64
DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556