Thanks for all your Help Kevin, I really did miss the OverSubscribe option in the docs :-( But now cpu job scheduling is working and I have a picture of the problem with gpu job scheduling to dig further :-)
On Fri, 13 Jan 2023 at 13:01, Kevin Broch <kbr...@rivosinc.com> wrote: > Sorry to hear that. Hopefully others in the group have some > ideas/explanations. I haven't had to deal with GPU resources in Slurm. > > On Fri, Jan 13, 2023 at 4:51 AM Helder Daniel <hdan...@ualg.pt> wrote: > >> Oh, ok. >> I guess I was expecting that the GPU job was suspended copying GPU memory >> to RAM memory. >> >> I tried also: REQUEUE,GANG and CANCEL,GANG. >> >> None of these options seems to be able to preempt GPU jobs >> >> On Fri, 13 Jan 2023 at 12:30, Kevin Broch <kbr...@rivosinc.com> wrote: >> >>> My guess, is that this isn't possible with GANG,SUSPEND. GPU memory >>> isn't managed in Slurm so the idea of suspending GPU memory for another job >>> to use the rest simply isn't possible. >>> >>> On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel <hdan...@ualg.pt> wrote: >>> >>>> Hi Kevin >>>> >>>> I did a "scontrol show partition". >>>> Oversubscribe was not enabled. >>>> I enable it in slurm.conf with: >>>> >>>> (...) >>>> GresTypes=gpu >>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 >>>> State=UNKNOWN >>>> PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES >>>> MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP >>>> >>>> but now it is working only with CPU jobs. It does not preempt gpu jobs. >>>> Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt >>>> after the timeslice as expected >>>> >>>> sbatch --cpus-per-task=32 test-cpu.sh >>>> >>>> JOBID PARTITION NAME USER ST TIME NODES >>>> NODELIST(REASON) >>>> 352 asimov01 cpu-only hdaniel R 0:58 1 >>>> asimov >>>> 353 asimov01 cpu-only hdaniel R 0:25 1 >>>> asimov >>>> 351 asimov01 cpu-only hdaniel S 0:36 1 >>>> asimov >>>> >>>> But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not >>>> preempt the first 2 that start running. >>>> It says that the 3rd job is hanging on resources. >>>> >>>> JOBID PARTITION NAME USER ST TIME NODES >>>> NODELIST(REASON) >>>> 356 asimov01 gpu hdaniel PD 0:00 1 >>>> (Resources) >>>> 354 asimov01 gpu hdaniel R 3:05 1 >>>> asimov >>>> 355 asimov01 gpu hdaniel R 3:02 1 >>>> asimov >>>> >>>> Do I need to change anything else in the configuration to support also >>>> gpu gang scheduling? >>>> Thanks >>>> >>>> >>>> ============================================================================ >>>> scontrol show partition asimov01 >>>> PartitionName=asimov01 >>>> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL >>>> AllocNodes=ALL Default=YES QoS=N/A >>>> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 >>>> Hidden=NO >>>> MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO >>>> MaxCPUsPerNode=UNLIMITED >>>> Nodes=asimov >>>> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO >>>> OverSubscribe=NO >>>> OverTimeLimit=NONE PreemptMode=GANG,SUSPEND >>>> State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE >>>> JobDefaults=DefCpuPerGPU=2 >>>> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED >>>> >>>> On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbr...@rivosinc.com> wrote: >>>> >>>>> Problem might be that OverSubscribe is not enabled? w/o it, I don't >>>>> believe the time-slicing can be GANG scheduled >>>>> >>>>> Can you do a "scontrol show partition" to verify that it is? >>>>> >>>>> On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdan...@ualg.pt> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am trying to enable gang scheduling on a server with a CPU with 32 >>>>>> cores and 4 GPUs. >>>>>> >>>>>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being >>>>>> preempted after the time slice, which is set to 30 secs. >>>>>> >>>>>> Below is a snapshot of squeue. There are 3 jobs each needing 32 >>>>>> cores. The first 2 jobs launched are never preempted. The 3rd job is >>>>>> forever (or at least until one of the other 2 ends) starving: >>>>>> >>>>>> JOBID PARTITION NAME USER ST TIME NODES >>>>>> NODELIST(REASON) >>>>>> 313 asimov01 cpu-only hdaniel PD 0:00 1 >>>>>> (Resources) >>>>>> 311 asimov01 cpu-only hdaniel R 1:52 1 >>>>>> asimov >>>>>> 312 asimov01 cpu-only hdaniel R 1:49 1 >>>>>> asimov >>>>>> >>>>>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU >>>>>> each, the 5th job will never run. The preemption is not working with the >>>>>> specified timeslice. >>>>>> >>>>>> I tried several combinations: >>>>>> >>>>>> SchedulerType=sched/builtin and backfill >>>>>> SelectType=select/cons_tres and linear >>>>>> >>>>>> I'll appreciate any help and suggestions >>>>>> The slurm.conf is below. >>>>>> Thanks >>>>>> >>>>>> ClusterName=asimov >>>>>> SlurmctldHost=localhost >>>>>> MpiDefault=none >>>>>> ProctrackType=proctrack/linuxproc # proctrack/cgroup >>>>>> ReturnToService=2 >>>>>> SlurmctldPidFile=/var/run/slurmctld.pid >>>>>> SlurmctldPort=6817 >>>>>> SlurmdPidFile=/var/run/slurmd.pid >>>>>> SlurmdPort=6818 >>>>>> SlurmdSpoolDir=/var/lib/slurm/slurmd >>>>>> SlurmUser=slurm >>>>>> StateSaveLocation=/var/lib/slurm/slurmctld >>>>>> SwitchType=switch/none >>>>>> TaskPlugin=task/none # task/cgroup >>>>>> # >>>>>> # TIMERS >>>>>> InactiveLimit=0 >>>>>> KillWait=30 >>>>>> MinJobAge=300 >>>>>> SlurmctldTimeout=120 >>>>>> SlurmdTimeout=300 >>>>>> Waittime=0 >>>>>> # >>>>>> # SCHEDULING >>>>>> #FastSchedule=1 #obsolete >>>>>> SchedulerType=sched/builtin #backfill >>>>>> SelectType=select/cons_tres >>>>>> SelectTypeParameters=CR_Core #CR_Core_Memory let's only one job >>>>>> run at a time >>>>>> PreemptType = preempt/partition_prio >>>>>> PreemptMode = SUSPEND,GANG >>>>>> SchedulerTimeSlice=30 #in seconds, default 30 >>>>>> # >>>>>> # LOGGING AND ACCOUNTING >>>>>> #AccountingStoragePort= >>>>>> AccountingStorageType=accounting_storage/none >>>>>> #AccountingStorageEnforce=associations >>>>>> #ClusterName=bip-cluster >>>>>> JobAcctGatherFrequency=30 >>>>>> JobAcctGatherType=jobacct_gather/linux >>>>>> SlurmctldDebug=info >>>>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log >>>>>> SlurmdDebug=info >>>>>> SlurmdLogFile=/var/log/slurm/slurmd.log >>>>>> # >>>>>> # >>>>>> # COMPUTE NODES >>>>>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN >>>>>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP >>>>>> >>>>>> # Partitions >>>>>> GresTypes=gpu >>>>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 >>>>>> ThreadsPerCore=2 State=UNKNOWN >>>>>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE >>>>>> MaxNodes=1 DefCpuPerGPU=2 State=UP >>>>>> >>>>>> >>>> >>>> -- >>>> com os melhores cumprimentos, >>>> >>>> Helder Daniel >>>> Universidade do Algarve >>>> Faculdade de Ciências e Tecnologia >>>> Departamento de Engenharia Electrónica e Informática >>>> https://www.ualg.pt/pt/users/hdaniel >>>> >>> >> >> -- >> com os melhores cumprimentos, >> >> Helder Daniel >> Universidade do Algarve >> Faculdade de Ciências e Tecnologia >> Departamento de Engenharia Electrónica e Informática >> https://www.ualg.pt/pt/users/hdaniel >> > -- com os melhores cumprimentos, Helder Daniel Universidade do Algarve Faculdade de Ciências e Tecnologia Departamento de Engenharia Electrónica e Informática https://www.ualg.pt/pt/users/hdaniel