Hi everyone, we have recently enabled sharding to allow GPU sharing by multiple jobs. According to SLURM documentation: once a GPU has been allocated as a gres/gpu resource it will not be available as a gres/shard (and vice versa).
However, we had the situation where, on nodes with a single GPU, jobs that allocate gres/shard and other jobs that allocate gres/gpu were running simultaneously. Has anyone encountered the same case? An example of this gres/gpu and gres/shard coexistence can be seen in the following: squeue -w s-sc-gpu017 JOBID PARTITION NAME USER STAT TIME TIME_LIMI NODES NODELIST(REASON) CPUS MIN_MEMORY 1288220 gpu spawner-ju xx1 RUNN 1-20:49:59 2-00:00:00 1 s-sc-gpu017 32 120G 1291298 gpu interactiv xx2 RUNN 13:40 8:00:00 1 s-sc-gpu017 8 32000M scontrol show job 1288220 | grep TRES ReqTRES=cpu=32,mem=120G,node=1,billing=62,gres/shard=1 AllocTRES=cpu=32,mem=120G,node=1,billing=62,gres/shard=1 scontrol show job 1291298 | grep TRES ReqTRES=cpu=1,mem=32000M,node=1,billing=136,gres/gpu=1 AllocTRES=cpu=8,mem=32000M,node=1,billing=143,gres/gpu=1,gres/gpu:nvidia_a100-pcie-40gb=1 And the information on the node status shows both gres/gpu and gres/shard on the allocated TRES: scontrol show node s-sc-gpu017 | grep TRES CfgTRES=cpu=128,mem=500000M,billing=378,gres/gpu=1,gres/gpu:nvidia_a100-pcie-40gb=1,gres/shard=4 AllocTRES=cpu=40,mem=154880M,gres/gpu=1,gres/gpu:nvidia_a100-pcie-40gb=1,gres/shard=1 We are running Slurm Version 23.02.4 on Rocky Linux 8.5 and the shard related configuration in slurm.conf is as follows: GresTypes=gpu,shard,gpu/gfx90a,gpu/nvidia_a100-pcie-40gb,gpu/nvidia_a100-sxm4-40gb,gpu/nvidia_a100-sxm4-80gb,gpu/nvidia_a100_80gb_pcie AccountingStorageTRES=gres/gpu,gres/shard,gres/gpu:gfx90a,gres/gpu:nvidia_a100-pcie-40gb,gres/gpu:nvidia_a100-sxm4-40gb,gres/gpu:nvidia_a100-sxm4-80gb,gres/gpu:nvidia_a100_80gb_pcie NodeName=s-sc-gpu003 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1 NodeName=s-sc-gpu017 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1 NodeName=s-sc-gpu018 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1 NodeName=s-sc-gpu019 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1 NodeName=s-sc-gpu021 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1 Kind Regards, Andreas ----------- Dr. Andreas Reppas Geschäftsbereich IT | Scientific Computing Charité – Universitätsmedizin Berlin Campus Charité Virchow Klinikum Forum 4 | Ebene 02 | Raum 2.020 Augustenburger Platz 1 13353 Berlin andreas.rep...@charite.de https://www.charite.de