Hi, I just upgraded from 20.11 to 21.08.2.
Now it seems the slurmd cannot handle my custom GRES.I have set VRAM of the GPUs as a custom GRES, to allow users to select a GPU with enough VRAM for their jobs.
I defined the VRAM in gres.conf:
NodeName=node[1,7,9] Name=VRAM Count=24G Flags=CountOnly NodeName=node[2-6] Name=VRAM Count=12G Flags=CountOnly NodeName=node[8,10] Name=VRAM Count=16G Flags=CountOnly NodeName=node[11-14] Name=VRAM Count=48G Flags=CountOnly
and in slurm.conf:
AccountingStorageTRES=gres/gpu,gres/gpu:p6000,gres/gpu:titan,gres/VRAM,gres/gpu:rtx_5000,gres/gpu:rtx_6000,gres/gpu:rtx_8000,gres/gpu:rtx_a6000 GresTypes=gpu,VRAM NodeName=node1 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=230000 Weight=30 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000 Gres=gpu:p6000:4,VRAM:no_consume:24G NodeName=node2 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000 Weight=20 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan Gres=gpu:titan:7,VRAM:no_consume:12G NodeName=node3 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000 Weight=21 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan Gres=gpu:titan:8,VRAM:no_consume:12G NodeName=node4 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000 Weight=22 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan Gres=gpu:titan:8,VRAM:no_consume:12G NodeName=node5 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000 Weight=23 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan Gres=gpu:titan:8,VRAM:no_consume:12G NodeName=node6 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000 Weight=24 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan Gres=gpu:titan:8,VRAM:no_consume:12G NodeName=node7 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000 Weight=31 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000 Gres=gpu:p6000:8,VRAM:no_consume:24G NodeName=node8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000 Weight=40 Feature=CPU_GEN:SKYL,CPU_SKU=GOLD-61,rtx_5000 Gres=gpu:rtx_5000:9,VRAM:no_consume:16G NodeName=node9 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000 Weight=50 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_6000 Gres=gpu:rtx_6000:9,VRAM:no_consume:24G NodeName=node10 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000 Weight=41 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_5000 Gres=gpu:rtx_5000:9,VRAM:no_consume:16G NodeName=node11 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=60 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000 Gres=gpu:rtx_8000:9,VRAM:no_consume:48G NodeName=node12 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=61 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000 Gres=gpu:rtx_8000:9,VRAM:no_consume:48G NodeName=node13 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=62 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000 Gres=gpu:rtx_8000:9,VRAM:no_consume:48G NodeName=node14 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=63 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_a6000 Gres=gpu:rtx_a6000:8,VRAM:no_consume:48G
If I want to run a job with only specifying --gpu=1 it gets executed on node2, if I add --gres=VRAM:32G it gets scheduled to node12, but then terminated with "Invalid generic resource (gres) specification".
So I understand that the scheduler knows about the gres/VRAM, but the slurmd does not.
Was there any change to this, and how can I get the old behaviour back? Thanks in advance Quirin Lohr
srun: defined options srun: -------------------- -------------------- srun: gpus : 1 srun: gres : gres:VRAM:32G srun: verbose : 1 srun: -------------------- -------------------- srun: end of defined options srun: Waiting for nodes to boot (delay looping 4650 times @ 0.100000 secs x index) srun: Nodes node12 are ready for job srun: jobid 571261: nodes(1):`node12', cpu counts: 1(x1) srun: error: Unable to create step for job 571261: Invalid generic resource (gres) specification
sacctmgr show tres:
Type Name ID -------- --------------- ------ cpu 1 mem 2 energy 3 node 4 billing 5 fs disk 6 vmem 7 pages 8 gres gpu 1001 gres gpu:p6000 1002 gres gpu:titanxp 1003 gres vram 1004 gres gpu:titanxpasc+ 1005 gres cudacores 1006 gres gpu:rtx5000 1007 gres gpu:rtx6000 1008 gres mps 1009 gres mps:rtx5000 1010 gres mps:rtx6000 1011 gres gpu:rtx8000 1012 gres gpu:titan 1013 gres gpu:rtx_5000 1014 gres gpu:rtx_6000 1015 gres gpu:rtx_8000 1016 gres gpu:rtx_a6000 1017
-- Quirin Lohr Systemadministration Technische Universität München Fakultät für Informatik Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz Boltzmannstrasse 3 85748 Garching Tel. +49 89 289 17769 Fax +49 89 289 17757 quirin.l...@in.tum.de www.vision.in.tum.de
smime.p7s
Description: S/MIME Cryptographic Signature