Hi Quirin maybe you have this gres issue

https://bugs.schedmd.com/show_bug.cgi?id=12642#c27

--
Bas van der Vlies


> On 17 Oct 2021, at 16:32, Quirin Lohr <quirin.l...@in.tum.de> wrote:
> 
> Hi,
> 
> I just upgraded from 20.11 to 21.08.2.
> 
> Now it seems the slurmd cannot handle my custom GRES.
> I have set VRAM of the GPUs as a custom GRES, to allow users to select a GPU 
> with enough VRAM for their jobs.
> 
> I defined the VRAM in gres.conf:
> 
>> NodeName=node[1,7,9] Name=VRAM Count=24G Flags=CountOnly
>> NodeName=node[2-6] Name=VRAM Count=12G Flags=CountOnly
>> NodeName=node[8,10] Name=VRAM Count=16G Flags=CountOnly
>> NodeName=node[11-14] Name=VRAM Count=48G Flags=CountOnly
> 
> 
> 
> and in slurm.conf:
>> AccountingStorageTRES=gres/gpu,gres/gpu:p6000,gres/gpu:titan,gres/VRAM,gres/gpu:rtx_5000,gres/gpu:rtx_6000,gres/gpu:rtx_8000,gres/gpu:rtx_a6000
>> GresTypes=gpu,VRAM
>> NodeName=node1  CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 
>> ThreadsPerCore=1 RealMemory=230000  Weight=30 
>> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000      
>> Gres=gpu:p6000:4,VRAM:no_consume:24G
>> NodeName=node2  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=490000  Weight=20 
>> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      
>> Gres=gpu:titan:7,VRAM:no_consume:12G
>> NodeName=node3  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=490000  Weight=21 
>> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      
>> Gres=gpu:titan:8,VRAM:no_consume:12G
>> NodeName=node4  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=490000  Weight=22 
>> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      
>> Gres=gpu:titan:8,VRAM:no_consume:12G
>> NodeName=node5  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=490000  Weight=23 
>> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      
>> Gres=gpu:titan:8,VRAM:no_consume:12G
>> NodeName=node6  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=490000  Weight=24 
>> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      
>> Gres=gpu:titan:8,VRAM:no_consume:12G
>> NodeName=node7  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=490000  Weight=31 
>> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000      
>> Gres=gpu:p6000:8,VRAM:no_consume:24G
>> NodeName=node8  CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 
>> ThreadsPerCore=1 RealMemory=360000  Weight=40 
>> Feature=CPU_GEN:SKYL,CPU_SKU=GOLD-61,rtx_5000 
>> Gres=gpu:rtx_5000:9,VRAM:no_consume:16G
>> NodeName=node9  CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 
>> ThreadsPerCore=1 RealMemory=360000  Weight=50 
>> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_6000   
>> Gres=gpu:rtx_6000:9,VRAM:no_consume:24G
>> NodeName=node10 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 
>> ThreadsPerCore=1 RealMemory=360000  Weight=41 
>> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_5000   
>> Gres=gpu:rtx_5000:9,VRAM:no_consume:16G
>> NodeName=node11 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=1500000 Weight=60 
>> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   
>> Gres=gpu:rtx_8000:9,VRAM:no_consume:48G
>> NodeName=node12 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=1500000 Weight=61 
>> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   
>> Gres=gpu:rtx_8000:9,VRAM:no_consume:48G
>> NodeName=node13 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=1500000 Weight=62 
>> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   
>> Gres=gpu:rtx_8000:9,VRAM:no_consume:48G
>> NodeName=node14 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 
>> ThreadsPerCore=1 RealMemory=1500000 Weight=63 
>> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_a6000  
>> Gres=gpu:rtx_a6000:8,VRAM:no_consume:48G
> 
> 
> If I want to run a job with only specifying --gpu=1 it gets executed on 
> node2, if I add --gres=VRAM:32G it gets scheduled to node12, but then 
> terminated with "Invalid generic resource (gres) specification".
> 
> So I understand that the scheduler knows about the gres/VRAM, but the slurmd 
> does not.
> Was there any change to this, and how can I get the old behaviour back?
> 
> Thanks in advance
> Quirin Lohr
> 
>> srun: defined options
>> srun: -------------------- --------------------
>> srun: gpus                : 1
>> srun: gres                : gres:VRAM:32G
>> srun: verbose             : 1
>> srun: -------------------- --------------------
>> srun: end of defined options
>> srun: Waiting for nodes to boot (delay looping 4650 times @ 0.100000 secs x 
>> index)
>> srun: Nodes node12 are ready for job
>> srun: jobid 571261: nodes(1):`node12', cpu counts: 1(x1)
>> srun: error: Unable to create step for job 571261: Invalid generic resource 
>> (gres) specification
> 
> 
> 
> 
> sacctmgr show tres:
>>    Type            Name     ID
>> -------- --------------- ------
>>     cpu                      1
>>     mem                      2
>>  energy                      3
>>    node                      4
>> billing                      5
>>      fs            disk      6
>>    vmem                      7
>>   pages                      8
>>    gres             gpu   1001
>>    gres       gpu:p6000   1002
>>    gres     gpu:titanxp   1003
>>    gres            vram   1004
>>    gres gpu:titanxpasc+   1005
>>    gres       cudacores   1006
>>    gres     gpu:rtx5000   1007
>>    gres     gpu:rtx6000   1008
>>    gres             mps   1009
>>    gres     mps:rtx5000   1010
>>    gres     mps:rtx6000   1011
>>    gres     gpu:rtx8000   1012
>>    gres       gpu:titan   1013
>>    gres    gpu:rtx_5000   1014
>>    gres    gpu:rtx_6000   1015
>>    gres    gpu:rtx_8000   1016
>>    gres   gpu:rtx_a6000   1017
> 
> 
> 
> -- 
> Quirin Lohr
> Systemadministration
> Technische Universität München
> Fakultät für Informatik
> Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz
> 
> Boltzmannstrasse 3
> 85748 Garching
> 
> Tel. +49 89 289 17769
> Fax +49 89 289 17757
> 
> quirin.l...@in.tum.de
> www.vision.in.tum.de
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to