[slurm-users] sinfo gresused shard count wrong/incomplete

Reed Dier via slurm-users Wed, 07 Feb 2024 09:20:55 -0800

I have a bash script that grabs current statistics from sinfo to ship into a 
time series database to use for Grafana dashboards.


We recently began using shards with our gpus, and I’m seeing some unexpected 
behavior with the data reported from sinfo.

> $ sinfo -h -O "NodeHost:5 ,GresUsed:100 ,Gres:100" | grep gpu03
> gpu03 
> gpu:p40:0(IDX:N/A),gpu:rtx:0(IDX:N/A),shard:p40:1(IDX:1),shard:rtx:0(IDX:N/A) 
>                      
> gpu:p40:6(S:0),gpu:rtx:2(S:0),shard:p40:36(S:0),shard:rtx:12(S:0)
> 
> $ scontrol show node gpu03
> NodeName=gpu03 Arch=x86_64 CoresPerSocket=22 
> CPUAlloc=52 CPUEfctv=88 CPUTot=88 CPULoad=43.67
> AvailableFeatures=avx,avx512,largeMemory,matlab 
> ActiveFeatures=avx,avx512,largeMemory,matlab
> Gres=gpu:p40:6(S:0),gpu:rtx:2(S:0),shard:p40:36(S:0),shard:rtx:12(S:0)
> NodeAddr=gpu03 NodeHostName=gpu03 Version=22.05.8 
> OS=Linux 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 
> RealMemory=768000 AllocMem=425984 FreeMem=116850 Sockets=2 Boards=1
> State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=gpu
> BootTime=2024-02-02T14:17:07 SlurmdStartTime=2024-02-05T12:33:50
> LastBusyTime=2024-02-07T09:48:50
> CfgTRES=cpu=88,mem=750G,billing=88,gres/gpu=8,gres/gpu:p40=6,gres/gpu:rtx=2,gres/shard=48
> AllocTRES=cpu=52,mem=416G,gres/shard=21
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Specifically, we can see there are 48 total shards on this specific node, 36 of 
type p40, 12 of type rtx.
From control, I can see that 21 shards are in use, and from nvidia-smi I can 
see that the breakdown of shards per gpu index is:

[0]p40:4/6
[1]p40:6/6
[2]rtx:4/6
[3]rtx:3/6
[4]p40:4/6
[5]p40:0/6
[6]p40:0/6
[7]p40:0/6

So, I can tell that what “GresUsed” is reporting is actually an entire gpu’s 
worth of shards used, and the index of that specific entire GPU consumed.
However, what it doesn’t show is the amount of specific gres/shard actually 
used, or more succinctly, I can’t seem to extrapolate that there are 21 shards 
in use.
Right now my script says there is 1 shard out of 48 used, because I’m scraping 
for the num of shards in use, which is reporting the value of entire GPUs used 
by shards, and anything less than 100% is rounded to 0..

So, given that I’m on 22.05.8, maybe its better in 23.02.X which I’m hoping to 
move to within the next month.
However I can’t seem to find anything in the release notes for 23.02 or 23.11 
that would imply that sinfo reports (my) expected value of the actual count of 
shards used,

Does anyone have any ideas for how I might be able to achieve what I’m looking 
for using sinfo, or should I instead try to use the sinfo json parser which has 
a “tres_used” field, that doesn’t appear to be accessible outside of the json 
output?

>     {
>       "architecture": "x86_64",
>       "burstbuffer_network_address": "",
>       "boards": 1,
>       "boot_time": 1706901428,
>       "comment": "",
>       "cores": 22,
>       "cpu_binding": 0,
>       [SNIP]
>       "tres": 
> "cpu=88,mem=750G,billing=88,gres/gpu=8,gres/gpu:p40=6,gres/gpu:rtx=2,gres/shard=48",
>       "slurmd_version": "22.05.8",
>       "alloc_memory": 284380,
>       "alloc_cpus": 62,
>       "idle_cpus": 26,
>       "tres_used": 
> "cpu=62,mem=284380M,gres/gpu=7,gres/gpu:p40=5,gres/gpu:rtx=2,gres/shard=17",
>       "tres_weighted": 62
>     },
 
Any ideas appreciated.
Thanks,
Reed

smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] sinfo gresused shard count wrong/incomplete

Reply via email to