[slurm-users] Re: File-less NVIDIA GeForce 4070 Ti being removed from GRES list

2024-04-02 Thread Reed Dier via slurm-users
Assuming that you have the cuda drivers installed correctly (nvidia-smi for instance), You should create a gres.conf with just this line: > AutoDetect=nvml If that doesn’t automagically begin working, you can increase the verbosity of slurmd with > SlurmdDebug=debug2 It should then print a

[slurm-users] Re: GPU shards not exclusive

2024-02-29 Thread Reed Dier via slurm-users
Hi Will, I appreciate your corroboration. After we upgraded to 23.02.$latest, it seemed to make it easier to reproduce than before. However, the issue appears to have subsided, and the only change I can potentially attribute it to was after turning on > SlurmctldParameters=rl_enable in

[slurm-users] GPU shards not exclusive

2024-02-14 Thread Reed Dier via slurm-users
I seem to have run into an edge case where I’m able to oversubscribe a specific subset of GPUs on one host in particular. Slurm 22.05.8 Ubuntu 20.04 cgroups v1 (ProctrackType=proctrack/cgroup) It seems to be partly a corner case with a couple of caveats. This host has 2 different GPU types in

[slurm-users] sinfo gresused shard count wrong/incomplete

2024-02-07 Thread Reed Dier via slurm-users
I have a bash script that grabs current statistics from sinfo to ship into a time series database to use for Grafana dashboards. We recently began using shards with our gpus, and I’m seeing some unexpected behavior with the data reported from sinfo. > $ sinfo -h -O "NodeHost:5 ,GresUsed:100