[slurm-users] sinfo gresused shard count wrong/incomplete

2024-02-07 Thread Reed Dier via slurm-users
I have a bash script that grabs current statistics from sinfo to ship into a time series database to use for Grafana dashboards. We recently began using shards with our gpus, and I’m seeing some unexpected behavior with the data reported from sinfo. > $ sinfo -h -O "NodeHost:5 ,GresUsed:100

[slurm-users] Errors upgrading to 23.11.0

2024-02-07 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I got this error: Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemctl status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details. but in slurmctld.service I see

[slurm-users] Re: Restricting local disk storage of jobs

2024-02-07 Thread Jeffrey T Frey via slurm-users
The native job_container/tmpfs would certainly have access to the job record, so modification to it (or a forked variant) would be possible. A SPANK plugin should be able to fetch the full job record [1] and is then able to inspect the "gres" list (as a C string), which means I could modify

[slurm-users] Re: Restricting local disk storage of jobs

2024-02-07 Thread Tim Schneider via slurm-users
Hey Jeffrey, thanks for this suggestion! This is probably the way to go if one can find a way to access GRES in the prolog. I read somewhere that people were calling scontrol to get this information, but this seems a bit unclean. Anyway, if I find some time I will try it out. Best, Tim On