[slurm-users] Releasing stale allocated TRES
Hi there, I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those TRES. $ scontrol show node node2 NodeName=node2 Arch=x86_64 CoresPerSocket=64 CPUAlloc=0 CPUTot=256 CPULoad=0.11 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:tesla:8 NodeAddr=node2 NodeHostName=node2 Version=21.08.5 OS=Linux 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023 RealMemory=1025593 AllocMem=0 FreeMem=1025934 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=AMPERE BootTime=2023-11-23T09:01:28 SlurmdStartTime=2023-11-23T09:02:09 LastBusyTime=2023-11-23T09:03:19 CfgTRES=cpu=256,mem=1025593M,billing=256,gres/gpu=8,gres/gpu:tesla=8 AllocTRES=gres/gpu=8 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Previously the allocation was gone after the server was turned off for a couple of hours (power conservation) but the issue occurred again and this time it persists even after the server was off over night. Is there any way to release the allocation manually? Regards, Gerald Schneider -- Gerald Schneider Fraunhofer-Institut für Graphische Datenverarbeitung IGD Joachim-Jungius-Str. 11 | 18059 Rostock | Germany Tel. +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199 gerald.schnei...@igd-r.fraunhofer.de | www.igd.fraunhofer.de
Re: [slurm-users] Releasing stale allocated TRES
Hi, On 23.11.23 10:56, Schneider, Gerald wrote: I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those TRES. Remove the node from slurm.conf and restart slurmctld, re-add, restart. Remove from Partition definitions as well. MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Releasing stale allocated TRES
On 11/23/23 11:50, Markus Kötter wrote: On 23.11.23 10:56, Schneider, Gerald wrote: I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those TRES. Remove the node from slurm.conf and restart slurmctld, re-add, restart. Remove from Partition definitions as well. Just my 2 cents: Do NOT remove a node from slurm.conf just as described! When adding or removing nodes, both slurmctld as well as all slurmd's must be restarted! See the SchedMD presentation https://slurm.schedmd.com/SLUG23/Field-Notes-7.pdf slides 51-56 for the recommended procedure. /Ole
Re: [slurm-users] Releasing stale allocated TRES
"Schneider, Gerald" writes: > Is there any way to release the allocation manually? I've only seen this once on our clusters, and that time it helped just restarting slurmctld. If this is a recurring problem, perhaps it will help to upgrade Slurm. You are running quite an old version. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature