[slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Schneider, Gerald
Hi there,

I have a recurring problem with allocated TRES, which are not released after 
all jobs on that node are finished. The TRES are still marked as allocated and 
no new jobs can't be scheduled on that node using those TRES.

$ scontrol show node node2
NodeName=node2 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=0 CPUTot=256 CPULoad=0.11
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:tesla:8
   NodeAddr=node2 NodeHostName=node2 Version=21.08.5
   OS=Linux 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023
   RealMemory=1025593 AllocMem=0 FreeMem=1025934 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=AMPERE
   BootTime=2023-11-23T09:01:28 SlurmdStartTime=2023-11-23T09:02:09
   LastBusyTime=2023-11-23T09:03:19
   CfgTRES=cpu=256,mem=1025593M,billing=256,gres/gpu=8,gres/gpu:tesla=8
   AllocTRES=gres/gpu=8
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Previously the allocation was gone after the server was turned off for a couple 
of hours (power conservation) but the issue occurred again and this time it 
persists even after the server was off over night.

Is there any way to release the allocation manually?

Regards,
Gerald Schneider

--
Gerald Schneider

Fraunhofer-Institut für Graphische Datenverarbeitung IGD 
Joachim-Jungius-Str. 11 | 18059 Rostock | Germany 
Tel. +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199 
gerald.schnei...@igd-r.fraunhofer.de | www.igd.fraunhofer.de




Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Markus Kötter

Hi,

On 23.11.23 10:56, Schneider, Gerald wrote:

I have a recurring problem with allocated TRES, which are not
released after all jobs on that node are finished. The TRES are still
marked as allocated and no new jobs can't be scheduled on that node
using those TRES.


Remove the node from slurm.conf and restart slurmctld, re-add, restart.
Remove from Partition definitions as well.


MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Ole Holm Nielsen

On 11/23/23 11:50, Markus Kötter wrote:

On 23.11.23 10:56, Schneider, Gerald wrote:

I have a recurring problem with allocated TRES, which are not
released after all jobs on that node are finished. The TRES are still
marked as allocated and no new jobs can't be scheduled on that node
using those TRES.


Remove the node from slurm.conf and restart slurmctld, re-add, restart.
Remove from Partition definitions as well.


Just my 2 cents:  Do NOT remove a node from slurm.conf just as described!

When adding or removing nodes, both slurmctld as well as all slurmd's must 
be restarted!  See the SchedMD presentation 
https://slurm.schedmd.com/SLUG23/Field-Notes-7.pdf slides 51-56 for the 
recommended procedure.


/Ole



Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Bjørn-Helge Mevik
"Schneider, Gerald"  writes:

> Is there any way to release the allocation manually?

I've only seen this once on our clusters, and that time it helped just
restarting slurmctld.

If this is a recurring problem, perhaps it will help to upgrade Slurm.
You are running quite an old version.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature