Not unusual. You should set your amount of memory a bit below what slurmd reports.

Different kernel modules that get upgraded may use a little more memory, causing just this situation. There are other causes as well, but by providing the kernel/system some wiggle room, you prevent any issues.

Also helps with OOM killer situations.

Brian Andrus

On 10/1/2021 1:22 AM, Diego Zuccato wrote:
Hello all.

I just upgraded to Debian 11 that brings Slurm 21.08 and the newer nodes upgraded w/o too many issues (just minor config changes, one being RealMemory value in slurm.conf, since for some reason it seems the new slurmd detects about 12MB less memory than before).

But the older nodes are still marked IDLE+DRAIN:
-8<--
NodeName=str957-bl0-01 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=24 CPULoad=0.39
   AvailableFeatures=ib,blade,intel,avx
   ActiveFeatures=ib,blade,intel,avx
   Gres=(null)
   NodeAddr=str957-bl0-01 NodeHostName=str957-bl0-01 Version=20.11.4
   OS=Linux 5.10.0-8-amd64 #1 SMP Debian 5.10.46-5 (2021-09-23)
   RealMemory=64000 AllocMem=0 FreeMem=63518 Sockets=2 Boards=1
   MemSpecLimit=2048
   State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=2 Owner=N/A MCS_label=N/A
   Partitions=b1
   BootTime=2021-10-01T09:35:42 SlurmdStartTime=2021-10-01T09:36:15
   CfgTRES=cpu=24,mem=62.50G,billing=182
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2021-10-01T08:08:18]
   Comment=(null)
-8<--
I already reduced RealMemory line in slurm.conf and restarted both slurmctld and slurmd (in case "scontrol reconfigure" was not enough... not really clear from the docs).

The relevant lines in slurm.conf are:
-8<--
NodeName=DEFAULT            Sockets=2 ThreadsPerCore=2  State=UNKNOWN  MemSpecLimit=2048 NodeName=str957-bl0-0[1-2]            CoresPerSocket=6  RealMemory=64000  Weight=2 Feature=ib,blade,intel,avx
-8<--

And the node says:
-8<--
root@str957-bl0-01:~# slurmd -C
NodeName=str957-bl0-01 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=64378
UpTime=0-00:37:17
-8<--

I also tried lowering RealMemory setting to 60000, in case MemSpecLimit interfered, but the result remains the same.

Any ideas?

TIA!


Reply via email to