Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Paul Edmon
They fix this in newer versions of Slurm.  We had the same issue with 
older versions so we hard to run with the config_override option on to 
keep the logs quiet.  They changed the way logging was done in the more 
recent releases and its not as chatty.


-Paul Edmon-

On 5/12/22 7:35 AM, Per Lönnborg wrote:

Greetings,

is there a way to lower the log rate on error messages in slurmctld 
for nodes with hardware errors?


We see for example this for a node that has DIMM errors:

[2022-05-12T07:07:34.757] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:35.760] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:36.763] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:37.766] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:38.769] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:39.773] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:40.776] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:41.779] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:42.781] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:45.143] error: Node node37 has low real_memory size 
(257642 < 257660)


The log warning is correct, the node has DIMM errors, but that´s one 
log entry per second. That doesn´t seem right with such high log rate?


Thanks,
/ Per Lonnborg




___
Annons: Handla enkelt och smidigt hos Clas Ohlson 



Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Bjørn-Helge Mevik
Per Lönnborg  writes:

> I "forgot" to tell our version because it´s a bit embarrising - 19.05.8...

Haha! :D

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Bjørn-Helge Mevik
Per Lönnborg  writes:

> Greetings,

God dag!

> is there a way to lower the log rate on error messages in slurmctld for nodes 
> with hardware errors? 

You don't say which version of Slurm you are running, but I think this
was changed in 21.08, so the node will only try to register once if it
has too little memory, thus only giving one such message.  (The node
will then hva state "inval" in sinfo.)

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature