Hi Magnus,

On 8/28/23 10:16, Hagdorn, Magnus Karl Moritz wrote:
we recently enabled the energy gathering plugin on using the IPMI
gatherer with libfreeipmi. We are running the latest slurm 23.02.4 on
rocky 8.5. We are getting sporadic buffer overflows in slurmd when it
is trying to query the IPMI interface. We have the feeling this occurs
when a lot of jobs are getting started on the node. Has anybody come
across this issue and even better found a solution?

I'm curious to learn about your energy gathering method: How do you extract node power using IPMI using FreeIMPI (or some other toolset), and how do you configure Slurm for this?

In our cluster I select a Dell node where I obtain this IPMI power reading from the BMC using a FreeIMPI tool:

$ ipmi-dcmi -D LAN_2_0 --username=root --password=<secret> --hostname=c190b 
--get-system-power-statistics
Current Power                        : 151 Watts
Minimum Power over sampling duration : 6 watts
Maximum Power over sampling duration : 293 watts
Average Power over sampling duration : 153 watts
Time Stamp                           : 08/29/2023 - 08:54:03
Statistics reporting time period     : 1000 milliseconds
Power Measurement                    : Active

However, the node's iDRAC BMC web GUI presents a somewhat different reading, which I assume must be reliable: 168 W.

I'm also using the Slurm with AcctGatherEnergyType=acct_gather_energy/rapl, see [1]. With RAPL and "scontrol show node c190" Slurm reports CurrentWatts=177 which just measures CPU+DIMM power.

Thanks for sharing any insights.

Best regards,
Ole

[1] https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#power-monitoring-and-management

Reply via email to