Hi

I'm running Slurm 2.6.0 and MWM 7.2.4 in our test cluster at the moment.
 I happened to notice that node load reporting wasn't consistent-
periodically you'd see a "sane" load reported in Moab, but most of the time
the reported load was zero despite an accurate CPULoad value reported by
"scontrol show node".

Finally got to digging into this.  It appears that the only time load was
being reported properly was in the Moab scheduling cycle directly after
slurmctld did a node ping.  In subsequent scheduling cycles the load
(again, as reported by Moab) was back to zero.

The node ping is significant as that is the only time the node is updated-
since the wiki2 interface only reports records that change, and the load
record isn't changed, it isn't reported in the queries after the node ping.

Judging from this behavior, I'm guessing that Moab does not store the load
value- every time it queries resources in Slurm it sets the node's load
back to zero.

I've altered src/plugins/sched/wiki2/get_nodes.c slightly- basically
moved the section that reports CPULOAD above the check for updated info
(update_time > last_node_update).

So I don't know if this is the appropriate way to fix it.  The wiki
specification that Adaptive has published doesn't seem to indicate how this
should function.  Either MWM should assume the last value reported is still
accurate or Slurm needs to report it for every wiki GETNODES command.

Anyway, the patch is attached, it seems to be working for me, and I've
rolled it into our debian build directory.  YMMV.

Michael

Attachment: loadreport.patch
Description: Binary data

Reply via email to