[slurm-users] nodes reported as "not responding"

rgrosso Mon, 03 Jul 2023 06:07:02 -0700

Hello,

After the addition of nodes to nodes.conf and their simultaneous removalfrom nodes_down.conf, where they were marked in "State=FUTURE", plusslurmctl reconfig(ure) and restart of slurmctld, several of the addednodes were reported as "not responding" following a very regular timepattern. This happened for nodes added in 'drain` state and for nodesadded directly in active partitions as well, so for a short time sinfowas showing them in 'partition*', then say for half an hour in'partition', then in 'partition*' again and so on ... at times they wereset in 'down' by the controller. All test of the network for those nodeswere always fine at the same time when the controller was marking thenodes as unresponsive.

To better understand the problem, does anyone know how the controllerdecides that a node is or is not responding? I would like, in case theproblem reappears, to be able toreproduce on the command line the conditions which led the controller tomark some nodes as not responding.

Does anyone know what could cause the issue? Is it maybe bound to theactivation of 'FUTURE' nodes? In our case it was solved probably byincreasing the default value of the TreeWidth parameter (from fifty tomore than the number of nodes) and in one case by undraining the nodes.


Our Slurm version: 21.08.8-2

Thanks, cheers,

    Raffaele

[slurm-users] nodes reported as "not responding"

Reply via email to