Hello,

After the addition of nodes to nodes.conf and their simultaneous removal from nodes_down.conf, where they were marked in "State=FUTURE", plus slurmctl reconfig(ure) and restart of slurmctld, several of the added nodes were reported as "not responding" following a very regular time pattern. This happened for nodes added in 'drain` state and for nodes added directly in active partitions as well, so for a short time sinfo was showing them in 'partition*', then say for half an hour in 'partition', then in 'partition*' again and so on ... at times they were set in 'down' by the controller. All test of the network for those nodes were always fine at the same time when the controller was marking the nodes as unresponsive.

To better understand the problem, does anyone know how the controller decides that a node is or is not responding? I would like, in case the problem reappears, to be able to reproduce on the command line the conditions which led the controller to mark some nodes as not responding.

Does anyone know what could cause the issue? Is it maybe bound to the activation of 'FUTURE' nodes? In our case it was solved probably by increasing the default value of the TreeWidth parameter (from fifty to more than the number of nodes) and in one case by undraining the nodes.

Our Slurm version: 21.08.8-2

Thanks, cheers,

    Raffaele


Reply via email to