Re: [slurm-users] how do slurmctld determine whether a compute node is not responding?

Ole Holm Nielsen Mon, 11 Jul 2022 00:58:07 -0700

On 7/11/22 09:32, taleinterve...@sjtu.edu.cn wrote:

Recently we found some strange log in slurmctld.log about node notresponding, such as:
[2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding

[2022-07-09T03:23:58.098] Node node171 now responding

[2022-07-09T03:23:58.099] Node node165 now responding

[2022-07-09T03:23:58.099] Node node163 now responding

[2022-07-09T03:23:58.099] Node node172 now responding

[2022-07-09T03:23:58.099] Node node170 now responding

[2022-07-09T03:23:58.099] Node node175 now responding

[2022-07-09T03:23:58.099] Node node164 now responding

[2022-07-09T03:23:58.099] Node node178 now responding

[2022-07-09T03:23:58.099] Node node177 now responding
Meanwhile, checking slurmd.log and nhc.log on those node all seem to be okat the reported timepoint.
So we guess it’s slurmctld launch some detection towards those computenode and didn’t get response, thus lead to slurmctld thinking those nodeto be not responding.

Such node warnings could be caused by a broken network. Or by your DNSservers not responding to DNS lookups so that "node177" is unknown, forexample.

Then the question is what detect action do slurmctld launched? How did itdetermine whether a node is responsive or non-responsive?
And is it possible to customize slurmctld’s behavior on such detection,for example wait timeout or retry count before determine the node to benot responding?


See the slurm.conf parameters displayed by:

# scontrol show config | grep Timeout

We normally use this:

SlurmctldTimeout        = 600 sec
SlurmdTimeout           = 300 sec

/Ole

Re: [slurm-users] how do slurmctld determine whether a compute node is not responding?

Reply via email to