Hello, I know that this is not quite the answer, but you could additionally (and maybe you already did this :)) check if this is not a network problem:
* Are the nodes available outside of Slurm during that time? SSH, ping? * If you have a monitoring system (Prometheus, Icinga, etc.), are there any issues reported? And lastly, did you try to set log level to "debug" for "slurmd" and "slurmctld"? Kind Regards -- W dniu 11.07.2022 o 09:32, taleinterve...@sjtu.edu.cn pisze:
Hi, all:Recently we found some strange log in slurmctld.log about node not responding, such as:[2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding [2022-07-09T03:23:58.098] Node node171 now responding [2022-07-09T03:23:58.099] Node node165 now responding [2022-07-09T03:23:58.099] Node node163 now responding [2022-07-09T03:23:58.099] Node node172 now responding [2022-07-09T03:23:58.099] Node node170 now responding [2022-07-09T03:23:58.099] Node node175 now responding [2022-07-09T03:23:58.099] Node node164 now responding [2022-07-09T03:23:58.099] Node node178 now responding [2022-07-09T03:23:58.099] Node node177 now respondingMeanwhile, checking slurmd.log and nhc.log on those node all seem to be ok at the reported timepoint.So we guess it’s slurmctld launch some detection towards those compute node and didn’t get response, thus lead to slurmctld thinking those node to be not responding.Then the question is what detect action do slurmctld launched? How did it determine whether a node is responsive or non-responsive?And is it possible to customize slurmctld’s behavior on such detection, for example wait timeout or retry count before determine the node to be not responding?
-- Kamil Wilczek [https://keys.openpgp.org/] [D415917E84B8DA5A60E853B6E676ED061316B69B] Laboratorium Komputerowe Wydział Matematyki, Informatyki i Mechaniki Uniwersytet Warszawski ul. Banacha 2 02-097 Warszawa Tel.: 22 55 44 392 https://www.mimuw.edu.pl/ https://www.uw.edu.pl/
OpenPGP_signature
Description: OpenPGP digital signature