We had a situation recently where a desktop was turned off for a week. When we brought it back online (in a different part of the network with a different IP), everything came up fine (slurmd and munge).
But it kept going into DOWN* for no apparent reason (neither daemon-wise nor log-wise). As part of another issue, we "scontrol reconfigure"d (and, as it turned out, restarted slurmctld as well). THAT seems to have corrected it going to DOWN*. It switched to IDLE and stayed there. Not that this necessarily has anything to do with your issue... But it does sound similar. -- - Bill +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Bill Benedetto <bbenede...@goodyear.com> The Goodyear Tire & Rubber Co. I don't speak for Goodyear and they don't speak for me. We're both happy. +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ >>> Herc Silverstein writes: Herc> We have a cluster (in Google gcp) which has a few partitions set up to Herc> auto-scale, but one partition is set up to not autoscale. The desired Herc> state is for all of the nodes in this non-autoscaled partition Herc> (SuspendExcParts=3Dgpu-t4-4x-ondemand) to continue running uninterrupted. Herc> However, we are finding that nodes periodically end up in the down* Herc> state and that we cannot get them back into a usable state. This is Herc> using slurm 19.05.7 Herc> Herc> We have a script that runs periodically and checks the state of the Herc> nodes and takes action based on the state. If the node is in a down Herc> state, then it gets terminated and if successfully terminated its state Herc> is set to power_down. There is a short 1 second pause and then for Herc> those nodes that are in the POWERING_DOWN and not drained state they are Herc> set to RESUME. Herc> Herc> Sometimes after we start up the node and it's running slurmd we cannot Herc> get some of these nodes back into a usable slurm state even after Herc> manually fiddling with its state. It seems to go between idle* and Herc> down*. But the node is there and we can log into it. Herc> Herc> Does anyone have an idea of what might be going on? And what we can do Herc> to get these nodes back into a usable (I guess "idle") state? Herc> Herc> Thanks, Herc> Herc> Herc Herc> Herc> Herc>