Hi Xavier,
On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out of
many (>242) node starts that resulted in
|slurm_update error: Invalid node state specified|
when we called:
|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|
in the Fail script. We run this to make 100% sure that the instances -
that are created on demand - are again `~idle` after being removed by the
fail program. They are set to RESUME before the actual instance gets
destroyed. I remember that I had this case manually before, but I don't
remember when it occurs.
Maybe someone has a great idea how to tackle this problem.
Probably you can't assign a "reason" when you update a node with
state=RESUME. The scontrol manual page says:
Reason=<reason> Identify the reason the node is in a "DOWN", "DRAINED",
"DRAINING", "FAILING" or "FAIL" state.
Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
IHTH,
Ole