Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs.

Maybe someone has a great idea how to tackle this problem.

Probably you can't assign a "reason" when you update a node with state=RESUME. The scontrol manual page says:

Reason=<reason> Identify the reason the node is in a "DOWN", "DRAINED", "DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

IHTH,
Ole


Reply via email to