[slurm-users] Issues with orphaned jobs after update

2023-12-06 Thread Jeffrey McDonald
Hi, Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I ended up losing a number of jobs on the compute nodes. Ultimately, the installation seems to be successful but I now have some issues with job remnants it appears.About once per minute (per job), the slurmctld

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
Hi Ole, for multiple reasons we build it ourself, but I am not really involved in that process, but I will contact the person who is. Thanks for the recommendation! We should probably implement a regular check whether there is a new slurm version. I am not 100% whether this will fix our issues

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. There are nice bug fixes in 23.02 mentioned in my

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
Hi Ole, Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. Xaver On 06.12.23 11:09, Ole Holm Nielsen wrote: Hi Xaver, Your version of Slurm may

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
Hi Xaver, Your version of Slurm may matter for your power saving experience. Do you run an updated version? /Ole On 12/6/23 10:54, Xaver Stiensmeier wrote: Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME

Re: [slurm-users] Disabling SWAP space will it effect SLURM working

2023-12-06 Thread Hans van Schoot
Hi Joseph, This might depend on the rest of your configuration, but in general swap should not be needed for anything on Linux. BUT: you might get OOM killer messages in your system logs, and SLURM might fall victim to the OOM killer (OOM = Out Of Memory) if you run applications on the

[slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier
Dear Slurm User list, using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We