[slurm-users] Issues with orphaned jobs after update

2023-12-06 Thread Jeffrey McDonald
Hi,
Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I
ended up losing a number of jobs on the compute nodes.   Ultimately, the
installation seems to be successful but I now have some issues with job
remnants it appears.About once per minute (per job), the slurmctld
daemon is logging:

[2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39104]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39106]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:32.792] error: slurm_receive_msg [146.57.133.38:54722]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:34.189] error: slurm_receive_msg [146.57.133.49:59058]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:34.197] error: slurm_receive_msg [146.57.133.49:58232]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48856]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48860]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:36.329] error: slurm_receive_msg [146.57.133.46:50848]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:59.827] error: slurm_receive_msg [146.57.133.14:60328]:
Zero Bytes were transmitted or received
[2023-12-06T08:16:59.828] error: slurm_receive_msg [146.57.133.37:37734]:
Zero Bytes were transmitted or received
[2023-12-06T08:17:03.285] error: slurm_receive_msg [146.57.133.35:41426]:
Zero Bytes were transmitted or received
[2023-12-06T08:17:13.244] error: slurm_receive_msg [146.57.133.105:34416]:
Zero Bytes were transmitted or received
[2023-12-06T08:17:13.726] error: slurm_receive_msg [146.57.133.15:60164]:
Zero Bytes were transmitted or received

The controller also shows orphaned jobs:

[2023-12-06T07:47:42.010] error: Orphan StepId=9050.extern reported on node
amd03
[2023-12-06T07:47:42.010] error: Orphan StepId=9055.extern reported on node
amd03
[2023-12-06T07:47:42.011] error: Orphan StepId=8862.extern reported on node
amd12
[2023-12-06T07:47:42.011] error: Orphan StepId=9065.extern reported on node
amd07
[2023-12-06T07:47:42.011] error: Orphan StepId=9066.extern reported on node
amd07
[2023-12-06T07:47:42.011] error: Orphan StepId=8987.extern reported on node
amd09
[2023-12-06T07:47:42.012] error: Orphan StepId=9068.extern reported on node
amd08
[2023-12-06T07:47:42.012] error: Orphan StepId=8862.extern reported on node
amd13
[2023-12-06T07:47:42.012] error: Orphan StepId=8774.extern reported on node
amd10
[2023-12-06T07:47:42.012] error: Orphan StepId=9051.extern reported on node
amd10
[2023-12-06T07:49:22.009] error: Orphan StepId=9071.extern reported on node
aslab01
[2023-12-06T07:49:22.010] error: Orphan StepId=8699.extern reported on node
gpu05


On the compute nodes, I see  a corresponding error message like this one:

[2023-12-06T08:18:03.292] [9052.extern] error: hash_g_compute: hash plugin
with id:0 not exist or is not loaded
[2023-12-06T08:18:03.292] [9052.extern] error: slurm_send_node_msg:
hash_g_compute: REQUEST_STEP_COMPLETE has error



The error seems to be a reference always to a job that was canceled, e.g.
9052:

# sacct -j 9052
JobID   JobName  PartitionAccount  AllocCPUS  State
ExitCode
 -- -- -- -- --

9052 sys/dashb+ a40gpu24  CANCELLED
 0:0
9052.batchbatch   24  CANCELLED
 0:0
9052.extern  extern   24  CANCELLED
 0:0

These jobs were running at the start of the update but we subsequently
canceled because of the slurmd or slurmctld timeouts during the update.
How can I clean this up?I've tried canceling the jobs but nothing seems
to work to remove them.

Thanks in advance,
Jeff


Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier

Hi Ole,

for multiple reasons we build it ourself, but I am not really involved
in that process, but I will contact the person who is. Thanks for the
recommendation! We should probably implement a regular check whether
there is a new slurm version. I am not 100% whether this will fix our
issues or not, but it's worth a try.

Best regards
Xaver

On 06.12.23 12:03, Ole Holm Nielsen wrote:

On 12/6/23 11:51, Xaver Stiensmeier wrote:

Good idea. Here's our current version:

```
sinfo -V
slurm 22.05.7
```

Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.


There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving
Power with Slurm" at https://slurm.schedmd.com/publications.html

For reasons of security and functionality it is recommended to follow
Slurm's releases (maybe not the first few minor versions of new major
releases like 23.11).  FYI, I've collected information about upgrading
Slurm in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm

/Ole





Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

On 12/6/23 11:51, Xaver Stiensmeier wrote:

Good idea. Here's our current version:

```
sinfo -V
slurm 22.05.7
```

Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.


There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving 
Power with Slurm" at https://slurm.schedmd.com/publications.html


For reasons of security and functionality it is recommended to follow 
Slurm's releases (maybe not the first few minor versions of new major 
releases like 23.11).  FYI, I've collected information about upgrading 
Slurm in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm


/Ole



Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier

Hi Ole,

Good idea. Here's our current version:

```
sinfo -V
slurm 22.05.7
```

Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.

Xaver

On 06.12.23 11:09, Ole Holm Nielsen wrote:

Hi Xaver,

Your version of Slurm may matter for your power saving experience.  Do
you run an updated version?

/Ole

On 12/6/23 10:54, Xaver Stiensmeier wrote:

Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).
>>

IHTH,
Ole








Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

Hi Xaver,

Your version of Slurm may matter for your power saving experience.  Do you 
run an updated version?


/Ole

On 12/6/23 10:54, Xaver Stiensmeier wrote:

Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).
>>

IHTH,
Ole





--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Your repository would've been really helpful for me when we started>>

IHTH,
Ole





--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

implementing the cloud scheduling, but I feel like we have implemented
most things you mention there already. But I will take a look at
`DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
out; SLURM plans/planned to change that in the future (cloud key behaves
different than any other key in PrivateData). Of course our setup
differs a little in the details.

Best regards
Xaver

On 06.12.23 10:30, Ole Holm Nielsen wrote:

Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:

using https://slurm.schedmd.com/power_save.html we had one case out
of many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances
- that are created on demand - are again `~idle` after being removed
by the fail program. They are set to RESUME before the actual
instance gets destroyed. I remember that I had this case manually
before, but I don't remember when it occurs.

Maybe someone has a great idea how to tackle this problem.


Probably you can't assign a "reason" when you update a node with
state=RESUME.  The scontrol manual page says:

Reason= Identify the reason the node is in a "DOWN",
"DRAINED", "DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving

and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save




Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier

Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).

Your repository would've been really helpful for me when we started
implementing the cloud scheduling, but I feel like we have implemented
most things you mention there already. But I will take a look at
`DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
out; SLURM plans/planned to change that in the future (cloud key behaves
different than any other key in PrivateData). Of course our setup
differs a little in the details.

Best regards
Xaver

On 06.12.23 10:30, Ole Holm Nielsen wrote:

Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:

using https://slurm.schedmd.com/power_save.html we had one case out
of many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances
- that are created on demand - are again `~idle` after being removed
by the fail program. They are set to RESUME before the actual
instance gets destroyed. I remember that I had this case manually
before, but I don't remember when it occurs.

Maybe someone has a great idea how to tackle this problem.


Probably you can't assign a "reason" when you update a node with
state=RESUME.  The scontrol manual page says:

Reason= Identify the reason the node is in a "DOWN",
"DRAINED", "DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving

and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

IHTH,
Ole






Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out of 
many (>242) node starts that resulted in


|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances - 
that are created on demand - are again `~idle` after being removed by the 
fail program. They are set to RESUME before the actual instance gets 
destroyed. I remember that I had this case manually before, but I don't 
remember when it occurs.


Maybe someone has a great idea how to tackle this problem.


Probably you can't assign a "reason" when you update a node with 
state=RESUME.  The scontrol manual page says:


Reason= Identify the reason the node is in a "DOWN", "DRAINED", 
"DRAINING", "FAILING" or "FAIL" state.


Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

IHTH,
Ole




Re: [slurm-users] Disabling SWAP space will it effect SLURM working

2023-12-06 Thread Hans van Schoot

Hi Joseph,

This might depend on the rest of your configuration, but in general swap 
should not be needed for anything on Linux.
BUT: you might get OOM killer messages in your system logs, and SLURM 
might fall victim to the OOM killer (OOM = Out Of Memory) if you run 
applications on the compute node that eat up all your RAM.
Swap does not prevent against this, but makes it less likely to happen. 
I've seen OOM kill slurm daemon processes on compute nodes with swap, 
usually slurm recovers just fine after the application that ate up all 
the RAM ends up getting killed by the OOM killer. My compute nodes are 
not configured to monitor memory usage of jobs. If you have memory 
configured as a managed resource in your SLURM setup, and you leave a 
bit of headroom for the OS itself (e.g. only hand our a maximum of 250GB 
RAM to jobs on your 256GB RAM nodes), you should be fine.


cheers,
Hans


ps. I'm just a happy slurm user/admin, not an expert, so I might be 
wrong about everything :-)




On 06-12-2023 05:57, John Joseph wrote:

Dear All,
Good morning
We have 4 node   [256 GB Ram in each node]  SLURM instance  with which 
we installed and it is working fine.
We have 2 GB of SWAP space on each node,  for some purpose  to make 
the system in full use want to disable the SWAP memory,


Like to know if I am disabling the SWAP  partition will it efffect 
SLURM  functionality .


Advice requested
Thanks
Joseph John



[slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier

Dear Slurm User list,

using https://slurm.schedmd.com/power_save.html we had one case out of
many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances -
that are created on demand - are again `~idle` after being removed by
the fail program. They are set to RESUME before the actual instance gets
destroyed. I remember that I had this case manually before, but I don't
remember when it occurs.

Maybe someone has a great idea how to tackle this problem.

Best regards
Xaver Stiensmeier