[slurm-users] Issues with orphaned jobs after update
Hi, Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I ended up losing a number of jobs on the compute nodes. Ultimately, the installation seems to be successful but I now have some issues with job remnants it appears.About once per minute (per job), the slurmctld daemon is logging: [2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39104]: Zero Bytes were transmitted or received [2023-12-06T08:16:32.505] error: slurm_receive_msg [146.57.133.18:39106]: Zero Bytes were transmitted or received [2023-12-06T08:16:32.792] error: slurm_receive_msg [146.57.133.38:54722]: Zero Bytes were transmitted or received [2023-12-06T08:16:34.189] error: slurm_receive_msg [146.57.133.49:59058]: Zero Bytes were transmitted or received [2023-12-06T08:16:34.197] error: slurm_receive_msg [146.57.133.49:58232]: Zero Bytes were transmitted or received [2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48856]: Zero Bytes were transmitted or received [2023-12-06T08:16:35.757] error: slurm_receive_msg [146.57.133.39:48860]: Zero Bytes were transmitted or received [2023-12-06T08:16:36.329] error: slurm_receive_msg [146.57.133.46:50848]: Zero Bytes were transmitted or received [2023-12-06T08:16:59.827] error: slurm_receive_msg [146.57.133.14:60328]: Zero Bytes were transmitted or received [2023-12-06T08:16:59.828] error: slurm_receive_msg [146.57.133.37:37734]: Zero Bytes were transmitted or received [2023-12-06T08:17:03.285] error: slurm_receive_msg [146.57.133.35:41426]: Zero Bytes were transmitted or received [2023-12-06T08:17:13.244] error: slurm_receive_msg [146.57.133.105:34416]: Zero Bytes were transmitted or received [2023-12-06T08:17:13.726] error: slurm_receive_msg [146.57.133.15:60164]: Zero Bytes were transmitted or received The controller also shows orphaned jobs: [2023-12-06T07:47:42.010] error: Orphan StepId=9050.extern reported on node amd03 [2023-12-06T07:47:42.010] error: Orphan StepId=9055.extern reported on node amd03 [2023-12-06T07:47:42.011] error: Orphan StepId=8862.extern reported on node amd12 [2023-12-06T07:47:42.011] error: Orphan StepId=9065.extern reported on node amd07 [2023-12-06T07:47:42.011] error: Orphan StepId=9066.extern reported on node amd07 [2023-12-06T07:47:42.011] error: Orphan StepId=8987.extern reported on node amd09 [2023-12-06T07:47:42.012] error: Orphan StepId=9068.extern reported on node amd08 [2023-12-06T07:47:42.012] error: Orphan StepId=8862.extern reported on node amd13 [2023-12-06T07:47:42.012] error: Orphan StepId=8774.extern reported on node amd10 [2023-12-06T07:47:42.012] error: Orphan StepId=9051.extern reported on node amd10 [2023-12-06T07:49:22.009] error: Orphan StepId=9071.extern reported on node aslab01 [2023-12-06T07:49:22.010] error: Orphan StepId=8699.extern reported on node gpu05 On the compute nodes, I see a corresponding error message like this one: [2023-12-06T08:18:03.292] [9052.extern] error: hash_g_compute: hash plugin with id:0 not exist or is not loaded [2023-12-06T08:18:03.292] [9052.extern] error: slurm_send_node_msg: hash_g_compute: REQUEST_STEP_COMPLETE has error The error seems to be a reference always to a job that was canceled, e.g. 9052: # sacct -j 9052 JobID JobName PartitionAccount AllocCPUS State ExitCode -- -- -- -- -- 9052 sys/dashb+ a40gpu24 CANCELLED 0:0 9052.batchbatch 24 CANCELLED 0:0 9052.extern extern 24 CANCELLED 0:0 These jobs were running at the start of the update but we subsequently canceled because of the slurmd or slurmctld timeouts during the update. How can I clean this up?I've tried canceling the jobs but nothing seems to work to remove them. Thanks in advance, Jeff
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, for multiple reasons we build it ourself, but I am not really involved in that process, but I will contact the person who is. Thanks for the recommendation! We should probably implement a regular check whether there is a new slurm version. I am not 100% whether this will fix our issues or not, but it's worth a try. Best regards Xaver On 06.12.23 12:03, Ole Holm Nielsen wrote: On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving Power with Slurm" at https://slurm.schedmd.com/publications.html For reasons of security and functionality it is recommended to follow Slurm's releases (maybe not the first few minor versions of new major releases like 23.11). FYI, I've collected information about upgrading Slurm in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm /Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving Power with Slurm" at https://slurm.schedmd.com/publications.html For reasons of security and functionality it is recommended to follow Slurm's releases (maybe not the first few minor versions of new major releases like 23.11). FYI, I've collected information about upgrading Slurm in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm /Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. Xaver On 06.12.23 11:09, Ole Holm Nielsen wrote: Hi Xaver, Your version of Slurm may matter for your power saving experience. Do you run an updated version? /Ole On 12/6/23 10:54, Xaver Stiensmeier wrote: Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not always giving all information. We run our solution for about a year now so I don't think there's a general problem (as in something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). >> IHTH, Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Xaver, Your version of Slurm may matter for your power saving experience. Do you run an updated version? /Ole On 12/6/23 10:54, Xaver Stiensmeier wrote: Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not always giving all information. We run our solution for about a year now so I don't think there's a general problem (as in something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). >> IHTH, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620 Your repository would've been really helpful for me when we started>> IHTH, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620 implementing the cloud scheduling, but I feel like we have implemented most things you mention there already. But I will take a look at `DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find out; SLURM plans/planned to change that in the future (cloud key behaves different than any other key in PrivateData). Of course our setup differs a little in the details. Best regards Xaver On 06.12.23 10:30, Ole Holm Nielsen wrote: Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Probably you can't assign a "reason" when you update a node with state=RESUME. The scontrol manual page says: Reason= Identify the reason the node is in a "DOWN", "DRAINED", "DRAINING", "FAILING" or "FAIL" state. Maybe you will find some useful hints in my Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving and in my power saving tools at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not always giving all information. We run our solution for about a year now so I don't think there's a general problem (as in something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). Your repository would've been really helpful for me when we started implementing the cloud scheduling, but I feel like we have implemented most things you mention there already. But I will take a look at `DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find out; SLURM plans/planned to change that in the future (cloud key behaves different than any other key in PrivateData). Of course our setup differs a little in the details. Best regards Xaver On 06.12.23 10:30, Ole Holm Nielsen wrote: Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Probably you can't assign a "reason" when you update a node with state=RESUME. The scontrol manual page says: Reason= Identify the reason the node is in a "DOWN", "DRAINED", "DRAINING", "FAILING" or "FAIL" state. Maybe you will find some useful hints in my Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving and in my power saving tools at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save IHTH, Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Probably you can't assign a "reason" when you update a node with state=RESUME. The scontrol manual page says: Reason= Identify the reason the node is in a "DOWN", "DRAINED", "DRAINING", "FAILING" or "FAIL" state. Maybe you will find some useful hints in my Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving and in my power saving tools at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save IHTH, Ole
Re: [slurm-users] Disabling SWAP space will it effect SLURM working
Hi Joseph, This might depend on the rest of your configuration, but in general swap should not be needed for anything on Linux. BUT: you might get OOM killer messages in your system logs, and SLURM might fall victim to the OOM killer (OOM = Out Of Memory) if you run applications on the compute node that eat up all your RAM. Swap does not prevent against this, but makes it less likely to happen. I've seen OOM kill slurm daemon processes on compute nodes with swap, usually slurm recovers just fine after the application that ate up all the RAM ends up getting killed by the OOM killer. My compute nodes are not configured to monitor memory usage of jobs. If you have memory configured as a managed resource in your SLURM setup, and you leave a bit of headroom for the OS itself (e.g. only hand our a maximum of 250GB RAM to jobs on your 256GB RAM nodes), you should be fine. cheers, Hans ps. I'm just a happy slurm user/admin, not an expert, so I might be wrong about everything :-) On 06-12-2023 05:57, John Joseph wrote: Dear All, Good morning We have 4 node [256 GB Ram in each node] SLURM instance with which we installed and it is working fine. We have 2 GB of SWAP space on each node, for some purpose to make the system in full use want to disable the SWAP memory, Like to know if I am disabling the SWAP partition will it efffect SLURM functionality . Advice requested Thanks Joseph John
[slurm-users] Power Save: When is RESUME an invalid node state?
Dear Slurm User list, using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Best regards Xaver Stiensmeier