Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB
Hi Richard, Slurmctld caches the updates until slurmdbd comes back online. You can see how many records are pending for the database by using the “sdiag” command and looking for “DBD Agent queue size”. If this number grows significantly it means that slurmdbd isn’t available. -Greg On 01/11/2022, 07:23, "slurm-users" wrote: Hi, Just for my info, I would like to know what happens when SlurmDBD loses connection to the backend Database, for ex, MariaDB. Does it cache the accounting info and keep them till the DB comes back up ?, or does it panic and shut down ? Thank you, RC.
Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB
It caches up to a point. As I understand it, that is about an hour (depending on size and how busy the cluster is, as well as available memory, etc). Brian Andrus On 10/31/2022 9:20 PM, Richard Chang wrote: Hi, Just for my info, I would like to know what happens when SlurmDBD loses connection to the backend Database, for ex, MariaDB. Does it cache the accounting info and keep them till the DB comes back up ?, or does it panic and shut down ? Thank you, RC.
[slurm-users] SlurmDBD losing connection to the backend MariaDB
Hi, Just for my info, I would like to know what happens when SlurmDBD loses connection to the backend Database, for ex, MariaDB. Does it cache the accounting info and keep them till the DB comes back up ?, or does it panic and shut down ? Thank you, RC.
Re: [slurm-users] Prolog and job_submit
On 10/31/22 5:46 am, Davide DelVento wrote: Thanks for helping me find workarounds. No worries! My only other thought is that you might be able to use node features & job constraints to communicate this without the user realising. I am not sure I understand this approach. I was just trying to think of things that could get into the Prolog that runs as root that you could use as a signal to it. Job constraints seemed the most reasonable choice. Are you saying that if the job_submit.lua can't directly add an environmental variable that the prolog can see, but can add the constraint which will become an environmental variable that the prolog can see? That's correct - the difference being that Slurm, not the user, is in control of its presence and the possible values it can have (as it's constrained by what you've chosen for the name of the node feature). Would that work if that feature is available in all nodes? Yes, that should work just fine I believe. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Prolog and job_submit
Thanks for helping me find workarounds. > My only other thought is that you might be able to use node features & > job constraints to communicate this without the user realising. I am not sure I understand this approach. > For instance you could declare the nodes where the software is installed > to have "Feature=mysoftware" and then your job submit could spot users > requesting the license and add the constraint "mysoftware" to their job. > The (root privileged) Prolog can see that via the SLURM_JOB_CONSTRAINTS > environment variable and so could react to it. Are you saying that if the job_submit.lua can't directly add an environmental variable that the prolog can see, but can add the constraint which will become an environmental variable that the prolog can see? Would that work if that feature is available in all nodes?
Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node
On 10/31/22 10:13, Richard Chang wrote: This is 21.08 As I have written to you previously, switch/hpe_slingshot is only supported from Slurm 22.05! /Ole On 10/31/2022 11:05 AM, Chris Samuel wrote: On 27/10/22 11:30 pm, Richard Chang wrote: Yes, the system is a HPE Cray EX, and I am trying to use switch/hpe_slingshot. Which version of Slurm are you using Richard? All the best, Chris
Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node
This is 21.08 Than you, RC On 10/31/2022 11:05 AM, Chris Samuel wrote: On 27/10/22 11:30 pm, Richard Chang wrote: Yes, the system is a HPE Cray EX, and I am trying to use switch/hpe_slingshot. Which version of Slurm are you using Richard? All the best, Chris