Re: [slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Greg Wickham
Hi Richard,

Slurmctld caches the updates until slurmdbd comes back online.

You can see how many records are pending for the database by using the “sdiag” 
command and looking for “DBD Agent queue size”.

If this number grows significantly it means that slurmdbd isn’t available.

   -Greg

On 01/11/2022, 07:23, "slurm-users"  
wrote:

Hi,

Just for my info, I would like to know what happens when SlurmDBD loses
connection to the backend Database, for ex, MariaDB.

Does it cache the accounting info and keep them till the DB comes back
up ?, or does it panic and shut down ?

Thank you,

RC.



Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Brian Andrus
It caches up to a point. As I understand it, that is about an hour 
(depending on size and how busy the cluster is, as well as available 
memory, etc).


Brian Andrus


On 10/31/2022 9:20 PM, Richard Chang wrote:

Hi,

Just for my info, I would like to know what happens when SlurmDBD 
loses connection to the backend Database, for ex, MariaDB.


Does it cache the accounting info and keep them till the DB comes back 
up ?, or does it panic and shut down ?


Thank you,

RC.






[slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Richard Chang

Hi,

Just for my info, I would like to know what happens when SlurmDBD loses 
connection to the backend Database, for ex, MariaDB.


Does it cache the accounting info and keep them till the DB comes back 
up ?, or does it panic and shut down ?


Thank you,

RC.




Re: [slurm-users] Prolog and job_submit

2022-10-31 Thread Christopher Samuel

On 10/31/22 5:46 am, Davide DelVento wrote:


Thanks for helping me find workarounds.


No worries!


My only other thought is that you might be able to use node features &
job constraints to communicate this without the user realising.


I am not sure I understand this approach.


I was just trying to think of things that could get into the Prolog that 
runs as root that you could use as a signal to it. Job constraints 
seemed the most reasonable choice.



Are you saying that if the job_submit.lua can't directly add an
environmental variable that the prolog can see, but can add the
constraint which will become an environmental variable that the prolog
can see?


That's correct - the difference being that Slurm, not the user, is in 
control of its presence and the possible values it can have (as it's 
constrained by what you've chosen for the name of the node feature).



Would that work if that feature is available in all nodes?


Yes, that should work just fine I believe.

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Prolog and job_submit

2022-10-31 Thread Davide DelVento
Thanks for helping me find workarounds.

> My only other thought is that you might be able to use node features &
> job constraints to communicate this without the user realising.

I am not sure I understand this approach.

> For instance you could declare the nodes where the software is installed
> to have "Feature=mysoftware" and then your job submit could spot users
> requesting the license and add the constraint "mysoftware" to their job.
> The (root privileged) Prolog can see that via the SLURM_JOB_CONSTRAINTS
> environment variable and so could react to it.

Are you saying that if the job_submit.lua can't directly add an
environmental variable that the prolog can see, but can add the
constraint which will become an environmental variable that the prolog
can see?
Would that work if that feature is available in all nodes?



Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-31 Thread Ole Holm Nielsen

On 10/31/22 10:13, Richard Chang wrote:

This is 21.08


As I have written to you previously, switch/hpe_slingshot is only 
supported from Slurm 22.05!


/Ole


On 10/31/2022 11:05 AM, Chris Samuel wrote:

On 27/10/22 11:30 pm, Richard Chang wrote:

Yes, the system is a HPE Cray EX, and I am trying to use 
switch/hpe_slingshot.


Which version of Slurm are you using Richard?

All the best,
Chris




Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-31 Thread Richard Chang

This is 21.08

Than you,

RC

On 10/31/2022 11:05 AM, Chris Samuel wrote:

On 27/10/22 11:30 pm, Richard Chang wrote:

Yes, the system is a HPE Cray EX, and I am trying to use 
switch/hpe_slingshot.


Which version of Slurm are you using Richard?

All the best,
Chris