Re: [slurm-users] salloc problem

2022-10-30 Thread Chris Samuel

On 27/10/22 4:18 am, Gizo Nanava wrote:


we run into another issue when using salloc interactively on a cluster where 
Slurm
power saving is enabled. The problem seems to be caused by the job_container 
plugin
and occurs when the job starts on a node which boots from a power down state.
If I resubmit a job immediately after the failure to the same node, it always 
works.
I can't find any other way to reproduce the issue other than booting a reserved 
node from a power down state.


Looking at this:


slurmstepd: error: container_p_join: open failed for 
/scratch/job_containers/791670/.ns: No such file or directory


I'm wondering is a separate filesystem and, if so, could /scratch be 
only getting mounted _after_ slurmd has started on the node?


If that's the case then it would explain the error and why it works 
immediately after.


On our systems we always try and ensure that slurmd is the very last 
thing to start on a node, and it only starts if everything has succeeded 
up to that point.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node

2022-10-30 Thread Chris Samuel

On 27/10/22 11:30 pm, Richard Chang wrote:

Yes, the system is a HPE Cray EX, and I am trying to use 
switch/hpe_slingshot.


Which version of Slurm are you using Richard?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Prolog and job_submit

2022-10-30 Thread Chris Samuel

On 30/10/22 12:27 pm, Davide DelVento wrote:


But if I understand correctly your Prolog vs TaskProlog distinction,
the latter would have the environmental variable and run as user,
whereas the former runs as root and doesn't get the environment,


That's correct. My personal view is that injecting arbitrary input from 
a user (such as these environment variables) would make life hazardous 
from a security point of view for a root privileged process such as a 
prolog.



not even from the job_submit script.


That is correct, all the job_submit will do is inject the environment 
variable into the jobs environment, just as if a user had done so.



The problem with a TaskProlog
approach is that what I want to do (making a non-accessible file
available) would work best as root. As a workaround is that I could
make that just obscure but still user-possible. Not ideal, but better
than nothing as it is now.

Alternatively, I could use another way to let the job_submit lua
script communicate with the Prolog, not sure exactly what (temp
directory on the shared filesystem, writeable only by root??)


My only other thought is that you might be able to use node features & 
job constraints to communicate this without the user realising.


For instance you could declare the nodes where the software is installed 
to have "Feature=mysoftware" and then your job submit could spot users 
requesting the license and add the constraint "mysoftware" to their job. 
The (root privileged) Prolog can see that via the SLURM_JOB_CONSTRAINTS 
environment variable and so could react to it.


Then when 23.02 comes out you could use the new SLURM_JOB_LICENSES 
environment variable in addition and retire the old way once jobs using 
the old method have completed.



Thanks for pointing to that commit. I bit too down the road but good to know.


No worries, best of luck!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




[slurm-users] What happens if slurmdbd loses connection to mysql

2022-10-30 Thread Richard Chang

Hi,

I have two dedicated nodes for slurm, node1 and node2.

I have created the following.

*Role*



*SlurmCTLD*



*SlurmDBD*



*Mariadb Server for accounting storage*

*Primary*



Node1



Node2



Node2

*Backup*



Node2



Node1



-

Shared NFS Storage from an NFS Server, for StateSaveLocation.

I want to know what if Node2 goes down. I have read in the documentation 
that if slurmdbd does down, slurmctld can still hold back the accounting 
info and when the slurmdbd is back up, it will get it passed on and 
written to the backend database ( not the exact words, but in that vein).


Just want to know what if node2 goes down and the backup slurmdbd in 
node1 takes over. Will it fail instantaneously or keep logging the data 
in it's memory and write back to the DB when it is back up ?


Hope I could explain what I mean.

Thanks & regards,

Richard.


Re: [slurm-users] Prolog and job_submit

2022-10-30 Thread Davide DelVento
Hi Chris,

> Unfortunately it looks like the license request information doesn't get
> propagated into any prologs from what I see from a scan of the
> documentation. :-(

Thanks. If I am reading you right, I did notice the same thing and in
fact that's why I wrote that  job_submit lua script which gets the
license information and sets an environmental variable, in the hope
that such a variable would be inherited by the prolog script.

But if I understand correctly your Prolog vs TaskProlog distinction,
the latter would have the environmental variable and run as user,
whereas the former runs as root and doesn't get the environment, not
even from the job_submit script. The problem with a TaskProlog
approach is that what I want to do (making a non-accessible file
available) would work best as root. As a workaround is that I could
make that just obscure but still user-possible. Not ideal, but better
than nothing as it is now.

Alternatively, I could use another way to let the job_submit lua
script communicate with the Prolog, not sure exactly what (temp
directory on the shared filesystem, writeable only by root??)

Thanks for pointing to that commit. I bit too down the road but good to know.

Cheers,
Davide



Re: [slurm-users] Prolog and job_submit

2022-10-30 Thread Chris Samuel

On 30/10/22 10:23 am, Chris Samuel wrote:

Unfortunately it looks like the license request information doesn't get 
propagated into any prologs from what I see from a scan of the 
documentation. 🙁


This _may_ be fixed in the next major Slurm release (February) if I'm 
reading this right:


https://github.com/SchedMD/slurm/commit/3c6c4c08d8deb89aa2c992a65964f53663097d26

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Prolog and job_submit

2022-10-30 Thread Chris Samuel

On 29/10/22 7:37 am, Davide DelVento wrote:


So either I misinterpreted that "same environment as the user tasks"
or there is something else that I am doing wrong.


Slurm has a number of different prologs that can run which can cause 
confusion, and I suspect that's what's happening here.


The "Prolog" in your configuration runs as root, but its the 
"TaskProlog" that runs as the user and so has access to the jobs 
environment (including the environment variable you are setting).


Unfortunately it looks like the license request information doesn't get 
propagated into any prologs from what I see from a scan of the 
documentation. :-(


Best of luck,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA