On 27/10/22 4:18 am, Gizo Nanava wrote:

we run into another issue when using salloc interactively on a cluster where 
Slurm
power saving is enabled. The problem seems to be caused by the job_container 
plugin
and occurs when the job starts on a node which boots from a power down state.
If I resubmit a job immediately after the failure to the same node, it always 
works.
I can't find any other way to reproduce the issue other than booting a reserved 
node from a power down state.

Looking at this:

slurmstepd: error: container_p_join: open failed for 
/scratch/job_containers/791670/.ns: No such file or directory

I'm wondering is a separate filesystem and, if so, could /scratch be only getting mounted _after_ slurmd has started on the node?

If that's the case then it would explain the error and why it works immediately after.

On our systems we always try and ensure that slurmd is the very last thing to start on a node, and it only starts if everything has succeeded up to that point.

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


Reply via email to