Re: [slurm-users] salloc problem

Chris Samuel Sun, 30 Oct 2022 22:45:00 -0700

On 27/10/22 4:18 am, Gizo Nanava wrote:

we run into another issue when using salloc interactively on a cluster where 
Slurm
power saving is enabled. The problem seems to be caused by the job_container 
plugin
and occurs when the job starts on a node which boots from a power down state.
If I resubmit a job immediately after the failure to the same node, it always 
works.
I can't find any other way to reproduce the issue other than booting a reserved 
node from a power down state.


Looking at this:

slurmstepd: error: container_p_join: open failed for 
/scratch/job_containers/791670/.ns: No such file or directory

I'm wondering is a separate filesystem and, if so, could /scratch beonly getting mounted _after_ slurmd has started on the node?

If that's the case then it would explain the error and why it worksimmediately after.

On our systems we always try and ensure that slurmd is the very lastthing to start on a node, and it only starts if everything has succeededup to that point.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] salloc problem

Reply via email to