Re: [slurm-users] salloc problem

Gizo Nanava Wed, 30 Nov 2022 04:58:06 -0800

Sorry for this very late response.

The directory where job containers are to be created is of course already there 
- it is the local filesystem.
We also start slurmd as a very last process once a node is ready to accept jobs.
That seems to be either a feature of salloc or a bug in Slurm, presumable 
caused by some race conditions - 
in very rare cases, salloc works without this issue.
I see that doc on the Slurm power saving mentions about salloc, but not for the 
case of interactive use of it.


Thank you & best regards
Gizo

> On 27/10/22 4:18 am, Gizo Nanava wrote:
> 
> > we run into another issue when using salloc interactively on a cluster 
> > where Slurm
> > power saving is enabled. The problem seems to be caused by the 
> > job_container plugin
> > and occurs when the job starts on a node which boots from a power down 
> > state.
> > If I resubmit a job immediately after the failure to the same node, it 
> > always works.
> > I can't find any other way to reproduce the issue other than booting a 
> > reserved node from a power down state.
> 
> Looking at this:
> 
> > slurmstepd: error: container_p_join: open failed for 
> > /scratch/job_containers/791670/.ns: No such file or directory
> 
> I'm wondering is a separate filesystem and, if so, could /scratch be 
> only getting mounted _after_ slurmd has started on the node?
> 
> If that's the case then it would explain the error and why it works 
> immediately after.
> 
> On our systems we always try and ensure that slurmd is the very last 
> thing to start on a node, and it only starts if everything has succeeded 
> up to that point.
> 
> All the best,
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> 
>

Re: [slurm-users] salloc problem

Reply via email to