On 27/10/22 4:18 am, Gizo Nanava wrote:
we run into another issue when using salloc interactively on a cluster where Slurm power saving is enabled. The problem seems to be caused by the job_container plugin and occurs when the job starts on a node which boots from a power down state. If I resubmit a job immediately after the failure to the same node, it always works. I can't find any other way to reproduce the issue other than booting a reserved node from a power down state.
Looking at this:
slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory
I'm wondering is a separate filesystem and, if so, could /scratch be only getting mounted _after_ slurmd has started on the node?
If that's the case then it would explain the error and why it works immediately after.
On our systems we always try and ensure that slurmd is the very last thing to start on a node, and it only starts if everything has succeeded up to that point.
All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA