Re: [slurm-users] salloc problem
Sorry for this very late response. The directory where job containers are to be created is of course already there - it is the local filesystem. We also start slurmd as a very last process once a node is ready to accept jobs. That seems to be either a feature of salloc or a bug in Slurm, presumable caused by some race conditions - in very rare cases, salloc works without this issue. I see that doc on the Slurm power saving mentions about salloc, but not for the case of interactive use of it. Thank you & best regards Gizo > On 27/10/22 4:18 am, Gizo Nanava wrote: > > > we run into another issue when using salloc interactively on a cluster > > where Slurm > > power saving is enabled. The problem seems to be caused by the > > job_container plugin > > and occurs when the job starts on a node which boots from a power down > > state. > > If I resubmit a job immediately after the failure to the same node, it > > always works. > > I can't find any other way to reproduce the issue other than booting a > > reserved node from a power down state. > > Looking at this: > > > slurmstepd: error: container_p_join: open failed for > > /scratch/job_containers/791670/.ns: No such file or directory > > I'm wondering is a separate filesystem and, if so, could /scratch be > only getting mounted _after_ slurmd has started on the node? > > If that's the case then it would explain the error and why it works > immediately after. > > On our systems we always try and ensure that slurmd is the very last > thing to start on a node, and it only starts if everything has succeeded > up to that point. > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > >
Re: [slurm-users] salloc problem
On 27/10/22 4:18 am, Gizo Nanava wrote: we run into another issue when using salloc interactively on a cluster where Slurm power saving is enabled. The problem seems to be caused by the job_container plugin and occurs when the job starts on a node which boots from a power down state. If I resubmit a job immediately after the failure to the same node, it always works. I can't find any other way to reproduce the issue other than booting a reserved node from a power down state. Looking at this: slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory I'm wondering is a separate filesystem and, if so, could /scratch be only getting mounted _after_ slurmd has started on the node? If that's the case then it would explain the error and why it works immediately after. On our systems we always try and ensure that slurmd is the very last thing to start on a node, and it only starts if everything has succeeded up to that point. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] salloc problem
Hello, we run into another issue when using salloc interactively on a cluster where Slurm power saving is enabled. The problem seems to be caused by the job_container plugin and occurs when the job starts on a node which boots from a power down state. If I resubmit a job immediately after the failure to the same node, it always works. I can't find any other way to reproduce the issue other than booting a reserved node from a power down state. Is this a known issue? srun and sbatch don't have the problem. We use slurm 22.05.3. > salloc --nodelist=isu-n001 salloc: Granted job allocation 791670 salloc: Waiting for resource configuration salloc: Nodes isu-n001 are ready for job slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory slurmstepd: error: container_g_join failed: 791670 slurmstepd: error: write to unblock task 0 failed: Broken pipe srun: error: isu-n001: task 0: Exited with exit code 1 srun: launch/slurm: _step_signal: Terminating StepId=791670.interactive salloc: Relinquishing job allocation 791670 # Slurm controller configs # > cat /etc/slurm/slurm.conf .. JobContainerType=job_container/tmpfs .. LaunchParameters=use_interactive_step InteractiveStepOptions="--interactive --preserve-env --pty $SHELL -l" # Job_container # > cat /etc/slurm/job_container.conf AutoBasePath=true BasePath=/scratch/job_containers Thank you & kind regards Gizo