Re: [slurm-users] salloc problem

2022-11-30 Thread Gizo Nanava
Sorry for this very late response.

The directory where job containers are to be created is of course already there 
- it is the local filesystem.
We also start slurmd as a very last process once a node is ready to accept jobs.
That seems to be either a feature of salloc or a bug in Slurm, presumable 
caused by some race conditions - 
in very rare cases, salloc works without this issue.
I see that doc on the Slurm power saving mentions about salloc, but not for the 
case of interactive use of it.

Thank you & best regards
Gizo

> On 27/10/22 4:18 am, Gizo Nanava wrote:
> 
> > we run into another issue when using salloc interactively on a cluster 
> > where Slurm
> > power saving is enabled. The problem seems to be caused by the 
> > job_container plugin
> > and occurs when the job starts on a node which boots from a power down 
> > state.
> > If I resubmit a job immediately after the failure to the same node, it 
> > always works.
> > I can't find any other way to reproduce the issue other than booting a 
> > reserved node from a power down state.
> 
> Looking at this:
> 
> > slurmstepd: error: container_p_join: open failed for 
> > /scratch/job_containers/791670/.ns: No such file or directory
> 
> I'm wondering is a separate filesystem and, if so, could /scratch be 
> only getting mounted _after_ slurmd has started on the node?
> 
> If that's the case then it would explain the error and why it works 
> immediately after.
> 
> On our systems we always try and ensure that slurmd is the very last 
> thing to start on a node, and it only starts if everything has succeeded 
> up to that point.
> 
> All the best,
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> 
>



Re: [slurm-users] salloc problem

2022-10-30 Thread Chris Samuel

On 27/10/22 4:18 am, Gizo Nanava wrote:


we run into another issue when using salloc interactively on a cluster where 
Slurm
power saving is enabled. The problem seems to be caused by the job_container 
plugin
and occurs when the job starts on a node which boots from a power down state.
If I resubmit a job immediately after the failure to the same node, it always 
works.
I can't find any other way to reproduce the issue other than booting a reserved 
node from a power down state.


Looking at this:


slurmstepd: error: container_p_join: open failed for 
/scratch/job_containers/791670/.ns: No such file or directory


I'm wondering is a separate filesystem and, if so, could /scratch be 
only getting mounted _after_ slurmd has started on the node?


If that's the case then it would explain the error and why it works 
immediately after.


On our systems we always try and ensure that slurmd is the very last 
thing to start on a node, and it only starts if everything has succeeded 
up to that point.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




[slurm-users] salloc problem

2022-10-27 Thread Gizo Nanava
Hello, 

we run into another issue when using salloc interactively on a cluster where 
Slurm 
power saving is enabled. The problem seems to be caused by the job_container 
plugin
and occurs when the job starts on a node which boots from a power down state.
If I resubmit a job immediately after the failure to the same node, it always 
works. 
I can't find any other way to reproduce the issue other than booting a reserved 
node from a power down state.

Is this a known issue?

srun and sbatch don't have the problem.
We use slurm 22.05.3. 

>  salloc --nodelist=isu-n001
salloc: Granted job allocation 791670   
salloc: Waiting for resource configuration  
 
salloc: Nodes isu-n001 are ready for job

  
slurmstepd: error: container_p_join: open failed for 
/scratch/job_containers/791670/.ns: No such file or directory
slurmstepd: error: container_g_join failed: 791670  


slurmstepd: error: write to unblock task 0 failed: Broken pipe  
srun: error: isu-n001: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=791670.interactive 
   
salloc: Relinquishing job allocation 791670 

# Slurm controller configs
#
> cat /etc/slurm/slurm.conf
..
JobContainerType=job_container/tmpfs
..
LaunchParameters=use_interactive_step
InteractiveStepOptions="--interactive --preserve-env --pty $SHELL -l"
  
# Job_container
#
> cat /etc/slurm/job_container.conf
AutoBasePath=true
BasePath=/scratch/job_containers

Thank you & kind regards
Gizo