Hello Everyone,

Having an odd issue with the latest version of slurm (22.05.0) when submitting 
jobs to the queue while on a compute resource. Some jobs are unable to 
reproduce this issue every time, but I've got a few that will. Here's one case 
that consistently errors when trying to launch. I've not been able to reproduce 
the issue when submitting jobs from the login node.

Anyone seen anything like this?

##############################
# start interactive session
##############################
[crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l
[crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/

##############################
# job details
##############################
[crutledge@largemem-5-1 gpu-6]$ cat job 
#!/bin/bash -l
#
#SBATCH --job-name=HPCC
#SBATCH -n 48
#SBATCH -p gpu
#SBATCH --mem-per-cpu=3975

module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel

srun ./hpcc

mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID}

##############################
# submit the job
##############################
[crutledge@largemem-5-1 gpu-6]$ sbatch job
Submitted batch job 8533

##############################
# resulting error
##############################
[crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out 
Loading icc version 2022.0.2
Loading compiler-rt version 2022.0.2
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 
0x000000000001000000000001.
srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable to 
satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT 2022-06-10T09:38:19 
***
srun: error: gpu-5-1: tasks 0-46: Killed
mv: cannot stat ‘hpccoutf.txt’: No such file or directory
[crutledge@largemem-5-1 gpu-6]$

Reply via email to