Hello Everyone, Having an odd issue with the latest version of slurm (22.05.0) when submitting jobs to the queue while on a compute resource. Some jobs are unable to reproduce this issue every time, but I've got a few that will. Here's one case that consistently errors when trying to launch. I've not been able to reproduce the issue when submitting jobs from the login node.
Anyone seen anything like this? ############################## # start interactive session ############################## [crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l [crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/ ############################## # job details ############################## [crutledge@largemem-5-1 gpu-6]$ cat job #!/bin/bash -l # #SBATCH --job-name=HPCC #SBATCH -n 48 #SBATCH -p gpu #SBATCH --mem-per-cpu=3975 module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel srun ./hpcc mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID} ############################## # submit the job ############################## [crutledge@largemem-5-1 gpu-6]$ sbatch job Submitted batch job 8533 ############################## # resulting error ############################## [crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out Loading icc version 2022.0.2 Loading compiler-rt version 2022.0.2 srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000001000000000001. srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT 2022-06-10T09:38:19 *** srun: error: gpu-5-1: tasks 0-46: Killed mv: cannot stat ‘hpccoutf.txt’: No such file or directory [crutledge@largemem-5-1 gpu-6]$