I added "env | grep SLURM" to the script just before the
srun and after the for loop. They were all identical:
SLURM_CHECKPOINT_IMAGE_DIR=/cray/css/u17/rwm/atp/tst90/CC_ATP_Cpp_san_save.ar_x86_tigerRun1/tst90.dir
SLURM_CLUSTER_NAME=tiger
SLURM_CPUS_ON_NODE=48
SLURM_CPUS_PER_TASK=1
SLURM_GTIDS=0
SLURM_JOBID=39774
SLURM_JOB_CPUS_PER_NODE=48(x5)
SLURM_JOB_ID=39774
SLURM_JOB_NAME=test2
SLURM_JOB_NODELIST=nid000[13-15,20-21]
SLURM_JOB_NUM_NODES=5
SLURM_JOB_PARTITION=workq
SLURM_JOB_UID=25032
SLURM_JOB_USER=rwm
SLURM_LOCALID=0
SLURM_MEM_PER_CPU=2007
SLURM_NNODES=5
SLURM_NODEID=0
SLURM_NODELIST=nid000[13-15,20-21]
SLURM_NODE_ALIASES=(null)
SLURM_NPROCS=5
SLURM_NTASKS=5
SLURM_NTASKS_PER_NODE=1
SLURM_PRIO_PROCESS=0
SLURM_PROCID=0
SLURM_SUBMIT_DIR=/cray/css/u17/rwm/atp/tst90/CC_ATP_Cpp_san_save.ar_x86_tigerRun1/tst90.dir
SLURM_SUBMIT_HOST=tiger
SLURM_TASKS_PER_NODE=1(x5)
SLURM_TASK_PID=28786
SLURM_TOPOLOGY_ADDR=nid00013
SLURM_TOPOLOGY_ADDR_PATTERN=node
The sleep was removed for this test and only the first srun
ran correctly. All others failed with the "Requested node
configuration is not available" error.
The sbatch invocation line is:
/opt/slurm/default/bin/sbatch --job-name=test2 --exclusive test2.qsub > /dev/null
2>&
:
And my batch script content is
#!/bin/bash
#SBATCH --time=7200
#SBATCH --job-name=test2
#SBATCH --export=ALL
#SBATCH -p workq
#SBATCH --quiet
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=1
#SBATCH -t 12
#SBATCH --ntasks-per-node=1
#SBATCH -o test2.slurmlog
It seems to me that everything is consist and correct
regarding ntasks, cpus-per-task, and ntasks-per-node.
Any other ideas?
Bob
On Mon, 24 Aug 2015, Moe Jette wrote:
The sbatch options get propagated via environment variables to the spawned
shell and picked up by srun (unless an srun command line option overrides
it). I'd guess your sbatch options conflict with the srun options, causing
the problem. I'd suggest that you take a look at your environment in the
spawned shell for variables starting with "SLURM_"
Quoting Bob Moench <[email protected]>:
Hi,
Has anyone seen these errors and know what they are?
srun: error: Unable to create job step: Requested node configuration is
not available
srun: error: Unable to create job step: Job/step already completing or
completed
I run this script from an sbatch with the same allocation as the srun in
the script:
for j in `seq 1 250` ; do
delay=`echo $j | awk '{print $1*20}'`
time srun --ntasks=5 --cpus-per-task=1 --ntasks-per-node=1 \
--exclusive test2.exe $delay
done
Run as above, every srun fails with the first message. If I
add a "sleep 1" to the loop, I can do about 140 sruns before the
failure (causing the second message for every failed run). Any
thing with a larger sleep gets pretty much the same results
as the "sleep 2".
The exact number of successful runs varies by 10 or 20. Am I
using up some resource with each run?
For completeness, I am running on a Cray system with SLURM 14.11.8
Thanks,
Bob
--
Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support
===============================================================
Slurm User Group Meeting, 15-16 September 2015, Washington D.C.
http://slurm.schedmd.com/slurm_ug_agenda.html
--
Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227