[slurm-dev] Re: srun unable to start in tight loop

Bob Moench Mon, 24 Aug 2015 12:18:05 -0700


I added "env | grep SLURM" to the script just before the
srun and after the for loop. They were all identical:


  
SLURM_CHECKPOINT_IMAGE_DIR=/cray/css/u17/rwm/atp/tst90/CC_ATP_Cpp_san_save.ar_x86_tigerRun1/tst90.dir
  SLURM_CLUSTER_NAME=tiger
  SLURM_CPUS_ON_NODE=48
  SLURM_CPUS_PER_TASK=1
  SLURM_GTIDS=0
  SLURM_JOBID=39774
  SLURM_JOB_CPUS_PER_NODE=48(x5)
  SLURM_JOB_ID=39774
  SLURM_JOB_NAME=test2
  SLURM_JOB_NODELIST=nid000[13-15,20-21]
  SLURM_JOB_NUM_NODES=5
  SLURM_JOB_PARTITION=workq
  SLURM_JOB_UID=25032
  SLURM_JOB_USER=rwm
  SLURM_LOCALID=0
  SLURM_MEM_PER_CPU=2007
  SLURM_NNODES=5
  SLURM_NODEID=0
  SLURM_NODELIST=nid000[13-15,20-21]
  SLURM_NODE_ALIASES=(null)
  SLURM_NPROCS=5
  SLURM_NTASKS=5
  SLURM_NTASKS_PER_NODE=1
  SLURM_PRIO_PROCESS=0
  SLURM_PROCID=0
  
SLURM_SUBMIT_DIR=/cray/css/u17/rwm/atp/tst90/CC_ATP_Cpp_san_save.ar_x86_tigerRun1/tst90.dir
  SLURM_SUBMIT_HOST=tiger
  SLURM_TASKS_PER_NODE=1(x5)
  SLURM_TASK_PID=28786
  SLURM_TOPOLOGY_ADDR=nid00013
  SLURM_TOPOLOGY_ADDR_PATTERN=node

The sleep was removed for this test and only the first srun
ran correctly. All others failed with the "Requested node
configuration is not available" error.

The sbatch invocation line is:

  /opt/slurm/default/bin/sbatch --job-name=test2 --exclusive test2.qsub > /dev/null 
2>&
:
And my batch script content is

  #!/bin/bash
  #SBATCH --time=7200
  #SBATCH --job-name=test2
  #SBATCH --export=ALL
  #SBATCH -p workq
  #SBATCH --quiet
  #SBATCH --ntasks=5
  #SBATCH --cpus-per-task=1
  #SBATCH -t 12
  #SBATCH --ntasks-per-node=1
  #SBATCH -o test2.slurmlog

It seems to me that everything is consist and correct
regarding ntasks, cpus-per-task, and ntasks-per-node.

Any other ideas?

Bob


On Mon, 24 Aug 2015, Moe Jette wrote:

The sbatch options get propagated via environment variables to the spawnedshell and picked up by srun (unless an srun command line option overridesit). I'd guess your sbatch options conflict with the srun options, causingthe problem. I'd suggest that you take a look at your environment in thespawned shell for variables starting with "SLURM_"
Quoting Bob Moench <[email protected]>:
Hi,

Has anyone seen these errors and know what they are?
srun: error: Unable to create job step: Requested node configuration isnot availablesrun: error: Unable to create job step: Job/step already completing orcompleted
I run this script from an sbatch with the same allocation as the srun inthe script:
 for j in `seq 1 250` ; do
   delay=`echo $j | awk '{print $1*20}'`
   time srun --ntasks=5 --cpus-per-task=1 --ntasks-per-node=1 \
             --exclusive test2.exe $delay
 done

Run as above, every srun fails with the first message. If I
add a "sleep 1" to the loop, I can do about 140 sruns before the
failure (causing the second message for every failed run). Any
thing with a larger sleep gets pretty much the same results
as the "sleep 2".

The exact number of successful runs varies by 10 or 20. Am I
using up some resource with each run?

For completeness, I am running on a Cray system with SLURM 14.11.8

Thanks,
Bob

--
Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support
===============================================================
Slurm User Group Meeting, 15-16 September 2015, Washington D.C.
http://slurm.schedmd.com/slurm_ug_agenda.html


--
Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227

[slurm-dev] Re: srun unable to start in tight loop

Reply via email to