Daniel, We don't specify types in our Gres configuration, simply the resource.
What happens if you update your srun syntax to: srun -n1 --gres=gpu:tesla:1 Does that dispatch the job? John DeSantis 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>: > Hello, > > currently I'm trying to set up SLURM on a gpu cluster with a small number of > nodes (where smurf0[1-7] are the node names) using the gpu plugin to > allocate jobs (requiring gpus). > > Unfortunately, when trying to run a gpu-job (any number of gpus; > --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of the > requested configuration. > I attached some logs and configuration text files in order to provide any > information necessary to analyze this issue. > > Note: Cross posted here: http://serverfault.com/questions/685258 > > Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES): > > srun -n1 --gres=gpu:1 test.sh > --> srun: error: Unable to allocate resources: Requested node > configuration is not available > > The slurmctld log for such calls shows: > > gres: gpu state for job X > gres_cnt:1 node_cnt:1 type:(null) > _pick_best_nodes: job X never runnable > _slurm_rpc_allocate_resources: Requested node configuration is not > available > > Jobs with any other type of configured generic resource complete > successfully: > > srun -n1 --gres=gram:500 test.sh > --> CUDA_VISIBLE_DEVICES=NoDevFiles > > The nodes and gres configuration in slurm.conf (which is attached as well) > are like: > > GresTypes=gpu,ram,gram,scratch > ... > NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" Boards=1 > SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 > Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 > NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" Boards=1 > SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 > Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 > > The respective gres.conf files are > Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7] > Name=ram Count=48 > Name=gram Count=6000 > Name=scratch Count=1300 > > The output of "scontrol show node" lists all the nodes with the correct gres > configuration i.e.: > > NodeName=smurf01 Arch=x86_64 CoresPerSocket=6 > CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi > Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 > ...etc. > > As far as I can tell, the slurmd daemon on the nodes recognizes the gpus > (and other generic resources) correctly. > > My slurmd.log on node smurf01 says > > Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev > /nvidia[0 - 7] > > The log for slurmctld shows > > gres / gpu: state for smurf01 > gres_cnt found : 8 configured : 8 avail : 8 alloc : 0 > gres_bit_alloc : > gres_used : (null) > > I can't figure out why the controller node states that jobs using > --gres=gpu:N are "never runnable" and why "the requested node configuration > is not available". > Any help is appreciated. > > Kind regards, > Daniel Weber > > PS: If further information is required, don't hesitate to ask.
