Daniel, Instead of defining the GPU type in our Gres configuration (global with hostnames, no count), we simply add a feature so that users can request a GPU (or GPU's) via Gres and the specific model via a constraint. This may help out the situation so that your users can request a specific GPU model.
--srun --gres=gpu:1 -C "gpu_k20" I didn't think of it at the time, but I remember running --gres=help when initially setting up GPU's to help rule out errors. I don't know if you ran that command or not, but it's worth a shot to verify that Gres types are being seen correctly on a node by the controller. I also wonder if using a cluster wide Gres definition (vs. only on nodes in question) would make a difference or not. John DeSantis 2015-05-06 15:12 GMT-04:00 Daniel Weber <[email protected]>: > > Hi John, > > I already tried using "Count=1" for each line as well as "Count=8" for a > single configuration line as well. > > I "solved" (or better circumvented) the problem by removing the "Type=..." > specifications from the "gres.conf" files and from the slurm.conf. > > The jobs are running successfully without the possibility to request a > certain GPU type. > > The generic resource examples on schedmd.com explicitly show the "Type" > specifications on GPUs and I really would like to use them. > I can handle that temporarily with node features instead but I'd prefer > utilizing the gpu types. > > Thank you for your help (and the hint into the right direction). > > Kind regards > Daniel > > > -----Ursprüngliche Nachricht----- > Von: John Desantis [mailto:[email protected]] > Gesendet: Mittwoch, 6. Mai 2015 18:16 > An: slurm-dev > Betreff: [slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't > work using gpu plugin (node configuration not available) > > > Daniel, > > What about a count? Try adding a count=1 after each of your GPU lines. > > John DeSantis > > 2015-05-06 11:54 GMT-04:00 Daniel Weber <[email protected]>: >> >> The same "problem" occurs when using the grey type in the srun syntax (using >> i.e. --gres=gpu:tesla:1). >> >> Regards, >> Daniel >> >> -- >> Von: John Desantis [mailto:[email protected]] >> Gesendet: Mittwoch, 6. Mai 2015 17:39 >> An: slurm-dev >> Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work >> using gpu plugin (node configuration not available) >> >> >> Daniel, >> >> We don't specify types in our Gres configuration, simply the resource. >> >> What happens if you update your srun syntax to: >> >> srun -n1 --gres=gpu:tesla:1 >> >> Does that dispatch the job? >> >> John DeSantis >> >> 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>: >>> Hello, >>> >>> currently I'm trying to set up SLURM on a gpu cluster with a small >>> number of nodes (where smurf0[1-7] are the node names) using the gpu >>> plugin to allocate jobs (requiring gpus). >>> >>> Unfortunately, when trying to run a gpu-job (any number of gpus; >>> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of >>> the requested configuration. >>> I attached some logs and configuration text files in order to provide >>> any information necessary to analyze this issue. >>> >>> Note: Cross posted here: http://serverfault.com/questions/685258 >>> >>> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES): >>> >>> srun -n1 --gres=gpu:1 test.sh >>> --> srun: error: Unable to allocate resources: Requested node >>> configuration is not available >>> >>> The slurmctld log for such calls shows: >>> >>> gres: gpu state for job X >>> gres_cnt:1 node_cnt:1 type:(null) >>> _pick_best_nodes: job X never runnable >>> _slurm_rpc_allocate_resources: Requested node configuration >>> is not available >>> >>> Jobs with any other type of configured generic resource complete >>> successfully: >>> >>> srun -n1 --gres=gram:500 test.sh >>> --> CUDA_VISIBLE_DEVICES=NoDevFiles >>> >>> The nodes and gres configuration in slurm.conf (which is attached as >>> well) are like: >>> >>> GresTypes=gpu,ram,gram,scratch >>> ... >>> NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" >>> Boards=1 >>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 >>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>> NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" >>> Boards=1 >>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 >>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>> >>> The respective gres.conf files are >>> Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7] >>> Name=ram Count=48 >>> Name=gram Count=6000 >>> Name=scratch Count=1300 >>> >>> The output of "scontrol show node" lists all the nodes with the >>> correct gres configuration i.e.: >>> >>> NodeName=smurf01 Arch=x86_64 CoresPerSocket=6 >>> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi >>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>> ...etc. >>> >>> As far as I can tell, the slurmd daemon on the nodes recognizes the >>> gpus (and other generic resources) correctly. >>> >>> My slurmd.log on node smurf01 says >>> >>> Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev >>> /nvidia[0 - 7] >>> >>> The log for slurmctld shows >>> >>> gres / gpu: state for smurf01 >>> gres_cnt found : 8 configured : 8 avail : 8 alloc : 0 >>> gres_bit_alloc : >>> gres_used : (null) >>> >>> I can't figure out why the controller node states that jobs using >>> --gres=gpu:N are "never runnable" and why "the requested node >>> configuration is not available". >>> Any help is appreciated. >>> >>> Kind regards, >>> Daniel Weber >>> >>> PS: If further information is required, don't hesitate to ask.
