Hi John, I added the types into slurm.conf and the gres.conf files on the nodes again and included a gres.conf on the controller node - without any success.
Slurm rejects jobs with "--gres=gpu:1" or "--gres=gpu:tesla:1". slurm.conf NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 ... gres.conf on controller NodeName=smurf01 Name=gpu Type=tesla File=/dev/nvidia0 Count=1 NodeName=smurf01 Name=gpu Type=tesla File=/dev/nvidia1 Count=1 NodeName=smurf01 Name=gpu Type=tesla File=/dev/nvidia2 Count=1 NodeName=smurf01 Name=gpu Type=tesla File=/dev/nvidia3 Count=1 NodeName=smurf01 Name=gpu Type=tesla File=/dev/nvidia4 Count=1 NodeName=smurf01 Name=gpu Type=tesla File=/dev/nvidia5 Count=1 NodeName=smurf01 Name=gpu Type=tesla File=/dev/nvidia6 Count=1 NodeName=smurf01 Name=gpu Type=tesla File=/dev/nvidia7 Count=1 NodeName=smurf01 Name=ram Count=48 NodeName=smurf01 Name=gram Count=6000 NodeName=smurf01 Name=scratch Count=1300 ... gres.conf on smurf01 Name=gpu Type=tesla File=/dev/nvidia0 Count=1 Name=gpu Type=tesla File=/dev/nvidia1 Count=1 Name=gpu Type=tesla File=/dev/nvidia2 Count=1 Name=gpu Type=tesla File=/dev/nvidia3 Count=1 Name=gpu Type=tesla File=/dev/nvidia4 Count=1 Name=gpu Type=tesla File=/dev/nvidia5 Count=1 Name=gpu Type=tesla File=/dev/nvidia6 Count=1 Name=gpu Type=tesla File=/dev/nvidia7 Count=1 Name=ram Count=48 Name=gram Count=6000 Name=scratch Count=1300 Regards Daniel -----Ursprüngliche Nachricht----- Von: John Desantis [mailto:[email protected]] Gesendet: Mittwoch, 6. Mai 2015 21:33 An: slurm-dev Betreff: [slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available) Daniel, I hit send without completing my message: # gres.conf NodeName=blah Name=gpu Type=Tesla-T10 File=/dev/nvidia[0-1] HTH. John DeSantis 2015-05-06 15:30 GMT-04:00 John Desantis <[email protected]>: > Daniel, > > You sparked an interest. > > I was able to get Gres Types working by: > > 1.) Ensuring that the type was defined in slurm.conf for the nodes in > question; > 2.) Ensuring that the global gres.conf respected the type. > > salloc -n 1 --gres=gpu:Tesla-T10:1 > salloc: Pending job allocation 532507 > salloc: job 532507 queued and waiting for resources > > # slurm.conf > Nodename=blah CPUs=16 CoresPerSocket=4 Sockets=4 RealMemory=129055 > Feature=ib_ddr,ib_ofa,sse,sse2,sse3,tpa,cpu_xeon,xeon_E7330,gpu_T10,ti > tan,mem_128G > Gres=gpu:Tesla-T10:2 Weight=1000 > > # gres.conf > > > 2015-05-06 15:25 GMT-04:00 John Desantis <[email protected]>: >> >> Daniel, >> >> "I can handle that temporarily with node features instead but I'd >> prefer utilizing the gpu types." >> >> Guilty of reading your response too quickly... >> >> John DeSantis >> >> 2015-05-06 15:22 GMT-04:00 John Desantis <[email protected]>: >>> Daniel, >>> >>> Instead of defining the GPU type in our Gres configuration (global >>> with hostnames, no count), we simply add a feature so that users can >>> request a GPU (or GPU's) via Gres and the specific model via a >>> constraint. This may help out the situation so that your users can >>> request a specific GPU model. >>> >>> --srun --gres=gpu:1 -C "gpu_k20" >>> >>> I didn't think of it at the time, but I remember running --gres=help >>> when initially setting up GPU's to help rule out errors. I don't >>> know if you ran that command or not, but it's worth a shot to verify >>> that Gres types are being seen correctly on a node by the >>> controller. I also wonder if using a cluster wide Gres definition >>> (vs. only on nodes in question) would make a difference or not. >>> >>> John DeSantis >>> >>> >>> 2015-05-06 15:12 GMT-04:00 Daniel Weber <[email protected]>: >>>> >>>> Hi John, >>>> >>>> I already tried using "Count=1" for each line as well as "Count=8" for a >>>> single configuration line as well. >>>> >>>> I "solved" (or better circumvented) the problem by removing the "Type=..." >>>> specifications from the "gres.conf" files and from the slurm.conf. >>>> >>>> The jobs are running successfully without the possibility to request a >>>> certain GPU type. >>>> >>>> The generic resource examples on schedmd.com explicitly show the "Type" >>>> specifications on GPUs and I really would like to use them. >>>> I can handle that temporarily with node features instead but I'd prefer >>>> utilizing the gpu types. >>>> >>>> Thank you for your help (and the hint into the right direction). >>>> >>>> Kind regards >>>> Daniel >>>> >>>> >>>> -----Ursprüngliche Nachricht----- >>>> Von: John Desantis [mailto:[email protected]] >>>> Gesendet: Mittwoch, 6. Mai 2015 18:16 >>>> An: slurm-dev >>>> Betreff: [slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs >>>> doesn't work using gpu plugin (node configuration not available) >>>> >>>> >>>> Daniel, >>>> >>>> What about a count? Try adding a count=1 after each of your GPU lines. >>>> >>>> John DeSantis >>>> >>>> 2015-05-06 11:54 GMT-04:00 Daniel Weber <[email protected]>: >>>>> >>>>> The same "problem" occurs when using the grey type in the srun syntax >>>>> (using i.e. --gres=gpu:tesla:1). >>>>> >>>>> Regards, >>>>> Daniel >>>>> >>>>> -- >>>>> Von: John Desantis [mailto:[email protected]] >>>>> Gesendet: Mittwoch, 6. Mai 2015 17:39 >>>>> An: slurm-dev >>>>> Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work >>>>> using gpu plugin (node configuration not available) >>>>> >>>>> >>>>> Daniel, >>>>> >>>>> We don't specify types in our Gres configuration, simply the resource. >>>>> >>>>> What happens if you update your srun syntax to: >>>>> >>>>> srun -n1 --gres=gpu:tesla:1 >>>>> >>>>> Does that dispatch the job? >>>>> >>>>> John DeSantis >>>>> >>>>> 2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>: >>>>>> Hello, >>>>>> >>>>>> currently I'm trying to set up SLURM on a gpu cluster with a >>>>>> small number of nodes (where smurf0[1-7] are the node names) >>>>>> using the gpu plugin to allocate jobs (requiring gpus). >>>>>> >>>>>> Unfortunately, when trying to run a gpu-job (any number of gpus; >>>>>> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability >>>>>> of the requested configuration. >>>>>> I attached some logs and configuration text files in order to >>>>>> provide any information necessary to analyze this issue. >>>>>> >>>>>> Note: Cross posted here: http://serverfault.com/questions/685258 >>>>>> >>>>>> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES): >>>>>> >>>>>> srun -n1 --gres=gpu:1 test.sh >>>>>> --> srun: error: Unable to allocate resources: Requested >>>>>> node configuration is not available >>>>>> >>>>>> The slurmctld log for such calls shows: >>>>>> >>>>>> gres: gpu state for job X >>>>>> gres_cnt:1 node_cnt:1 type:(null) >>>>>> _pick_best_nodes: job X never runnable >>>>>> _slurm_rpc_allocate_resources: Requested node >>>>>> configuration is not available >>>>>> >>>>>> Jobs with any other type of configured generic resource complete >>>>>> successfully: >>>>>> >>>>>> srun -n1 --gres=gram:500 test.sh >>>>>> --> CUDA_VISIBLE_DEVICES=NoDevFiles >>>>>> >>>>>> The nodes and gres configuration in slurm.conf (which is attached >>>>>> as >>>>>> well) are like: >>>>>> >>>>>> GresTypes=gpu,ram,gram,scratch >>>>>> ... >>>>>> NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" >>>>>> Boards=1 >>>>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 >>>>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>>>> NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" >>>>>> Boards=1 >>>>>> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 >>>>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>>>> >>>>>> The respective gres.conf files are >>>>>> Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7] >>>>>> Name=ram Count=48 >>>>>> Name=gram Count=6000 >>>>>> Name=scratch Count=1300 >>>>>> >>>>>> The output of "scontrol show node" lists all the nodes with the >>>>>> correct gres configuration i.e.: >>>>>> >>>>>> NodeName=smurf01 Arch=x86_64 CoresPerSocket=6 >>>>>> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi >>>>>> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 >>>>>> ...etc. >>>>>> >>>>>> As far as I can tell, the slurmd daemon on the nodes recognizes >>>>>> the gpus (and other generic resources) correctly. >>>>>> >>>>>> My slurmd.log on node smurf01 says >>>>>> >>>>>> Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = >>>>>> /dev >>>>>> /nvidia[0 - 7] >>>>>> >>>>>> The log for slurmctld shows >>>>>> >>>>>> gres / gpu: state for smurf01 >>>>>> gres_cnt found : 8 configured : 8 avail : 8 alloc : 0 >>>>>> gres_bit_alloc : >>>>>> gres_used : (null) >>>>>> >>>>>> I can't figure out why the controller node states that jobs using >>>>>> --gres=gpu:N are "never runnable" and why "the requested node >>>>>> configuration is not available". >>>>>> Any help is appreciated. >>>>>> >>>>>> Kind regards, >>>>>> Daniel Weber >>>>>> >>>>>> PS: If further information is required, don't hesitate to ask.
