The same "problem" occurs when using the grey type in the srun syntax (using 
i.e. --gres=gpu:tesla:1).

Regards,
Daniel

--
Von: John Desantis [mailto:[email protected]] 
Gesendet: Mittwoch, 6. Mai 2015 17:39
An: slurm-dev
Betreff: [slurm-dev] Re: Job allocation for GPU jobs doesn't work using gpu 
plugin (node configuration not available)


Daniel,

We don't specify types in our Gres configuration, simply the resource.

What happens if you update your srun syntax to:

srun -n1 --gres=gpu:tesla:1

Does that dispatch the job?

John DeSantis

2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>:
> Hello,
>
> currently I'm trying to set up SLURM on a gpu cluster with a small 
> number of nodes (where smurf0[1-7] are the node names) using the gpu 
> plugin to allocate jobs (requiring gpus).
>
> Unfortunately, when trying to run a gpu-job (any number of gpus; 
> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of 
> the requested configuration.
> I attached some logs and configuration text files in order to provide 
> any information necessary to analyze this issue.
>
> Note: Cross posted here: http://serverfault.com/questions/685258
>
> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES):
>
>     srun -n1 --gres=gpu:1 test.sh
>         --> srun: error: Unable to allocate resources: Requested node 
> configuration is not available
>
> The slurmctld log for such calls shows:
>
>     gres: gpu state for job X
>         gres_cnt:1 node_cnt:1 type:(null)
>         _pick_best_nodes: job X never runnable
>         _slurm_rpc_allocate_resources: Requested node configuration is 
> not available
>
> Jobs with any other type of configured generic resource complete
> successfully:
>
>     srun -n1 --gres=gram:500 test.sh
>         --> CUDA_VISIBLE_DEVICES=NoDevFiles
>
> The nodes and gres configuration in slurm.conf (which is attached as 
> well) are like:
>
>     GresTypes=gpu,ram,gram,scratch
>     ...
>     NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" 
> Boards=1
> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2
> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>     NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" 
> Boards=1
> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1
> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>
> The respective gres.conf files are
>     Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7]
>     Name=ram Count=48
>     Name=gram Count=6000
>     Name=scratch Count=1300
>
> The output of "scontrol show node" lists all the nodes with the 
> correct gres configuration i.e.:
>
>     NodeName=smurf01 Arch=x86_64 CoresPerSocket=6
>        CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi
>        Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>        ...etc.
>
> As far as I can tell, the slurmd daemon on the nodes recognizes the 
> gpus (and other generic resources) correctly.
>
> My slurmd.log on node smurf01 says
>
>     Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev
> /nvidia[0 - 7]
>
> The log for slurmctld shows
>
>     gres / gpu: state for smurf01
>        gres_cnt found : 8 configured : 8 avail : 8 alloc : 0
>        gres_bit_alloc :
>        gres_used : (null)
>
> I can't figure out why the controller node states that jobs using 
> --gres=gpu:N are "never runnable" and why "the requested node 
> configuration is not available".
> Any help is appreciated.
>
> Kind regards,
> Daniel Weber
>
> PS: If further information is required, don't hesitate to ask.

Reply via email to