Daniel,

We don't specify types in our Gres configuration, simply the resource.

What happens if you update your srun syntax to:

srun -n1 --gres=gpu:tesla:1

Does that dispatch the job?

John DeSantis

2015-05-06 9:40 GMT-04:00 Daniel Weber <[email protected]>:
> Hello,
>
> currently I'm trying to set up SLURM on a gpu cluster with a small number of
> nodes (where smurf0[1-7] are the node names) using the gpu plugin to
> allocate jobs (requiring gpus).
>
> Unfortunately, when trying to run a gpu-job (any number of gpus;
> --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of the
> requested configuration.
> I attached some logs and configuration text files in order to provide any
> information necessary to analyze this issue.
>
> Note: Cross posted here: http://serverfault.com/questions/685258
>
> Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES):
>
>     srun -n1 --gres=gpu:1 test.sh
>         --> srun: error: Unable to allocate resources: Requested node
> configuration is not available
>
> The slurmctld log for such calls shows:
>
>     gres: gpu state for job X
>         gres_cnt:1 node_cnt:1 type:(null)
>         _pick_best_nodes: job X never runnable
>         _slurm_rpc_allocate_resources: Requested node configuration is not
> available
>
> Jobs with any other type of configured generic resource complete
> successfully:
>
>     srun -n1 --gres=gram:500 test.sh
>         --> CUDA_VISIBLE_DEVICES=NoDevFiles
>
> The nodes and gres configuration in slurm.conf (which is attached as well)
> are like:
>
>     GresTypes=gpu,ram,gram,scratch
>     ...
>     NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" Boards=1
> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2
> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>     NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" Boards=1
> SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1
> Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>
> The respective gres.conf files are
>     Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7]
>     Name=ram Count=48
>     Name=gram Count=6000
>     Name=scratch Count=1300
>
> The output of "scontrol show node" lists all the nodes with the correct gres
> configuration i.e.:
>
>     NodeName=smurf01 Arch=x86_64 CoresPerSocket=6
>        CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi
>        Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
>        ...etc.
>
> As far as I can tell, the slurmd daemon on the nodes recognizes the gpus
> (and other generic resources) correctly.
>
> My slurmd.log on node smurf01 says
>
>     Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev
> /nvidia[0 - 7]
>
> The log for slurmctld shows
>
>     gres / gpu: state for smurf01
>        gres_cnt found : 8 configured : 8 avail : 8 alloc : 0
>        gres_bit_alloc :
>        gres_used : (null)
>
> I can't figure out why the controller node states that jobs using
> --gres=gpu:N are "never runnable" and why "the requested node configuration
> is not available".
> Any help is appreciated.
>
> Kind regards,
> Daniel Weber
>
> PS: If further information is required, don't hesitate to ask.

Reply via email to