Hi Hans,

You log shows that you slurm.conf is out-of-sync on some nodes, is that on purpose?

What happens when you synchronise the slurm.conf on all nodes, restart slurmctld and restart slurmd on all nodes?

Best,

Robbert

On 06-02-17 15:45, Hans-Nikolai Viessmann wrote:
Hi all,

Over the weekend I tried to setup Gres device type allocation but stumbled
onto a bit of a problem.

The setup I'm working with is a cluster with Bright Cluster Manager
setup and
managing in total 10 nodes. Eight of these nodes contain GPUs - gpu[01-07]
contains one GPU each, whereas gpu08 contains two GPUs. The GPUs in
gpu08 are not the same, one is a Tesla device and the other a Quadro.
The other
two nodes have MICs on them, but these have not been configured yet.

Some software version information:

  * Bright Cluster Manager: 7.3 running on SL 7.2
  * SLURM: 16.05.2

As per https://slurm.schedmd.com/gres.html, I setup my slurm.conf file
to have the
Gres line with the type specification - excerpt (full slurm.conf attached):

# Nodes
NodeName=mic[01-02]
NodeName=gpu08  Feature=multiple-gpus Gres=gpu:tesla:1,gpu:quadro:1
NodeName=gpu[01-07]  Gres=gpu:1
# Generic resources types
GresTypes=gpu,mic

and on node08 I added to the gres.conf file the following configuration:

Name=gpu File=/dev/nvidia0 Type=tesla
Name=gpu File=/dev/nvidia1 Type=quadro

/I added nothing into the controller gres.conf file./

I believe that these settings/type information has propagated to slurmctld
as calling sinfo gives the following output:

$ sinfo -o "%10P %.5a %.15l %.6D %.6t %25G %N"
PARTITION  AVAIL       TIMELIMIT  NODES  STATE GRES
 NODELIST
longq         up      7-00:00:00      1  drain (null)
 mic02
longq         up      7-00:00:00      1  alloc gpu:1
gpu06
longq         up      7-00:00:00      6   idle gpu:1
gpu[01-05,07]
longq         up      7-00:00:00      1   idle gpu:tesla:1,gpu:quadro:1
 gpu08
longq         up      7-00:00:00      1   idle (null)
 mic01
testq*        up           15:00      1  drain (null)
 mic02
testq*        up           15:00      1  alloc gpu:1
gpu06
testq*        up           15:00      6   idle gpu:1
gpu[01-05,07]
testq*        up           15:00      1   idle gpu:tesla:1,gpu:quadro:1
 gpu08
testq*        up           15:00      1   idle (null)
 mic01

When I try to allocate the resource using salloc, I get the following
error message
though:

$ salloc --gres=gpu:tesla:1
salloc: error: Job submit/allocate failed: Requested node configuration
is not available
salloc: Job allocation 544 has been revoked.

Doing a normal allocation without the`--gres=<>' flag works, but when I
try the following
it fails as well:

$ salloc -w gpu08 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration
is not available
salloc: Job allocation 545 has been revoked.

Activating the DebugFlag Gres provides the following output in the
slurmctld.log file:

[2017-02-06T14:22:46.990] gres/gpu: state for gpu08
[2017-02-06T14:22:46.990]   gres_cnt found:2 configured:2 avail:2 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:
[2017-02-06T14:22:46.990]   gres_used:(null)
[2017-02-06T14:22:46.990]   topo_cpus_bitmap[0]:NULL
[2017-02-06T14:22:46.990]   topo_gres_bitmap[0]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_alloc[0]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_avail[0]:1
[2017-02-06T14:22:46.990]   type[0]:tesla
[2017-02-06T14:22:46.990]   topo_cpus_bitmap[1]:NULL
[2017-02-06T14:22:46.990]   topo_gres_bitmap[1]:1
[2017-02-06T14:22:46.990]   topo_gres_cnt_alloc[1]:0
[2017-02-06T14:22:46.990]   topo_gres_cnt_avail[1]:1
[2017-02-06T14:22:46.990]   type[1]:quadro
[2017-02-06T14:22:46.990]   type_cnt_alloc[0]:0
[2017-02-06T14:22:46.990]   type_cnt_avail[0]:1
[2017-02-06T14:22:46.990]   type[0]:tesla
[2017-02-06T14:22:46.990]   type_cnt_alloc[1]:0
[2017-02-06T14:22:46.990]   type_cnt_avail[1]:1
[2017-02-06T14:22:46.990]   type[1]:quadro
[2017-02-06T14:22:46.990] gres/mic: state for gpu08
[2017-02-06T14:22:46.990]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2017-02-06T14:22:46.990]   gres_bit_alloc:NULL
[2017-02-06T14:22:46.990]   gres_used:(null)

So here I am, a bit stumbled.

Having said that, is there something wrong with my configuration? Am I
missing
anything or overlooked something?

Any help with this would be greatly appreciated!

Sincerely,
Hans Viessmann

P.s. please find attached the slurm.conf file, the slurmctld.log file,
and gres.conf
       file.

Untitled Document
------------------------------------------------------------------------

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
campuses and students across the entire globe we span the world,
delivering innovation and educational excellence in business,
engineering, design and the physical, social and life sciences.

The contents of this e-mail (including any attachments) are
confidential. If you are not the intended recipient of this e-mail, any
disclosure, copying, distribution or use of its contents is strictly
prohibited, and you should please notify the sender immediately and then
delete it (including any attachments) from your system.



--
Robbert Eggermont                                  Intelligent Systems
r.eggerm...@tudelft.nl         Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234                         Delft University of Technology

Reply via email to