Hi Hans,
You log shows that you slurm.conf is out-of-sync on some nodes, is that on purpose?
What happens when you synchronise the slurm.conf on all nodes, restart slurmctld and restart slurmd on all nodes?
Best, Robbert On 06-02-17 15:45, Hans-Nikolai Viessmann wrote:
Hi all, Over the weekend I tried to setup Gres device type allocation but stumbled onto a bit of a problem. The setup I'm working with is a cluster with Bright Cluster Manager setup and managing in total 10 nodes. Eight of these nodes contain GPUs - gpu[01-07] contains one GPU each, whereas gpu08 contains two GPUs. The GPUs in gpu08 are not the same, one is a Tesla device and the other a Quadro. The other two nodes have MICs on them, but these have not been configured yet. Some software version information: * Bright Cluster Manager: 7.3 running on SL 7.2 * SLURM: 16.05.2 As per https://slurm.schedmd.com/gres.html, I setup my slurm.conf file to have the Gres line with the type specification - excerpt (full slurm.conf attached): # Nodes NodeName=mic[01-02] NodeName=gpu08 Feature=multiple-gpus Gres=gpu:tesla:1,gpu:quadro:1 NodeName=gpu[01-07] Gres=gpu:1 # Generic resources types GresTypes=gpu,mic and on node08 I added to the gres.conf file the following configuration: Name=gpu File=/dev/nvidia0 Type=tesla Name=gpu File=/dev/nvidia1 Type=quadro /I added nothing into the controller gres.conf file./ I believe that these settings/type information has propagated to slurmctld as calling sinfo gives the following output: $ sinfo -o "%10P %.5a %.15l %.6D %.6t %25G %N" PARTITION AVAIL TIMELIMIT NODES STATE GRES NODELIST longq up 7-00:00:00 1 drain (null) mic02 longq up 7-00:00:00 1 alloc gpu:1 gpu06 longq up 7-00:00:00 6 idle gpu:1 gpu[01-05,07] longq up 7-00:00:00 1 idle gpu:tesla:1,gpu:quadro:1 gpu08 longq up 7-00:00:00 1 idle (null) mic01 testq* up 15:00 1 drain (null) mic02 testq* up 15:00 1 alloc gpu:1 gpu06 testq* up 15:00 6 idle gpu:1 gpu[01-05,07] testq* up 15:00 1 idle gpu:tesla:1,gpu:quadro:1 gpu08 testq* up 15:00 1 idle (null) mic01 When I try to allocate the resource using salloc, I get the following error message though: $ salloc --gres=gpu:tesla:1 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 544 has been revoked. Doing a normal allocation without the`--gres=<>' flag works, but when I try the following it fails as well: $ salloc -w gpu08 --gres=gpu:1 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 545 has been revoked. Activating the DebugFlag Gres provides the following output in the slurmctld.log file: [2017-02-06T14:22:46.990] gres/gpu: state for gpu08 [2017-02-06T14:22:46.990] gres_cnt found:2 configured:2 avail:2 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc: [2017-02-06T14:22:46.990] gres_used:(null) [2017-02-06T14:22:46.990] topo_cpus_bitmap[0]:NULL [2017-02-06T14:22:46.990] topo_gres_bitmap[0]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_alloc[0]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_avail[0]:1 [2017-02-06T14:22:46.990] type[0]:tesla [2017-02-06T14:22:46.990] topo_cpus_bitmap[1]:NULL [2017-02-06T14:22:46.990] topo_gres_bitmap[1]:1 [2017-02-06T14:22:46.990] topo_gres_cnt_alloc[1]:0 [2017-02-06T14:22:46.990] topo_gres_cnt_avail[1]:1 [2017-02-06T14:22:46.990] type[1]:quadro [2017-02-06T14:22:46.990] type_cnt_alloc[0]:0 [2017-02-06T14:22:46.990] type_cnt_avail[0]:1 [2017-02-06T14:22:46.990] type[0]:tesla [2017-02-06T14:22:46.990] type_cnt_alloc[1]:0 [2017-02-06T14:22:46.990] type_cnt_avail[1]:1 [2017-02-06T14:22:46.990] type[1]:quadro [2017-02-06T14:22:46.990] gres/mic: state for gpu08 [2017-02-06T14:22:46.990] gres_cnt found:0 configured:0 avail:0 alloc:0 [2017-02-06T14:22:46.990] gres_bit_alloc:NULL [2017-02-06T14:22:46.990] gres_used:(null) So here I am, a bit stumbled. Having said that, is there something wrong with my configuration? Am I missing anything or overlooked something? Any help with this would be greatly appreciated! Sincerely, Hans Viessmann P.s. please find attached the slurm.conf file, the slurmctld.log file, and gres.conf file. Untitled Document ------------------------------------------------------------------------ Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. The contents of this e-mail (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.
-- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15 27 83234 Delft University of Technology