We recently updated and rebooted Infiniband-attached nodes, and now when 
trying to schedule MPI jobs with slurm, we are seeing the following:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

   Local host:           node-x
   Local device:         mlx5_0
   Local port:           1
   CPCs attempted:       udcm
--------------------------------------------------------------------------

This worked before reboots. The infiniband network itself is fine. 
However, if we invoke the same job directly using mpirun on the same 
nodes, we do not receive the message (meaning the openib BTL works). 
Some IB-related packages were updated (e.g. the rdma metapackage for 
Centos6.7).

What I'm hoping for is some guidance on what components are involved 
here and the possible causes of slurm not being able to use the openib 
BTL (a post to the slurm list did not get anywhere). There is very 
little documentation I could locate on what CPCs are, or what udcm is, 
and how to troubleshoot it.

Using openmpi 1.10.2 with slurm and PMI support configured in.

-- 
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University

Reply via email to