Yeah we have ran into more issues, with rdmacm not being avialable on all of 
our hosts.  So it would be nice to know what we can do to test that a host 
would support rdmacm,

Example:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           nyx5067.engin.umich.edu
  Local device:         mlx4_0
  Local port:           1
  CPCs attempted:       rdmacm
--------------------------------------------------------------------------

This is one of our QDR hosts that rdmacm generally works on. Which this code 
(CRASH) requires to avoid a collective hang in MPI_Allreduce() 

I look on this hosts and I find:
[root@nyx5067 ~]# rpm -qa | grep rdma
librdmacm-1.0.11-1
librdmacm-1.0.11-1
librdmacm-devel-1.0.11-1
librdmacm-devel-1.0.11-1
librdmacm-utils-1.0.11-1

So all the libraries are installed (I think) is there a way to verify this?  Or 
to have OpenMPI be more verbose what caused rdmacm to fail as an oob option?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 3, 2011, at 9:42 AM, Dave Love wrote:

> Brock Palen <bro...@umich.edu> writes:
> 
>> We managed to have another user hit the bug that causes collectives (this 
>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>> 
>> btl_openib_cpc_include rdmacm
> 
> Could someone explain this?  We also have problems with collective hangs
> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
> see any relevant issues filed.  However, rdmacm isn't an available value
> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
> that I understand what these things are...).
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Reply via email to