Sorry for the delay on this -- it looks like the problem is caused by messages 
like this (from your first message):

[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port

RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID 
where you want to use it.


On May 5, 2011, at 1:15 PM, Brock Palen wrote:

> Yeah we have ran into more issues, with rdmacm not being avialable on all of 
> our hosts.  So it would be nice to know what we can do to test that a host 
> would support rdmacm,
> 
> Example:
> 
> --------------------------------------------------------------------------
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>  Local host:           nyx5067.engin.umich.edu
>  Local device:         mlx4_0
>  Local port:           1
>  CPCs attempted:       rdmacm
> --------------------------------------------------------------------------
> 
> This is one of our QDR hosts that rdmacm generally works on. Which this code 
> (CRASH) requires to avoid a collective hang in MPI_Allreduce() 
> 
> I look on this hosts and I find:
> [root@nyx5067 ~]# rpm -qa | grep rdma
> librdmacm-1.0.11-1
> librdmacm-1.0.11-1
> librdmacm-devel-1.0.11-1
> librdmacm-devel-1.0.11-1
> librdmacm-utils-1.0.11-1
> 
> So all the libraries are installed (I think) is there a way to verify this?  
> Or to have OpenMPI be more verbose what caused rdmacm to fail as an oob 
> option?
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On May 3, 2011, at 9:42 AM, Dave Love wrote:
> 
>> Brock Palen <bro...@umich.edu> writes:
>> 
>>> We managed to have another user hit the bug that causes collectives (this 
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>> 
>>> btl_openib_cpc_include rdmacm
>> 
>> Could someone explain this?  We also have problems with collective hangs
>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>> see any relevant issues filed.  However, rdmacm isn't an available value
>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>> that I understand what these things are...).
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to