Yeah we have ran into more issues, with rdmacm not being avialable on all of our hosts. So it would be nice to know what we can do to test that a host would support rdmacm,
Example: -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: nyx5067.engin.umich.edu Local device: mlx4_0 Local port: 1 CPCs attempted: rdmacm -------------------------------------------------------------------------- This is one of our QDR hosts that rdmacm generally works on. Which this code (CRASH) requires to avoid a collective hang in MPI_Allreduce() I look on this hosts and I find: [root@nyx5067 ~]# rpm -qa | grep rdma librdmacm-1.0.11-1 librdmacm-1.0.11-1 librdmacm-devel-1.0.11-1 librdmacm-devel-1.0.11-1 librdmacm-utils-1.0.11-1 So all the libraries are installed (I think) is there a way to verify this? Or to have OpenMPI be more verbose what caused rdmacm to fail as an oob option? Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On May 3, 2011, at 9:42 AM, Dave Love wrote: > Brock Palen <bro...@umich.edu> writes: > >> We managed to have another user hit the bug that causes collectives (this >> time MPI_Bcast() ) to hang on IB that was fixed by setting: >> >> btl_openib_cpc_include rdmacm > > Could someone explain this? We also have problems with collective hangs > with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't > see any relevant issues filed. However, rdmacm isn't an available value > for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not > that I understand what these things are...). > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >