Sorry for the delay on this -- it looks like the problem is caused by messages like this (from your first message):
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID where you want to use it. On May 5, 2011, at 1:15 PM, Brock Palen wrote: > Yeah we have ran into more issues, with rdmacm not being avialable on all of > our hosts. So it would be nice to know what we can do to test that a host > would support rdmacm, > > Example: > > -------------------------------------------------------------------------- > No OpenFabrics connection schemes reported that they were able to be > used on a specific port. As such, the openib BTL (OpenFabrics > support) will be disabled for this port. > > Local host: nyx5067.engin.umich.edu > Local device: mlx4_0 > Local port: 1 > CPCs attempted: rdmacm > -------------------------------------------------------------------------- > > This is one of our QDR hosts that rdmacm generally works on. Which this code > (CRASH) requires to avoid a collective hang in MPI_Allreduce() > > I look on this hosts and I find: > [root@nyx5067 ~]# rpm -qa | grep rdma > librdmacm-1.0.11-1 > librdmacm-1.0.11-1 > librdmacm-devel-1.0.11-1 > librdmacm-devel-1.0.11-1 > librdmacm-utils-1.0.11-1 > > So all the libraries are installed (I think) is there a way to verify this? > Or to have OpenMPI be more verbose what caused rdmacm to fail as an oob > option? > > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > On May 3, 2011, at 9:42 AM, Dave Love wrote: > >> Brock Palen <bro...@umich.edu> writes: >> >>> We managed to have another user hit the bug that causes collectives (this >>> time MPI_Bcast() ) to hang on IB that was fixed by setting: >>> >>> btl_openib_cpc_include rdmacm >> >> Could someone explain this? We also have problems with collective hangs >> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't >> see any relevant issues filed. However, rdmacm isn't an available value >> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not >> that I understand what these things are...). >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/