On May 9, 2011, at 9:31 AM, Jeff Squyres wrote: > On May 3, 2011, at 6:42 AM, Dave Love wrote: > >>> We managed to have another user hit the bug that causes collectives (this >>> time MPI_Bcast() ) to hang on IB that was fixed by setting: >>> >>> btl_openib_cpc_include rdmacm >> >> Could someone explain this? We also have problems with collective hangs >> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't >> see any relevant issues filed. However, rdmacm isn't an available value >> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not >> that I understand what these things are...). > > Sorry for the delay -- perhaps an IB vendor can reply here with more detail... > > We had a user-reported issue of some hangs that the IB vendors have been > unable to replicate in their respective labs. We *suspect* that it may be an > issue with the oob openib CPC, but that code is pretty old and pretty mature, > so all of us would be at least somewhat surprised if that were the case. If > anyone can reliably reproduce this error, please let us know and/or give us > access to your machines -- we have not closed this issue, but are unable to > move forward because the customers who reported this issue switched to rdmacm > and moved on (i.e., we don't have access to their machines to test any more).
An update, we set all our ib0 interfaces to have IP's on a 172. network. This allowed the use of rdmacm to work and get latencies that we would expect. That said we are still getting hangs. I can very reliably reproduce it using IMB with a specific core count on a specific test case. Just an update. Has anyone else had luck fixing the lockup issues on openib BTL for collectives in some cases? Thanks! Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >