On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:

> On May 3, 2011, at 6:42 AM, Dave Love wrote:
> 
>>> We managed to have another user hit the bug that causes collectives (this 
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>> 
>>> btl_openib_cpc_include rdmacm
>> 
>> Could someone explain this?  We also have problems with collective hangs
>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>> see any relevant issues filed.  However, rdmacm isn't an available value
>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>> that I understand what these things are...).
> 
> Sorry for the delay -- perhaps an IB vendor can reply here with more detail...
> 
> We had a user-reported issue of some hangs that the IB vendors have been 
> unable to replicate in their respective labs.  We *suspect* that it may be an 
> issue with the oob openib CPC, but that code is pretty old and pretty mature, 
> so all of us would be at least somewhat surprised if that were the case.  If 
> anyone can reliably reproduce this error, please let us know and/or give us 
> access to your machines -- we have not closed this issue, but are unable to 
> move forward because the customers who reported this issue switched to rdmacm 
> and moved on (i.e., we don't have access to their machines to test any more).

An update, we set all our ib0 interfaces to have IP's on a 172. network. This 
allowed the use of rdmacm to work and get latencies that we would expect.  That 
said we are still getting hangs.  I can very reliably reproduce it using IMB 
with a specific core count on a specific test case. 

Just an update.  Has anyone else had luck fixing the lockup issues on openib 
BTL for collectives in some cases? Thanks!


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Reply via email to