On May 3, 2011, at 6:42 AM, Dave Love wrote:

>> We managed to have another user hit the bug that causes collectives (this 
>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>> 
>> btl_openib_cpc_include rdmacm
> 
> Could someone explain this?  We also have problems with collective hangs
> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
> see any relevant issues filed.  However, rdmacm isn't an available value
> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
> that I understand what these things are...).

Sorry for the delay -- perhaps an IB vendor can reply here with more detail...

We had a user-reported issue of some hangs that the IB vendors have been unable 
to replicate in their respective labs.  We *suspect* that it may be an issue 
with the oob openib CPC, but that code is pretty old and pretty mature, so all 
of us would be at least somewhat surprised if that were the case.  If anyone 
can reliably reproduce this error, please let us know and/or give us access to 
your machines -- we have not closed this issue, but are unable to move forward 
because the customers who reported this issue switched to rdmacm and moved on 
(i.e., we don't have access to their machines to test any more).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to