On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:

> That's pretty much my first proposition : abort when an error arises,
> because if we don't, we'll crash soon afterwards. That's my original
> concern and this should really be fixed.
> 
> Now, if you want to fix the openib BTL so that an error in IB results in
> an elegant fallback on TCP (elegant = notified ;-)), then hooray.

You're specifically referring to the point where the openib btl sets the 
reachable bit, but then later decides "oops, an error occurred, so return 
!=OMPI_SUCCESS" -- and assume that the openib btl is not called again.

Right?

If so, then yes, that's a bug.  The openib btl should be fixed to unset the 
reachable bit(s) that it just set before returning the error.

Or the BML could assume that !=OMPI_SUCCESS codes means that the reachable bits 
it got back were invalid and should be ignored.

I'd lead towards the former.  

Can you file and bug and submit a patch?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to