On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote: > That's pretty much my first proposition : abort when an error arises, > because if we don't, we'll crash soon afterwards. That's my original > concern and this should really be fixed. > > Now, if you want to fix the openib BTL so that an error in IB results in > an elegant fallback on TCP (elegant = notified ;-)), then hooray.
You're specifically referring to the point where the openib btl sets the reachable bit, but then later decides "oops, an error occurred, so return !=OMPI_SUCCESS" -- and assume that the openib btl is not called again. Right? If so, then yes, that's a bug. The openib btl should be fixed to unset the reachable bit(s) that it just set before returning the error. Or the BML could assume that !=OMPI_SUCCESS codes means that the reachable bits it got back were invalid and should be ignored. I'd lead towards the former. Can you file and bug and submit a patch? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/