Re: [OMPI devel] BTL add procs errors

Sylvain Jeaugey Fri, 28 May 2010 06:04:16 -0400

On Thu, 27 May 2010, Jeff Squyres wrote:

On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:
That's pretty much my first proposition : abort when an error arises,
because if we don't, we'll crash soon afterwards. That's my original
concern and this should really be fixed.

Now, if you want to fix the openib BTL so that an error in IB results in
an elegant fallback on TCP (elegant = notified ;-)), then hooray.
You're specifically referring to the point where the openib btl sets thereachable bit, but then later decides "oops, an error occurred, soreturn !=OMPI_SUCCESS" -- and assume that the openib btl is not calledagain.
Right?

Perfectly right.

If so, then yes, that's a bug. The openib btl should be fixed to unsetthe reachable bit(s) that it just set before returning the error.
Or the BML could assume that !=OMPI_SUCCESS codes means that thereachable bits it got back were invalid and should be ignored.
I'd lead towards the former.

Can you file and bug and submit a patch?

I'd like to (though I don't have an svn account), but some things
bother me.

Having errors on add_procs stop the application seems a good thing in allcases, so why not do it ? That would solve my original problem which leadto this discussion.

Yes, the openib BTL may be suboptimal (stopping the application instead ofnicely returning), but I'm fine with that, so I'm not very inclined tospend time on this.


Sylvain

Re: [OMPI devel] BTL add procs errors

Reply via email to