On Thu, 27 May 2010, Jeff Squyres wrote:
On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:
That's pretty much my first proposition : abort when an error arises,
because if we don't, we'll crash soon afterwards. That's my original
concern and this should really be fixed.
Now, if you want to fix the openib BTL so that an error in IB results in
an elegant fallback on TCP (elegant = notified ;-)), then hooray.
You're specifically referring to the point where the openib btl sets the
reachable bit, but then later decides "oops, an error occurred, so
return !=OMPI_SUCCESS" -- and assume that the openib btl is not called
again.
Right?
Perfectly right.
If so, then yes, that's a bug. The openib btl should be fixed to unset
the reachable bit(s) that it just set before returning the error.
Or the BML could assume that !=OMPI_SUCCESS codes means that the
reachable bits it got back were invalid and should be ignored.
I'd lead towards the former.
Can you file and bug and submit a patch?
I'd like to (though I don't have an svn account), but some things
bother me.
Having errors on add_procs stop the application seems a good thing in all
cases, so why not do it ? That would solve my original problem which lead
to this discussion.
Yes, the openib BTL may be suboptimal (stopping the application instead of
nicely returning), but I'm fine with that, so I'm not very inclined to
spend time on this.
Sylvain