On Fri, 28 May 2010, Jeff Squyres wrote:

Herein lies the quandary: we don't/can't know the user or sysadmin intent. They may not care if the IB is borked -- they might just want the job to fall over to TCP and continue. But they may care a lot if IB is borked -- they might want the job to abort (because it would be too slow over TCP).
There is no intent nor choice : Open MPI today always crashes on such an error. The thing is, we crash at the wrong place, which is why I'd like it to stop on the real error rather than trying to continue and hide the real problem within a ton of error traces.

Frankly, I don't know how to be clearer. The discussion started on a bug and you moved it to a nice-feature-we-would-like-to-have.

So please, fix the bug first, then if you want that "automatic failover to TCP" feature, develop it. Put a parameter for an eventual notification, or abort, or whatever you want. But it doesn't exist today. It just doesn't work, with any BTL. Errors reported by BTLs are all fatal.

Sylvain

Reply via email to