On May 27, 2010, at 1:47 AM, Sylvain Jeaugey wrote: > I don't think what the openib BTL is doing is that bad. It is returning an > error because something really went bad in IB. So yes, it could blank the > bitmask and return success, but would you really want IB to fail and fallback > on TCP once in a while without any notice ?
As a sys admin - no, I would want to know it happened. As a user - heck yeah! I don't care how the problem gets done, I just want the answer. It will probably take longer to complete, but that is better than having to start all over just because the cluster hiccups. I believe this is what the notifier is intended to resolve. > I wouldn't. > > So, as it seems that all "normal" problems can be handled through the > reachable bitmask, it seems a good idea to me that BTLs returning errors > make the application stop. > > Sylvain > > On Wed, 26 May 2010, Barrett, Brian W wrote: > >> George - >> >> I'm not sure I agree - the return code should indicate a failure beyond >> "something prohibited me from talking to the remote side" - something >> occurred that resulted in it being highly unlikely the app can successfully >> run to completion (such as malloc failing). On the other hand, I also think >> that the OpenIB BTL is probably doing the wrong thing - I can't imagine that >> the error returned reaches that state of badness, and it should probably >> zero out the bitmask and quietly return rather than try to cause the app to >> abort. >> >> Just my $0.02. >> >> Brian >> >> >> On May 25, 2010, at 12:27 PM, George Bosilca wrote: >> >>> The BTLs are allowed to fail adding procs without major consequences in the >>> short term. As you noticed each BTL returns a bit mask array containing all >>> procs reachable through this particular instance of the BTL. Later (in the >>> same file line 395) we check for the complete coverage for all procs, and >>> only complain if one of the peers is unreachable. >>> >>> If you replace the continue statement by a return, we will never give a >>> chance to the other BTLs and we will complain about lack of connectivity as >>> soon as one BTL fails (for some reasons). Without talking about the fact >>> that all the eager, send and rmda endpoint arrays will not be built. >>> >>> george. >>> >>> On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote: >>> >>>> Hi, >>>> >>>> I'm currently trying to have Open MPI exit more gracefully when a BTL >>>> returns an error during the "add procs" phase. >>>> >>>> The current bml/r2 code silently ignores btl->add_procs() error codes with >>>> the following comment : >>>> ---- ompi/mca/bml/r2/bml_r2.c:208 ---- >>>> /* This BTL has troubles adding the nodes. Let's continue maybe some other >>>> BTL >>>> * can take care of this task. */ >>>> continue; >>>> -------------------------------------- >>>> >>>> This seems wrong to me : either a proc is reached (the "reachable" bit >>>> field is therefore updated), either it is not (and nothing is done). Any >>>> error code should denote a fatal error needing a clean abort. >>>> >>>> In the current openib btl code, the "reachable" bit is set but an error is >>>> returned - then ignored by r2. The next call to the openib BTL results in >>>> a segmentation fault. >>>> >>>> So, maybe this simple fix would do the trick : >>>> ======================================================================== >>>> diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c >>>> --- a/ompi/mca/bml/r2/bml_r2.c Wed May 19 14:35:27 2010 +0200 >>>> +++ b/ompi/mca/bml/r2/bml_r2.c Tue May 25 10:54:19 2010 +0200 >>>> @@ -210,7 +210,7 @@ >>>> /* This BTL has troubles adding the nodes. Let's continue maybe >>>> some other BTL >>>> * can take care of this task. >>>> */ >>>> - continue; >>>> + return rc; >>>> } >>>> >>>> /* for each proc that is reachable */ >>>> ======================================================================== >>>> >>>> Does anyone see a case (with a specific btl) where add_procs returns an >>>> error but we still want to continue ? >>>> >>>> Sylvain >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel