On May 27, 2010, at 1:47 AM, Sylvain Jeaugey wrote:

> I don't think what the openib BTL is doing is that bad. It is returning an 
> error because something really went bad in IB. So yes, it could blank the 
> bitmask and return success, but would you really want IB to fail and fallback 
> on TCP once in a while without any notice ?

As a sys admin - no, I would want to know it happened.

As a user - heck yeah! I don't care how the problem gets done, I just want the 
answer. It will probably take longer to complete, but that is better than 
having to start all over just because the cluster hiccups.

I believe this is what the notifier is intended to resolve.

> I wouldn't.
> 
> So, as it seems that all "normal" problems can be handled through the 
> reachable bitmask, it seems a good idea to me that BTLs returning errors
> make the application stop.
> 
> Sylvain
> 
> On Wed, 26 May 2010, Barrett, Brian W wrote:
> 
>> George -
>> 
>> I'm not sure I agree - the return code should indicate a failure beyond 
>> "something prohibited me from talking to the remote side" - something 
>> occurred that resulted in it being highly unlikely the app can successfully 
>> run to completion (such as malloc failing).  On the other hand, I also think 
>> that the OpenIB BTL is probably doing the wrong thing - I can't imagine that 
>> the error returned reaches that state of badness, and it should probably 
>> zero out the bitmask and quietly return rather than try to cause the app to 
>> abort.
>> 
>> Just my $0.02.
>> 
>> Brian
>> 
>> 
>> On May 25, 2010, at 12:27 PM, George Bosilca wrote:
>> 
>>> The BTLs are allowed to fail adding procs without major consequences in the 
>>> short term. As you noticed each BTL returns a bit mask array containing all 
>>> procs reachable through this particular instance of the BTL. Later (in the 
>>> same file line 395) we check for the complete coverage for all procs, and 
>>> only complain if one of the peers is unreachable.
>>> 
>>> If you replace the continue statement by a return, we will never give a 
>>> chance to the other BTLs and we will complain about lack of connectivity as 
>>> soon as one BTL fails (for some reasons). Without talking about the fact 
>>> that all the eager, send and rmda endpoint arrays will not be built.
>>> 
>>> george.
>>> 
>>> On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I'm currently trying to have Open MPI exit more gracefully when a BTL 
>>>> returns an error during the "add procs" phase.
>>>> 
>>>> The current bml/r2 code silently ignores btl->add_procs() error codes with 
>>>> the following comment :
>>>> ---- ompi/mca/bml/r2/bml_r2.c:208 ----
>>>> /* This BTL has troubles adding the nodes. Let's continue maybe some other 
>>>> BTL
>>>> * can take care of this task. */
>>>> continue;
>>>> --------------------------------------
>>>> 
>>>> This seems wrong to me : either a proc is reached (the "reachable" bit 
>>>> field is therefore updated), either it is not (and nothing is done). Any 
>>>> error code should denote a fatal error needing a clean abort.
>>>> 
>>>> In the current openib btl code, the "reachable" bit is set but an error is 
>>>> returned - then ignored by r2. The next call to the openib BTL results in 
>>>> a segmentation fault.
>>>> 
>>>> So, maybe this simple fix would do the trick :
>>>> ========================================================================
>>>> diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
>>>> --- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
>>>> +++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
>>>> @@ -210,7 +210,7 @@
>>>>           /* This BTL has troubles adding the nodes. Let's continue maybe 
>>>> some other BTL
>>>>            * can take care of this task.
>>>>            */
>>>> -            continue;
>>>> +            return rc;
>>>>       }
>>>> 
>>>>       /* for each proc that is reachable */
>>>> ========================================================================
>>>> 
>>>> Does anyone see a case (with a specific btl) where add_procs returns an 
>>>> error but we still want to continue ?
>>>> 
>>>> Sylvain
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to