Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey
I don't think what the openib BTL is doing is that bad. It is returning an 
error because something really went bad in IB. So yes, it could blank the 
bitmask and return success, but would you really want IB to fail and 
fallback on TCP once in a while without any notice ? I wouldn't.


So, as it seems that all "normal" problems can be handled through the 
reachable bitmask, it seems a good idea to me that BTLs returning errors

make the application stop.

Sylvain

On Wed, 26 May 2010, Barrett, Brian W wrote:


George -

I'm not sure I agree - the return code should indicate a failure beyond 
"something prohibited me from talking to the remote side" - something 
occurred that resulted in it being highly unlikely the app can 
successfully run to completion (such as malloc failing).  On the other 
hand, I also think that the OpenIB BTL is probably doing the wrong thing 
- I can't imagine that the error returned reaches that state of badness, 
and it should probably zero out the bitmask and quietly return rather 
than try to cause the app to abort.


Just my $0.02.

Brian


On May 25, 2010, at 12:27 PM, George Bosilca wrote:

The BTLs are allowed to fail adding procs without major consequences in 
the short term. As you noticed each BTL returns a bit mask array 
containing all procs reachable through this particular instance of the 
BTL. Later (in the same file line 395) we check for the complete 
coverage for all procs, and only complain if one of the peers is 
unreachable.


If you replace the continue statement by a return, we will never give a 
chance to the other BTLs and we will complain about lack of 
connectivity as soon as one BTL fails (for some reasons). Without 
talking about the fact that all the eager, send and rmda endpoint 
arrays will not be built.


 george.

On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:


Hi,

I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error 
during the "add procs" phase.

The current bml/r2 code silently ignores btl->add_procs() error codes with the 
following comment :
 ompi/mca/bml/r2/bml_r2.c:208 
/* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
 * can take care of this task. */
continue;
--

This seems wrong to me : either a proc is reached (the "reachable" bit field is 
therefore updated), either it is not (and nothing is done). Any error code should denote 
a fatal error needing a clean abort.

In the current openib btl code, the "reachable" bit is set but an error is 
returned - then ignored by r2. The next call to the openib BTL results in a segmentation 
fault.

So, maybe this simple fix would do the trick :

diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
   /* This BTL has troubles adding the nodes. Let's continue maybe some 
other BTL
* can take care of this task.
*/
-continue;
+return rc;
   }

   /* for each proc that is reachable */


Does anyone see a case (with a specific btl) where add_procs returns an error 
but we still want to continue ?

Sylvain
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia National Laboratories





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Ralph Castain

On May 27, 2010, at 1:47 AM, Sylvain Jeaugey wrote:

> I don't think what the openib BTL is doing is that bad. It is returning an 
> error because something really went bad in IB. So yes, it could blank the 
> bitmask and return success, but would you really want IB to fail and fallback 
> on TCP once in a while without any notice ?

As a sys admin - no, I would want to know it happened.

As a user - heck yeah! I don't care how the problem gets done, I just want the 
answer. It will probably take longer to complete, but that is better than 
having to start all over just because the cluster hiccups.

I believe this is what the notifier is intended to resolve.

> I wouldn't.
> 
> So, as it seems that all "normal" problems can be handled through the 
> reachable bitmask, it seems a good idea to me that BTLs returning errors
> make the application stop.
> 
> Sylvain
> 
> On Wed, 26 May 2010, Barrett, Brian W wrote:
> 
>> George -
>> 
>> I'm not sure I agree - the return code should indicate a failure beyond 
>> "something prohibited me from talking to the remote side" - something 
>> occurred that resulted in it being highly unlikely the app can successfully 
>> run to completion (such as malloc failing).  On the other hand, I also think 
>> that the OpenIB BTL is probably doing the wrong thing - I can't imagine that 
>> the error returned reaches that state of badness, and it should probably 
>> zero out the bitmask and quietly return rather than try to cause the app to 
>> abort.
>> 
>> Just my $0.02.
>> 
>> Brian
>> 
>> 
>> On May 25, 2010, at 12:27 PM, George Bosilca wrote:
>> 
>>> The BTLs are allowed to fail adding procs without major consequences in the 
>>> short term. As you noticed each BTL returns a bit mask array containing all 
>>> procs reachable through this particular instance of the BTL. Later (in the 
>>> same file line 395) we check for the complete coverage for all procs, and 
>>> only complain if one of the peers is unreachable.
>>> 
>>> If you replace the continue statement by a return, we will never give a 
>>> chance to the other BTLs and we will complain about lack of connectivity as 
>>> soon as one BTL fails (for some reasons). Without talking about the fact 
>>> that all the eager, send and rmda endpoint arrays will not be built.
>>> 
>>> george.
>>> 
>>> On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:
>>> 
 Hi,
 
 I'm currently trying to have Open MPI exit more gracefully when a BTL 
 returns an error during the "add procs" phase.
 
 The current bml/r2 code silently ignores btl->add_procs() error codes with 
 the following comment :
  ompi/mca/bml/r2/bml_r2.c:208 
 /* This BTL has troubles adding the nodes. Let's continue maybe some other 
 BTL
 * can take care of this task. */
 continue;
 --
 
 This seems wrong to me : either a proc is reached (the "reachable" bit 
 field is therefore updated), either it is not (and nothing is done). Any 
 error code should denote a fatal error needing a clean abort.
 
 In the current openib btl code, the "reachable" bit is set but an error is 
 returned - then ignored by r2. The next call to the openib BTL results in 
 a segmentation fault.
 
 So, maybe this simple fix would do the trick :
 
 diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
 --- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
 +++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
 @@ -210,7 +210,7 @@
   /* This BTL has troubles adding the nodes. Let's continue maybe 
 some other BTL
* can take care of this task.
*/
 -continue;
 +return rc;
   }
 
   /* for each proc that is reachable */
 
 
 Does anyone see a case (with a specific btl) where add_procs returns an 
 error but we still want to continue ?
 
 Sylvain
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Barrett, Brian W
Sylvain -

I have to agree with Ralph.  The situation you describe (IB failing) may or may 
not be what the user wants.  And, in fact, we will print a warning message to 
the user that such a situation (falling back to TCP) has occurred.  However, it 
also does not fall under the category of "fail the job" bad according to the 
design goals of Open MPI -- we explicitly wanted to allow easy fallback to 
another interconnect when something bad happens.  The bml and pml have the 
opprotunity to determine if they like the BTL choices made, and this is the 
right level to make that decision.  Lower layer transports should not be 
implementing high level policy.  So I go back to: if the job can run to 
completion (even if slower), add_procs should not return an error.  If 
add_procs does return an error, the job should abort.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories

From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of 
Sylvain Jeaugey [sylvain.jeau...@bull.net]
Sent: Thursday, May 27, 2010 1:47 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] BTL add procs errors

I don't think what the openib BTL is doing is that bad. It is returning an
error because something really went bad in IB. So yes, it could blank the
bitmask and return success, but would you really want IB to fail and
fallback on TCP once in a while without any notice ? I wouldn't.

So, as it seems that all "normal" problems can be handled through the
reachable bitmask, it seems a good idea to me that BTLs returning errors
make the application stop.

Sylvain

On Wed, 26 May 2010, Barrett, Brian W wrote:

> George -
>
> I'm not sure I agree - the return code should indicate a failure beyond
> "something prohibited me from talking to the remote side" - something
> occurred that resulted in it being highly unlikely the app can
> successfully run to completion (such as malloc failing).  On the other
> hand, I also think that the OpenIB BTL is probably doing the wrong thing
> - I can't imagine that the error returned reaches that state of badness,
> and it should probably zero out the bitmask and quietly return rather
> than try to cause the app to abort.
>
> Just my $0.02.
>
> Brian
>
>
> On May 25, 2010, at 12:27 PM, George Bosilca wrote:
>
>> The BTLs are allowed to fail adding procs without major consequences in
>> the short term. As you noticed each BTL returns a bit mask array
>> containing all procs reachable through this particular instance of the
>> BTL. Later (in the same file line 395) we check for the complete
>> coverage for all procs, and only complain if one of the peers is
>> unreachable.
>>
>> If you replace the continue statement by a return, we will never give a
>> chance to the other BTLs and we will complain about lack of
>> connectivity as soon as one BTL fails (for some reasons). Without
>> talking about the fact that all the eager, send and rmda endpoint
>> arrays will not be built.
>>
>>  george.
>>
>> On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:
>>
>>> Hi,
>>>
>>> I'm currently trying to have Open MPI exit more gracefully when a BTL 
>>> returns an error during the "add procs" phase.
>>>
>>> The current bml/r2 code silently ignores btl->add_procs() error codes with 
>>> the following comment :
>>>  ompi/mca/bml/r2/bml_r2.c:208 
>>> /* This BTL has troubles adding the nodes. Let's continue maybe some other 
>>> BTL
>>>  * can take care of this task. */
>>> continue;
>>> --
>>>
>>> This seems wrong to me : either a proc is reached (the "reachable" bit 
>>> field is therefore updated), either it is not (and nothing is done). Any 
>>> error code should denote a fatal error needing a clean abort.
>>>
>>> In the current openib btl code, the "reachable" bit is set but an error is 
>>> returned - then ignored by r2. The next call to the openib BTL results in a 
>>> segmentation fault.
>>>
>>> So, maybe this simple fix would do the trick :
>>> 
>>> diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
>>> --- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
>>> +++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
>>> @@ -210,7 +210,7 @@
>>>/* This BTL has troubles adding the nodes. Let's continue maybe 
>>> some other BTL
>>> * can take care of this task.
>>> */
>>> -continue;
>>> +return rc;
>>>}
>>>
>>>/* for each proc that is reachable */
>>> 
>>>
>>> Does anyone see a case (with a specific btl) where add_procs returns an 
>>> error but we still want to continue ?
>>>
>>> Sylvain
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> 

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey
That's pretty much my first proposition : abort when an error arises, 
because if we don't, we'll crash soon afterwards. That's my original 
concern and this should really be fixed.


Now, if you want to fix the openib BTL so that an error in IB results in 
an elegant fallback on TCP (elegant = notified ;-)), then hooray.


Sylvain

On Thu, 27 May 2010, Barrett, Brian W wrote:


Sylvain -

I have to agree with Ralph.  The situation you describe (IB failing) may 
or may not be what the user wants.  And, in fact, we will print a 
warning message to the user that such a situation (falling back to TCP) 
has occurred.  However, it also does not fall under the category of 
"fail the job" bad according to the design goals of Open MPI -- we 
explicitly wanted to allow easy fallback to another interconnect when 
something bad happens.  The bml and pml have the opprotunity to 
determine if they like the BTL choices made, and this is the right level 
to make that decision.  Lower layer transports should not be 
implementing high level policy.  So I go back to: if the job can run to 
completion (even if slower), add_procs should not return an error.  If 
add_procs does return an error, the job should abort.


Brian

--
 Brian W. Barrett
 Scalable System Software Group
 Sandia National Laboratories

From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of 
Sylvain Jeaugey [sylvain.jeau...@bull.net]
Sent: Thursday, May 27, 2010 1:47 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] BTL add procs errors

I don't think what the openib BTL is doing is that bad. It is returning an
error because something really went bad in IB. So yes, it could blank the
bitmask and return success, but would you really want IB to fail and
fallback on TCP once in a while without any notice ? I wouldn't.

So, as it seems that all "normal" problems can be handled through the
reachable bitmask, it seems a good idea to me that BTLs returning errors
make the application stop.

Sylvain

On Wed, 26 May 2010, Barrett, Brian W wrote:


George -

I'm not sure I agree - the return code should indicate a failure beyond
"something prohibited me from talking to the remote side" - something
occurred that resulted in it being highly unlikely the app can
successfully run to completion (such as malloc failing).  On the other
hand, I also think that the OpenIB BTL is probably doing the wrong thing
- I can't imagine that the error returned reaches that state of badness,
and it should probably zero out the bitmask and quietly return rather
than try to cause the app to abort.

Just my $0.02.

Brian


On May 25, 2010, at 12:27 PM, George Bosilca wrote:


The BTLs are allowed to fail adding procs without major consequences in
the short term. As you noticed each BTL returns a bit mask array
containing all procs reachable through this particular instance of the
BTL. Later (in the same file line 395) we check for the complete
coverage for all procs, and only complain if one of the peers is
unreachable.

If you replace the continue statement by a return, we will never give a
chance to the other BTLs and we will complain about lack of
connectivity as soon as one BTL fails (for some reasons). Without
talking about the fact that all the eager, send and rmda endpoint
arrays will not be built.

 george.

On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:


Hi,

I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error 
during the "add procs" phase.

The current bml/r2 code silently ignores btl->add_procs() error codes with the 
following comment :
 ompi/mca/bml/r2/bml_r2.c:208 
/* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
 * can take care of this task. */
continue;
--

This seems wrong to me : either a proc is reached (the "reachable" bit field is 
therefore updated), either it is not (and nothing is done). Any error code should denote 
a fatal error needing a clean abort.

In the current openib btl code, the "reachable" bit is set but an error is 
returned - then ignored by r2. The next call to the openib BTL results in a segmentation 
fault.

So, maybe this simple fix would do the trick :

diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
   /* This BTL has troubles adding the nodes. Let's continue maybe some 
other BTL
* can take care of this task.
*/
-continue;
+return rc;
   }

   /* for each proc that is reachable */


Does anyone see a case (with a specific btl) where add_procs returns an error 
but we still want to continue ?

Sylvain

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Jeff Squyres
On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:

> That's pretty much my first proposition : abort when an error arises,
> because if we don't, we'll crash soon afterwards. That's my original
> concern and this should really be fixed.
> 
> Now, if you want to fix the openib BTL so that an error in IB results in
> an elegant fallback on TCP (elegant = notified ;-)), then hooray.

You're specifically referring to the point where the openib btl sets the 
reachable bit, but then later decides "oops, an error occurred, so return 
!=OMPI_SUCCESS" -- and assume that the openib btl is not called again.

Right?

If so, then yes, that's a bug.  The openib btl should be fixed to unset the 
reachable bit(s) that it just set before returning the error.

Or the BML could assume that !=OMPI_SUCCESS codes means that the reachable bits 
it got back were invalid and should be ignored.

I'd lead towards the former.  

Can you file and bug and submit a patch?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/