Re: [OMPI devel] BTL add procs errors

2010-05-31 Thread Ralph Castain
Just curious - your proposed fix sounds exactly like what was done in the OPAL 
SOS work. Are you therefore proposing to use SOS to provide a more 
informational status return?

If so, then it would seem the only real dispute here is: is there -any- 
condition whereby a given BTL should have the authority to tell OMPI to 
terminate an application, even if other BTLs could still function?

I guess that is what is unclear to me. How would an error in the IB network 
translate into "you must terminate the application, even if you could run just 
fine over the TCP transport"?

I understand that the current code may not yet support that operation, but I do 
believe that was the intent of the design. So only the case where -all- BTLs 
say "I can't do it" would result in termination. Rather than change that 
design, I believe the intent is to work towards completing that implementation. 
In the interim, it would seem most sensible to me that we add an MCA param that 
specifies the termination behavior (i.e., attempt to continue or terminate on 
first fatal BTL error).

Could you perhaps provide an example where the IB network should dictate that 
the application must terminate, even if other BTLs could function?


On May 31, 2010, at 3:10 AM, Sylvain Jeaugey wrote:

> In my case, the error happens in :
>  mca_btl_openib_add_procs()
>mca_btl_openib_size_queues()
>  adjust_cq()
>ibv_create_cq_compat()
>  ibv_create_cq()
> 
> ibv_create_cq() returns an error which goes up until 
> mca_btl_openib_add_procs(). As george mentionned, the openib btl should be 
> completely ignored, since the bitmask is not taken into account when a error 
> is returned. However -I don't know why- openib get called again and crashes.
> 
> So, yes, there must be a bug in openib.
> 
> And I know this is how you guys designed the bml layer. But I was hoping we 
> could improve the design to improve error handling.
> 
> So, this is my last try to explain my opinion. If you disagree, then we'll 
> fix this on the openib side.
> 
> Ignoring BTL errors bugs me because the current errors are all serious. Our 
> try to continue will therefore always leads to a crash (George, you 
> introduced an error return code, not a real error, hence you managed to 
> continue). This confuses the user as of why we have a problem, because the 
> first serious error will be flooded by further errors or crashes. This is 
> true for openib, but also for sm (I would like to stop on the first 
> "malloc()" that fails).
> 
> We have a two-level system (bitmask + return code) we could use to handle non 
> severe errors (bitmask) and severe errors (return code). Currently, we just 
> use the return code as a way to ignore the bitmask, but we could use the 
> return code as a more serious message and thus improve our error management.
> 
> To sum up, my proposition is to change the meaning of an error return code in 
> add_procs() from "I got a problem, continue without me" which can be 
> perfectly handled with the bitmask alone, to "I got a fatal error, please 
> stop the application".
> 
> I know this can be seen as an attempt to prevent fixing a bug in openib by 
> changing the design of the BML, but in this case, I think changing the BML 
> design would improve the overall behavior.
> 
> Sylvain
> 
> On Fri, 28 May 2010, Jeff Squyres wrote:
> 
>> To that point, where exactly in the openib BTL init / query sequence is it 
>> returning an error for you, Sylvain?  Is it just a matter of tidying 
>> something up properly before returning the error?
>> 
>> 
>> On May 28, 2010, at 2:21 PM, George Bosilca wrote:
>> 
>>> On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote:
>>> 
 On Fri, 28 May 2010, Jeff Squyres wrote:
 
> On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:
> 
>> Understood, and I agreed that the bug should be fixed.  Patches would be 
>> welcome.  :-)
 I sent a patch on the bml layer in my first e-mail. We will apply it on 
 our tree, but as always we're trying to send patches back to open-source 
 (that was not my intent to start such a debate).
>>> 
>>> The only problem with your patch is that it solve something that is not 
>>> supposed to happen. As a proof of concept I did return errors from the tcp 
>>> and sm BTLs, and Open MPI gracefully deal with them. So, it is not a matter 
>>> of aborting we're looking at is a matter of the opebib BTL doing something 
>>> it is not supposed to do.
>>> 
>>> Going through the code it looks like the bitmask doesn't matter, if an 
>>> error is returned by a BTL we zero the bitmask and continue to another BTL.
>>> 
>>> Example: the SM BTL returns OMPI_ERROR after creating all the internal 
>>> structures.
>>> 
> mpirun -np 4 --host node01 --mca btl sm,self ./ring
>>> 
>>> --
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications.  This means 

Re: [OMPI devel] BTL add procs errors

2010-05-31 Thread Sylvain Jeaugey

In my case, the error happens in :
  mca_btl_openib_add_procs()
mca_btl_openib_size_queues()
  adjust_cq()
ibv_create_cq_compat()
  ibv_create_cq()

ibv_create_cq() returns an error which goes up until 
mca_btl_openib_add_procs(). As george mentionned, the openib btl should be 
completely ignored, since the bitmask is not taken into account when a 
error is returned. However -I don't know why- openib get called again and 
crashes.


So, yes, there must be a bug in openib.

And I know this is how you guys designed the bml layer. But I was hoping 
we could improve the design to improve error handling.


So, this is my last try to explain my opinion. If you disagree, then we'll 
fix this on the openib side.


Ignoring BTL errors bugs me because the current errors are all serious. 
Our try to continue will therefore always leads to a crash (George, you 
introduced an error return code, not a real error, hence you managed to 
continue). This confuses the user as of why we have a problem, because the 
first serious error will be flooded by further errors or crashes. This is 
true for openib, but also for sm (I would like to stop on the first 
"malloc()" that fails).


We have a two-level system (bitmask + return code) we could use to handle 
non severe errors (bitmask) and severe errors (return code). Currently, we 
just use the return code as a way to ignore the bitmask, but we could use 
the return code as a more serious message and thus improve our error 
management.


To sum up, my proposition is to change the meaning of an error return code 
in add_procs() from "I got a problem, continue without me" which can be 
perfectly handled with the bitmask alone, to "I got a fatal error, please 
stop the application".


I know this can be seen as an attempt to prevent fixing a bug in openib by 
changing the design of the BML, but in this case, I think changing the BML 
design would improve the overall behavior.


Sylvain

On Fri, 28 May 2010, Jeff Squyres wrote:

To that point, where exactly in the openib BTL init / query sequence is 
it returning an error for you, Sylvain?  Is it just a matter of tidying 
something up properly before returning the error?



On May 28, 2010, at 2:21 PM, George Bosilca wrote:


On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote:


On Fri, 28 May 2010, Jeff Squyres wrote:


On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:


Understood, and I agreed that the bug should be fixed.  Patches would be 
welcome.  :-)

I sent a patch on the bml layer in my first e-mail. We will apply it on our 
tree, but as always we're trying to send patches back to open-source (that was 
not my intent to start such a debate).


The only problem with your patch is that it solve something that is not 
supposed to happen. As a proof of concept I did return errors from the tcp and 
sm BTLs, and Open MPI gracefully deal with them. So, it is not a matter of 
aborting we're looking at is a matter of the opebib BTL doing something it is 
not supposed to do.

Going through the code it looks like the bitmask doesn't matter, if an error is 
returned by a BTL we zero the bitmask and continue to another BTL.

Example: the SM BTL returns OMPI_ERROR after creating all the internal 
structures.


mpirun -np 4 --host node01 --mca btl sm,self ./ring


--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[22047,1],3]) is on host: node01
  Process 2 ([[22047,1],0]) is on host: node01
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--

Now if I allow TCP on the node:

mpirun -np 4 --host node01 --mca btl sm,self,tcp ./ring


Process 0 sending 10 to 1, tag 201 (4 procs in ring)
Process 0 sent to 1
Process 3 exiting
Process 0 decremented num: 9
Process 0 decremented num: 8

Thus, Open MPI does the right thing when the BTLs are playing the game.

  george.




I should clarify rather than being flip:

1. I agree: the bug should be fixed.  Clearly, we should never crash.

2. After the bug is fixed, there is clearly a choice: some people may want to 
use a different transport if a given BTL is unavailable. Others may want to 
abort.  Once the bug is fixed, this seems like a pretty straightforward thing 
to add.

If you use my patch, you still have no choice. Errors on BTLs lead to an 
immediate stop instead of trying to continue (and crash).
If someone wants to go further on this, then that's great. If nobody does, I 
think you should take my patch. Maybe it's not the best solution, but it's 
still better than the