Just curious - your proposed fix sounds exactly like what was done in the OPAL 
SOS work. Are you therefore proposing to use SOS to provide a more 
informational status return?

If so, then it would seem the only real dispute here is: is there -any- 
condition whereby a given BTL should have the authority to tell OMPI to 
terminate an application, even if other BTLs could still function?

I guess that is what is unclear to me. How would an error in the IB network 
translate into "you must terminate the application, even if you could run just 
fine over the TCP transport"?

I understand that the current code may not yet support that operation, but I do 
believe that was the intent of the design. So only the case where -all- BTLs 
say "I can't do it" would result in termination. Rather than change that 
design, I believe the intent is to work towards completing that implementation. 
In the interim, it would seem most sensible to me that we add an MCA param that 
specifies the termination behavior (i.e., attempt to continue or terminate on 
first fatal BTL error).

Could you perhaps provide an example where the IB network should dictate that 
the application must terminate, even if other BTLs could function?


On May 31, 2010, at 3:10 AM, Sylvain Jeaugey wrote:

> In my case, the error happens in :
>  mca_btl_openib_add_procs()
>    mca_btl_openib_size_queues()
>      adjust_cq()
>        ibv_create_cq_compat()
>          ibv_create_cq()
> 
> ibv_create_cq() returns an error which goes up until 
> mca_btl_openib_add_procs(). As george mentionned, the openib btl should be 
> completely ignored, since the bitmask is not taken into account when a error 
> is returned. However -I don't know why- openib get called again and crashes.
> 
> So, yes, there must be a bug in openib.
> 
> And I know this is how you guys designed the bml layer. But I was hoping we 
> could improve the design to improve error handling.
> 
> So, this is my last try to explain my opinion. If you disagree, then we'll 
> fix this on the openib side.
> 
> Ignoring BTL errors bugs me because the current errors are all serious. Our 
> try to continue will therefore always leads to a crash (George, you 
> introduced an error return code, not a real error, hence you managed to 
> continue). This confuses the user as of why we have a problem, because the 
> first serious error will be flooded by further errors or crashes. This is 
> true for openib, but also for sm (I would like to stop on the first 
> "malloc()" that fails).
> 
> We have a two-level system (bitmask + return code) we could use to handle non 
> severe errors (bitmask) and severe errors (return code). Currently, we just 
> use the return code as a way to ignore the bitmask, but we could use the 
> return code as a more serious message and thus improve our error management.
> 
> To sum up, my proposition is to change the meaning of an error return code in 
> add_procs() from "I got a problem, continue without me" which can be 
> perfectly handled with the bitmask alone, to "I got a fatal error, please 
> stop the application".
> 
> I know this can be seen as an attempt to prevent fixing a bug in openib by 
> changing the design of the BML, but in this case, I think changing the BML 
> design would improve the overall behavior.
> 
> Sylvain
> 
> On Fri, 28 May 2010, Jeff Squyres wrote:
> 
>> To that point, where exactly in the openib BTL init / query sequence is it 
>> returning an error for you, Sylvain?  Is it just a matter of tidying 
>> something up properly before returning the error?
>> 
>> 
>> On May 28, 2010, at 2:21 PM, George Bosilca wrote:
>> 
>>> On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote:
>>> 
>>>> On Fri, 28 May 2010, Jeff Squyres wrote:
>>>> 
>>>>> On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:
>>>>> 
>>>>>> Understood, and I agreed that the bug should be fixed.  Patches would be 
>>>>>> welcome.  :-)
>>>> I sent a patch on the bml layer in my first e-mail. We will apply it on 
>>>> our tree, but as always we're trying to send patches back to open-source 
>>>> (that was not my intent to start such a debate).
>>> 
>>> The only problem with your patch is that it solve something that is not 
>>> supposed to happen. As a proof of concept I did return errors from the tcp 
>>> and sm BTLs, and Open MPI gracefully deal with them. So, it is not a matter 
>>> of aborting we're looking at is a matter of the opebib BTL doing something 
>>> it is not supposed to do.
>>> 
>>> Going through the code it looks like the bitmask doesn't matter, if an 
>>> error is returned by a BTL we zero the bitmask and continue to another BTL.
>>> 
>>> Example: the SM BTL returns OMPI_ERROR after creating all the internal 
>>> structures.
>>> 
>>>>> mpirun -np 4 --host node01 --mca btl sm,self ./ring
>>> 
>>> --------------------------------------------------------------------------
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications.  This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes.  This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other.  This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>> 
>>>  Process 1 ([[22047,1],3]) is on host: node01
>>>  Process 2 ([[22047,1],0]) is on host: node01
>>>  BTLs attempted: self sm
>>> 
>>> Your MPI job is now going to abort; sorry.
>>> --------------------------------------------------------------------------
>>> 
>>> Now if I allow TCP on the node:
>>>>> mpirun -np 4 --host node01 --mca btl sm,self,tcp ./ring
>>> 
>>> Process 0 sending 10 to 1, tag 201 (4 procs in ring)
>>> Process 0 sent to 1
>>> Process 3 exiting
>>> Process 0 decremented num: 9
>>> Process 0 decremented num: 8
>>> 
>>> Thus, Open MPI does the right thing when the BTLs are playing the game.
>>> 
>>>  george.
>>> 
>>>> 
>>>>> I should clarify rather than being flip:
>>>>> 
>>>>> 1. I agree: the bug should be fixed.  Clearly, we should never crash.
>>>>> 
>>>>> 2. After the bug is fixed, there is clearly a choice: some people may 
>>>>> want to use a different transport if a given BTL is unavailable. Others 
>>>>> may want to abort.  Once the bug is fixed, this seems like a pretty 
>>>>> straightforward thing to add.
>>>> If you use my patch, you still have no choice. Errors on BTLs lead to an 
>>>> immediate stop instead of trying to continue (and crash).
>>>> If someone wants to go further on this, then that's great. If nobody does, 
>>>> I think you should take my patch. Maybe it's not the best solution, but 
>>>> it's still better than the current state.
>>>> 
>>>> Sylvain
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to