I will re-iterate my concern.  The code that is there now is mostly nine
years old (with some mods made when it was brought over to Open MPI).  It
took about 2 months of testing on systems with 5-13 way network parallelism
to track down all KNOWN race conditions.  This code is at the center of MPI
correctness, so I am VERY concerned about changing it w/o some very strong
reasons.  Not apposed, just very cautious.

Rich


On 12/11/07 11:47 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:

> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
>> Possibly, though I have results from a benchmark I've written indicating
>> the reordering happens at the sender.  I believe I found it was due to
>> the QP striping trick I use to get more bandwidth -- if you back down to
>> one QP (there's a define in the code you can change), the reordering
>> rate drops.
> Ah, OK. My assumption was just from looking into code, so I may be
> wrong.
> 
>> 
>> Also I do not make any recursive calls to progress -- at least not
>> directly in the BTL; I can't speak for the upper layers.  The reason I
>> do many completions at once is that it is a big help in turning around
>> receive buffers, making it harder to run out of buffers and drop frags.
>>   I want to say there was some performance benefit as well but I can't
>> say for sure.
> Currently upper layers of Open MPI may call BTL progress function
> recursively. I hope this will change some day.
> 
>> 
>> Andrew
>> 
>> Gleb Natapov wrote:
>>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
>>>> Try UD, frags are reordered at a very high rate so should be a good test.
>>> Good Idea I'll try this. BTW I thing the reason for such a high rate of
>>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
>>> (500) and process them one by one and if progress function is called
>>> recursively next 500 completion will be reordered versus previous
>>> completions (reordering happens on a receiver, not sender).
>>> 
>>>> Andrew
>>>> 
>>>> Richard Graham wrote:
>>>>> Gleb,
>>>>>   I would suggest that before this is checked in this be tested on a
>>>>> system
>>>>> that has N-way network parallelism, where N is as large as you can find.
>>>>> This is a key bit of code for MPI correctness, and out-of-order operations
>>>>> will break it, so you want to maximize the chance for such operations.
>>>>> 
>>>>> Rich
>>>>> 
>>>>> 
>>>>> On 12/11/07 10:54 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>>    I did a rewrite of matching code in OB1. I made it much simpler and 2
>>>>>> times smaller (which is good, less code - less bugs). I also got rid
>>>>>> of huge macros - very helpful if you need to debug something. There
>>>>>> is no performance degradation, actually I even see very small performance
>>>>>> improvement. I ran MTT with this patch and the result is the same as on
>>>>>> trunk. I would like to commit this to the trunk. The patch is attached
>>>>>> for everybody to try.
>>>>>> 
>>>>>> --
>>>>>> Gleb.
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> --
>>> Gleb.
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
> Gleb.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to