I will re-iterate my concern. The code that is there now is mostly nine years old (with some mods made when it was brought over to Open MPI). It took about 2 months of testing on systems with 5-13 way network parallelism to track down all KNOWN race conditions. This code is at the center of MPI correctness, so I am VERY concerned about changing it w/o some very strong reasons. Not apposed, just very cautious.
Rich On 12/11/07 11:47 AM, "Gleb Natapov" <gl...@voltaire.com> wrote: > On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote: >> Possibly, though I have results from a benchmark I've written indicating >> the reordering happens at the sender. I believe I found it was due to >> the QP striping trick I use to get more bandwidth -- if you back down to >> one QP (there's a define in the code you can change), the reordering >> rate drops. > Ah, OK. My assumption was just from looking into code, so I may be > wrong. > >> >> Also I do not make any recursive calls to progress -- at least not >> directly in the BTL; I can't speak for the upper layers. The reason I >> do many completions at once is that it is a big help in turning around >> receive buffers, making it harder to run out of buffers and drop frags. >> I want to say there was some performance benefit as well but I can't >> say for sure. > Currently upper layers of Open MPI may call BTL progress function > recursively. I hope this will change some day. > >> >> Andrew >> >> Gleb Natapov wrote: >>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: >>>> Try UD, frags are reordered at a very high rate so should be a good test. >>> Good Idea I'll try this. BTW I thing the reason for such a high rate of >>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions >>> (500) and process them one by one and if progress function is called >>> recursively next 500 completion will be reordered versus previous >>> completions (reordering happens on a receiver, not sender). >>> >>>> Andrew >>>> >>>> Richard Graham wrote: >>>>> Gleb, >>>>> I would suggest that before this is checked in this be tested on a >>>>> system >>>>> that has N-way network parallelism, where N is as large as you can find. >>>>> This is a key bit of code for MPI correctness, and out-of-order operations >>>>> will break it, so you want to maximize the chance for such operations. >>>>> >>>>> Rich >>>>> >>>>> >>>>> On 12/11/07 10:54 AM, "Gleb Natapov" <gl...@voltaire.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I did a rewrite of matching code in OB1. I made it much simpler and 2 >>>>>> times smaller (which is good, less code - less bugs). I also got rid >>>>>> of huge macros - very helpful if you need to debug something. There >>>>>> is no performance degradation, actually I even see very small performance >>>>>> improvement. I ran MTT with this patch and the result is the same as on >>>>>> trunk. I would like to commit this to the trunk. The patch is attached >>>>>> for everybody to try. >>>>>> >>>>>> -- >>>>>> Gleb. >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> -- >>> Gleb. >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Gleb. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel