Re: [OMPI devel] matching code rewrite in OB1
On Mon, Dec 17, 2007 at 08:08:02PM -0500, Richard Graham wrote: > Needless to say (for the nth time :-) ) that changing this bit of code > makes me > nervous. I've noticed it already :) >However, it occurred to me that there is a much better way to > test > this code than setting up an environment that generates some out of order > events with out us being able to specify the order. > Since this routine is executed serially, it should be sufficient to set up > a test > code that would simulate any out-of-order scenario we want. If one > specifies > number of ³messages² to be ³sent², and ³randomly² changes the order they > arrive (e.g. scramble some input vector), one can check and see if the > messages > are ³received² in the correct order. One could even ³drop² messages and > see > if matching stops. Using a test code like this, and a code coverage tool, > one > should be able to get much better testing that we have to date. While I sometimes do unit testing for code that I write, in this case it is easy to generate all reasonable corner case without isolating the code in a separate unit. I run this code through different specially constructed MPI application and checked code coverage with gcov. Here is the result: File '/home/glebn/OpenMPI/ompi.stg/ompi/mca/pml/ob1/pml_ob1_recvfrag.c' Lines executed:97.58% of 124 Only two lines of the code was never executes, both for error cases that should cause abort anyway. The pml_ob1_recvfrag.c.gcov with coverage results is attached for however is curious enough to look at them. BTW I doubt that previous code passed this level of testing. At least with gcov it is not possible to generate meaningful results when most of the code is inside macros. > What would you think about doing something like this ? Seems like a few > hours > of this sort of simulation would be much better than even years of testing > and > relying on random fluctuations in the run to thoroughly test out-of-order > scenarios. > > What do you think ? I think that coverage testing I did is enough for this code. > Rich > > > On 12/17/07 8:32 AM, "Gleb Natapov"wrote: > > > On Thu, Dec 13, 2007 at 08:04:21PM -0500, Richard Graham wrote: > >> > Yes, should be a bit more clear. Need an independent way to verify that > >> > data is matched > >> > in the correct order sending this information as payload is one way to > >> do > >> > this. So, > >> > sending unique data in every message, and making sure that it arrives in > >> > the user buffers > >> > in the expected order is a way to do this. > > > > Did that. Encoded sequence number in a payload and sent many eager > > packets from one rank to another. Many packets were reoredered, but > > application received everything in a correct order. > > > > -- > > Gleb. > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. -: 0:Source:/home/glebn/OpenMPI/ompi.stg/ompi/mca/pml/ob1/pml_ob1_recvfrag.c -:0:Graph:pml_ob1_recvfrag.gcno -:0:Data:pml_ob1_recvfrag.gcda -:0:Runs:8 -:0:Programs:1 -:0:Source is newer than graph -:1:/* -:2: * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana -:3: * University Research and Technology -:4: * Corporation. All rights reserved. -:5: * Copyright (c) 2004-2006 The University of Tennessee and The University -:6: * of Tennessee Research Foundation. All rights -:7: * reserved. -:8: * Copyright (c) 2004-2007 High Performance Computing Center Stuttgart, -:9: * University of Stuttgart. All rights reserved. -: 10: * Copyright (c) 2004-2005 The Regents of the University of California. -: 11: * All rights reserved. -: 12: * $COPYRIGHT$ -: 13: * -: 14: * Additional copyrights may follow -: 15: * -: 16: * $HEADER$ -: 17: */ -: 18: -: 19:/** -: 20: * @file -: 21: */ -: 22: -: 23:#include "ompi_config.h" -: 24: -: 25:#include "opal/class/opal_list.h" -: 26:#include "opal/threads/mutex.h" -: 27:#include "opal/prefetch.h" -: 28:#include "ompi/constants.h" -: 29:#include "ompi/communicator/communicator.h" -: 30:#include "ompi/mca/pml/pml.h" -: 31:#include "pml_ob1.h" -: 32:#include
Re: [OMPI devel] matching code rewrite in OB1
On Thu, Dec 13, 2007 at 08:04:21PM -0500, Richard Graham wrote: > Yes, should be a bit more clear. Need an independent way to verify that > data is matched > in the correct order sending this information as payload is one way to do > this. So, > sending unique data in every message, and making sure that it arrives in > the user buffers > in the expected order is a way to do this. Did that. Encoded sequence number in a payload and sent many eager packets from one rank to another. Many packets were reoredered, but application received everything in a correct order. -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
On Fri, Dec 14, 2007 at 06:53:55AM -0500, Richard Graham wrote: > If you have positive confirmation that such things have happened, this will > go a long way. I instrumented the code to log all kind of info about fragment reordering while I chased a bug in openib that caused matching logic to malfunction. Any non trivial application that uses OpenIB BTL will have reordered fragments. (I wish this would not be that case, but I don't have a solution yet). > I will not trust the code until this has also been done with > multiple independent network paths. I ran IMB over IP and IB simultaneously on more then 80 ranks. > I very rarely express such strong > opinions, even if I don't agree with what is being done, but this is the > core of correct MPI functionality, and first hand experience has shown that I agree that this is indeed very important piece of code, but it certain is not more important than data type engine for instance (and it is much easier to test all corner cases in matching logic than in data type engine IMHO). And event if matching code works perfectly, but other parts of OB1 are buggy the Open MPI will not work properly, so why this code is chosen to be a sacred cow? > just thinking through the logic, I can miss some of the race conditions. That is of cause correct, but the more people will look at the code the better, isn't it? > The code here has been running for 8+ years in two production MPI's running > on very large clusters, so I am very reluctant to make changes for what Are you sure about this? I see a number of changes to this code during Open MPI development and current SVN does not hold all the history of this code unfortunately. Here is the list of commits that I found, part of them change the code logic quite a bit: r6770,r7342,r8339,r8352,r8353,r8356,r8946,r11874,r12323,r12582 > seems to amount to people's taste - maintenance is not an issue in this > case. Had this not been such a key bit of code, I would not even bat an Why do you think that maintenance is not an issue? It is for me. Otherwise I wouldn't even look at this part of code. All those macros prohibit the use of a debugger for instance. (And I see a small latency improvement too :)) > eye. I suppose if you can go through some formal verification, this would > also be good - actually better than hoping that one will hit out-of-order > situations. > > Rich > > > On 12/14/07 2:20 AM, "Gleb Natapov"wrote: > > > On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote: > >> The situation that needs to be triggered, just as George has mentions, is > >> where we have a lot of unexpected messages, to make sure that when one that > >> we can match against comes in, all the unexpected messages that can be > >> matched with pre-posted receives are matched. Since we attempt to match > >> only when a new fragment comes in, we need to make sure that we don't leave > >> other unexpected messages that can be matched in the unexpected queue, as > >> these (if the out of order scenario is just right) would block any new > >> matches from occurring. > >> > >> For example: Say the next expect message is 25 > >> > >> Unexpected message queue has: 26 28 29 .. > >> > >> If 25 comes in, and is handled, if 26 is not pulled off the unexpected > >> message queue, when 27 comes in it won't be able to be matched, as 26 is > >> sitting in the unexpected queue, and will never be looked at again ... > > This situation is triggered constantly with openib BTL. OpenIB BTL has > > two ways to receive a packet: over a send queue or over an eager RDMA path. > > Receiver polls both of them and may reorders packets locally. Actually > > currently there is a bug in openib BTL that one channel may starve the other > > at the receiver so if a match fragment with a next sequence number is in the > > starved path tenth of thousands fragment can be reorederd. Test case > > attached > > to ticket #1158 triggers this case and my patch handles all reordered > > packets. > > > > And, by the way, the code is much simpler now and can be review easily ;) > > > > -- > > Gleb. > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
If you have positive confirmation that such things have happened, this will go a long way. I will not trust the code until this has also been done with multiple independent network paths. I very rarely express such strong opinions, even if I don't agree with what is being done, but this is the core of correct MPI functionality, and first hand experience has shown that just thinking through the logic, I can miss some of the race conditions. The code here has been running for 8+ years in two production MPI's running on very large clusters, so I am very reluctant to make changes for what seems to amount to people's taste - maintenance is not an issue in this case. Had this not been such a key bit of code, I would not even bat an eye. I suppose if you can go through some formal verification, this would also be good - actually better than hoping that one will hit out-of-order situations. Rich On 12/14/07 2:20 AM, "Gleb Natapov"wrote: > On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote: >> The situation that needs to be triggered, just as George has mentions, is >> where we have a lot of unexpected messages, to make sure that when one that >> we can match against comes in, all the unexpected messages that can be >> matched with pre-posted receives are matched. Since we attempt to match >> only when a new fragment comes in, we need to make sure that we don't leave >> other unexpected messages that can be matched in the unexpected queue, as >> these (if the out of order scenario is just right) would block any new >> matches from occurring. >> >> For example: Say the next expect message is 25 >> >> Unexpected message queue has: 26 28 29 .. >> >> If 25 comes in, and is handled, if 26 is not pulled off the unexpected >> message queue, when 27 comes in it won't be able to be matched, as 26 is >> sitting in the unexpected queue, and will never be looked at again ... > This situation is triggered constantly with openib BTL. OpenIB BTL has > two ways to receive a packet: over a send queue or over an eager RDMA path. > Receiver polls both of them and may reorders packets locally. Actually > currently there is a bug in openib BTL that one channel may starve the other > at the receiver so if a match fragment with a next sequence number is in the > starved path tenth of thousands fragment can be reorederd. Test case attached > to ticket #1158 triggers this case and my patch handles all reordered packets. > > And, by the way, the code is much simpler now and can be review easily ;) > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
Yes, should be a bit more clear. Need an independent way to verify that data is matched in the correct order sending this information as payload is one way to do this. So, sending unique data in every message, and making sure that it arrives in the user buffers in the expected order is a way to do this. Rich On 12/12/07 5:04 PM, "Jeff Squyres"wrote: > Was Rich referring to ensuring that the test codes checked that their > payloads were correct (and not re-assembled in the wrong order)? > > > On Dec 12, 2007, at 4:10 PM, Brian W. Barrett wrote: > >> > On Wed, 12 Dec 2007, Gleb Natapov wrote: >> > >>> >> On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: >>> This is better than nothing, but really not very helpful for >>> looking at the >>> specific issues that can arise with this, unless these systems >>> have several >>> parallel networks, with tests that will generate a lot of parallel >>> network >>> traffic, and be able to self check for out-of-order received - >>> i.e. this >>> needs to be encoded into the payload for verification purposes. >>> There are >>> some out-of-order scenarios that need to be generated and >>> checked. I think >>> that George may have a system that will be good for this sort of >>> testing. >>> >>> >> I am running various test with multiple networks right now. I use >>> >> several IB BTLs and TCP BTL simultaneously. I see many reordered >>> >> messages and all tests were OK till now, but they don't encode >>> >> message sequence in a payload as far as I know. I'll change one of >>> >> them to do so. >> > >> > Other than Rich's comment that we need sequence numbers, why add >> > them? We >> > haven't had them for non-matching packets for the last 3 years in >> > Open MPI >> > (ie, forever), and I can't see why we would need them. Yes, we need >> > sequence numbers for match headers to make sure MPI ordering is >> > correct. >> > But for the rest of the payload, there's no need with OMPI's datatype >> > engine. It's just more payload for no gain. >> > >> > Brian >> > ___ >> > devel mailing list >> > de...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > Cisco Systems > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] matching code rewrite in OB1
The situation that needs to be triggered, just as George has mentions, is where we have a lot of unexpected messages, to make sure that when one that we can match against comes in, all the unexpected messages that can be matched with pre-posted receives are matched. Since we attempt to match only when a new fragment comes in, we need to make sure that we don't leave other unexpected messages that can be matched in the unexpected queue, as these (if the out of order scenario is just right) would block any new matches from occurring. For example: Say the next expect message is 25 Unexpected message queue has: 26 28 29 .. If 25 comes in, and is handled, if 26 is not pulled off the unexpected message queue, when 27 comes in it won't be able to be matched, as 26 is sitting in the unexpected queue, and will never be looked at again ... Rich On 12/13/07 2:09 PM, "George Bosilca"wrote: > Rich was referring to the fact that the reordering of fragments other > than the matching ones is irrelevant to the Gleb's change. In order to > trigger the changes we need to force a lot of small unexpected > messages over multiple networks. The testing environment should have > multiple similar networks (to make sure the matching fragment is > distributed evenly across them), and the test should generate a lot of > unexpected messages. I think the flood test is a good base for this. > >Thanks, > george. > > > On Dec 12, 2007, at 5:04 PM, Jeff Squyres wrote: > >> Was Rich referring to ensuring that the test codes checked that their >> payloads were correct (and not re-assembled in the wrong order)? >> >> >> On Dec 12, 2007, at 4:10 PM, Brian W. Barrett wrote: >> >>> On Wed, 12 Dec 2007, Gleb Natapov wrote: >>> On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: > This is better than nothing, but really not very helpful for > looking at the > specific issues that can arise with this, unless these systems > have several > parallel networks, with tests that will generate a lot of parallel > network > traffic, and be able to self check for out-of-order received - > i.e. this > needs to be encoded into the payload for verification purposes. > There are > some out-of-order scenarios that need to be generated and > checked. I think > that George may have a system that will be good for this sort of > testing. > I am running various test with multiple networks right now. I use several IB BTLs and TCP BTL simultaneously. I see many reordered messages and all tests were OK till now, but they don't encode message sequence in a payload as far as I know. I'll change one of them to do so. >>> >>> Other than Rich's comment that we need sequence numbers, why add >>> them? We >>> haven't had them for non-matching packets for the last 3 years in >>> Open MPI >>> (ie, forever), and I can't see why we would need them. Yes, we need >>> sequence numbers for match headers to make sure MPI ordering is >>> correct. >>> But for the rest of the payload, there's no need with OMPI's datatype >>> engine. It's just more payload for no gain. >>> >>> Brian >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> Cisco Systems >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
Rich was referring to the fact that the reordering of fragments other than the matching ones is irrelevant to the Gleb's change. In order to trigger the changes we need to force a lot of small unexpected messages over multiple networks. The testing environment should have multiple similar networks (to make sure the matching fragment is distributed evenly across them), and the test should generate a lot of unexpected messages. I think the flood test is a good base for this. Thanks, george. On Dec 12, 2007, at 5:04 PM, Jeff Squyres wrote: Was Rich referring to ensuring that the test codes checked that their payloads were correct (and not re-assembled in the wrong order)? On Dec 12, 2007, at 4:10 PM, Brian W. Barrett wrote: On Wed, 12 Dec 2007, Gleb Natapov wrote: On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: This is better than nothing, but really not very helpful for looking at the specific issues that can arise with this, unless these systems have several parallel networks, with tests that will generate a lot of parallel network traffic, and be able to self check for out-of-order received - i.e. this needs to be encoded into the payload for verification purposes. There are some out-of-order scenarios that need to be generated and checked. I think that George may have a system that will be good for this sort of testing. I am running various test with multiple networks right now. I use several IB BTLs and TCP BTL simultaneously. I see many reordered messages and all tests were OK till now, but they don't encode message sequence in a payload as far as I know. I'll change one of them to do so. Other than Rich's comment that we need sequence numbers, why add them? We haven't had them for non-matching packets for the last 3 years in Open MPI (ie, forever), and I can't see why we would need them. Yes, we need sequence numbers for match headers to make sure MPI ordering is correct. But for the rest of the payload, there's no need with OMPI's datatype engine. It's just more payload for no gain. Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] matching code rewrite in OB1
On Wed, Dec 12, 2007 at 03:10:10PM -0600, Brian W. Barrett wrote: > On Wed, 12 Dec 2007, Gleb Natapov wrote: > > > On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: > >> This is better than nothing, but really not very helpful for looking at the > >> specific issues that can arise with this, unless these systems have several > >> parallel networks, with tests that will generate a lot of parallel network > >> traffic, and be able to self check for out-of-order received - i.e. this > >> needs to be encoded into the payload for verification purposes. There are > >> some out-of-order scenarios that need to be generated and checked. I think > >> that George may have a system that will be good for this sort of testing. > >> > > I am running various test with multiple networks right now. I use > > several IB BTLs and TCP BTL simultaneously. I see many reordered > > messages and all tests were OK till now, but they don't encode > > message sequence in a payload as far as I know. I'll change one of > > them to do so. > > Other than Rich's comment that we need sequence numbers, why add them? We > haven't had them for non-matching packets for the last 3 years in Open MPI > (ie, forever), and I can't see why we would need them. Yes, we need > sequence numbers for match headers to make sure MPI ordering is correct. > But for the rest of the payload, there's no need with OMPI's datatype > engine. It's just more payload for no gain. > As I understand what Rich propose he says that we need to construct special test that will check that matching engine did its job right on the application layer. In other words test should check that payload received is correct one. He is not talking about adding additional fields to OB1 header. -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
Tarballs available at: http://www.open-mpi.org/~jsquyres/unofficial/ On Dec 12, 2007, at 4:08 PM, Jeff Squyres (jsquyres) wrote: Heh, ok. I'll make a tarball against your patch later. Its against the trunk? -jms Sent from my PDA -Original Message- From: Gleb Natapov [mailto:gl...@voltaire.com] Sent: Wednesday, December 12, 2007 03:54 PM Eastern Standard Time To: Open MPI Developers Subject:Re: [OMPI devel] matching code rewrite in OB1 On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote: > On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote: > > >> How about making a tarball with this patch in it that can be thrown > >> at > >> everyone's MTT? (we can put the tarball on www.open-mpi.org > >> somewhere) > > I don't have access to www.open-mpi.org, but I can send you the patch. > > I can send you a tarball too, but I prefer to not abuse email. > > Do you have access to staging.openfabrics.org? I could download it > from there and put it on www.open-mpi.org. > No. I don't :( -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] matching code rewrite in OB1
Was Rich referring to ensuring that the test codes checked that their payloads were correct (and not re-assembled in the wrong order)? On Dec 12, 2007, at 4:10 PM, Brian W. Barrett wrote: On Wed, 12 Dec 2007, Gleb Natapov wrote: On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: This is better than nothing, but really not very helpful for looking at the specific issues that can arise with this, unless these systems have several parallel networks, with tests that will generate a lot of parallel network traffic, and be able to self check for out-of-order received - i.e. this needs to be encoded into the payload for verification purposes. There are some out-of-order scenarios that need to be generated and checked. I think that George may have a system that will be good for this sort of testing. I am running various test with multiple networks right now. I use several IB BTLs and TCP BTL simultaneously. I see many reordered messages and all tests were OK till now, but they don't encode message sequence in a payload as far as I know. I'll change one of them to do so. Other than Rich's comment that we need sequence numbers, why add them? We haven't had them for non-matching packets for the last 3 years in Open MPI (ie, forever), and I can't see why we would need them. Yes, we need sequence numbers for match headers to make sure MPI ordering is correct. But for the rest of the payload, there's no need with OMPI's datatype engine. It's just more payload for no gain. Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] matching code rewrite in OB1
On Wed, 12 Dec 2007, Gleb Natapov wrote: On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: This is better than nothing, but really not very helpful for looking at the specific issues that can arise with this, unless these systems have several parallel networks, with tests that will generate a lot of parallel network traffic, and be able to self check for out-of-order received - i.e. this needs to be encoded into the payload for verification purposes. There are some out-of-order scenarios that need to be generated and checked. I think that George may have a system that will be good for this sort of testing. I am running various test with multiple networks right now. I use several IB BTLs and TCP BTL simultaneously. I see many reordered messages and all tests were OK till now, but they don't encode message sequence in a payload as far as I know. I'll change one of them to do so. Other than Rich's comment that we need sequence numbers, why add them? We haven't had them for non-matching packets for the last 3 years in Open MPI (ie, forever), and I can't see why we would need them. Yes, we need sequence numbers for match headers to make sure MPI ordering is correct. But for the rest of the payload, there's no need with OMPI's datatype engine. It's just more payload for no gain. Brian
Re: [OMPI devel] matching code rewrite in OB1
Heh, ok. I'll make a tarball against your patch later. Its against the trunk? -jms Sent from my PDA -Original Message- From: Gleb Natapov [mailto:gl...@voltaire.com] Sent: Wednesday, December 12, 2007 03:54 PM Eastern Standard Time To: Open MPI Developers Subject:Re: [OMPI devel] matching code rewrite in OB1 On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote: > On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote: > > >> How about making a tarball with this patch in it that can be thrown > >> at > >> everyone's MTT? (we can put the tarball on www.open-mpi.org > >> somewhere) > > I don't have access to www.open-mpi.org, but I can send you the patch. > > I can send you a tarball too, but I prefer to not abuse email. > > Do you have access to staging.openfabrics.org? I could download it > from there and put it on www.open-mpi.org. > No. I don't :( -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote: > On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote: > > >> How about making a tarball with this patch in it that can be thrown > >> at > >> everyone's MTT? (we can put the tarball on www.open-mpi.org > >> somewhere) > > I don't have access to www.open-mpi.org, but I can send you the patch. > > I can send you a tarball too, but I prefer to not abuse email. > > Do you have access to staging.openfabrics.org? I could download it > from there and put it on www.open-mpi.org. > No. I don't :( -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote: How about making a tarball with this patch in it that can be thrown at everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere) I don't have access to www.open-mpi.org, but I can send you the patch. I can send you a tarball too, but I prefer to not abuse email. Do you have access to staging.openfabrics.org? I could download it from there and put it on www.open-mpi.org. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] matching code rewrite in OB1
On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote: > Gleb -- > > How about making a tarball with this patch in it that can be thrown at > everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere) I don't have access to www.open-mpi.org, but I can send you the patch. I can send you a tarball too, but I prefer to not abuse email. > > > On Dec 11, 2007, at 4:14 PM, Richard Graham wrote: > > > I will re-iterate my concern. The code that is there now is mostly > > nine > > years old (with some mods made when it was brought over to Open > > MPI). It > > took about 2 months of testing on systems with 5-13 way network > > parallelism > > to track down all KNOWN race conditions. This code is at the center > > of MPI > > correctness, so I am VERY concerned about changing it w/o some very > > strong > > reasons. Not apposed, just very cautious. > > > > Rich > > > > > > On 12/11/07 11:47 AM, "Gleb Natapov"wrote: > > > >> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote: > >>> Possibly, though I have results from a benchmark I've written > >>> indicating > >>> the reordering happens at the sender. I believe I found it was > >>> due to > >>> the QP striping trick I use to get more bandwidth -- if you back > >>> down to > >>> one QP (there's a define in the code you can change), the reordering > >>> rate drops. > >> Ah, OK. My assumption was just from looking into code, so I may be > >> wrong. > >> > >>> > >>> Also I do not make any recursive calls to progress -- at least not > >>> directly in the BTL; I can't speak for the upper layers. The > >>> reason I > >>> do many completions at once is that it is a big help in turning > >>> around > >>> receive buffers, making it harder to run out of buffers and drop > >>> frags. > >>> I want to say there was some performance benefit as well but I > >>> can't > >>> say for sure. > >> Currently upper layers of Open MPI may call BTL progress function > >> recursively. I hope this will change some day. > >> > >>> > >>> Andrew > >>> > >>> Gleb Natapov wrote: > On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: > > Try UD, frags are reordered at a very high rate so should be a > > good test. > Good Idea I'll try this. BTW I thing the reason for such a high > rate of > reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions > (500) and process them one by one and if progress function is > called > recursively next 500 completion will be reordered versus previous > completions (reordering happens on a receiver, not sender). > > > Andrew > > > > Richard Graham wrote: > >> Gleb, > >> I would suggest that before this is checked in this be tested > >> on a > >> system > >> that has N-way network parallelism, where N is as large as you > >> can find. > >> This is a key bit of code for MPI correctness, and out-of-order > >> operations > >> will break it, so you want to maximize the chance for such > >> operations. > >> > >> Rich > >> > >> > >> On 12/11/07 10:54 AM, "Gleb Natapov" wrote: > >> > >>> Hi, > >>> > >>> I did a rewrite of matching code in OB1. I made it much > >>> simpler and 2 > >>> times smaller (which is good, less code - less bugs). I also > >>> got rid > >>> of huge macros - very helpful if you need to debug something. > >>> There > >>> is no performance degradation, actually I even see very small > >>> performance > >>> improvement. I ran MTT with this patch and the result is the > >>> same as on > >>> trunk. I would like to commit this to the trunk. The patch is > >>> attached > >>> for everybody to try. > >>> > >>> -- > >>> Gleb. > >>> ___ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> ___ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> -- > >> Gleb. > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > >
Re: [OMPI devel] matching code rewrite in OB1
Gleb -- How about making a tarball with this patch in it that can be thrown at everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere) On Dec 11, 2007, at 4:14 PM, Richard Graham wrote: I will re-iterate my concern. The code that is there now is mostly nine years old (with some mods made when it was brought over to Open MPI). It took about 2 months of testing on systems with 5-13 way network parallelism to track down all KNOWN race conditions. This code is at the center of MPI correctness, so I am VERY concerned about changing it w/o some very strong reasons. Not apposed, just very cautious. Rich On 12/11/07 11:47 AM, "Gleb Natapov"wrote: On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote: Possibly, though I have results from a benchmark I've written indicating the reordering happens at the sender. I believe I found it was due to the QP striping trick I use to get more bandwidth -- if you back down to one QP (there's a define in the code you can change), the reordering rate drops. Ah, OK. My assumption was just from looking into code, so I may be wrong. Also I do not make any recursive calls to progress -- at least not directly in the BTL; I can't speak for the upper layers. The reason I do many completions at once is that it is a big help in turning around receive buffers, making it harder to run out of buffers and drop frags. I want to say there was some performance benefit as well but I can't say for sure. Currently upper layers of Open MPI may call BTL progress function recursively. I hope this will change some day. Andrew Gleb Natapov wrote: On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: Try UD, frags are reordered at a very high rate so should be a good test. Good Idea I'll try this. BTW I thing the reason for such a high rate of reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions (500) and process them one by one and if progress function is called recursively next 500 completion will be reordered versus previous completions (reordering happens on a receiver, not sender). Andrew Richard Graham wrote: Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI correctness, and out-of-order operations will break it, so you want to maximize the chance for such operations. Rich On 12/11/07 10:54 AM, "Gleb Natapov" wrote: Hi, I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] matching code rewrite in OB1
I will re-iterate my concern. The code that is there now is mostly nine years old (with some mods made when it was brought over to Open MPI). It took about 2 months of testing on systems with 5-13 way network parallelism to track down all KNOWN race conditions. This code is at the center of MPI correctness, so I am VERY concerned about changing it w/o some very strong reasons. Not apposed, just very cautious. Rich On 12/11/07 11:47 AM, "Gleb Natapov"wrote: > On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote: >> Possibly, though I have results from a benchmark I've written indicating >> the reordering happens at the sender. I believe I found it was due to >> the QP striping trick I use to get more bandwidth -- if you back down to >> one QP (there's a define in the code you can change), the reordering >> rate drops. > Ah, OK. My assumption was just from looking into code, so I may be > wrong. > >> >> Also I do not make any recursive calls to progress -- at least not >> directly in the BTL; I can't speak for the upper layers. The reason I >> do many completions at once is that it is a big help in turning around >> receive buffers, making it harder to run out of buffers and drop frags. >> I want to say there was some performance benefit as well but I can't >> say for sure. > Currently upper layers of Open MPI may call BTL progress function > recursively. I hope this will change some day. > >> >> Andrew >> >> Gleb Natapov wrote: >>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: Try UD, frags are reordered at a very high rate so should be a good test. >>> Good Idea I'll try this. BTW I thing the reason for such a high rate of >>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions >>> (500) and process them one by one and if progress function is called >>> recursively next 500 completion will be reordered versus previous >>> completions (reordering happens on a receiver, not sender). >>> Andrew Richard Graham wrote: > Gleb, > I would suggest that before this is checked in this be tested on a > system > that has N-way network parallelism, where N is as large as you can find. > This is a key bit of code for MPI correctness, and out-of-order operations > will break it, so you want to maximize the chance for such operations. > > Rich > > > On 12/11/07 10:54 AM, "Gleb Natapov" wrote: > >> Hi, >> >>I did a rewrite of matching code in OB1. I made it much simpler and 2 >> times smaller (which is good, less code - less bugs). I also got rid >> of huge macros - very helpful if you need to debug something. There >> is no performance degradation, actually I even see very small performance >> improvement. I ran MTT with this patch and the result is the same as on >> trunk. I would like to commit this to the trunk. The patch is attached >> for everybody to try. >> >> -- >> Gleb. >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> -- >>> Gleb. >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: > Try UD, frags are reordered at a very high rate so should be a good test. mpi-ping works fine with UD BTL and the patch. > > Andrew > > Richard Graham wrote: > > Gleb, > > I would suggest that before this is checked in this be tested on a system > > that has N-way network parallelism, where N is as large as you can find. > > This is a key bit of code for MPI correctness, and out-of-order operations > > will break it, so you want to maximize the chance for such operations. > > > > Rich > > > > > > On 12/11/07 10:54 AM, "Gleb Natapov"wrote: > > > >> Hi, > >> > >>I did a rewrite of matching code in OB1. I made it much simpler and 2 > >> times smaller (which is good, less code - less bugs). I also got rid > >> of huge macros - very helpful if you need to debug something. There > >> is no performance degradation, actually I even see very small performance > >> improvement. I ran MTT with this patch and the result is the same as on > >> trunk. I would like to commit this to the trunk. The patch is attached > >> for everybody to try. -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
Possibly, though I have results from a benchmark I've written indicating the reordering happens at the sender. I believe I found it was due to the QP striping trick I use to get more bandwidth -- if you back down to one QP (there's a define in the code you can change), the reordering rate drops. Also I do not make any recursive calls to progress -- at least not directly in the BTL; I can't speak for the upper layers. The reason I do many completions at once is that it is a big help in turning around receive buffers, making it harder to run out of buffers and drop frags. I want to say there was some performance benefit as well but I can't say for sure. Andrew Gleb Natapov wrote: On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: Try UD, frags are reordered at a very high rate so should be a good test. Good Idea I'll try this. BTW I thing the reason for such a high rate of reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions (500) and process them one by one and if progress function is called recursively next 500 completion will be reordered versus previous completions (reordering happens on a receiver, not sender). Andrew Richard Graham wrote: Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI correctness, and out-of-order operations will break it, so you want to maximize the chance for such operations. Rich On 12/11/07 10:54 AM, "Gleb Natapov"wrote: Hi, I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
Try UD, frags are reordered at a very high rate so should be a good test. Andrew Richard Graham wrote: Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI correctness, and out-of-order operations will break it, so you want to maximize the chance for such operations. Rich On 12/11/07 10:54 AM, "Gleb Natapov"wrote: Hi, I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI correctness, and out-of-order operations will break it, so you want to maximize the chance for such operations. Rich On 12/11/07 10:54 AM, "Gleb Natapov"wrote: > Hi, > >I did a rewrite of matching code in OB1. I made it much simpler and 2 > times smaller (which is good, less code - less bugs). I also got rid > of huge macros - very helpful if you need to debug something. There > is no performance degradation, actually I even see very small performance > improvement. I ran MTT with this patch and the result is the same as on > trunk. I would like to commit this to the trunk. The patch is attached > for everybody to try. > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
On Tue, 11 Dec 2007, Gleb Natapov wrote: I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. I don't think we can live without those macros :). Out of curiousity, is there any functionality that was removed as a result of this change? I'll test on a couple systems over the next couple of days... Brian
[OMPI devel] matching code rewrite in OB1
Hi, I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. -- Gleb. diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c index d3f7c37..299ae9e 100644 --- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c +++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c @@ -184,244 +184,159 @@ void mca_pml_ob1_recv_frag_callback( mca_btl_base_module_t* btl, } } -/** - * Try and match the incoming message fragment to a generic - * list of receives - * - * @param hdr Matching data from received fragment (IN) - * - * @param generic_receives Pointer to the receive list used for - * matching purposes. (IN) - * - * @return Matched receive - * - * This routine assumes that the appropriate matching locks are - * set by the upper level routine. - */ -#define MCA_PML_OB1_MATCH_GENERIC_RECEIVES(hdr,generic_receives,proc,return_match) \ -do { \ -/* local variables */ \ -mca_pml_ob1_recv_request_t *generic_recv = (mca_pml_ob1_recv_request_t *) \ - opal_list_get_first(generic_receives);\ -mca_pml_ob1_recv_request_t *last_recv = (mca_pml_ob1_recv_request_t *) \ -opal_list_get_end(generic_receives); \ -register int recv_tag, frag_tag = hdr->hdr_tag;\ - \ -/* Loop over the receives. If the received tag is less than zero */ \ -/* enter in a special mode, where we match only our internal tags */ \ -/* (such as those used by the collectives.*/ \ -if( 0 <= frag_tag ) { \ -for( ; generic_recv != last_recv; \ - generic_recv = (mca_pml_ob1_recv_request_t *) \ - ((opal_list_item_t *)generic_recv)->opal_list_next) { \ -/* Check for a match */\ -recv_tag = generic_recv->req_recv.req_base.req_tag;\ -if ( (frag_tag == recv_tag) || (recv_tag == OMPI_ANY_TAG) ) { \ -break; \ -} \ -} \ -} else { \ -for( ; generic_recv != last_recv; \ - generic_recv = (mca_pml_ob1_recv_request_t *) \ - ((opal_list_item_t *)generic_recv)->opal_list_next) { \ -/* Check for a match */\ -recv_tag = generic_recv->req_recv.req_base.req_tag;\ -if( OPAL_UNLIKELY(frag_tag == recv_tag) ) {\ -break; \ -} \ -} \ -} \ -if( generic_recv != (mca_pml_ob1_recv_request_t *) \ -opal_list_get_end(generic_receives) ) {\ - \ -/* Match made */ \ -return_match = generic_recv; \ - \ -/* remove descriptor from posted specific ireceive list */ \ -opal_list_remove_item(generic_receives,\ - (opal_list_item_t *)generic_recv); \ -PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, \ - &(generic_recv->req_recv.req_base), \ -