Re: [OMPI devel] openmpi-1.2.4 compilation error in orte_abort.c on Fedora 8 - patch included
0600 you means ? I don't really see why you want to share the file with the whole group ? Thanks, george. On Dec 10, 2007, at 5:15 PM, Ralph Castain wrote: Nah, go ahead! Just change the permission to 0660 - that's a private file that others shouldn't really perturb. Ralph On 12/10/07 2:59 PM, "Jeff Squyres" wrote: Yo Ralph -- I see you committed this to the ORTE-future branch. Any objections to me committing to trunk/v1.2? (Thanks Sebastian -- stupid Fedora! ;-) ) On Dec 10, 2007, at 11:02 AM, Sebastian Schmitzdorff wrote: Hi, on Fedora 8 x86_64 openmpi-1.2.4 doesn't compile. A quick glance at the nightly openmpi snapshot leads me to the conclusion that this is still the case. In function 'open', inlined from 'orte_abort' at runtime/orte_abort.c:91: /usr/include/bits/fcntl2.h:51: error: call to '__open_missing_mode' declared with attribute error: open with O_CREAT in second argument needs 3 arguments make[1]: *** [runtime/orte_abort.lo] Error 1 make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ openmpi-1.2.4/ orte' make: *** [all-recursive] Error 1 There is a missing filemode in "open" in orte_abort.c:91. fcntl2.h doesnt allow this anymore. please find the simple diff below. --- runtime/orte_abort.c2007-12-10 00:01:50.0 +0100 +++ test2007-12-10 00:01:00.0 +0100 @@ -88,7 +88,7 @@ ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE); goto CLEANUP; } -fd = open(abort_file, O_CREAT); +fd = open(abort_file, O_CREAT, 0666); if (0 < fd) close(fd); } Hope this is the right place for the diff. regards sebastian -- Sebastian Schmitzdorff - Managing Director Hamburgnet http://www.hamburgnet.de Kottwitzstrasse 49 D-20253 Hamburg fon: +49 40 736 72-322 fax: +49 40 736 72-321 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] openmpi-1.2.4 compilation error in orte_abort.c on Fedora 8 - patch included
Er, ya -- duh. Oops. I'll fix... On Dec 11, 2007, at 5:07 AM, George Bosilca wrote: 0600 you means ? I don't really see why you want to share the file with the whole group ? Thanks, george. On Dec 10, 2007, at 5:15 PM, Ralph Castain wrote: Nah, go ahead! Just change the permission to 0660 - that's a private file that others shouldn't really perturb. Ralph On 12/10/07 2:59 PM, "Jeff Squyres" wrote: Yo Ralph -- I see you committed this to the ORTE-future branch. Any objections to me committing to trunk/v1.2? (Thanks Sebastian -- stupid Fedora! ;-) ) On Dec 10, 2007, at 11:02 AM, Sebastian Schmitzdorff wrote: Hi, on Fedora 8 x86_64 openmpi-1.2.4 doesn't compile. A quick glance at the nightly openmpi snapshot leads me to the conclusion that this is still the case. In function 'open', inlined from 'orte_abort' at runtime/orte_abort.c:91: /usr/include/bits/fcntl2.h:51: error: call to '__open_missing_mode' declared with attribute error: open with O_CREAT in second argument needs 3 arguments make[1]: *** [runtime/orte_abort.lo] Error 1 make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ openmpi-1.2.4/ orte' make: *** [all-recursive] Error 1 There is a missing filemode in "open" in orte_abort.c:91. fcntl2.h doesnt allow this anymore. please find the simple diff below. --- runtime/orte_abort.c2007-12-10 00:01:50.0 +0100 +++ test2007-12-10 00:01:00.0 +0100 @@ -88,7 +88,7 @@ ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE); goto CLEANUP; } -fd = open(abort_file, O_CREAT); +fd = open(abort_file, O_CREAT, 0666); if (0 < fd) close(fd); } Hope this is the right place for the diff. regards sebastian -- Sebastian Schmitzdorff - Managing Director Hamburgnet http://www.hamburgnet.de Kottwitzstrasse 49 D-20253 Hamburg fon: +49 40 736 72-322 fax: +49 40 736 72-321 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] opal_condition_wait
Ok, I think I am understanding this a bit now. By not decrementing the signaled count, we are allowing a single broadcast to wake up the same thread multiple times, and are allowing a single cond_signal to wake up multiple threads. My understanding was that this behavior was not right, but upon further inspection of the pthreads documentation this behavior seems to be allowable. Thanks for the clarifications, Tim Gleb Natapov wrote: On Thu, Dec 06, 2007 at 09:46:45AM -0500, Tim Prins wrote: Also, when we are using threads, there is a case where we do not decrement the signaled count, in condition.h:84. Gleb put this in in r9451, however the change does not make sense to me. I think that the signal count should always be decremented. Can anyone shine any light on these issues? I made this change a long time ago (I wander why I even tested threaded build back then), but what I recall looking into the code and log message there was a deadlock when signal broadcast doesn't wake up all thread that are waiting on a conditional variable. Suppose two threads wait on a condition C, third thread does broadcast. This makes C->c_signaled to be equal 2. Now one thread wakes up and decrement C->c_signaled by one. And before other thread is starting to run it calls condition_wait on C one more time. Because c_signaled is 1 it doesn't sleep and decrement c_signaled one more time. Now c_signaled is zero and when second thread wakes up it see this and go to sleep again. The solution was to check in condition_wait if condition is already signaled before go to sleep and if yes exit immediately. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] opal_condition_wait
Well, this makes some sense, although it still seems like this violates the spirit of condition variables. Thanks, Tim Brian W. Barrett wrote: On Thu, 6 Dec 2007, Tim Prins wrote: Tim Prins wrote: First, in opal_condition_wait (condition.h:97) we do not release the passed mutex if opal_using_threads() is not set. Is there a reason for this? I ask since this violates the way condition variables are supposed to work, and it seems like there are situations where this could cause deadlock. So in (partial) answer to my own email, this is because throughout the code we do: OPAL_THREAD_LOCK(m) opal_condition_wait(cond, m); OPAL_THREAD_UNLOCK(m) So this relies on opal_condition_wait not touching the lock. This explains it, but it still seems very wrong. Yes, this is correct. The assumption is that you are using the conditional macro lock/unlock with the condition variables. I personally don't like this (I think we should have had macro conditional condition variables), but that obviously isn't how it works today. The problem with always holding the lock when you enter the condition variable is that even when threading is disabled, calling a lock is at least as expensive as an add, possibly including a cache miss. So from a performance standpoint, this would be a no-go. Also, when we are using threads, there is a case where we do not decrement the signaled count, in condition.h:84. Gleb put this in in r9451, however the change does not make sense to me. I think that the signal count should always be decremented. Can anyone shine any light on these issues? Unfortunately, I can't add much on this front. Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] opal_condition_wait
On Tue, Dec 11, 2007 at 10:27:55AM -0500, Tim Prins wrote: > My understanding was that this behavior was not right, but upon further > inspection of the pthreads documentation this behavior seems to be > allowable. > I think that Open MPI does not implement condition variable in the strict sense. Open MPI condition variable has to progress devices and wait for a condition simultaneously and not just wait till a condition is satisfied. -- Gleb.
[OMPI devel] matching code rewrite in OB1
Hi, I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. -- Gleb. diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c index d3f7c37..299ae9e 100644 --- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c +++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c @@ -184,244 +184,159 @@ void mca_pml_ob1_recv_frag_callback( mca_btl_base_module_t* btl, } } -/** - * Try and match the incoming message fragment to a generic - * list of receives - * - * @param hdr Matching data from received fragment (IN) - * - * @param generic_receives Pointer to the receive list used for - * matching purposes. (IN) - * - * @return Matched receive - * - * This routine assumes that the appropriate matching locks are - * set by the upper level routine. - */ -#define MCA_PML_OB1_MATCH_GENERIC_RECEIVES(hdr,generic_receives,proc,return_match) \ -do { \ -/* local variables */ \ -mca_pml_ob1_recv_request_t *generic_recv = (mca_pml_ob1_recv_request_t *) \ - opal_list_get_first(generic_receives);\ -mca_pml_ob1_recv_request_t *last_recv = (mca_pml_ob1_recv_request_t *) \ -opal_list_get_end(generic_receives); \ -register int recv_tag, frag_tag = hdr->hdr_tag;\ - \ -/* Loop over the receives. If the received tag is less than zero */ \ -/* enter in a special mode, where we match only our internal tags */ \ -/* (such as those used by the collectives.*/ \ -if( 0 <= frag_tag ) { \ -for( ; generic_recv != last_recv; \ - generic_recv = (mca_pml_ob1_recv_request_t *) \ - ((opal_list_item_t *)generic_recv)->opal_list_next) { \ -/* Check for a match */\ -recv_tag = generic_recv->req_recv.req_base.req_tag;\ -if ( (frag_tag == recv_tag) || (recv_tag == OMPI_ANY_TAG) ) { \ -break; \ -} \ -} \ -} else { \ -for( ; generic_recv != last_recv; \ - generic_recv = (mca_pml_ob1_recv_request_t *) \ - ((opal_list_item_t *)generic_recv)->opal_list_next) { \ -/* Check for a match */\ -recv_tag = generic_recv->req_recv.req_base.req_tag;\ -if( OPAL_UNLIKELY(frag_tag == recv_tag) ) {\ -break; \ -} \ -} \ -} \ -if( generic_recv != (mca_pml_ob1_recv_request_t *) \ -opal_list_get_end(generic_receives) ) {\ - \ -/* Match made */ \ -return_match = generic_recv; \ - \ -/* remove descriptor from posted specific ireceive list */ \ -opal_list_remove_item(generic_receives,\ - (opal_list_item_t *)generic_recv); \ -PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, \ - &(generic_recv->req_recv.req_base), \ - P
Re: [OMPI devel] matching code rewrite in OB1
On Tue, 11 Dec 2007, Gleb Natapov wrote: I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. I don't think we can live without those macros :). Out of curiousity, is there any functionality that was removed as a result of this change? I'll test on a couple systems over the next couple of days... Brian
Re: [OMPI devel] matching code rewrite in OB1
Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI correctness, and out-of-order operations will break it, so you want to maximize the chance for such operations. Rich On 12/11/07 10:54 AM, "Gleb Natapov" wrote: > Hi, > >I did a rewrite of matching code in OB1. I made it much simpler and 2 > times smaller (which is good, less code - less bugs). I also got rid > of huge macros - very helpful if you need to debug something. There > is no performance degradation, actually I even see very small performance > improvement. I ran MTT with this patch and the result is the same as on > trunk. I would like to commit this to the trunk. The patch is attached > for everybody to try. > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
Try UD, frags are reordered at a very high rate so should be a good test. Andrew Richard Graham wrote: Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI correctness, and out-of-order operations will break it, so you want to maximize the chance for such operations. Rich On 12/11/07 10:54 AM, "Gleb Natapov" wrote: Hi, I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
On Tue, Dec 11, 2007 at 11:00:51AM -0500, Richard Graham wrote: > Gleb, > I would suggest that before this is checked in this be tested on a system > that has N-way network parallelism, where N is as large as you can find. > This is a key bit of code for MPI correctness, and out-of-order operations > will break it, so you want to maximize the chance for such operations. > I started this rewrite while chasing this bug https://svn.open-mpi.org/trac/ompi/ticket/1158. As you can see OpenIB reorders fragment quite a bit unfortunately :( No testing is enough for such important piece of code of cause. -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
On Tue, Dec 11, 2007 at 10:00:08AM -0600, Brian W. Barrett wrote: > On Tue, 11 Dec 2007, Gleb Natapov wrote: > > > I did a rewrite of matching code in OB1. I made it much simpler and 2 > > times smaller (which is good, less code - less bugs). I also got rid > > of huge macros - very helpful if you need to debug something. There > > is no performance degradation, actually I even see very small performance > > improvement. I ran MTT with this patch and the result is the same as on > > trunk. I would like to commit this to the trunk. The patch is attached > > for everybody to try. > > I don't think we can live without those macros :). Out of curiousity, is > there any functionality that was removed as a result of this change? No. The way out of order packets are handled changed a little bit, but they are handled in correct order. > > I'll test on a couple systems over the next couple of days... > Thanks! -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: > Try UD, frags are reordered at a very high rate so should be a good test. Good Idea I'll try this. BTW I thing the reason for such a high rate of reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions (500) and process them one by one and if progress function is called recursively next 500 completion will be reordered versus previous completions (reordering happens on a receiver, not sender). > > Andrew > > Richard Graham wrote: > > Gleb, > > I would suggest that before this is checked in this be tested on a system > > that has N-way network parallelism, where N is as large as you can find. > > This is a key bit of code for MPI correctness, and out-of-order operations > > will break it, so you want to maximize the chance for such operations. > > > > Rich > > > > > > On 12/11/07 10:54 AM, "Gleb Natapov" wrote: > > > >> Hi, > >> > >>I did a rewrite of matching code in OB1. I made it much simpler and 2 > >> times smaller (which is good, less code - less bugs). I also got rid > >> of huge macros - very helpful if you need to debug something. There > >> is no performance degradation, actually I even see very small performance > >> improvement. I ran MTT with this patch and the result is the same as on > >> trunk. I would like to commit this to the trunk. The patch is attached > >> for everybody to try. > >> > >> -- > >> Gleb. > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
Possibly, though I have results from a benchmark I've written indicating the reordering happens at the sender. I believe I found it was due to the QP striping trick I use to get more bandwidth -- if you back down to one QP (there's a define in the code you can change), the reordering rate drops. Also I do not make any recursive calls to progress -- at least not directly in the BTL; I can't speak for the upper layers. The reason I do many completions at once is that it is a big help in turning around receive buffers, making it harder to run out of buffers and drop frags. I want to say there was some performance benefit as well but I can't say for sure. Andrew Gleb Natapov wrote: On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: Try UD, frags are reordered at a very high rate so should be a good test. Good Idea I'll try this. BTW I thing the reason for such a high rate of reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions (500) and process them one by one and if progress function is called recursively next 500 completion will be reordered versus previous completions (reordering happens on a receiver, not sender). Andrew Richard Graham wrote: Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI correctness, and out-of-order operations will break it, so you want to maximize the chance for such operations. Rich On 12/11/07 10:54 AM, "Gleb Natapov" wrote: Hi, I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] matching code rewrite in OB1
On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote: > Possibly, though I have results from a benchmark I've written indicating > the reordering happens at the sender. I believe I found it was due to > the QP striping trick I use to get more bandwidth -- if you back down to > one QP (there's a define in the code you can change), the reordering > rate drops. Ah, OK. My assumption was just from looking into code, so I may be wrong. > > Also I do not make any recursive calls to progress -- at least not > directly in the BTL; I can't speak for the upper layers. The reason I > do many completions at once is that it is a big help in turning around > receive buffers, making it harder to run out of buffers and drop frags. > I want to say there was some performance benefit as well but I can't > say for sure. Currently upper layers of Open MPI may call BTL progress function recursively. I hope this will change some day. > > Andrew > > Gleb Natapov wrote: > > On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: > >> Try UD, frags are reordered at a very high rate so should be a good test. > > Good Idea I'll try this. BTW I thing the reason for such a high rate of > > reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions > > (500) and process them one by one and if progress function is called > > recursively next 500 completion will be reordered versus previous > > completions (reordering happens on a receiver, not sender). > > > >> Andrew > >> > >> Richard Graham wrote: > >>> Gleb, > >>> I would suggest that before this is checked in this be tested on a > >>> system > >>> that has N-way network parallelism, where N is as large as you can find. > >>> This is a key bit of code for MPI correctness, and out-of-order operations > >>> will break it, so you want to maximize the chance for such operations. > >>> > >>> Rich > >>> > >>> > >>> On 12/11/07 10:54 AM, "Gleb Natapov" wrote: > >>> > Hi, > > I did a rewrite of matching code in OB1. I made it much simpler and 2 > times smaller (which is good, less code - less bugs). I also got rid > of huge macros - very helpful if you need to debug something. There > is no performance degradation, actually I even see very small performance > improvement. I ran MTT with this patch and the result is the same as on > trunk. I would like to commit this to the trunk. The patch is attached > for everybody to try. > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> ___ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > > Gleb. > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: > Try UD, frags are reordered at a very high rate so should be a good test. mpi-ping works fine with UD BTL and the patch. > > Andrew > > Richard Graham wrote: > > Gleb, > > I would suggest that before this is checked in this be tested on a system > > that has N-way network parallelism, where N is as large as you can find. > > This is a key bit of code for MPI correctness, and out-of-order operations > > will break it, so you want to maximize the chance for such operations. > > > > Rich > > > > > > On 12/11/07 10:54 AM, "Gleb Natapov" wrote: > > > >> Hi, > >> > >>I did a rewrite of matching code in OB1. I made it much simpler and 2 > >> times smaller (which is good, less code - less bugs). I also got rid > >> of huge macros - very helpful if you need to debug something. There > >> is no performance degradation, actually I even see very small performance > >> improvement. I ran MTT with this patch and the result is the same as on > >> trunk. I would like to commit this to the trunk. The patch is attached > >> for everybody to try. -- Gleb.
Re: [OMPI devel] matching code rewrite in OB1
I will re-iterate my concern. The code that is there now is mostly nine years old (with some mods made when it was brought over to Open MPI). It took about 2 months of testing on systems with 5-13 way network parallelism to track down all KNOWN race conditions. This code is at the center of MPI correctness, so I am VERY concerned about changing it w/o some very strong reasons. Not apposed, just very cautious. Rich On 12/11/07 11:47 AM, "Gleb Natapov" wrote: > On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote: >> Possibly, though I have results from a benchmark I've written indicating >> the reordering happens at the sender. I believe I found it was due to >> the QP striping trick I use to get more bandwidth -- if you back down to >> one QP (there's a define in the code you can change), the reordering >> rate drops. > Ah, OK. My assumption was just from looking into code, so I may be > wrong. > >> >> Also I do not make any recursive calls to progress -- at least not >> directly in the BTL; I can't speak for the upper layers. The reason I >> do many completions at once is that it is a big help in turning around >> receive buffers, making it harder to run out of buffers and drop frags. >> I want to say there was some performance benefit as well but I can't >> say for sure. > Currently upper layers of Open MPI may call BTL progress function > recursively. I hope this will change some day. > >> >> Andrew >> >> Gleb Natapov wrote: >>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: Try UD, frags are reordered at a very high rate so should be a good test. >>> Good Idea I'll try this. BTW I thing the reason for such a high rate of >>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions >>> (500) and process them one by one and if progress function is called >>> recursively next 500 completion will be reordered versus previous >>> completions (reordering happens on a receiver, not sender). >>> Andrew Richard Graham wrote: > Gleb, > I would suggest that before this is checked in this be tested on a > system > that has N-way network parallelism, where N is as large as you can find. > This is a key bit of code for MPI correctness, and out-of-order operations > will break it, so you want to maximize the chance for such operations. > > Rich > > > On 12/11/07 10:54 AM, "Gleb Natapov" wrote: > >> Hi, >> >>I did a rewrite of matching code in OB1. I made it much simpler and 2 >> times smaller (which is good, less code - less bugs). I also got rid >> of huge macros - very helpful if you need to debug something. There >> is no performance degradation, actually I even see very small performance >> improvement. I ran MTT with this patch and the result is the same as on >> trunk. I would like to commit this to the trunk. The patch is attached >> for everybody to try. >> >> -- >> Gleb. >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> -- >>> Gleb. >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's
Currently, alternate CMs cannot be called because ompi_btl_openib_connect_base_open forces a choice of either oob or xoob (and goes into an erroneous error path if you pick something else). This patch reorganizes ompi_btl_openib_connect_base_open so that new functions can easily be added. New Open functions were added to oob and xoob for the error handling. I tested calling oob, xoob, and rdma_cm. oob happily allows connections to be established and throws no errors. xoob fails because ompi does not have it compiled in (and I have no connectx cards). rdma_cm calls the empty hooks and exits without connecting (thus throwing non-connection errors). All expected behavior. Since this patch fixes the existing behavior, and is not necessarily tied to my implementing of rdma_cm, I think it is acceptable to go in now. Thanks, Jon Index: ompi/mca/btl/openib/connect/btl_openib_connect_base.c === --- ompi/mca/btl/openib/connect/btl_openib_connect_base.c (revision 16937) +++ ompi/mca/btl/openib/connect/btl_openib_connect_base.c (working copy) @@ -50,8 +50,8 @@ */ int ompi_btl_openib_connect_base_open(void) { -int i; -char **temp, *a, *b; +char **temp, *a, *b, *defval; +int i, ret = OMPI_ERROR; /* Make an MCA parameter to select which connect module to use */ temp = NULL; @@ -66,40 +66,23 @@ /* For XRC qps we must to use XOOB connection manager */ if (mca_btl_openib_component.num_xrc_qps > 0) { -mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version, -"connect", -b, false, false, -"xoob", ¶m); -if (0 != strcmp("xoob", param)) { -opal_show_help("help-mpi-btl-openib.txt", -"XRC with wrong OOB", true, -orte_system_info.nodename, -mca_btl_openib_component.num_xrc_qps); -return OMPI_ERROR; -} + defval = "xoob"; } else { /* For all others we should use OOB */ -mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version, -"connect", -b, false, false, -"oob", ¶m); -if (0 != strcmp("oob", param)) { -opal_show_help("help-mpi-btl-openib.txt", -"SRQ or PP with wrong OOB", true, -orte_system_info.nodename, -mca_btl_openib_component.num_srq_qps, -mca_btl_openib_component.num_pp_qps); -return OMPI_ERROR; -} + defval = "oob"; } +mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version, + "connect", b, false, false, defval, ¶m); + /* Call the open function on all the connect modules */ for (i = 0; NULL != all[i]; ++i) { -if (NULL != all[i]->bcf_open) { -all[i]->bcf_open(); +if (0 == strcmp(all[i]->bcf_name, param)) { +ret = all[i]->bcf_open(); + break; } } -return OMPI_SUCCESS; +return ret; } Index: ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c === --- ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c (revision 16937) +++ ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c (working copy) @@ -28,11 +28,7 @@ static int ibcm_open(void) { -mca_base_param_reg_int(&mca_btl_openib_component.super.btl_version, - "btl_openib_connect_ibcm_foo", - "A dummy help message", false, false, - 17, NULL); - +printf("ibcm open\n"); return OMPI_SUCCESS; } Index: ompi/mca/btl/openib/connect/btl_openib_connect_oob.c === --- ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(revision 16937) +++ ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(working copy) @@ -22,6 +22,8 @@ #include "ompi_config.h" +#include "opal/util/show_help.h" + #include "orte/mca/ns/base/base.h" #include "orte/mca/oob/base/base.h" #include "orte/mca/rml/rml.h" @@ -39,6 +41,7 @@ ENDPOINT_CONNECT_ACK } connect_message_type_t; +static int oob_open(void); static int oob_init(void); static int oob_start_connect(mca_btl_base_endpoint_t *e); static int oob_finalize(void); @@ -67,8 +70,8 @@ */ ompi_btl_openib_connect_base_funcs_t ompi_btl_openib_connect_oob = { "oob", -/* No need for "open */ -NULL, +/* Open */ +oob_open, /* Init */ oob_init, /* Connect */ @@ -78,6 +81,23 @@ }; /* + * Open function. + */ +static int oob_open(void) +{ +if (mca_btl_openib_component.num_xrc_qps > 0) { +opal_show_help("help-mpi-btl-openib.txt", +"SRQ or PP with wrong OOB", true, +orte_system_info.nodename, +
[OMPI devel] Fwd: Subversion and trac outage
Begin forwarded message: From: DongInn Kim Date: December 11, 2007 6:20:03 PM EST To: Jeff Squyres Subject: Subversion and trac outage Hi, I am sorry for the unexpected outage of subversion and trac of Open MPI. There was a mistake of handling the ACL information about blocking some specific ports this afternoon. Hence, the following websites are not accessible now. http(s)://svn.open-mpi.org/svn/ompi http://svn.open-mpi.org/trac/ompi I believe that this will be fixed first thing tomorrow morning. I will let you know as soon as the services are available. Again, I am really sorry about this incident. Best Regards, -- - DongInn -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's
Hmm. I don't think that we want to put knowledge of XRC in the OOB CPC (and vice versa). That seems like an abstraction violation. I didn't like that XRC knowledge was put in the connect base either, but I was too busy to argue with it. :-) Isn't there a better way somehow? Perhaps we should have "select" call *all* the functions and accept back a priority. The one with the highest priority then wins. This is quite similar to much of the other selection logic in OMPI. Sidenote: Keep in mind that there are some changes coming to select CPCs on a per-endpoint basis (I can't look up the trac ticket right now...). This makes things a little complicated -- do we need btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to include/exclude CPCs (because you might need more than one CPC in a single job)? That wouldn't be hard to do. But then what to do about if someone sets to use some XRC QPs and selects to use OOB or RDMA CM? How do we catch this and print an error? It doesn't seem right to put the "if num_xrc_qps>0" check in every CPC. What happens if you try to make an XRC QP when not using xoob? Where is the error detected and what kind of error message do we print? Also, I'm not sure why the #if/#else is there for xoob (i.e., having empty/printf functions there when XRC support is compiled out) -- if xoob was disabled during compilation, then it simply should not be compiled and therefore not be there at all at run-time. If a user selects the xoob CPC, then we should print a message from the base that that CPC doesn't exist in the installation. Correspondingly, we can make an info MCA param in the btl openib that shows which CPCs are available (we already have this information -- it's easy enough to put this in an info MCA param). On Dec 11, 2007, at 6:59 PM, Jon Mason wrote: Currently, alternate CMs cannot be called because ompi_btl_openib_connect_base_open forces a choice of either oob or xoob (and goes into an erroneous error path if you pick something else). This patch reorganizes ompi_btl_openib_connect_base_open so that new functions can easily be added. New Open functions were added to oob and xoob for the error handling. I tested calling oob, xoob, and rdma_cm. oob happily allows connections to be established and throws no errors. xoob fails because ompi does not have it compiled in (and I have no connectx cards). rdma_cm calls the empty hooks and exits without connecting (thus throwing non-connection errors). All expected behavior. Since this patch fixes the existing behavior, and is not necessarily tied to my implementing of rdma_cm, I think it is acceptable to go in now. Thanks, Jon Index: ompi/mca/btl/openib/connect/btl_openib_connect_base.c === --- ompi/mca/btl/openib/connect/btl_openib_connect_base.c (revision 16937) +++ ompi/mca/btl/openib/connect/btl_openib_connect_base.c (working copy) @@ -50,8 +50,8 @@ */ int ompi_btl_openib_connect_base_open(void) { -int i; -char **temp, *a, *b; +char **temp, *a, *b, *defval; +int i, ret = OMPI_ERROR; /* Make an MCA parameter to select which connect module to use */ temp = NULL; @@ -66,40 +66,23 @@ /* For XRC qps we must to use XOOB connection manager */ if (mca_btl_openib_component.num_xrc_qps > 0) { - mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version, -"connect", -b, false, false, -"xoob", ¶m); -if (0 != strcmp("xoob", param)) { -opal_show_help("help-mpi-btl-openib.txt", -"XRC with wrong OOB", true, -orte_system_info.nodename, -mca_btl_openib_component.num_xrc_qps); -return OMPI_ERROR; -} + defval = "xoob"; } else { /* For all others we should use OOB */ - mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version, -"connect", -b, false, false, -"oob", ¶m); -if (0 != strcmp("oob", param)) { -opal_show_help("help-mpi-btl-openib.txt", -"SRQ or PP with wrong OOB", true, -orte_system_info.nodename, -mca_btl_openib_component.num_srq_qps, -mca_btl_openib_component.num_pp_qps); -return OMPI_ERROR; -} + defval = "oob"; } + mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version, + "connect", b, false, false, defval, ¶m); + /* Call the open function on all the connect modules */ for (i = 0; NULL != all[i]; ++i) { -if (NULL != all[i]->bcf_open) { -all[i]->bcf_open(); +if (0 == strcmp(all[i]->bcf_name, param)) { +ret = all[i]->bcf_open(); + break; } } -return OMPI_S