Re: [OMPI devel] openmpi-1.2.4 compilation error in orte_abort.c on Fedora 8 - patch included

2007-12-11 Thread George Bosilca
0600 you means ? I don't really see why you want to share the file  
with the whole group ?


  Thanks,
george.

On Dec 10, 2007, at 5:15 PM, Ralph Castain wrote:

Nah, go ahead! Just change the permission to 0660 - that's a private  
file

that others shouldn't really perturb.

Ralph



On 12/10/07 2:59 PM, "Jeff Squyres"  wrote:


Yo Ralph --

I see you committed this to the ORTE-future branch.  Any objections  
to

me committing to trunk/v1.2?

(Thanks Sebastian -- stupid Fedora! ;-) )


On Dec 10, 2007, at 11:02 AM, Sebastian Schmitzdorff wrote:


Hi,

on Fedora 8 x86_64 openmpi-1.2.4 doesn't compile.
A quick glance at the nightly openmpi snapshot leads me to the
conclusion that
this is still the case.


In function 'open',
 inlined from 'orte_abort' at runtime/orte_abort.c:91:
/usr/include/bits/fcntl2.h:51: error: call to '__open_missing_mode'
declared with attribute error: open with O_CREAT in second argument
needs 3 arguments
make[1]: *** [runtime/orte_abort.lo] Error 1
make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ 
openmpi-1.2.4/

orte'
make: *** [all-recursive] Error 1


There is a missing filemode in "open" in orte_abort.c:91.
fcntl2.h doesnt allow this anymore.

please find the simple diff below.


--- runtime/orte_abort.c2007-12-10 00:01:50.0 +0100
+++ test2007-12-10 00:01:00.0 +0100
@@ -88,7 +88,7 @@
  ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
  goto CLEANUP;
  }
-fd = open(abort_file, O_CREAT);
+fd = open(abort_file, O_CREAT, 0666);
  if (0 < fd) close(fd);
  }


Hope this is the right place for the diff.

regards
sebastian

--

Sebastian Schmitzdorff - Managing Director
Hamburgnet
http://www.hamburgnet.de
Kottwitzstrasse 49 D-20253 Hamburg
fon: +49 40 736 72-322 fax: +49 40 736 72-321

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] openmpi-1.2.4 compilation error in orte_abort.c on Fedora 8 - patch included

2007-12-11 Thread Jeff Squyres

Er, ya -- duh.  Oops.  I'll fix...


On Dec 11, 2007, at 5:07 AM, George Bosilca wrote:

0600 you means ? I don't really see why you want to share the file  
with the whole group ?


 Thanks,
   george.

On Dec 10, 2007, at 5:15 PM, Ralph Castain wrote:

Nah, go ahead! Just change the permission to 0660 - that's a  
private file

that others shouldn't really perturb.

Ralph



On 12/10/07 2:59 PM, "Jeff Squyres"  wrote:


Yo Ralph --

I see you committed this to the ORTE-future branch.  Any  
objections to

me committing to trunk/v1.2?

(Thanks Sebastian -- stupid Fedora! ;-) )


On Dec 10, 2007, at 11:02 AM, Sebastian Schmitzdorff wrote:


Hi,

on Fedora 8 x86_64 openmpi-1.2.4 doesn't compile.
A quick glance at the nightly openmpi snapshot leads me to the
conclusion that
this is still the case.


In function 'open',
inlined from 'orte_abort' at runtime/orte_abort.c:91:
/usr/include/bits/fcntl2.h:51: error: call to '__open_missing_mode'
declared with attribute error: open with O_CREAT in second argument
needs 3 arguments
make[1]: *** [runtime/orte_abort.lo] Error 1
make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ 
openmpi-1.2.4/

orte'
make: *** [all-recursive] Error 1


There is a missing filemode in "open" in orte_abort.c:91.
fcntl2.h doesnt allow this anymore.

please find the simple diff below.


--- runtime/orte_abort.c2007-12-10 00:01:50.0 +0100
+++ test2007-12-10 00:01:00.0 +0100
@@ -88,7 +88,7 @@
 ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
 goto CLEANUP;
 }
-fd = open(abort_file, O_CREAT);
+fd = open(abort_file, O_CREAT, 0666);
 if (0 < fd) close(fd);
 }


Hope this is the right place for the diff.

regards
sebastian

--

Sebastian Schmitzdorff - Managing Director
Hamburgnet
http://www.hamburgnet.de
Kottwitzstrasse 49 D-20253 Hamburg
fon: +49 40 736 72-322 fax: +49 40 736 72-321

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems


Re: [OMPI devel] opal_condition_wait

2007-12-11 Thread Tim Prins
Ok, I think I am understanding this a bit now. By not decrementing the 
signaled count, we are allowing a single broadcast to wake up the same 
thread multiple times, and are allowing a single cond_signal to wake up 
multiple threads.


My understanding was that this behavior was not right, but upon further 
inspection of the pthreads documentation this behavior seems to be 
allowable.


Thanks for the clarifications,

Tim

Gleb Natapov wrote:

On Thu, Dec 06, 2007 at 09:46:45AM -0500, Tim Prins wrote:
Also, when we are using threads, there is a case where we do not 
decrement the signaled count, in condition.h:84. Gleb put this in in 
r9451, however the change does not make sense to me. I think that the 
signal count should always be decremented.


Can anyone shine any light on these issues?


I made this change a long time ago (I wander why I even tested threaded
build back then), but what I recall looking into the code and log message
there was a deadlock when signal broadcast doesn't wake up all thread
that are waiting on a conditional variable. Suppose two threads wait on
a condition C, third thread does broadcast. This makes C->c_signaled to
be equal 2. Now one thread wakes up and decrement C->c_signaled by one.
And before other thread is starting to run it calls condition_wait on C
one more time. Because c_signaled is 1 it doesn't sleep and decrement
c_signaled one more time. Now c_signaled is zero and when second thread
wakes up it see this and go to sleep again. The solution was to check in
condition_wait if condition is already signaled before go to sleep and
if yes exit immediately.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] opal_condition_wait

2007-12-11 Thread Tim Prins
Well, this makes some sense, although it still seems like this violates 
the spirit of condition variables.


Thanks,

Tim

Brian W. Barrett wrote:

On Thu, 6 Dec 2007, Tim Prins wrote:


Tim Prins wrote:

First, in opal_condition_wait (condition.h:97) we do not release the
passed mutex if opal_using_threads() is not set. Is there a reason for
this? I ask since this violates the way condition variables are supposed
to work, and it seems like there are situations where this could cause
deadlock.

So in (partial) answer to my own email, this is because throughout the
code we do:
OPAL_THREAD_LOCK(m)
opal_condition_wait(cond, m);
OPAL_THREAD_UNLOCK(m)

So this relies on opal_condition_wait not touching the lock. This
explains it, but it still seems very wrong.


Yes, this is correct.  The assumption is that you are using the 
conditional macro lock/unlock with the condition variables.  I personally 
don't like this (I think we should have had macro conditional condition 
variables), but that obviously isn't how it works today.


The problem with always holding the lock when you enter the condition 
variable is that even when threading is disabled, calling a lock is at 
least as expensive as an add, possibly including a cache miss.  So from a 
performance standpoint, this would be a no-go.



Also, when we are using threads, there is a case where we do not
decrement the signaled count, in condition.h:84. Gleb put this in in
r9451, however the change does not make sense to me. I think that the
signal count should always be decremented.

Can anyone shine any light on these issues?


Unfortunately, I can't add much on this front.

Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] opal_condition_wait

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 10:27:55AM -0500, Tim Prins wrote:
> My understanding was that this behavior was not right, but upon further 
> inspection of the pthreads documentation this behavior seems to be 
> allowable.
> 
I think that Open MPI does not implement condition variable in the strict
sense. Open MPI condition variable has to progress devices and wait for a
condition simultaneously and not just wait till a condition is satisfied.

--
Gleb.


[OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
Hi,

   I did a rewrite of matching code in OB1. I made it much simpler and 2
times smaller (which is good, less code - less bugs). I also got rid
of huge macros - very helpful if you need to debug something. There
is no performance degradation, actually I even see very small performance
improvement. I ran MTT with this patch and the result is the same as on
trunk. I would like to commit this to the trunk. The patch is attached
for everybody to try.

--
Gleb.
diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index d3f7c37..299ae9e 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -184,244 +184,159 @@ void mca_pml_ob1_recv_frag_callback( mca_btl_base_module_t* btl,
 }
 }

-/**
- * Try and match the incoming message fragment to a generic
- * list of receives
- *
- * @param hdr Matching data from received fragment (IN)
- *
- * @param generic_receives Pointer to the receive list used for
- * matching purposes. (IN)
- *
- * @return Matched receive
- *
- * This routine assumes that the appropriate matching locks are
- * set by the upper level routine.
- */
-#define MCA_PML_OB1_MATCH_GENERIC_RECEIVES(hdr,generic_receives,proc,return_match) \
-do {   \
-/* local variables */  \
-mca_pml_ob1_recv_request_t *generic_recv = (mca_pml_ob1_recv_request_t *)  \
- opal_list_get_first(generic_receives);\
-mca_pml_ob1_recv_request_t *last_recv = (mca_pml_ob1_recv_request_t *) \
-opal_list_get_end(generic_receives);   \
-register int recv_tag, frag_tag = hdr->hdr_tag;\
-   \
-/* Loop over the receives. If the received tag is less than zero  */   \
-/* enter in a special mode, where we match only our internal tags */   \
-/* (such as those used by the collectives.*/   \
-if( 0 <= frag_tag ) {  \
-for( ; generic_recv != last_recv;  \
- generic_recv = (mca_pml_ob1_recv_request_t *) \
- ((opal_list_item_t *)generic_recv)->opal_list_next) { \
-/* Check for a match */\
-recv_tag = generic_recv->req_recv.req_base.req_tag;\
-if ( (frag_tag == recv_tag) || (recv_tag == OMPI_ANY_TAG) ) {  \
-break; \
-}  \
-}  \
-} else {   \
-for( ; generic_recv != last_recv;  \
- generic_recv = (mca_pml_ob1_recv_request_t *) \
- ((opal_list_item_t *)generic_recv)->opal_list_next) { \
-/* Check for a match */\
-recv_tag = generic_recv->req_recv.req_base.req_tag;\
-if( OPAL_UNLIKELY(frag_tag == recv_tag) ) {\
-break; \
-}  \
-}  \
-}  \
-if( generic_recv != (mca_pml_ob1_recv_request_t *) \
-opal_list_get_end(generic_receives) ) {\
-   \
-/* Match made */   \
-return_match = generic_recv;   \
-   \
-/* remove descriptor from posted specific ireceive list */ \
-opal_list_remove_item(generic_receives,\
-  (opal_list_item_t *)generic_recv);   \
-PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q, \
- &(generic_recv->req_recv.req_base),   \
- P

Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Brian W. Barrett

On Tue, 11 Dec 2007, Gleb Natapov wrote:


  I did a rewrite of matching code in OB1. I made it much simpler and 2
times smaller (which is good, less code - less bugs). I also got rid
of huge macros - very helpful if you need to debug something. There
is no performance degradation, actually I even see very small performance
improvement. I ran MTT with this patch and the result is the same as on
trunk. I would like to commit this to the trunk. The patch is attached
for everybody to try.


I don't think we can live without those macros :).  Out of curiousity, is 
there any functionality that was removed as a result of this change?


I'll test on a couple systems over the next couple of days...

Brian


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Richard Graham
Gleb,
  I would suggest that before this is checked in this be tested on a system
that has N-way network parallelism, where N is as large as you can find.
This is a key bit of code for MPI correctness, and out-of-order operations
will break it, so you want to maximize the chance for such operations.

Rich


On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:

> Hi,
> 
>I did a rewrite of matching code in OB1. I made it much simpler and 2
> times smaller (which is good, less code - less bugs). I also got rid
> of huge macros - very helpful if you need to debug something. There
> is no performance degradation, actually I even see very small performance
> improvement. I ran MTT with this patch and the result is the same as on
> trunk. I would like to commit this to the trunk. The patch is attached
> for everybody to try.
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Andrew Friedley

Try UD, frags are reordered at a very high rate so should be a good test.

Andrew

Richard Graham wrote:

Gleb,
  I would suggest that before this is checked in this be tested on a system
that has N-way network parallelism, where N is as large as you can find.
This is a key bit of code for MPI correctness, and out-of-order operations
will break it, so you want to maximize the chance for such operations.

Rich


On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:


Hi,

   I did a rewrite of matching code in OB1. I made it much simpler and 2
times smaller (which is good, less code - less bugs). I also got rid
of huge macros - very helpful if you need to debug something. There
is no performance degradation, actually I even see very small performance
improvement. I ran MTT with this patch and the result is the same as on
trunk. I would like to commit this to the trunk. The patch is attached
for everybody to try.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 11:00:51AM -0500, Richard Graham wrote:
> Gleb,
>   I would suggest that before this is checked in this be tested on a system
> that has N-way network parallelism, where N is as large as you can find.
> This is a key bit of code for MPI correctness, and out-of-order operations
> will break it, so you want to maximize the chance for such operations.
> 
I started this rewrite while chasing this bug 
https://svn.open-mpi.org/trac/ompi/ticket/1158.
As you can see OpenIB reorders fragment quite a bit unfortunately :(
No testing is enough for such important piece of code of cause.

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 10:00:08AM -0600, Brian W. Barrett wrote:
> On Tue, 11 Dec 2007, Gleb Natapov wrote:
> 
> >   I did a rewrite of matching code in OB1. I made it much simpler and 2
> > times smaller (which is good, less code - less bugs). I also got rid
> > of huge macros - very helpful if you need to debug something. There
> > is no performance degradation, actually I even see very small performance
> > improvement. I ran MTT with this patch and the result is the same as on
> > trunk. I would like to commit this to the trunk. The patch is attached
> > for everybody to try.
> 
> I don't think we can live without those macros :).  Out of curiousity, is 
> there any functionality that was removed as a result of this change?
No. The way out of order packets are handled changed a little bit, but
they are handled in correct order.

> 
> I'll test on a couple systems over the next couple of days...
> 
Thanks!

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> Try UD, frags are reordered at a very high rate so should be a good test.
Good Idea I'll try this. BTW I thing the reason for such a high rate of
reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
(500) and process them one by one and if progress function is called
recursively next 500 completion will be reordered versus previous
completions (reordering happens on a receiver, not sender).

> 
> Andrew
> 
> Richard Graham wrote:
> > Gleb,
> >   I would suggest that before this is checked in this be tested on a system
> > that has N-way network parallelism, where N is as large as you can find.
> > This is a key bit of code for MPI correctness, and out-of-order operations
> > will break it, so you want to maximize the chance for such operations.
> > 
> > Rich
> > 
> > 
> > On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> > 
> >> Hi,
> >>
> >>I did a rewrite of matching code in OB1. I made it much simpler and 2
> >> times smaller (which is good, less code - less bugs). I also got rid
> >> of huge macros - very helpful if you need to debug something. There
> >> is no performance degradation, actually I even see very small performance
> >> improvement. I ran MTT with this patch and the result is the same as on
> >> trunk. I would like to commit this to the trunk. The patch is attached
> >> for everybody to try.
> >>
> >> --
> >> Gleb.
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Andrew Friedley
Possibly, though I have results from a benchmark I've written indicating 
the reordering happens at the sender.  I believe I found it was due to 
the QP striping trick I use to get more bandwidth -- if you back down to 
one QP (there's a define in the code you can change), the reordering 
rate drops.


Also I do not make any recursive calls to progress -- at least not 
directly in the BTL; I can't speak for the upper layers.  The reason I 
do many completions at once is that it is a big help in turning around 
receive buffers, making it harder to run out of buffers and drop frags. 
 I want to say there was some performance benefit as well but I can't 
say for sure.


Andrew

Gleb Natapov wrote:

On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:

Try UD, frags are reordered at a very high rate so should be a good test.

Good Idea I'll try this. BTW I thing the reason for such a high rate of
reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
(500) and process them one by one and if progress function is called
recursively next 500 completion will be reordered versus previous
completions (reordering happens on a receiver, not sender).


Andrew

Richard Graham wrote:

Gleb,
  I would suggest that before this is checked in this be tested on a system
that has N-way network parallelism, where N is as large as you can find.
This is a key bit of code for MPI correctness, and out-of-order operations
will break it, so you want to maximize the chance for such operations.

Rich


On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:


Hi,

   I did a rewrite of matching code in OB1. I made it much simpler and 2
times smaller (which is good, less code - less bugs). I also got rid
of huge macros - very helpful if you need to debug something. There
is no performance degradation, actually I even see very small performance
improvement. I ran MTT with this patch and the result is the same as on
trunk. I would like to commit this to the trunk. The patch is attached
for everybody to try.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
> Possibly, though I have results from a benchmark I've written indicating 
> the reordering happens at the sender.  I believe I found it was due to 
> the QP striping trick I use to get more bandwidth -- if you back down to 
> one QP (there's a define in the code you can change), the reordering 
> rate drops.
Ah, OK. My assumption was just from looking into code, so I may be
wrong.

> 
> Also I do not make any recursive calls to progress -- at least not 
> directly in the BTL; I can't speak for the upper layers.  The reason I 
> do many completions at once is that it is a big help in turning around 
> receive buffers, making it harder to run out of buffers and drop frags. 
>   I want to say there was some performance benefit as well but I can't 
> say for sure.
Currently upper layers of Open MPI may call BTL progress function
recursively. I hope this will change some day.

> 
> Andrew
> 
> Gleb Natapov wrote:
> > On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> >> Try UD, frags are reordered at a very high rate so should be a good test.
> > Good Idea I'll try this. BTW I thing the reason for such a high rate of
> > reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
> > (500) and process them one by one and if progress function is called
> > recursively next 500 completion will be reordered versus previous
> > completions (reordering happens on a receiver, not sender).
> > 
> >> Andrew
> >>
> >> Richard Graham wrote:
> >>> Gleb,
> >>>   I would suggest that before this is checked in this be tested on a 
> >>> system
> >>> that has N-way network parallelism, where N is as large as you can find.
> >>> This is a key bit of code for MPI correctness, and out-of-order operations
> >>> will break it, so you want to maximize the chance for such operations.
> >>>
> >>> Rich
> >>>
> >>>
> >>> On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> >>>
>  Hi,
> 
> I did a rewrite of matching code in OB1. I made it much simpler and 2
>  times smaller (which is good, less code - less bugs). I also got rid
>  of huge macros - very helpful if you need to debug something. There
>  is no performance degradation, actually I even see very small performance
>  improvement. I ran MTT with this patch and the result is the same as on
>  trunk. I would like to commit this to the trunk. The patch is attached
>  for everybody to try.
> 
>  --
>  Gleb.
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
> Try UD, frags are reordered at a very high rate so should be a good test.
mpi-ping works fine with UD BTL and the patch.

> 
> Andrew
> 
> Richard Graham wrote:
> > Gleb,
> >   I would suggest that before this is checked in this be tested on a system
> > that has N-way network parallelism, where N is as large as you can find.
> > This is a key bit of code for MPI correctness, and out-of-order operations
> > will break it, so you want to maximize the chance for such operations.
> > 
> > Rich
> > 
> > 
> > On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> > 
> >> Hi,
> >>
> >>I did a rewrite of matching code in OB1. I made it much simpler and 2
> >> times smaller (which is good, less code - less bugs). I also got rid
> >> of huge macros - very helpful if you need to debug something. There
> >> is no performance degradation, actually I even see very small performance
> >> improvement. I ran MTT with this patch and the result is the same as on
> >> trunk. I would like to commit this to the trunk. The patch is attached
> >> for everybody to try.

--
Gleb.


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Richard Graham
I will re-iterate my concern.  The code that is there now is mostly nine
years old (with some mods made when it was brought over to Open MPI).  It
took about 2 months of testing on systems with 5-13 way network parallelism
to track down all KNOWN race conditions.  This code is at the center of MPI
correctness, so I am VERY concerned about changing it w/o some very strong
reasons.  Not apposed, just very cautious.

Rich


On 12/11/07 11:47 AM, "Gleb Natapov"  wrote:

> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
>> Possibly, though I have results from a benchmark I've written indicating
>> the reordering happens at the sender.  I believe I found it was due to
>> the QP striping trick I use to get more bandwidth -- if you back down to
>> one QP (there's a define in the code you can change), the reordering
>> rate drops.
> Ah, OK. My assumption was just from looking into code, so I may be
> wrong.
> 
>> 
>> Also I do not make any recursive calls to progress -- at least not
>> directly in the BTL; I can't speak for the upper layers.  The reason I
>> do many completions at once is that it is a big help in turning around
>> receive buffers, making it harder to run out of buffers and drop frags.
>>   I want to say there was some performance benefit as well but I can't
>> say for sure.
> Currently upper layers of Open MPI may call BTL progress function
> recursively. I hope this will change some day.
> 
>> 
>> Andrew
>> 
>> Gleb Natapov wrote:
>>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
 Try UD, frags are reordered at a very high rate so should be a good test.
>>> Good Idea I'll try this. BTW I thing the reason for such a high rate of
>>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
>>> (500) and process them one by one and if progress function is called
>>> recursively next 500 completion will be reordered versus previous
>>> completions (reordering happens on a receiver, not sender).
>>> 
 Andrew
 
 Richard Graham wrote:
> Gleb,
>   I would suggest that before this is checked in this be tested on a
> system
> that has N-way network parallelism, where N is as large as you can find.
> This is a key bit of code for MPI correctness, and out-of-order operations
> will break it, so you want to maximize the chance for such operations.
> 
> Rich
> 
> 
> On 12/11/07 10:54 AM, "Gleb Natapov"  wrote:
> 
>> Hi,
>> 
>>I did a rewrite of matching code in OB1. I made it much simpler and 2
>> times smaller (which is good, less code - less bugs). I also got rid
>> of huge macros - very helpful if you need to debug something. There
>> is no performance degradation, actually I even see very small performance
>> improvement. I ran MTT with this patch and the result is the same as on
>> trunk. I would like to commit this to the trunk. The patch is attached
>> for everybody to try.
>> 
>> --
>> Gleb.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> --
>>> Gleb.
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-11 Thread Jon Mason
Currently, alternate CMs cannot be called because
ompi_btl_openib_connect_base_open forces a choice of either oob or xoob
(and goes into an erroneous error path if you pick something else).
This patch reorganizes ompi_btl_openib_connect_base_open so that new
functions can easily be added.  New Open functions were added to oob
and xoob for the error handling.

I tested calling oob, xoob, and rdma_cm.  oob happily allows connections
to be established and throws no errors.  xoob fails because ompi does
not have it compiled in (and I have no connectx cards).  rdma_cm calls
the empty hooks and exits without connecting (thus throwing
non-connection errors).  All expected behavior.

Since this patch fixes the existing behavior, and is not necessarily
tied to my implementing of rdma_cm, I think it is acceptable to go in
now.  

Thanks,
Jon

Index: ompi/mca/btl/openib/connect/btl_openib_connect_base.c
===
--- ompi/mca/btl/openib/connect/btl_openib_connect_base.c   (revision 16937)
+++ ompi/mca/btl/openib/connect/btl_openib_connect_base.c   (working copy)
@@ -50,8 +50,8 @@
  */
 int ompi_btl_openib_connect_base_open(void)
 {
-int i;
-char **temp, *a, *b;
+char **temp, *a, *b, *defval;
+int i, ret = OMPI_ERROR;

 /* Make an MCA parameter to select which connect module to use */
 temp = NULL;
@@ -66,40 +66,23 @@

 /* For XRC qps we must to use XOOB connection manager */
 if (mca_btl_openib_component.num_xrc_qps > 0) {
-mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version,
-"connect",
-b, false, false,
-"xoob", ¶m);
-if (0 != strcmp("xoob", param)) {
-opal_show_help("help-mpi-btl-openib.txt",
-"XRC with wrong OOB", true,
-orte_system_info.nodename,
-mca_btl_openib_component.num_xrc_qps);
-return OMPI_ERROR;
-}
+   defval = "xoob";
 } else { /* For all others we should use OOB */
-mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version,
-"connect",
-b, false, false,
-"oob", ¶m);
-if (0 != strcmp("oob", param)) {
-opal_show_help("help-mpi-btl-openib.txt",
-"SRQ or PP with wrong OOB", true,
-orte_system_info.nodename,
-mca_btl_openib_component.num_srq_qps,
-mca_btl_openib_component.num_pp_qps);
-return OMPI_ERROR;
-}
+   defval = "oob";
 }

+mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version,
+ "connect", b, false, false, defval, ¶m);
+
 /* Call the open function on all the connect modules */
 for (i = 0; NULL != all[i]; ++i) {
-if (NULL != all[i]->bcf_open) {
-all[i]->bcf_open();
+if (0 == strcmp(all[i]->bcf_name, param)) {
+ret = all[i]->bcf_open();
+   break;
 }
 }

-return OMPI_SUCCESS;
+return ret;
 }


Index: ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c
===
--- ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c   (revision 16937)
+++ ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c   (working copy)
@@ -28,11 +28,7 @@

 static int ibcm_open(void)
 {
-mca_base_param_reg_int(&mca_btl_openib_component.super.btl_version,
-   "btl_openib_connect_ibcm_foo",
-   "A dummy help message", false, false,
-   17, NULL);
-
+printf("ibcm open\n");
 return OMPI_SUCCESS;
 }

Index: ompi/mca/btl/openib/connect/btl_openib_connect_oob.c
===
--- ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(revision 16937)
+++ ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(working copy)
@@ -22,6 +22,8 @@

 #include "ompi_config.h"

+#include "opal/util/show_help.h"
+
 #include "orte/mca/ns/base/base.h"
 #include "orte/mca/oob/base/base.h"
 #include "orte/mca/rml/rml.h"
@@ -39,6 +41,7 @@
 ENDPOINT_CONNECT_ACK
 } connect_message_type_t;

+static int oob_open(void);
 static int oob_init(void);
 static int oob_start_connect(mca_btl_base_endpoint_t *e);
 static int oob_finalize(void);
@@ -67,8 +70,8 @@
  */
 ompi_btl_openib_connect_base_funcs_t ompi_btl_openib_connect_oob = {
 "oob",
-/* No need for "open */
-NULL,
+/* Open */
+oob_open,
 /* Init */
 oob_init,
 /* Connect */
@@ -78,6 +81,23 @@
 };

 /*
+ * Open function.
+ */
+static int oob_open(void)
+{
+if (mca_btl_openib_component.num_xrc_qps > 0) {
+opal_show_help("help-mpi-btl-openib.txt",
+"SRQ or PP with wrong OOB", true,
+orte_system_info.nodename,
+ 

[OMPI devel] Fwd: Subversion and trac outage

2007-12-11 Thread Jeff Squyres

Begin forwarded message:


From: DongInn Kim 
Date: December 11, 2007 6:20:03 PM EST
To: Jeff Squyres 
Subject: Subversion and trac outage

Hi,

I am sorry for the unexpected outage of subversion and trac of Open  
MPI.


There was a mistake of handling the ACL information about blocking  
some specific ports this afternoon.

Hence, the following websites are not accessible now.

http(s)://svn.open-mpi.org/svn/ompi
http://svn.open-mpi.org/trac/ompi

I believe that this will be fixed first thing tomorrow morning.
I will let you know as soon as the services are available.
Again, I am really sorry about this incident.

Best Regards,


--
- DongInn



--
Jeff Squyres
Cisco Systems


Re: [OMPI devel] [PATCH] openib: clean-up connect to allow for new cm's

2007-12-11 Thread Jeff Squyres
Hmm.  I don't think that we want to put knowledge of XRC in the OOB  
CPC (and vice versa).  That seems like an abstraction violation.


I didn't like that XRC knowledge was put in the connect base either,  
but I was too busy to argue with it.  :-)


Isn't there a better way somehow?  Perhaps we should have "select"  
call *all* the functions and accept back a priority.  The one with the  
highest priority then wins.  This is quite similar to much of the  
other selection logic in OMPI.


Sidenote: Keep in mind that there are some changes coming to select  
CPCs on a per-endpoint basis (I can't look up the trac ticket right  
now...).  This makes things a little complicated -- do we need  
btl_openib_cpc_include and btl_openib_cpc_exclude MCA params to  
include/exclude CPCs (because you might need more than one CPC in a  
single job)?  That wouldn't be hard to do.


But then what to do about if someone sets to use some XRC QPs and  
selects to use OOB or RDMA CM?  How do we catch this and print an  
error?  It doesn't seem right to put the "if num_xrc_qps>0" check in  
every CPC.  What happens if you try to make an XRC QP when not using  
xoob?  Where is the error detected and what kind of error message do  
we print?


Also, I'm not sure why the #if/#else is there for xoob (i.e., having  
empty/printf functions there when XRC support is compiled out) -- if  
xoob was disabled during compilation, then it simply should not be  
compiled and therefore not be there at all at run-time.  If a user  
selects the xoob CPC, then we should print a message from the base  
that that CPC doesn't exist in the installation.  Correspondingly, we  
can make an info MCA param in the btl openib that shows which CPCs are  
available (we already have this information -- it's easy enough to put  
this in an info MCA param).



On Dec 11, 2007, at 6:59 PM, Jon Mason wrote:


Currently, alternate CMs cannot be called because
ompi_btl_openib_connect_base_open forces a choice of either oob or  
xoob

(and goes into an erroneous error path if you pick something else).
This patch reorganizes ompi_btl_openib_connect_base_open so that new
functions can easily be added.  New Open functions were added to oob
and xoob for the error handling.

I tested calling oob, xoob, and rdma_cm.  oob happily allows  
connections

to be established and throws no errors.  xoob fails because ompi does
not have it compiled in (and I have no connectx cards).  rdma_cm calls
the empty hooks and exits without connecting (thus throwing
non-connection errors).  All expected behavior.

Since this patch fixes the existing behavior, and is not necessarily
tied to my implementing of rdma_cm, I think it is acceptable to go in
now.

Thanks,
Jon

Index: ompi/mca/btl/openib/connect/btl_openib_connect_base.c
===
--- ompi/mca/btl/openib/connect/btl_openib_connect_base.c	(revision  
16937)
+++ ompi/mca/btl/openib/connect/btl_openib_connect_base.c	(working  
copy)

@@ -50,8 +50,8 @@
 */
int ompi_btl_openib_connect_base_open(void)
{
-int i;
-char **temp, *a, *b;
+char **temp, *a, *b, *defval;
+int i, ret = OMPI_ERROR;

/* Make an MCA parameter to select which connect module to use */
temp = NULL;
@@ -66,40 +66,23 @@

/* For XRC qps we must to use XOOB connection manager */
if (mca_btl_openib_component.num_xrc_qps > 0) {
- 
mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version,

-"connect",
-b, false, false,
-"xoob", ¶m);
-if (0 != strcmp("xoob", param)) {
-opal_show_help("help-mpi-btl-openib.txt",
-"XRC with wrong OOB", true,
-orte_system_info.nodename,
-mca_btl_openib_component.num_xrc_qps);
-return OMPI_ERROR;
-}
+   defval = "xoob";
} else { /* For all others we should use OOB */
- 
mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version,

-"connect",
-b, false, false,
-"oob", ¶m);
-if (0 != strcmp("oob", param)) {
-opal_show_help("help-mpi-btl-openib.txt",
-"SRQ or PP with wrong OOB", true,
-orte_system_info.nodename,
-mca_btl_openib_component.num_srq_qps,
-mca_btl_openib_component.num_pp_qps);
-return OMPI_ERROR;
-}
+   defval = "oob";
}

+ 
mca_base_param_reg_string(&mca_btl_openib_component.super.btl_version,

+ "connect", b, false, false, defval, ¶m);
+
/* Call the open function on all the connect modules */
for (i = 0; NULL != all[i]; ++i) {
-if (NULL != all[i]->bcf_open) {
-all[i]->bcf_open();
+if (0 == strcmp(all[i]->bcf_name, param)) {
+ret = all[i]->bcf_open();
+   break;
}
}

-return OMPI_S