[OMPI devel] RFC: Change PML error handler signature

2010-04-21 Thread Rolf vandeVaart

WHAT:
Add two arguments to the mca_pml_ob1_error_handler to make it more 
useful for BTLs that may take advantage of that feature.  Adding an 
ompi_proc_t pointer and a char pointer.  This is what the new signature 
looks like.


void mca_pml_ob1_error_handler(
   struct mca_btl_base_module_t* btl,
   int32_t flags, ompi_proc_t *errproc, char *btlname) {

WHY:
There are times when the BTL wants to notify the PML not only that it 
had an error, but also the endpoint the error occurred on.  In addition, 
we add a string so the BTL can put descriptive information like which 
interface had the error.


WHERE: ompi/mca/pml/pml_ob1.c
   ompi/mca/btl/openib/btl_openib_component.c

MORE DETAILS:
I just want to expand the function signature by two variables.  Not that 
currently the only place the callback is used is in the openib BTL.  And 
when the callback is called, it just aborts the program.  So this has no 
effect whatsoever on the current library.  I will also fix the signature 
in the other PMLs to keep things consistent. 


TIMEOUT: Monday, April 26, 2010 (as this is a minor change)


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r23014

2010-04-21 Thread George Bosilca
The comment doesn't match the commit itself.

  george.

On Apr 20, 2010, at 20:00 , cy...@osl.iu.edu wrote:

> Author: cyeoh
> Date: 2010-04-20 20:00:14 EDT (Tue, 20 Apr 2010)
> New Revision: 23014
> URL: https://svn.open-mpi.org/trac/ompi/changeset/23014
> 
> Log:
> fixes #2355 - race in interaction between opal_atomic_lifo_push
> and opal_atomic_lifo_pop. Adds memory barriers to remove the race
> condition
> 
> 
> Text files modified: 
>   trunk/opal/include/opal/sys/powerpc/atomic.h | 3 +--
>  
>   1 files changed, 1 insertions(+), 2 deletions(-)
> 
> Modified: trunk/opal/include/opal/sys/powerpc/atomic.h
> ==
> --- trunk/opal/include/opal/sys/powerpc/atomic.h  (original)
> +++ trunk/opal/include/opal/sys/powerpc/atomic.h  2010-04-20 20:00:14 EDT 
> (Tue, 20 Apr 2010)
> @@ -9,6 +9,7 @@
>  * University of Stuttgart.  All rights reserved.
>  * Copyright (c) 2004-2005 The Regents of the University of California.
>  * All rights reserved.
> + * Copyright (c) 2010  IBM Corporation.  All rights reserved.
>  * $COPYRIGHT$
>  * 
>  * Additional copyrights may follow
> @@ -296,7 +297,6 @@
> " add %0, %2, %0   \n\t"
> " stwcx.  %0, 0, %3\n\t"
> " bne-1b   \n\t"
> -" mr  %3, %0   \n\t"
> : "=&r" (t), "=m" (*v)
> : "r" (inc), "r" (v), "m" (*v)
> : "cc");
> @@ -314,7 +314,6 @@
> " subf%0,%2,%0 \n\t"
> " stwcx.  %0,0,%3  \n\t"
> " bne-1b   \n\t"
> -" mr  %3, %0   \n\t"
> : "=&r" (t), "=m" (*v)
> : "r" (dec), "r" (v), "m" (*v)
> : "cc");
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full




Re: [OMPI devel] RFC: Change PML error handler signature

2010-04-21 Thread George Bosilca
The current error system follows a different design. There are basically two 
ways to report errors, per peer or global. The per-peer can only be triggered 
by a specific send or receive, and is based on the value of the last argument 
on the callbacks. Such errors, clearly indicated which is the peer and what is 
the message when such error have been detected. The second way is global, not 
peer related, and was supposed to be used more for local errors (such as this 
specific BTL is now down). As a result, this kind of errors is supposed to 
unlink all peers connected through the BTL, and this is why the ompi_proc_t is 
not part of the arguments list.

If you change the signature of this function, this will change the design. And 
I'm not sure it make it more consistent. How do we report that a BTL is now 
completely down and all peers connected through it have to be relinked through 
another BTL?

  george.

On Apr 21, 2010, at 11:07 , Rolf vandeVaart wrote:

> WHAT:
> Add two arguments to the mca_pml_ob1_error_handler to make it more useful for 
> BTLs that may take advantage of that feature.  Adding an ompi_proc_t pointer 
> and a char pointer.  This is what the new signature looks like.
> 
> void mca_pml_ob1_error_handler(
>   struct mca_btl_base_module_t* btl,
>   int32_t flags, ompi_proc_t *errproc, char *btlname) {
> 
> WHY:
> There are times when the BTL wants to notify the PML not only that it had an 
> error, but also the endpoint the error occurred on.  In addition, we add a 
> string so the BTL can put descriptive information like which interface had 
> the error.
> 
> WHERE: ompi/mca/pml/pml_ob1.c
>   ompi/mca/btl/openib/btl_openib_component.c
> 
> MORE DETAILS:
> I just want to expand the function signature by two variables.  Not that 
> currently the only place the callback is used is in the openib BTL.  And when 
> the callback is called, it just aborts the program.  So this has no effect 
> whatsoever on the current library.  I will also fix the signature in the 
> other PMLs to keep things consistent. 
> TIMEOUT: Monday, April 26, 2010 (as this is a minor change)
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] New OMPI MPI extension

2010-04-21 Thread Jeff Squyres
Per the telecon Tuesday, I committed a new OMPI MPI extension to the trunk:

https://svn.open-mpi.org/trac/ompi/changeset/23018

Please read the commit message and let me know what you think.  Suggestions are 
welcome.

If everyone is ok with it, I'd like to see this functionality hit the 1.5 
series at some point.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] Cisco MTT testing

2010-04-21 Thread Jeff Squyres
I tried and failed to get my cluster up and going yesterday (my MTT runs last 
night didn't go well -- they're all flagged as "trial" for the moment for 
exactly this reason).  I may have just figured out what the major cause of my 
problems was; hopefully I'll be able to submit another big MTT run for 1.4.2rc1 
by tonight.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Change PML error handler signature

2010-04-21 Thread Rolf vandeVaart

Hi George:
To report that an entire BTL is down, one just sets the ompi_proc_t 
argument is set to NULL.  That is how I was using it.  That means the 
mca_pml_ob1_error_handler could see that it is NULL, and map out the 
entire BTL.  BTLs can set the ompi_proc_t if they want and the PML is 
free to use or ignore it if it wants.  This allows us to handle errors 
that may occur on a receive but that we would not want to error out the 
entire BTL, but just a single connection.


Does that make this change better?  Or am I still violating the general 
design.


Rolf

On 04/21/10 11:34, George Bosilca wrote:

The current error system follows a different design. There are basically two 
ways to report errors, per peer or global. The per-peer can only be triggered 
by a specific send or receive, and is based on the value of the last argument 
on the callbacks. Such errors, clearly indicated which is the peer and what is 
the message when such error have been detected. The second way is global, not 
peer related, and was supposed to be used more for local errors (such as this 
specific BTL is now down). As a result, this kind of errors is supposed to 
unlink all peers connected through the BTL, and this is why the ompi_proc_t is 
not part of the arguments list.

If you change the signature of this function, this will change the design. And 
I'm not sure it make it more consistent. How do we report that a BTL is now 
completely down and all peers connected through it have to be relinked through 
another BTL?

  george.

On Apr 21, 2010, at 11:07 , Rolf vandeVaart wrote:

  

WHAT:
Add two arguments to the mca_pml_ob1_error_handler to make it more useful for 
BTLs that may take advantage of that feature.  Adding an ompi_proc_t pointer 
and a char pointer.  This is what the new signature looks like.

void mca_pml_ob1_error_handler(
  struct mca_btl_base_module_t* btl,
  int32_t flags, ompi_proc_t *errproc, char *btlname) {

WHY:
There are times when the BTL wants to notify the PML not only that it had an 
error, but also the endpoint the error occurred on.  In addition, we add a 
string so the BTL can put descriptive information like which interface had the 
error.

WHERE: ompi/mca/pml/pml_ob1.c
  ompi/mca/btl/openib/btl_openib_component.c

MORE DETAILS:
I just want to expand the function signature by two variables.  Not that currently the only place the callback is used is in the openib BTL.  And when the callback is called, it just aborts the program.  So this has no effect whatsoever on the current library.  I will also fix the signature in the other PMLs to keep things consistent. 
TIMEOUT: Monday, April 26, 2010 (as this is a minor change)

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel