Re: [OMPI devel] 1.3 PML default choice

2009-01-13 Thread Brian W. Barrett
The selection logic for the PML is very confusing and doesn't follow the 
standard priority selection.  The reasons for this are convoluted and not 
worth discussing here.  The bottom line, however, is that the OB1 PML will 
be the default *UNLESS* the PSM (PathScale/Qlogic) MTL can be chosen, in 
which case the CM PML is used by default.


Brian

On Tue, 13 Jan 2009, Bogdan Costescu wrote:


On Tue, 13 Jan 2009, Tim Mattox wrote:


The cm PML does not use BTLs..., only MTLs, so
... the BTL selection is ignored.


OK, thanks for clarifying this bit, but...


The README for 1.3b2 specifies that CM is now chosen if possible; in my
trials, when I specify CM+BTL, it doesn't complain and works well.
However either the default (no options) or OB1+BTL leads to the jumps
mentioned above, which makes me believe that OB1+BTL is still chosen as
default, contrary to what the README specifies.


... this bit is still unclear to me. Should OB1+BTL or CM+MTL be the default 
? I have just tried using "mpi_show_mca_params" for both v1.3b2 and v1.3rc3 
and this tells me that:


pml= (default value)
pml_cm_priority=30 (default value)
pml_ob1_priority=20 (default value)

which, from what I know, should lead to CM being chosen as the default. Still 
for v1.3b2 OB1 seemed to be chosen; for v1.3rc3 I can't distinguish anymore 
from timings as they behave very similarly.





Re: [OMPI devel] RFC: [slightly] Optimize Fortran MPI_SEND / MPI_RECV

2009-02-07 Thread Brian W. Barrett

On Sat, 7 Feb 2009, Jeff Squyres wrote:

End result: I guess I'm a little surprised that the difference is that clear 
-- does a function call really take 10ns?  I'm also surprised that the 
layered C version has significantly more jitter than the non-layered version; 
I can't really explain that.  I'd welcome anyone else replicating experiment 
and/or eyeballing my code to make sure I didn't bork something up.


That is significantly higher than I would have expected for a single 
function call.  When I did all the component tests a couple years ago, a 
function call into a shared library was about 5ns on an Intel Xeon 
(pre-Core 2 design) and about 2.5 on an AMD Opteron.


Brian


Re: [OMPI devel] RFC: Rename several OMPI_* names to OPAL_*

2009-02-10 Thread Brian W. Barrett

I have no objections to this change

Brian


On Tue, 10 Feb 2009, Greg Koenig wrote:


RFC: Rename several OMPI_* names to OPAL_*

WHAT: Rename several #define values that encode the prefix "OMPI_" to
instead encode the prefix "OPAL_" throughout the entire Open MPI source code
tree.  Also, eliminate unnecessary #include lines from source code files
under the ".../ompi/mca/btl" subtree.

WHY: (1) These are general source code improvements that update #define
values to more accurately describe which layer the values belong and remove
unnecessary dependencies within the source code; (2) These changes will help
with the effort to move the BTL code into an independent layer.

WHERE: 1.4 trunk

WHEN: Negotiable -- see below, but probably near split for 1.4
 (No earlier than February 19, 2009)

Timeout: February 19, 2009



The proposed change involves renaming several #define values that encode the
prefix "OMPI_" to instead encode the prefix "OPAL_" throughout the entire
Open MPI source code tree.  These names are holdovers from when the three
existing layers of Open MPI were developed together prior to being split
apart.  Additionally, the proposed change eliminates a few unnecessary
#include lines in BTL source code files under the .../ompi/mca/btl subtree.

Specific modifications are detailed following this message text.  A script
to carry out these modifications is also attached to this message (gzipped
to pass unmolested through the ORNL e-mail server).

We believe these modifications improve the Open MPI source code by renaming
values such that they correspond to the Open MPI layer to which they most
closely belong, and that this improvement is itself of benefit to Open MPI.
These modifications will also aid our ongoing efforts to extract the BTL
code into a new layer ("ONET") that can be built with just direct dependence
on the OPAL layer.

Although these changes are simple string substitutions, they touch a fair
amount of code in the Open MPI tree.  Three people have tested these changes
at our site on various platforms and have not discovered any problems.
However, we recognize that some members of the community may have
input/feedback regarding testing and we remain open to suggestions related
to testing.

One challenge that has been brought up regarding this RFC is that applying
patches and/or CMRs to the source code tree after the proposed changes are
performed will be more difficult.  To that end, the best opportunity to
apply the modifications proposed in this RFC seems to be in conjunction with
1.4.  (My understanding from the developer conference call this morning is
that there are a few other changes waiting for this switch as well.)  We are
open to suggestions about the best time to apply this RFC to avoid major
disruptions.


Specific changes follow:

* From .../configure.ac.
* OMPI_NEED_C_BOOL
* OMPI_HAVE_WEAK_SYMBOLS
* OMPI_C_HAVE_WEAK_SYMBOLS
* OMPI_USE_STDBOOL_H
* OMPI_HAVE_SA_RESTART
* OMPI_HAVE_VA_COPY
* OMPI_HAVE_UNDERSCORE_VA_COPY
* OMPI_PTRDIFF_TYPE
* (also, ompi_ptrdiff_t)
* OMPI_ALIGN_WORD_SIZE_INTEGERS
* OMPI_WANT_LIBLTDL
* (also, OMPI_ENABLE_DLOPEN_SUPPORT)
* OMPI_STDC_HEADERS
* OMPI_HAVE_SYS_TIME_H
* OMPI_HAVE_LONG_LONG
* OMPI_HAVE_SYS_SYNCH_H
* OMPI_SIZEOF_BOOL
* OMPI_SIZEOF_INT

* From .../config/ompi_check_attributes.m4.
* OMPI_HAVE_ATTRIBUTE
* (also, ompi_cv___attribute__)
* OMPI_HAVE_ATTRIBUTE_ALIGNED
* (also, ompi_cv___attribute__aligned)
* OMPI_HAVE_ATTRIBUTE_ALWAYS_INLINE
* (also, ompi_cv___attribute__always_inline)
* OMPI_HAVE_ATTRIBUTE_COLD
* (also, ompi_cv___attribute__cold)
* OMPI_HAVE_ATTRIBUTE_CONST
* (also, ompi_cv___attribute__const)
* OMPI_HAVE_ATTRIBUTE_DEPRECATED
* (also, ompi_cv___attribute__deprecated)
* OMPI_HAVE_ATTRIBUTE_FORMAT
* (also, ompi_cv___attribute__format)
* OMPI_HAVE_ATTRIBUTE_HOT
* (also, ompi_cv___attribute__hot)
* OMPI_HAVE_ATTRIBUTE_MALLOC
* (also, ompi_cv___attribute__malloc)
* OMPI_HAVE_ATTRIBUTE_MAY_ALIAS
* (also, ompi_cv___attribute__may_alias)
* OMPI_HAVE_ATTRIBUTE_NO_INSTRUMENT_FUNCTION
* (also, ompi_cv___attribute__no_instrument_function)
* OMPI_HAVE_ATTRIBUTE_NONNULL
* (also, ompi_cv___attribute__nonnull)
* OMPI_HAVE_ATTRIBUTE_NORETURN
* (also, ompi_cv___attribute__noreturn)
* OMPI_HAVE_ATTRIBUTE_PACKED
* (also, ompi_cv___attribute__packed)
* OMPI_HAVE_ATTRIBUTE_PURE
* (also, ompi_cv___attribute__pure)
* OMPI_HAVE_ATTRIBUTE_SENTINEL
* (also, ompi_cv___attribute__sentinel)
* OMPI_HAVE_ATTRIBUTE_UNUSED
* (also, ompi_cv___attribute__unused)
* OMPI_HAVE_ATTRIBUTE_VISIBILITY
* (also, ompi_cv___attribute__visibility)
* OMPI_HAVE_ATTRIBUTE_WARN_UNUSED_RESULT
* (also, ompi_cv___attribute__warn_unused_result)
* OMPI_HAVE_ATTRIBUTE_WEAK_ALIAS
* (also, ompi_cv___attribute__weak

Re: [OMPI devel] RFC: eliminating "descriptor" argument from sendi function

2009-02-23 Thread Brian W. Barrett
At a high level, it seems reasonable to me.  I am not familiar enough with 
the sendi code, however, to have a strong opinion either way.


Brian

On Mon, 23 Feb 2009, Jeff Squyres wrote:


Sounds reasonable to me.  George / Brian?


On Feb 21, 2009, at 2:11 AM, Eugene Loh wrote:


What:  Eliminate the "descriptor" argument from sendi functions.

Why:  The only thing this argument is used for is so that the sendi 
function can allocate a descriptor in the event that the "send" cannot 
complete.  But, in that case, the sendi reverts to the PML, where there is 
already code to allocate a descriptor.  So, each sendi function (in each 
BTL that has a sendi function) must have code that is already in the PML 
anyhow.  This is unnecessary extra coding and not clean design.


Where:  In each BTL that has a sendi function (only three, and there are 
not all used) and in the function prototype and at the PML calling site.


When:  I'd like to incorporate this in the shared-memory latency work I'm 
doing that we're targetting for 1.3.x.


Timeout:  Feb 27.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] RFC: eliminating "descriptor" argument from sendi function

2009-02-23 Thread Brian W. Barrett

On Mon, 23 Feb 2009, Jeff Squyres wrote:


On Feb 23, 2009, at 10:37 AM, Eugene Loh wrote:


I sense an opening here and rush in for the kill...


:-)

And, why does the PML pass a BTL argument into the sendi function?  First, 
the BTL argument is not typically used.  Second, if the BTL sendi function 
wants to know what BTL it is,... uh, doesn't it already know???  Doesn't a 
BTL know who it is?  Why, then, should the PML have to tell it?


I suspect that it's passing in the BTL *module* argument, which may have 
specific information about the connection that is to be used.


Example: if I have a dual-port IB HCA, Open MPI will make 2 different openib 
BTL modules.  In this case, the openib BTL will need to know exactly which 
module the PML is trying to sendi on.


Exactly.  In multi-nic situations, the BTL argument is critical.  Since 
the SM btl never really does "multi-nic", it doesn't have to worry about 
the btl argument.


Brian


Re: [OMPI devel] compiler_args in wrapper-data.txt files with Portland Group Compilers

2009-02-24 Thread Brian W. Barrett

Hi Wayne -

Sorry for the delay.  I'm the author of that code, and am currently trying 
to finish my dissertation, so I've been a bit behind.


Anyway, at present, the compiler_args field only works on a single token. 
So you can't have something looking for -tp p7.  I thought about how to do 
this, but never got a chance to add it to the code base.  I'm not sure 
when/if that feature will be added.  If you have some time, the code lives 
in opal/tools/wrappers/opal_wrapper.c, if you want to have a look.


Good luck,

Brian


On Mon, 23 Feb 2009, Wayne Gilmore wrote:

I sent this to the users mailing list buy maybe this is a better place for 
it. Can anyone help with this??


I'm trying to use the compiler_args field in the wrappers script to deal
with 32 bit compiles on our cluster.

I'm using Portland Group compilers and use the following for 32 bit
builds: -tp p7

I've created a separate stanza in the wrapper but I am not able to use
the whole option "-tp p7" for the compiler_args. It only works if I do
compiler_args=p7

Is there a way to provide compiler_args with arguments that contain a
space?

This would eliminate cases where 'p7' would appear elsewhere in the
compile line and be falsely recognized as a 32 bit build.

Here is some additional information from my build:

For a regular 64 bit build:
(no problems here, works fine)

katana:~ % mpicc --showme
pgcc -D_REENTRANT
-I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include
-Wl,-rpath
-Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib
-L/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib
-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil
-lpthread -ldl

For a 32 bit build when compiler_args is set to "-tp p7" in the wrapper:
(note that in this case is does not pick up the lib32 and include32 dirs)

katana:share/openmpi % mpicc -tp p7 --showme
pgcc -D_REENTRANT
-I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include
-tp p7 -Wl,-rpath
-Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib
-L/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib
-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil
-lpthread -ldl

For a 32 bit build when compiler_args is set to "p7" in the wrapper
(note that in this case it does pick up the lib32 and include32 dirs)

katana:share/openmpi % mpicc -tp p7 --showme
pgcc -D_REENTRANT
-I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include32

-I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include32

-tp p7 -Wl,-rpath
-Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib32
-L/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib32
-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil
-lpthread -ldl

Here's the mpicc-wrapper-data.txt file that I am using: (with
compiler_args set to "p7" only. This works, but if I set it to "-tp p7"
it fails to pick up the info in the stanza)

compiler_args=
project=Open MPI
project_short=OMPI
version=1.3
language=C
compiler_env=CC
compiler_flags_env=CFLAGS
compiler=pgcc
extra_includes=
preprocessor_flags=-D_REENTRANT
compiler_flags=
linker_flags=-Wl,-rpath
-Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib
libs=-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl
-lutil -lpthread -ldl
required_file=
includedir=${includedir}
libdir=${libdir}

compiler_args=p7
project=Open MPI
project_short=OMPI
version=1.3
language=C
compiler_env=CC
compiler_flags_env=CFLAGS
compiler=pgcc
extra_includes=
preprocessor_flags=-D_REENTRANT
-I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include32
compiler_flags=
linker_flags=-Wl,-rpath
-Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib32
libs=-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl
-lutil -lpthread -ldl
required_file=
includedir=/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include32
libdir=/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib32 
___

devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] 1.3.1rc3 was borked; 1.3.1rc4 is out

2009-03-03 Thread Brian W. Barrett

On Tue, 3 Mar 2009, Jeff Squyres wrote:

1.3.1rc3 had a race condition in the ORTE shutdown sequence.  The only 
difference between rc3 and rc4 was a fix for that race condition.  Please 
test ASAP:


  http://www.open-mpi.org/software/ompi/v1.3/


I'm sorry, I've failed to test rc1 & rc2 on Catamount.  I'm getting a 
compile failure in the ORTE code.  I'll do a bit more testing and send 
Ralph an e-mail this afternoon.


Brian


Re: [OMPI devel] calling sendi earlier in the PML

2009-03-03 Thread Brian W. Barrett

On Tue, 3 Mar 2009, Eugene Loh wrote:

First, this behavior is basically what I was proposing and what George didn't 
feel comfortable with.  It is arguably no compromise at all.  (Uggh, why must 
I be so honest?)  For eager messages, it favors BTLs with sendi functions, 
which could lead to those BTLs becoming overloaded.  I think favoring BTLs 
with sendi for short messages is good.  George thinks that load balancing 
BTLs is good.


I have two thoughts on the issue:

1) How often are a btl with a sendi and a btl without a sendi going to be 
used together?  Keep in mind, this is two BTLs with the same priority and 
connectivity to the same peer.  My thought is that given the very few 
heterogeneous networked machines (yes, I know UTK has one, but we're 
talking percentages), optimizing for that case at the cost of the much 
more common case is a poor choice.


2) It seems like a much better idea would be to add sendi calls to all 
btls that are likely to be used at the same priority.  This seems like 
good long-term form anyway, so why not optimize the PML for the long term 
rather than the short term and assume all BTLs will have a sendi function?


Brian


Re: [OMPI devel] calling sendi earlier in the PML

2009-03-03 Thread Brian W. Barrett

On Tue, 3 Mar 2009, Jeff Squyres wrote:


On Mar 3, 2009, at 3:31 PM, Eugene Loh wrote:

First, this behavior is basically what I was proposing and what George 
didn't feel comfortable with.  It is arguably no compromise at all.  (Uggh, 
why must I be so honest?)  For eager messages, it favors BTLs with sendi 
functions, which could lead to those BTLs becoming overloaded.  I think 
favoring BTLs with sendi for short messages is good.  George thinks that 
load balancing BTLs is good.


Second, the implementation can be simpler than you suggest:

*) You don't need a separate list since testing for a sendi-enabled BTL is 
relatively cheap (I think... could verify).
*) You don't need to shuffle the list.  The mechanism used by ob1 just 
resumes the BTL search from the last BTL used.  E.g., check 
https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/pml/ob1/pml_ob1_sendreq.h#mca_pml_ob1_send_request_start 
.  You use mca_bml_base_btl_array_get_next(&btl_eager) to roundrobin over 
BTLs in a totally fair manner (remembering where the last loop left off), 
and using mca_bml_base_btl_array_get_size(&btl_eager) to make sure you 
don't loop endlessly.


Cool / fair enough.

How about an MCA parameter to switch between this mechanism (early sendi) and 
the original behavior (late sendi)?


This is the usual way that we resolve "I want to do X / I want to do Y" 
disputes.  :-)


Of all the options presented, this is the one I dislike most :).

This is *THE* critical path of the OB1 PML.  It's already horribly complex 
and hard to follow (as Eugene is finding out the hard way).  Making it 
more complex as a way to settle this argument is pain and suffering just 
to avoid conflict.


However, one possible option that just occurred to me.  I propose yet 
another option.  If (AND ONLY IF) ob1/r2 detects that there are at least 
two BTLs to the same peer at the same priority and at least one has a 
sendi and at least one does not have a sendi, what about an MCA parameter 
to disable all sendi functions to that peer?


There's only a 1% gain in the FAIR protocol Euegene proposed, so we'd lose 
that 1% in the heterogeneous multi-nic case (the least common case). 
There would be a much bigger gain for the sendi homogeneous multi-nic / 
all single-nic cases (much more common), because the FAST protocol would 
be used.


That way, we get the FAST protocol in all cases for sm, which is what I 
really want ;).


Brian


Re: [OMPI devel] 1.3.1rc3 was borked; 1.3.1rc4 is out

2009-03-03 Thread Brian W. Barrett

On Tue, 3 Mar 2009, Brian W. Barrett wrote:


On Tue, 3 Mar 2009, Jeff Squyres wrote:

1.3.1rc3 had a race condition in the ORTE shutdown sequence.  The only 
difference between rc3 and rc4 was a fix for that race condition.  Please 
test ASAP:


  http://www.open-mpi.org/software/ompi/v1.3/


I'm sorry, I've failed to test rc1 & rc2 on Catamount.  I'm getting a compile 
failure in the ORTE code.  I'll do a bit more testing and send Ralph an 
e-mail this afternoon.



Attached is a patch against v1.3 branch that makes it work on Red Storm. 
I'm not sure it's right, so I'm just e-mailing it rather than committing.. 
Sorry Ralph, but can you take a look? :(


BrianIndex: orte/mca/odls/base/base.h
===
--- orte/mca/odls/base/base.h	(revision 20705)
+++ orte/mca/odls/base/base.h	(working copy)
@@ -29,9 +29,10 @@
 #include "opal/mca/mca.h"
 #include "opal/class/opal_list.h"
 
+#if !ORTE_DISABLE_FULL_SUPPORT
 #include "orte/mca/odls/odls.h"
+#endif
 
-
 BEGIN_C_DECLS
 
 /**
Index: orte/mca/grpcomm/grpcomm.h
===
--- orte/mca/grpcomm/grpcomm.h	(revision 20705)
+++ orte/mca/grpcomm/grpcomm.h	(working copy)
@@ -44,7 +44,6 @@
 
 #include "orte/mca/rmaps/rmaps_types.h"
 #include "orte/mca/rml/rml_types.h"
-#include "orte/mca/odls/odls_types.h"
 
 #include "orte/mca/grpcomm/grpcomm_types.h"
 
Index: orte/runtime/orte_globals.c
===
--- orte/runtime/orte_globals.c	(revision 20705)
+++ orte/runtime/orte_globals.c	(working copy)
@@ -40,11 +40,11 @@
 #include "orte/runtime/runtime_internals.h"
 #include "orte/runtime/orte_globals.h"
 
+#if !ORTE_DISABLE_FULL_SUPPORT
+
 /* need the data type support functions here */
 #include "orte/runtime/data_type_support/orte_dt_support.h"
 
-#if !ORTE_DISABLE_FULL_SUPPORT
-
 /* globals used by RTE */
 bool orte_timing;
 bool orte_debug_daemons_file_flag = false;
@@ -135,7 +135,8 @@
 opal_output_set_verbosity(orte_debug_output, 1);
 }
 }
-
+
+#if !ORTE_DISABLE_FULL_SUPPORT
 /** register the base system types with the DSS */
 tmp = ORTE_STD_CNTR;
 if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_std_cntr,
@@ -192,7 +193,6 @@
 return rc;
 }
 
-#if !ORTE_DISABLE_FULL_SUPPORT
 /* get a clean output channel too */
 {
 opal_output_stream_t lds;
Index: orte/runtime/data_type_support/orte_dt_support.h
===
--- orte/runtime/data_type_support/orte_dt_support.h	(revision 20705)
+++ orte/runtime/data_type_support/orte_dt_support.h	(working copy)
@@ -30,7 +30,9 @@
 
 #include "opal/dss/dss_types.h"
 #include "orte/mca/grpcomm/grpcomm_types.h"
+#if !ORTE_DISABLE_FULL_SUPPORT
 #include "orte/mca/odls/odls_types.h"
+#endif
 #include "orte/mca/plm/plm_types.h"
 #include "orte/mca/rmaps/rmaps_types.h"
 #include "orte/mca/rml/rml_types.h"


Re: [OMPI devel] calling sendi earlier in the PML

2009-03-04 Thread Brian W. Barrett

On Wed, 4 Mar 2009, George Bosilca wrote:

I'm churning a lot and not making much progress, but I'll try chewing on 
that idea (unless someone points out it's utterly ridiculous).  I'll look 
into having PML ignore sendi functions altogether and just make the 
"send-immediate" path work fast with normal send functions.  If that works, 
then we can get rid of sendi functions and hopefully have a solution that 
makes sense for everyone.


This is utterly ridiculous (I hope you really expect someone to say it). As I 
said before, SM is only one of the networks supported by Open MPI. 
Independent on how much I would like to have better shared memory 
performance, I will not agree with any PML modifications that are SM 
oriented. We did that in the past with other BTLs and it turned out to be a 
bad idea, so I'm clearly not in favor of doing the same mistake twice.


Regarding the sendi there are at least 3 networks that can take advantage of 
it: Portals, MX and Sicortex. Some of them do this right now, some others in 
the near future. Moreover, for these particular networks there is no way to 
avoid extra overhead without this feature (for very obscure reasons such as 
non contiguous pieces of memory only known by the BTL that can decrease the 
number of network operations).


How about removing the MCA parameter from my earlier proposal and just 
having r2 filter out the sendi calls if there are multiple BTLs with 
heterogeneous BTLs (ie, some with sendi and some without) to the same 
peer.  That way, the early sendi will be bypassed in that case.  But for 
the cases of BTLs that support sendi in common usage scenarios 
(homogeneous nics), we'll get the optimization?  Does that offend you 
George? :)


Brian


Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer

2009-03-09 Thread Brian W. Barrett
I, not suprisingly, have serious concerns about this RFC.  It assumes that 
the ompi_proc issues and bootstrapping issues (the entire point of the 
move, as I understand it) can both be solved, but offer no proof to 
support that claim.  Without those two issues solved, we would be left 
with an onet layer that is dependent on ORTE and OMPI, and which OMPI 
depends upon.  This is not a good place to be.  These issues should be 
resolved before an onet layer is created in the trunk.


This is not an unusual requirement.  The fault tolerance work took a very 
long time because of similar requirements.  Not only was a full 
implementation required to prove performance would not be negatively 
impacted (when FT wasn't active), but we had discussions about its impact 
on code maintainability.  We had a full implementation of all the pieces 
that impacted the code *before* any of it was allowed into the trunk.


We should live by the rules the community has setup.  They have served us 
well in the past.  Further, these are not new objections on my part. 
Since the initial RFCs related to this move started, I have continually 
brought up the exact same questions and never gotten a satisfactory 
answer.  This RFC even acknowledges the issues, but without presenting any 
solution and still asks to do the most disruptive work.  I simply can't 
see how that fits with Open MPI's long-standing development proceedures.


If all the issues I've asked about previously (which are essentially the 
ones you've identified in the RFC) can be solved, the impact to code base 
maintainability is reasonable, and the impact to performance is 
negligable, I'll gladly remove my objection to this RFC.


Further, before any work on this branch is brought into the trunk, the 
admin-level discussion regarding this issue should be resolved.  At this 
time, that discussion is blocking on ORNL and they've given April as the 
earliest such a discussion can occur.  So at the very least, the RFC 
timeout should be pushed into April or ORNL should revise their 
availability for the admin discussion.



Brian


On Mon, 9 Mar 2009, Rainer Keller wrote:



What: Move BTLs into separate layer

Why:  Several projects have expressed interest to use the BTLs. Use-cases
such as the RTE using the BTLs for modex or tools collecting/distributing data
in the fastest possible way may be possible.

Where:This would affect several components, that the BTLs depend on
(namely allocator, mpool, rcache and the common part of the BTLs).
Additionally some changes to classes were/are necessary.

When: Preferably 1.5 (in case we use the Feature/Stable Release cycle ;-)

Timeout:  23.03.2009


There has been much speculation about this project.
This RFC should shed some light, if there is some more information required,
please feel free to ask/comment. Of course, suggestions are welcome!

The BTLs offer access to fast communication framework. Several projects have
expressed interest to use them separate of other layers of Open MPI.
Additionally (with further changes) BTLs maybe used within ORTE itself.

COURSE OF WORK:
The extraction is not easy (as was the extraction of ORTE and OMPI in the
early stages of Open MPI?).
In order to get as much input and be as visible as possible (e.g. in TRACS),
the tmp-branch for this work has been set up on:
  https://svn.open-mpi.org/svn/ompi/tmp/koenig-btl

We propose to have a separate ONET library living in onet, based on orte (see
attached fig).

In order to keep the diff between the trunk and the branch to a minimum
several cleanup patches have already been applied to the trunk (e.g.
unnecessary #include of ompi and orte header files, integration of
ompi_bitmap_t into opal_bitmap_t, #include "*_config.h").


Additionally a script (attached below) has been kept up-to-date (contrib/move-
btl-into-onet), that will perform this separation on a fresh checkout of
trunk:
 svn list https://svn.open-mpi.org/svn/ompi/tmp/koenig-btl/contrib/move-btl-
into-onet

This script requires several patches (see attached TAR-ball).
Please update the variable PATCH_DIR to match the location of patches.

 ./move-btl-into-onet ompi-clean/
 # Lots of output deleted.
 cd ompi-clean/
 rm -fr ompi/mca/common/  # No two mcas called common, too bad...
 ./autogen.sh


OTHER RTEs:
A preliminary header file is provided in onet/include/rte.h to accommodate the
requirements of other RTEs (such as stci), that replaces selected
functionality, as proposed by Jeff and Ralph in the Louisville meeting.
Additionally, this header file is included before orte-header files (within
onet)...
By default, this does not change anything in the standard case (ORTE),
otherwise -DHAVE_STCI, redefinitions for components orte-functionality
required within onet is done.


TESTS:
First tests have been done locally on Linux/x86_64.
The branch compiles without warnings.
The wrappers have been updated

Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer

2009-03-09 Thread Brian W. Barrett
I guess then I missed the point of this RFC if not to move code.  It talks 
about bringing this code into the trunk for the 1.5 time frame.  If it's 
just getting general comments and there will be an RFC for all the changes 
(including the onet split proposed below) when the issues have been 
solved, that's great.  I'll comment on the proposal as a whole once my 4 
month old questions are answered.  Until then, I don't think we should be 
using the RFC process to get permission to move portions of a project with 
critical questions unanswered (which is exactly what this RFC reads as 
doing).


Brian


On Mon, 9 Mar 2009, Rainer Keller wrote:


Hi Jeff,
thanks for the mail!
I completely agree with Your points.

To stress the fact: The timeout date does not mean, that we intend to just
commit to trunk by that date.
It was rather to get comments to this particular date by all the parties
interested. (this is what I remembered from previous RFCs, but I could be
wrong...)
All the work that has been committed should cleanup the code. Anything that
was beyond a cleanup deserved an RFC and input from many people (such as
bitmap_t change...).

We still intend, as in the Louisville meeting, to have as much input from the
community (that's why this is TRACS-visible svn-tmp-branch).

Thanks,
Rainer



On Monday 09 March 2009 04:52:28 pm Jeff Squyres wrote:

Random points in no particular order (Rainer please correct me if I'm
making bad assumptions):

- I believe that ORNL is proposing to do this work on a separate
branch (this is what we have discussed for some time now, and we
discussed this deeply in Louisville).  The RFC text doesn't
specifically say, but I would be very surprised if this stuff is
planned to come back to the trunk in the near future -- as we have all
agreed, it's not done yet.

- I believe that the timeout field in RFC's is a limit for non-
responsiveness -- it is mainly intended to prevent people from
ignoring / not responding to RFCs.  I do not believe that Rainer was
using that date as a "that's when I'm bringing it all back to the
trunk."  Indeed, he specifically called out the 1.5 series as a target
for this work.

- I also believe that Rainer is using this RFC as a means to get
preliminary review of the work that has been done on the branch so
far.  He has provided a script that shows what they plan to do, how
the code will be laid out, etc.  There are still some important core
issues to be solved -- and, like Brian, I want to see how they'll get
solved before being happy (we have strong precedent for this
requirement) -- but I think all that Rainer was saying in his RFC was
"here's where we are so far; can people review and see if they hate it?"

- It was made abundantly clear in the Louisville meeting that ORTE has
no short-term plans for using the ONET layer (probably no long-term
plans, either, but hey -- never say "never" :-) ).  The design of ONET
is such that other RTE's *could* use ONET if they want (e.g., STCI
will), but it is not a requirement for the underlying RTE to use
ONET.  We agreed in Louisville that ORTE will provide sufficient stubs
and hooks (all probably effectively no-ops) so that ONET can compile
against it in the default OMPI configuration; other RTEs that want to
do more meaningful stuff will need to provide more meaningful
implementations of the stubs and hooks.

- Hopefully the teleconference time tomorrow works out for Rich (his
communications were unclear on this point).  Otherwise, postponing the
admin discussion until April seems problematic.

On Mar 9, 2009, at 4:01 PM, Brian W. Barrett wrote:

I, not suprisingly, have serious concerns about this RFC.  It
assumes that
the ompi_proc issues and bootstrapping issues (the entire point of the
move, as I understand it) can both be solved, but offer no proof to
support that claim.  Without those two issues solved, we would be left
with an onet layer that is dependent on ORTE and OMPI, and which OMPI
depends upon.  This is not a good place to be.  These issues should be
resolved before an onet layer is created in the trunk.

This is not an unusual requirement.  The fault tolerance work took a
very
long time because of similar requirements.  Not only was a full
implementation required to prove performance would not be negatively
impacted (when FT wasn't active), but we had discussions about its
impact
on code maintainability.  We had a full implementation of all the
pieces
that impacted the code *before* any of it was allowed into the trunk.

We should live by the rules the community has setup.  They have
served us
well in the past.  Further, these are not new objections on my part.
Since the initial RFCs related to this move started, I have
continually
brought up the exact same questions and never gotten a satisfactory
answer.  This RFC even acknowledges the issues, but without
presenting any
solution and s

Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer

2009-03-11 Thread Brian W. Barrett

On Wed, 11 Mar 2009, Richard Graham wrote:


Brian,
Going back over the e-mail trail it seems like you have raised two
concerns:
- BTL performance after the change, which I would take to be
  - btl latency
  - btl bandwidth
- Code maintainability
- repeated code changes that impact a large number of files
- A demonstration that the changes actually achieve their goal. As we
discussed after you got off the call, there are two separate goals here
  - being able to use the btl?s outside the context of mpi, but
within the ompi code base
  - ability to use the btl?s in the context of a run-time other than
orte
Another concern I have heard raised by others is
  - mpi startup time

Has anything else been missed here ?  I would like to make sure that we
address all the issues raised in the next version of the RFC.


I think the umbrella concerns for the final success of the change are btl 
performance (in particular, latency and message rates for cache-unfriendly 
applications/benchmarks) and code maintainability.  In addition, there are 
some intermediate change issues I have, in that this project is working 
different than other large changes.  In particular, there is/was the 
appearance of being asked to accept changes which only make sense if the 
btl move is going to move forward, without any way to judge the 
performance or code impact because critical technical issues still remain.


The latency/message rate issues are fairly straight forward from an end 
measure point-of-view.  My concerns on latency/message rate come not from 
the movement of the BTL to another library (for most operating systems / 
shared library systems that should be negligible), but from the code 
changes which surround moving the BTLs.  The BTLs are tightly intertwined 
with a number of pieces of the OMPI layer, in particular the BML and MPool 
frameworks and the ompi proc structure.  I had a productive conversation 
with Rainer this morning explaining why I'm so concerned about the bml and 
ompi proc structures.  The ompi proc structure currently acts not only as 
the identifier for a remote endpoint, but stores endpoint specific data 
for both the PML and BML.  The BML structure actually contains each BTL's 
per process endpoint information, in the form of the base_endpoint_t* 
structures returned from add_procs().  Moving these structures around must 
be done with care, as some of the proposals Jeff, Rainer, and I came up 
with this morning either induced spaghetti code or greatly increased the 
spread of information needed for the critical send path through the memory 
space (thereby likely increasing cache misses on send for real 
applications).


The code maintainability issue comes from three separate and independent 
issues.  First, there is the issue of how the pieces of the OMPI layer 
will interact after the move.  The BML/BTL/MPool/Rcache dance is already 
complicated, and care should be taken to minimize that change.  Start-up 
is also already quite complex, and moving the BTLs to make them 
independent of starting other pieces of Open MPI can be done well or can 
be done poorly.  We need to ensure it's done well, obviously.  Second, 
there is the issue of wire-up.  My impression from conversations with 
everyone at ORNL was that this move of BTLs would include changes to allow 
BTLs to wire-up without the RML.  I understand that Rich said this was not 
the case during the part of the admin meeting I missed yesterday, so 
that may no longer be a concern.  Finally, there has been some discussion, 
mainly second hand in my case, about the mechanisms in which the trunk 
would be modified to allow for using OMPI without ORTE.  I have concerns 
that we'd add complexity to the BTLs to achieve that, and again that can 
be done poorly if we're not careful.  Talking with Jeff and Rainer this 
morning helped reduce my concern in this area, but I think it also added 
to the technical issues with must be solved to consider this project ready 
for movement to the trunk.


There are a couple of technical issues which I believe prevent a 
reasonable discussion of the performance and maintainability issues based 
on the current branch.  I talked about some of them in the previous two 
paragraphs, but so that we have a short bullet list, they are:


  - How will the ompi_proc_t be handled?  In particular,
where will PML/BML data be stored, and how will we
avoid adding new cache misses.
  - How will the BML and MPool be handled?  The BML holds
the BTL endpoint data, so changes have to be made if
it continues to live in OMPI.
  - How will the modex and the intricate dance with adding
new procs from dynamic processes be handled?
  - How will we handle the progress mechanisms in cases where
the MTLs are used and the BTLs aren't needed by the RTE?
  - If there are users outside of OMPI, but who want to also use
OMPI, how will the library versioning / conflict problem be
solved?


As was mentioned before, our t

Re: [OMPI devel] Meta Question -- Open MPI: Is it a dessert topping or is it a floor wax?

2009-03-11 Thread Brian W. Barrett

On Wed, 11 Mar 2009, Andrew Lumsdaine wrote:

Hi all -- There is a meta question that I think is underlying some of the 
discussion about what to do with BTLs etc.  Namely, is Open MPI an MPI 
implementation with a portable run time system -- or is it a distributed OS 
with an MPI interface?  It seems like some of the changes being asked for 
(e.g., with the BTLs) reflect the latter -- but perhaps not everyone shares 
that view and hence the impedance mismatch.


I doubt this is the last time that tensions will come up because of differing 
views on this question.


I suggest that we come to some kind of common understanding of the question 
(and answer) and structure development and administration accordingly.


My personal (and I believe, Sandia's) view is that Open MPI should seek to 
be the best MPI implementation it can be and to leave the distributed OS 
part to a distributed OS project.  This can be seen by my work with Ralph 
over the past few years to reduce the amount of run-time that exists when 
running on Red Storm.  My vision of the (ideal, possibly impractical) Open 
MPI would be one with a DPM framework (the interface between OMPI and the 
run-time) and nothing else in the run-time category.


That being said, I understand the fact that we need a run-time for 
platforms which are not as robust as Red Storm.  I also understand the 
desire to build a variety of programming paradigms on top of Open MPI's 
strong infrastructure.  Given the number of broken interfaces out there, 
only having to fix them once with more software is attractive.


In the end, I don't want to give up the high quality MPI implementation 
part of the project to achieve the goal of wider applicability.  Five 
years ago, we set out to build the best MPI implementation we could, and 
we're not done yet.  We should not give up that goal to support other 
programming paradigms or projects.  However, changes to better support 
other projects and which do not detract from the primary goal of a high 
quality MPI implementation should be pursued.



Brian


Re: [OMPI devel] Meta Question -- Open MPI: Is it a dessert toppingor is it a floor wax?

2009-03-12 Thread Brian W. Barrett
I'm going to stay out of the debate about whether Andy correctly 
characterized the two points you brought up as a distributed OS or not.


Sandia's position on these two points remains the same as I previously 
stated when the question was distributed OS or not.  The primary goal of 
the Open MPI project was and should remain to be the best MPI project 
available.  Low-cost items to support different run-times or different 
non-MPI communication contexts are worth the work.  But high-cost items 
should be avoided, as they degrade our ability to provide the best MPI 
project available (of course, others, including OMPI developers, can take 
the source and do what they wish outside the primary development tree).


High performance is a concern, but so is code maintainability.  If it 
takes twices as long to implement feature A because I have to worry about 
it's impact not only on MPI, but also on projects X, Y, Z, as an MPI 
developer, I've lost something important.


Brian

On Thu, 12 Mar 2009, Richard Graham wrote:


I am assuming that by distributed OS you are referring to the changes that
we (not just ORNL) are trying to do.  If this is the case, this is a
mischaracterization of the of out intentions.  We have two goals

 - To be able to use a different run-time than ORTE to drive Open MPI
 - To use the communication primitives outside the context of MPI (with or
without ORTE)

High performance is critical, and at NO time have we ever said anything
about sacrificing performance - these have been concerns that others
(rightfully) have expressed.

Rich


On 3/12/09 8:24 AM, "Jeff Squyres"  wrote:


I think I have to agree with Terry.

I love to inspire and see new, original, and unintended uses for Open
MPI.  But our primary focus must remain to create, maintain, and
continue to deliver a high performance MPI implementation.

We have a long history of adding "small" things to Open MPI that are
useful to 3rd parties because it helps them, helps further Open MPI
adoption/usefulness, and wasn't difficult for us to do ("small" can
have varying definitions).  I'm in favor of such things, as long as we
maintain a policy of "in cases of conflict, OMPI/high performance MPI
wins".


On Mar 12, 2009, at 9:01 AM, Terry Dontje wrote:


Sun's participation in this community was to obtain a stable and
performant MPI implementation that had some research work done on the
side to improve those goals and the introduction of new features.   We
don't have problems with others using and improving on the OMPI code
base but we need to make sure such usage doesn't detract from our
primary goal of performant MPI implementation.

However, changes to the OMPI code base to allow it to morph or even
support a distributed OS does cause for some concern.  That is are we
opening the door to having more interfaces to support?  If so is this
wise in the fact that it seems to me we have a hard enough time trying
to focus on the MPI items?  Not to mention this definitely starts
detracting from the original goals.

--td

Andrew Lumsdaine wrote:

Hi all -- There is a meta question that I think is underlying some

of

the discussion about what to do with BTLs etc.  Namely, is Open

MPI an

MPI implementation with a portable run time system -- or is it a
distributed OS with an MPI interface?  It seems like some of the
changes being asked for (e.g., with the BTLs) reflect the latter --
but perhaps not everyone shares that view and hence the impedance
mismatch.

I doubt this is the last time that tensions will come up because of
differing views on this question.

I suggest that we come to some kind of common understanding of the
question (and answer) and structure development and administration
accordingly.

Best Regards,
Andrew Lumsdaine

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Inherent limit on #communicators?

2009-04-30 Thread Brian W. Barrett

On Thu, 30 Apr 2009, Ralph Castain wrote:


We seem to have hit a problem here - it looks like we are seeing a
built-in limit on the number of communicators one can create in a
program. The program basically does a loop, calling MPI_Comm_split each
time through the loop to create a sub-communicator, does a reduce
operation on the members of the sub-communicator, and then calls
MPI_Comm_free to release it (this is a minimized reproducer for the real
code). After 64k times through the loop, the program fails.

This looks remarkably like a 16-bit index that hits a max value and then
blocks.

I have looked at the communicator code, but I don't immediately see such
a field. Is anyone aware of some other place where we would have a limit
that would cause this problem?


There's a maximum of 32768 communicator ids when using OB1 (each PML can 
set the max contextid, although the communicator code is the part that 
actually assigns a cid).  Assuming that comm_free is actually properly 
called, there should be plenty of cids available for that pattern. 
However, I'm not sure I understand the block algorithm someone added to 
cid allocation - I'd have to guess that there's something funny with that 
routine and cids aren't being recycled properly.


Brian


Re: [OMPI devel] Inherent limit on #communicators?

2009-04-30 Thread Brian W. Barrett
When we added the CM PML, we added a pml_max_contextid field to the PML 
structure, which is the max size cid the PML can handle (because the 
matching interfaces don't allow 32 bits to be used for the cid.  At the 
same time, the max cid for OB1 was shrunk significantly, so that the 
header on a short message would be packed tightly with no alignment 
padding.


At the time, we believed 32k simultaneous communicators was plenty, and 
that CIDs were reused (we checked, I'm pretty sure).  It sounds like 
someone removed the CID reuse code, which seems rather bad to me.  There 
have to be unused CIDs in Ralph's example - is there a way to fallback out 
of the block algorithm when it can't find a new CID and find one it can 
reuse?  Other than setting the multi-threaded case back on, that is?


Brian

On Thu, 30 Apr 2009, Edgar Gabriel wrote:

cid's are in fact not recycled in the block algorithm. The problem is that 
comm_free is not collective, so you can not make any assumptions whether 
other procs have also released that communicator.



But nevertheless, a cid in the communicator structure is a uint32_t, so it 
should not hit the 16k limit there yet. this is not new, so if there is a 
discrepancy between what the comm structure assumes that a cid is and what 
the pml assumes, than this was in the code since the very first days of Open 
MPI...


Thanks
Edgar

Brian W. Barrett wrote:

On Thu, 30 Apr 2009, Ralph Castain wrote:


We seem to have hit a problem here - it looks like we are seeing a
built-in limit on the number of communicators one can create in a
program. The program basically does a loop, calling MPI_Comm_split each
time through the loop to create a sub-communicator, does a reduce
operation on the members of the sub-communicator, and then calls
MPI_Comm_free to release it (this is a minimized reproducer for the real
code). After 64k times through the loop, the program fails.

This looks remarkably like a 16-bit index that hits a max value and then
blocks.

I have looked at the communicator code, but I don't immediately see such
a field. Is anyone aware of some other place where we would have a limit
that would cause this problem?


There's a maximum of 32768 communicator ids when using OB1 (each PML can 
set the max contextid, although the communicator code is the part that 
actually assigns a cid).  Assuming that comm_free is actually properly 
called, there should be plenty of cids available for that pattern. However, 
I'm not sure I understand the block algorithm someone added to cid 
allocation - I'd have to guess that there's something funny with that 
routine and cids aren't being recycled properly.


Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Inherent limit on #communicators?

2009-04-30 Thread Brian W. Barrett

On Thu, 30 Apr 2009, Edgar Gabriel wrote:


Brian W. Barrett wrote:
When we added the CM PML, we added a pml_max_contextid field to the PML 
structure, which is the max size cid the PML can handle (because the 
matching interfaces don't allow 32 bits to be used for the cid.  At the 
same time, the max cid for OB1 was shrunk significantly, so that the header 
on a short message would be packed tightly with no alignment padding.


At the time, we believed 32k simultaneous communicators was plenty, and 
that CIDs were reused (we checked, I'm pretty sure).  It sounds like 
someone removed the CID reuse code, which seems rather bad to me. 


yes, we added the block algorithm. Not reusing a CID actually doesn't bite me 
as that dramatic, and I am still not sure and convinced about that:-) We do 
not have an empty array or something like that, its just a number.


The reason for the block algorithm was that the performance of our 
communicator creation code sucked, and the cid allocation was one portion of 
that. We used to require *at least* 4 collective operations per communicator 
creation at that time. We are now potentially down to 0, among others thanks 
to the block algorithm.


However, let me think about reusing entire blocks, its probably doable just 
requires a little more bookkeeping...


There have to be unused CIDs in Ralph's example - is there a way to 
fallback out of the block algorithm when it can't find a new CID and find 
one it can reuse?  Other than setting the multi-threaded case back on, that 
is?


remember that its not the communicator id allocation that is failing at this 
point, so the question is do we have to 'validate' a cid with the pml before 
we declare it to be ok?


well, that's only because the code's doing something it shouldn't.  Have a 
look at comm_cid.c:185 - there's the check we added to the multi-threaded 
case (which was the only case when we added it).  The cid generation 
should never generate a number larger than mca_pml.pml_max_contextid. 
I'm actually somewhat amazed this fails gracefully, as OB1 doesn't appear 
to check it got a valid cid in add_comm, which it should probably do.


Looking at the differences between v1.2 and v1.3, the max_contextid code 
was already in v1.2 and OB1 was setting it to 32k.  So the cid blocking 
code removed a rather critical feature and probably should be fixed or 
removed for v1.3.  On Portals, I only get 8k cids, so not having reuse is 
a rather large problem.


Brian


Re: [OMPI devel] Inherent limit on #communicators?

2009-04-30 Thread Brian W. Barrett

On Thu, 30 Apr 2009, Ralph Castain wrote:

well, that's only because the code's doing something it shouldn't.
 Have a look at comm_cid.c:185 - there's the check we added to the
multi-threaded case (which was the only case when we added it).
 The cid generation should never generate a number larger than
mca_pml.pml_max_contextid. I'm actually somewhat amazed this fails
gracefully, as OB1 doesn't appear to check it got a valid cid in
add_comm, which it should probably do.

Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick.


Ah.  Patch to change the hang into an abort coming RSN.

Brian

Re: [OMPI devel] Inherent limit on #communicators?

2009-05-01 Thread Brian W. Barrett
tory



  On Apr 30, 2009, at 12:28 PM,
  Edgar Gabriel wrote:

cid's are in fact
not recycled in the
block algorithm.
The problem is that
comm_free is not
collective, so you
can not make any
assumptions whether
other procs have
also released that
communicator.


But nevertheless, a
cid in the
communicator
structure is a
uint32_t, so it
should not hit the
16k limit there
yet. this is not
new, so if there is
a discrepancy
between what the
comm structure
assumes that a cid
is and what the pml
assumes, than this
was in the code
since the very
first days of Open
MPI...

Thanks
            Edgar

Brian W. Barrett
wrote:
  On Thu,
  30 Apr
  2009,
  Ralph
  Castain
  wrote:
We
seem
to
have
hit
a
problem
here
-
it
looks
like
we
are
seeing
a
built-in
limit
on
the
number
of
communicators
one
can
create
in
a
program.
The
program
basically
does
a
loop,
calling
MPI_Comm_split
each
time
through
the
loop
to
create
a
sub-communicator,
does
a
reduce
operation
on
the
members
of
the
sub-communicator,
and
then
calls
MPI_Comm_free
to
release
it
(this
is
a
minimized
reproducer
for
the
real
code).
After
64k
times
through
the
loop,
 

Re: [OMPI devel] Revise paffinity method?

2009-05-06 Thread Brian W. Barrett

On Wed, 6 May 2009, Ralph Castain wrote:


Any thoughts on this? Should we change it?


Yes, we should change this (IMHO) :).

If so, who wants to be involved in the re-design? I'm pretty sure it would 
require some modification of the paffinity framework, plus some minor mods to 
the odls framework and (since you cannot bind a process other than yourself) 
addition of a new small "proxy" script that would bind-then-exec each process 
started by the orted (Eugene posted a candidate on the user list, though we 
will have to deal with some system-specific issues in it).


I can't contribute a whole lot of time, but I'd be happy to lurk, offer 
advice, and write some small bits of code.  But I definitely can't lead.


Fist offering of opinion from me.  I think we can avoid the "proxy" script 
by doing the binding after the fork but before the exec.  This will 
definitely require minor changes to the odls and probably a bunch of 
changes to the paffinity framework.  This will make things slightly less 
fragile than a script would, and yet get us what we want.


Brian


Re: [OMPI devel] Build failures on trunk? r21235

2009-05-14 Thread Brian W. Barrett

On Thu, 14 May 2009, Jeff Squyres wrote:


On May 14, 2009, at 1:46 PM, Ralf Wildenhues wrote:


A more permanent workaround could be in OpenMPI to list each library
that is used *directly* by some other library as a dependency.  Sigh.


We actually took pains to *not* do that; we *used* to do that and explicitly 
took it out.  :-\  IIRC, it had something to do with dlopen'ing libmpi.so...?


Actually, I think that was something else.  Today, libopen-rte.la lists 
libopen-pal.la as a dependency and libmpi.la lists libopen-rte.la.  I had 
removed the dependency of libmpi.la on libopen-pal.la because it was 
causing libopen-pal.so to be listed twice by libtool, which was causing 
problems.


It would be a trivial fix to change the Makefiles to make libmpi.la to 
depend on libopen-pal.la as well as libopen-rte.la.


Brian


Re: [OMPI devel] Build failures on trunk? r21235

2009-05-14 Thread Brian W. Barrett

On Thu, 14 May 2009, Ralf Wildenhues wrote:


Hi Brian,

* Brian W. Barrett wrote on Thu, May 14, 2009 at 08:22:58PM CEST:


Actually, I think that was something else.  Today, libopen-rte.la lists
libopen-pal.la as a dependency and libmpi.la lists libopen-rte.la.  I had
removed the dependency of libmpi.la on libopen-pal.la because it was
causing libopen-pal.so to be listed twice by libtool, which was causing
problems.


That's weird, and shouldn't happen (the problems, that is).  Do you have
a pointer for them?


I don't - it was many moons ago.  And it very likely was when we were in 
that (evil) period where we were using LT2 before it was released as 
stable.  So it's completely possible we were seeing a transient bug which 
is long since gone.


Brian


Re: [OMPI devel] Build failures on trunk? r21235

2009-05-14 Thread Brian W. Barrett

On Thu, 14 May 2009, Jeff Squyres wrote:


On May 14, 2009, at 2:22 PM, Brian W. Barrett wrote:

We actually took pains to *not* do that; we *used* to do that and 
explicitly
took it out.  :-\  IIRC, it had something to do with dlopen'ing 
libmpi.so...?


Actually, I think that was something else.  Today, libopen-rte.la lists
libopen-pal.la as a dependency and libmpi.la lists libopen-rte.la.  I had
removed the dependency of libmpi.la on libopen-pal.la because it was
causing libopen-pal.so to be listed twice by libtool, which was causing
problems.

It would be a trivial fix to change the Makefiles to make libmpi.la to
depend on libopen-pal.la as well as libopen-rte.la.


Ah -- am I thinking of us removing libmpi (etc.) from the DSOs?


I think so.  And that's a change we definitely don't want to undo.

Brian


Re: [OMPI devel] opal / fortran / Flogical

2009-06-01 Thread Brian W. Barrett

I have to agree with Jeff's concerns.

Brian

On Mon, 1 Jun 2009, Jeff Squyres wrote:


Hmm.  I'm not sure that I like this commit.

George, Brian, and I specifically kept Fortran out of (the non-generated code 
in) opal because the MPI layer is the *only* layer that uses Fortran.  There 
was one or two minor abstraction breaks (you cited opal/util/arch.c), but now 
we have Fortran all throughout Opal.  Hmmm...  :-\


Is MPI_Flogical a real type?  I don't see it defined in the MPI-2.2 latex 
sources, but I could be missing it.  I *thought* we used 
ompi_fortran_logical_t internally because there was no officially sanctioned 
MPI_ type for it...?




On May 30, 2009, at 11:54 AM,   
wrote:



Author: rusraink
Date: 2009-05-30 11:54:29 EDT (Sat, 30 May 2009)
New Revision: 21330
URL: https://svn.open-mpi.org/trac/ompi/changeset/21330

Log:
- Move alignment and size output generated by configure-tests
  into the OPAL namespace, eliminating cases like opal/util/arch.c
  testing for ompi_fortran_logical_t.
  As this is processor- and compiler-related information
  (e.g. does the compiler/architecture support REAL*16)
  this should have been on the OPAL layer.
- Unifies f77 code using MPI_Flogical instead of opal_fortran_logical_t

- Tested locally (Linux/x86-64) with mpich and intel testsuite
  but would like to get this week-ends MTT output


- PLEASE NOTE: configure-internal macro-names and
  ompi_cv_ variables have not been changed, so that
  external platform (not in contrib/) files still work.


Text files modified:
  trunk/config/f77_check.m4 
|20
  trunk/config/f77_check_logical_array.m4 
| 6
  trunk/config/f77_check_real16_c_equiv.m4 
|14
  trunk/config/f77_get_fortran_handle_max.m4 
| 4
  trunk/config/f77_get_value_true.m4 
|14
  trunk/config/f77_purge_unsupported_kind.m4 
| 8
  trunk/config/f90_check.m4 
|10
  trunk/configure.ac 
|20
  trunk/contrib/platform/win32/CMakeModules/f77_check.cmake 
|24
  trunk/contrib/platform/win32/CMakeModules/f77_check_real16_c_equiv.cmake 
|12
  trunk/contrib/platform/win32/CMakeModules/ompi_configure.cmake 
|   154 
  trunk/contrib/platform/win32/ConfigFiles/mpi.h.cmake 
|96 ++--
  trunk/contrib/platform/win32/ConfigFiles/opal_config.h.cmake 
|   222 ++--
  trunk/ompi/attribute/attribute.c 
| 6
  trunk/ompi/attribute/attribute.h 
| 4
  trunk/ompi/communicator/comm_init.c 
| 2
  trunk/ompi/datatype/copy_functions.c 
|10
  trunk/ompi/datatype/copy_functions_heterogeneous.c 
|14
  trunk/ompi/datatype/dt_module.c 
|   224 ++--
  trunk/ompi/errhandler/errcode-internal.c 
| 2
  trunk/ompi/errhandler/errcode.c 
| 2
  trunk/ompi/errhandler/errhandler.c 
| 2
  trunk/ompi/file/file.c 
| 2
  trunk/ompi/group/group_init.c 
| 2
  trunk/ompi/include/mpi.h.in 
|96 ++--
  trunk/ompi/include/ompi_config.h.in 
|48 +-
  trunk/ompi/info/info.c 
| 2
  trunk/ompi/mca/op/base/functions.h 
|56 +-
  trunk/ompi/mca/op/base/op_base_functions.c 
|   722 
  trunk/ompi/mca/osc/base/osc_base_obj_convert.c 
| 8
  trunk/ompi/mpi/c/type_create_f90_integer.c 
| 4
  trunk/ompi/mpi/f77/base/attr_fn_f.c 
|48 +-
  trunk/ompi/mpi/f77/file_read_all_end_f.c 
| 6
  trunk/ompi/mpi/f77/file_read_all_f.c 
| 6
  trunk/ompi/mpi/f77/file_read_at_all_end_f.c 
| 6
  trunk/ompi/mpi/f77/file_read_at_all_f.c 
| 6
  trunk/ompi/mpi/f77/file_read_at_f.c 
| 6
  trunk/ompi/mpi/f77/file_read_f.c 
| 6
  trunk/ompi/mpi/f77/file_read_ordered_end_f.c 
| 6
  trunk/ompi/mpi/f77/file_read_ordered_f.c 
| 6
  trunk/ompi/mpi/f77/file_read_shared_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_all_end_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_all_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_at_all_end_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_at_all_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_at_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_ordered_end_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_ordered_f.c 
| 6
  trunk/ompi/mpi/f77/file_write_shared_f.c 
| 6
  trunk/ompi/mpi/f77/fint_2_int.h 
|16
  trunk/ompi/mpi/f77/iprobe_f.c 
| 6
  trunk/ompi/mpi/f77/probe_f.c 
| 6
  trunk/ompi/mpi/f77/recv_f.c 
| 6
  trunk/ompi/mpi/f77/testsome_f.c 
| 4
  trunk/ompi/mpi/f90/fortran_sizes.h.in 
|64 +-
  trunk/ompi/mpi/f90/scripts/mpi_sizeof.f90.sh 
|16
  trunk/ompi/request/request.c 
| 2
  trunk/ompi/tools/ompi_info/param.cc 
|96 ++--
  trunk/ompi/win/win.c 
| 2
  trunk/opal/class/opal_bitmap.c 
| 2
  trunk/opal/class/opal_bitmap.h 
| 2
  trunk/opal/class/opal_pointer_array.c 
| 4
  trunk/opal/include/opal_config_bottom.h 
|10
  trunk/opal/util/arch.c 
| 6

  65 files changed, 1104 insertions(+), 1104 deletions(-)

Modified: trunk/config/f77_check.m4
==
---

Re: [OMPI devel] opal / fortran / Flogical

2009-06-01 Thread Brian W. Barrett
Well, this may just be another sign that the push of the DDT to OPAL is a 
bad idea.  That's been my opinion from the start, so I'm biased.  But OPAL 
was intended to be single process systems portability, not MPI crud.


Brian

On Mon, 1 Jun 2009, Rainer Keller wrote:


Hmm, OK, I see.
However, I do see potentially a problem with work getting ddt on the OPAL
layer when we do have a fortran compiler with different alignment requirements
for the same-sized basic types...

As far as I understand the OPAL layer to abstract away from underlying system
portability, libc-quirks, and compiler information.

But I am perfectly fine with reverting this!
Let's discuss, maybe phone?

Thanks,
Rainer


On Monday 01 June 2009 10:38:51 am Jeff Squyres wrote:

Hmm.  I'm not sure that I like this commit.

George, Brian, and I specifically kept Fortran out of (the non-
generated code in) opal because the MPI layer is the *only* layer that
uses Fortran.  There was one or two minor abstraction breaks (you
cited opal/util/arch.c), but now we have Fortran all throughout Opal.
Hmmm...  :-\

Is MPI_Flogical a real type?  I don't see it defined in the MPI-2.2
latex sources, but I could be missing it.  I *thought* we used
ompi_fortran_logical_t internally because there was no officially
sanctioned MPI_ type for it...?



On May 30, 2009, at 11:54 AM, 

 wrote:

Author: rusraink
Date: 2009-05-30 11:54:29 EDT (Sat, 30 May 2009)
New Revision: 21330
URL: https://svn.open-mpi.org/trac/ompi/changeset/21330

Log:
 - Move alignment and size output generated by configure-tests
   into the OPAL namespace, eliminating cases like opal/util/arch.c
   testing for ompi_fortran_logical_t.
   As this is processor- and compiler-related information
   (e.g. does the compiler/architecture support REAL*16)
   this should have been on the OPAL layer.
 - Unifies f77 code using MPI_Flogical instead of
opal_fortran_logical_t

 - Tested locally (Linux/x86-64) with mpich and intel testsuite
   but would like to get this week-ends MTT output


 - PLEASE NOTE: configure-internal macro-names and
   ompi_cv_ variables have not been changed, so that
   external platform (not in contrib/) files still work.


Text files modified:
   trunk/config/
f77_check.m4|20
   trunk/config/
f77_check_logical_array.m4  | 6
   trunk/config/
f77_check_real16_c_equiv.m4 |14
   trunk/config/
f77_get_fortran_handle_max.m4   | 4
   trunk/config/
f77_get_value_true.m4   |14
   trunk/config/
f77_purge_unsupported_kind.m4   | 8
   trunk/config/
f90_check.m4|10
   trunk/
configure.ac

|20

   trunk/contrib/platform/win32/CMakeModules/
f77_check.cmake|24
   trunk/contrib/platform/win32/CMakeModules/
f77_check_real16_c_equiv.cmake |12
   trunk/contrib/platform/win32/CMakeModules/
ompi_configure.cmake   |   154 
   trunk/contrib/platform/win32/ConfigFiles/
mpi.h.cmake |96 ++--
   trunk/contrib/platform/win32/ConfigFiles/
opal_config.h.cmake |   222 ++--
   trunk/ompi/attribute/
attribute.c | 6
   trunk/ompi/attribute/
attribute.h | 4
   trunk/ompi/communicator/
comm_init.c  | 2
   trunk/ompi/datatype/
copy_functions.c |10
   trunk/ompi/datatype/
copy_functions_heterogeneous.c   |14
   trunk/ompi/datatype/
dt_module.c  |   224 +
+--
   trunk/ompi/errhandler/errcode-
internal.c | 2
   trunk/ompi/errhandler/
errcode.c  | 2
   trunk/ompi/errhandler/
errhandler.c   | 2
   trunk/ompi/file/
file.c   | 2
   trunk/ompi/group/
group_init.c| 2
   trunk/ompi/include/
mpi.h.in  |96 ++--
   trunk/ompi/include/
ompi_config.h.in  |48 +-
   trunk/ompi/info/
info.c   | 2
   trunk/ompi/mca/op/base/
functions.h   |56 +-
   trunk/ompi/mca/op/base/
op_base_functions.c   |   722 +++
+
   trunk/ompi/mca/osc/base/
osc_base_obj_convert.c   | 8
   trunk/ompi/mpi/c/
type_create_f90_integer.c   | 4
   trunk/ompi/mpi/f77/base/
attr_fn_f.c  |48 +-
   trunk/ompi/mpi/f77/
fi

Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-23 Thread Brian W. Barrett
I think that sounds like a rational path forward.  Another, more long 
term, option would be to move from the FIFOs to a linked list (which can 
even be atomic), which is what MPICH does with nemesis.  In that case, 
there's never a queue to get backed up (although the receive queue for 
collectives is still a problem).  It would also solve the returning a 
fragment without space problem, as there's always space in a linked list.


Brian

On Tue, 23 Jun 2009, Eugene Loh wrote:

The sm BTL used to have two mechanisms for dealing with congested FIFOs.  One 
was to grow the FIFOs.  Another was to queue pending sends locally (on the 
sender's side).  I think the grow-FIFO mechanism was typically invoked and 
the pending-send mechanism used only under extreme circumstances (no more 
memory).


With the sm makeover of 1.3.2, we dropped the ability to grow FIFOs.  The 
code added complexity and there seemed to be no need to have two mechanisms 
to deal with congested FIFOs.  In ticket 1944, however, we see that repeated 
collectives can produce hangs, and this seems to be due to the pending-send 
code not adequately dealing with congested FIFOs.


Today, when a process tries to write to a remote FIFO and fails, it queues 
the write as a pending send.  The only condition under which it retries 
pending sends is when it gets a fragment back from a remote process.


I think the logic must have been that the FIFO got congested because we 
issued too many sends.  Getting a fragment back indicates that the remote 
process has made progress digesting those sends.  In ticket 1944, we see that 
a FIFO can also get congested from too many returning fragments.  Further, 
with shared FIFOs, a FIFO could become congested due to the activity of a 
third-party process.


In sum, getting a fragment back from a remote process is a poor indicator 
that it's time to retry pending sends.


Maybe the real way to know when to retry pending sends is just to check if 
there's room on the FIFO.


So, I'll try modifying MCA_BTL_SM_FIFO_WRITE.  It'll start by checking if 
there are pending sends.  If so, it'll retry them before performing the 
requested write.  This should also help preserve ordering a little better. 
I'm guessing this will not hurt our message latency in any meaningful way, 
but I'll check this out.


Meanwhile, I wanted to check in with y'all for any guidance you might have.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trac ticket 1944 and pending sends

2009-06-24 Thread Brian W. Barrett

On Wed, 24 Jun 2009, Eugene Loh wrote:


Brian Barrett wrote:

Or go to what I proposed and USE A LINKED LIST!  (as I said before,  not an 
original idea, but one I think has merit)  Then you don't have  to size the 
fifo, because there isn't a fifo.  Limit the number of  send fragments any 
one proc can allocate and the only place memory can  grow without bound is 
the OB1 unexpected list.  Then use SEND_COMPLETE  instead of SEND_NORMAL in 
the collectives without barrier semantics  (bcast, reduce, gather, scatter) 
and you effectively limit how far  ahead any one proc can get to something 
that we can handle, with no  performance hit.


I'm still digesting George's mail and trac comments and responses thereto. 
Meanwhile, a couple of questions here.


First, I think it'd be helpful if you said a few words about how you think a 
linked list should be used here.  I can think of a couple of different ways, 
and I have questions about each way.  Instead of my enumerating these ways 
and those questions, how about you just be more specific?  (We used to grow 
the FIFOs, so sizing them didn't used to be an issue.)


My thought is to essentially implement a good chunk of the Nemesis design 
from MPICH, so reading that paper might give background on where I'm 
coming from.  But if it were me


1) Always limit the number of send fragments that can be allocated to 
something small.  This gives us a concrete upper bound on the size of the 
shared memory region we need to allocate.


2) Rather than a FIFO in which we put offset pointers, which requires a 
large amount of memory (p * num_frags), a linked list option offsets that 
into the size of the fragment - it's two fields in there, plus some 
constant overhead for the LL structure.


3) On insert, either acquire the lock for the LL and insert at the tail of 
the list or use atomic ops to update the tail of the list (the nemesis 
paper talks about the atomic sequence).  Because there's no FIFO to fill 
up, there's no deadlock issues.


4) If, on send, you don't have any send fragments available, as they're a 
constrainted resource, drain your incoming queue to collect acks - if you 
don't get any fragments, return failure to the upper layer and let it try 
again.


5) I can see how #4 might cause issues, as the draining of the queue might 
actually result in more send requests.  In this case, I'd be tempted to 
have two linked lists (they're small, after all), one for incoming 
fragments and one for acks.  This wasn't an option with the fifos, due to 
their large size.


Second, I'm curious how elaborate of a fix I should be trying for here.  Are 
we looking for something to fix the problems at hand, or are we opening the 
door to rearchitecting a big part of the sm BTL?


Well, like Ralph said, I worry about whether we can strap another bandaid 
on and keep it working.  If we can, great.  But George's proposal seems 
like it undoes all the memory savings work you did, and that worries me.


Brian


Re: [OMPI devel] sm BTL flow management

2009-06-25 Thread Brian W. Barrett

All -

Jeff, Eugene, and I had a long discussion this morning on the sm BTL flow 
management issues and came to a couple of conclusions.


* Jeff, Eugene, and I are all convinced that Eugene's addition of polling 
the receive queue to drain acks when sends start backing up is required 
for deadlock avoidance.


* We're also convinced that George's proposal, while a good idea in 
general, is not sufficient.  The send path doesn't appear to sufficiently 
progress the btl to avoid the deadlocks we're seeing with the SM btl 
today.  Therefore, while I still recommend sizing the fifo appropriately 
and limiting the freelist size, I think it's not sufficient to solve all 
problems.


* Finally, it took an hour, but we did determine one of the major 
differences between 1.2.8 and 1.3.0 in terms of sm is how messages were 
pulled off the FIFO.  In 1.2.8 (and all earlier versions), we return from 
btl_progress after a single message is received (ack or message) or the 
fifo was empty.  In 1.3.0 (pre-srq work Eugene did), we changed to 
completely draining all queues before returning from btl_progress.  This 
has led to a situation where a single call to btl_progress can make a 
large number of callbacks into the PML (900,000 times in one of Eugene's 
test case).  The change was made to resolve an issue Terry was having with 
performance of a benchmark.  We've decided that it would be adventageous 
to try something between the two points and drain X number of messages 
from the queue, then return, where X is 100 or so at most.  This should 
cover the performance issues Terry saw, but still not cause the huge 
number of messages added to the unexpected queue with a single call to 
MPI_Recv.  Since a recv that is matched on the unexpected queue doesn't 
result in a call to opal_progress, this should help balance the load a 
little bit better.  Eugene's going to take a stab at implementing this 
short term.


I think the combination of Euegene's deadlock avoidance fix and the 
careful queue draining should make me comfortable enough to start another 
round of testing, but at least explains the bottom line issues.


Brian


Re: [OMPI devel] sm BTL flow management

2009-06-25 Thread Brian W. Barrett

On Thu, 25 Jun 2009, Eugene Loh wrote:

I spoke with Brian and Jeff about this earlier today.  Presumably, up through 
1.2, mca_btl_component_progress would poll and if it received a message 
fragment would return.  Then, presumably in 1.3.0, behavior was changed to 
keep polling until the FIFO was empty.  Brian said this was based on Terry's 
desire to keep latency as low as possible in benchmarks.  Namely, reaching 
down into a progress call was a long code path.  It would be better to pick 
up multiple messages, if available on the FIFO, and queue extras up in the 
unexpected queue.  Then, a subsequent call could more efficiently find the 
anticipated message fragment.


I don't see how the behavior would impact short-message pingpongs (the 
typical way to measure latency) one way or the other.


I asked Terry, who struggled to remember the issue and pointed me at this 
thread:  http://www.open-mpi.org/community/lists/devel/2008/06/4158.php . 
But that is related to an issue that's solved if one keeps polling as long as 
one gets ACKs (but returns as soon as a real message fragment is found).


Can anyone shed some light on the history here?  Why keep polling even when a 
message fragment has been found?  The downside of polling too aggressively is 
that the unexpected queue can grow (without bounds).


Brian's proposal is to set some variable that determines how many message 
fragments a single mca_btl_sm_component_progress call can drain from the FIFO 
before returning.


I checked, and 1.3.2 definitely drains all messages until the fifo is 
empty.  If we were to switch to drain until we receive a data message and 
that fixes Terry's issue, that seems like a rational change and would not 
require the fix I suggested.  My assumption had been that we needed to 
drain more than one data message per call to component_progress in order 
to work around Terry's issue.  If not, then let's go with the simple fix 
and only drain one data message per enterance to component_progress (but 
drain multiple acks if we have a bunch of acks and then a data message in 
the queue).


Unfortunately I have no more history than what Terry proposed, but it 
looks like the changes were made around that time.


Brian


Re: [OMPI devel] MPI_Accumulate() with MPI_PROC_NULL target rank

2009-07-15 Thread Brian W. Barrett

On Wed, 15 Jul 2009, Lisandro Dalcin wrote:


The MPI 2-1 standard says:

"MPI_PROC_NULL is a valid target rank in the MPI RMA calls
MPI_ACCUMULATE, MPI_GET, and MPI_PUT. The effect is the same as for
MPI_PROC_NULL in MPI point-to-point communication. After any RMA
operation with rank MPI_PROC_NULL, it is still necessary to finish the
RMA epoch with the synchronization method that started the epoch."

Unfortunately, MPI_Accumulate() is not quite the same as
point-to-point, as a reduction is involved. Suppose you make this call
(let me abuse and use keyword arguments):

MPI_Accumulate(..., target_rank=MPI_PROC_NULL,
target_datatype=MPI_BYTE, op=MPI_SUM, ...)

IIUC, the call fails (with MPI_ERR_OP) in Open MPI because MPI_BYTE is
an invalid datatype for MPI_SUM.

But provided that the target rank is MPI_PROC_NULL, would it make
sense for the call to success?


I believe no.  We do full argument error checking (that you provided a 
valid communicator and datatype) on send, receive, put, and get when the 
source/dest is MPI_PROC_NULL.  Therefore, I think it's logical that we 
extend that to include valid operations for accumulate.


Brian


Re: [OMPI devel] autodetect broken

2009-07-22 Thread Brian W. Barrett
The current autodetect implementation seems like the wrong approach to me. 
I'm rather unhappy the base functionality was hacked up like it was 
without any advanced notice or questions about original design intent. 
We seem to have a set of base functions which are now more unreadable than 
before, overly complex, and which leak memory.


The intent of the installdirs framework was to allow this type of 
behavior, but without rehacking all this infer crap into base.  The 
autodetect component should just set $prefix in the set of functions it 
returns (and possibly libdir and bindir if you really want, which might 
actually make sense if you guess wrong), and let the expansion code take 
over from there.  The general thought on how this would work went 
something like:


 - Run after config
 - If determine you have a special $prefix, set the
   opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and
   set your special prefix.
 - Same with bindir and libdir if needed
 - Let expansion (which runs after all components have had the
   chance to fill in their fields) expand out with your special
   data

And the base stays simple, the components do all the heavy lifting, and 
life is happy.  I would not be opposed to putting in a "find expaneded 
part" type function that takes two strings like "${prefix}/lib" and 
"/usr/local/lib" and returns "/usr/local" being added to the base so that 
other autodetect-style components don't need to handle such a case, but 
that's about the extent of the base changes I think are appropriate.


Finally, a first quick code review reveals a couple of problems:

 - We don't AC_SUBST variables adding .lo files to build sources in
   OMPI.  Instead, we use AM_CONDITIONALS to add sources as needed.
 - Obviously, there's a problem with the _happy variable name
   consistency in configure.m4
 - There's a naming convention issue - files should all start with
   opal_installdirs_autodetect_, and a number of the added files
   do not.
 - From a finding code standpoint, I'd rather walkcontext.c and
   backtrace.c be one file with #ifs - for such short functions,
   it makes it more obvious that it's just portability implementations
   of the same function.

I'd be happy to discuss the issues further or review any code before it 
gets committed.  But I think the changes as they exist today (even with 
bugs fixed) aren't consistent with what the installdirs framework was 
trying to accomplish and should be reworked.


Brian


Re: [OMPI devel] RFC: meaning of "btl_XXX_eager_limit"

2009-07-23 Thread Brian W. Barrett

On Thu, 23 Jul 2009, Jeff Squyres wrote:


  There are two solutions I can think of.  Which should we do?

  a. Pass the (max?) PML header size down into the BTL during
 initialization such that the the btl_XXX_eager_limit can
 represent the max MPI data payload size (i.e., the BTL can size
 its buffers to accommodate its desired max eager payload size,
 its header size, and the PML header size).  Thus, the
 eager_limit can truly be the MPI data payload size -- and easy
 to explain to users.


This will not work.  Remember, the PML IS NOT THE ONLY USER OF THE BTLS. 
I'm really getting sick of saying this, but it's true.  There can be no 
PML knowledge in the BTL, even if it's something simple like a header 
size.  And since PML headers change depending on the size and type of 
message, this seems like a really stupid parameter to publish to the user.



  b. Stay with the current btl_XXX_eager_limit implementation (which
 OMPI has had for a long, long time) and add the code to check
 for btl_eager_limit less than the pml header size (per this past
 Tuesday's discussion).  This is the minimal distance change.


Since there's already code in Terry's hands to do this, I vote for b.


2. OMPI currently does not publish enough information for a user to
  set eager_limit to be able to do BTL traffic shaping.  That is, one
  really needs to know the (max) BTL header length and the (max) PML
  header length values to be able to calculate the correct
  eager_limit force a specific (max) BTL wire fragment size.  Our
  proposed solution is to have ompi_info print out the (max) PML and
  BTL header sizes.  Regardless of whether 1a) or 1b) is chosen, with
  these two pieces of information, a determined network administrator
  could calculate the max wire fragment size used by OMPI, and
  therefore be able to do at least some of traffic shaping.


Actually, there's no need to know the PML header size to shape traffic. 
There's only need to know the BTL header, and I wouldn't be opposed to 
changing the behavior so that the BTL eager limit parameter included the 
btl header size (because the PML header is not a factor in determining 
size of individual eager packets).  It seems idiotic, but whatever - you 
should more care about what the data size the user is sending than the MTU 
size.  Sending multiple MTUs should have little performance on a network 
that doesn't suck and we shouldn't be doing all kinds of hacks to support 
networks who's designers can't figure out which way is up.


Again, since there are multiple consumers of the BTLs, allowing network 
designers to screw around with defaults to try and get what they want 
(even when it isn't what they actually want) seems stupid.  But as long as 
you don't do 1a, I won't object to uselessness contained in ompi_info.


Brian


Re: [OMPI devel] libtool issue with crs/self

2009-07-29 Thread Brian W. Barrett
What are you trying to do with lt_dlopen?  It seems like you should always 
go through the MCA base utilities.  If one's missing, adding it there 
seems like the right mechanism.


Brian

On Wed, 29 Jul 2009, Josh Hursey wrote:

George suggested that to me as well yesterday after the meeting. So we would 
create opal interfaces to libtool (similar to what we do with the event 
engine). That might be the best way to approach this.


I'll start to take a look at implementing this. Since opal/libltdl is not 
part of the repository, is there a 'right' place to put this header? maybe in 
opal/util/?


Thanks,
Josh


On Jul 28, 2009, at 6:57 PM, Jeff Squyres (jsquyres) wrote:

Josh - this is almost certainly what happened. Yoibks. Unfortunately, 
there's good reasons for it. :(


What about if we proxy calls to lt_dlopen through an opal function call?

-jms
Sent from my PDA.  No type good.

- Original Message -
From: devel-boun...@open-mpi.org 
To: Open MPI Developers 
Sent: Tue Jul 28 16:39:42 2009
Subject: Re: [OMPI devel] libtool issue with crs/self

It was mentioned to me that r21731 might have caused this problem by
restricting the visibility of the libltdl library.
  https://svn.open-mpi.org/trac/ompi/changeset/21731

Brian,
Do you have any thoughts on how we might extend the visibility so that
MCA components could also use the libtool in opal?
I can try to initialize libtool in the Self CRS component and use it
directly, but since it is already opened by OPAL, I think it might be
better to use the instantiation in OPAL.

Cheers,
Josh

On Jul 28, 2009, at 3:06 PM, Josh Hursey wrote:


Once upon a time, the Self CRS module worked correctly, but I admit
that I have not tested it in a long time.

The Self CRS component uses dl_open and friends to inspect the
running process for a particular set of functions. When I try to run
an MPI program that contains these signatures I get the following
error when it tries to resolve lt_dlopen() in
opal_crs_self_component_query():
--
my-app: symbol lookup error: /path/to/install/lib/openmpi/
mca_crs_self.so: undefined symbol: lt_dlopen
--

I am configuring with the following:
--
./configure --prefix=/path/to/install \
 --enable-binaries \
 --with-devel-headers \
 --enable-debug \
 --enable-mpi-threads \
 --with-ft=cr \
 --without-memory-manager \
 --enable-ft-thread \
 CC=gcc CXX=g++ \
 F77=gfortran FC=gfortran
--

The source code is at the link below:
 https://svn.open-mpi.org/trac/ompi/browser/trunk/opal/mca/crs/self


Does anyone have any thoughts on what might be going wrong here?

Thanks,
Josh

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Shared library versioning

2009-07-29 Thread Brian W. Barrett

On Wed, 29 Jul 2009, Jeff Squyres wrote:


On Jul 28, 2009, at 1:56 PM, Ralf Wildenhues wrote:


- support files are not versioned (e.g., show_help text files)
- include files are not versioned (e.g., mpi.h)
- OMPI's DSOs actually are versioned, but more work would be needed
in this area to make that versioning scheme work in real world
scenarios
- ...and probably some other things that I'm not thinking of...


You can probably solve most of these issues by just versioning the
directory names where you put the files; and with some luck, some
downstream distribution can achieve this by merely passing a bunch of
--foodir=... options to configure.


This is probably true -- we do obey all the Autoconf-specified directories, 
so overriding --foodir= should work.  It may break things like mpirun 
--prefix behavior, though.  But I think that the executables would be 
problematic -- you'd only have 1 mpirun, orted, etc.  OMPI does *not* 
currently handle the Autoconf --program-* configure options properly.  I 
confess to not recalling the specific issues, but I recall we had long 
discussions about them -- the issues are quite tangled and complicated.  And 
I remember coming to the conclusion "not worth supporting those."


FWIW, Chris is probably right that it's far easier to simply install 
different OMPI versions into different $prefix trees (IMHO).


Agreeed.  I was looking at the versioning of shared libraries not as a way 
to allow multiple installs in the same prefix, but to allow users to know 
when it was time to recompile their application.


Brian


Re: [OMPI devel] libtool issue with crs/self

2009-07-29 Thread Brian W. Barrett
Never mind, I'm an idiot.  I still don't like the wrappers around 
lt_dlopen in util, but it might be your best option.  Are you looking for 
symbols in components or the executable?  I assumed the executable, in 
which case you might be better off just using dlsym() directly.  If you're 
looking for a symbol first place it's found, then you can just do:


  dlsym(RTLD_DEFAULT, symbol);

The lt_dlsym only really helps if you're running on really obscure 
platforms which don't support dlsym and loading "preloaded" components.


Brian

On Wed, 29 Jul 2009, Brian W. Barrett wrote:

What are you trying to do with lt_dlopen?  It seems like you should always go 
through the MCA base utilities.  If one's missing, adding it there seems like 
the right mechanism.


Brian

On Wed, 29 Jul 2009, Josh Hursey wrote:

George suggested that to me as well yesterday after the meeting. So we 
would create opal interfaces to libtool (similar to what we do with the 
event engine). That might be the best way to approach this.


I'll start to take a look at implementing this. Since opal/libltdl is not 
part of the repository, is there a 'right' place to put this header? maybe 
in opal/util/?


Thanks,
Josh


On Jul 28, 2009, at 6:57 PM, Jeff Squyres (jsquyres) wrote:

Josh - this is almost certainly what happened. Yoibks. Unfortunately, 
there's good reasons for it. :(


What about if we proxy calls to lt_dlopen through an opal function call?

-jms
Sent from my PDA.  No type good.

- Original Message -
From: devel-boun...@open-mpi.org 
To: Open MPI Developers 
Sent: Tue Jul 28 16:39:42 2009
Subject: Re: [OMPI devel] libtool issue with crs/self

It was mentioned to me that r21731 might have caused this problem by
restricting the visibility of the libltdl library.
  https://svn.open-mpi.org/trac/ompi/changeset/21731

Brian,
Do you have any thoughts on how we might extend the visibility so that
MCA components could also use the libtool in opal?
I can try to initialize libtool in the Self CRS component and use it
directly, but since it is already opened by OPAL, I think it might be
better to use the instantiation in OPAL.

Cheers,
Josh

On Jul 28, 2009, at 3:06 PM, Josh Hursey wrote:


Once upon a time, the Self CRS module worked correctly, but I admit
that I have not tested it in a long time.

The Self CRS component uses dl_open and friends to inspect the
running process for a particular set of functions. When I try to run
an MPI program that contains these signatures I get the following
error when it tries to resolve lt_dlopen() in
opal_crs_self_component_query():
--
my-app: symbol lookup error: /path/to/install/lib/openmpi/
mca_crs_self.so: undefined symbol: lt_dlopen
--

I am configuring with the following:
--
./configure --prefix=/path/to/install \
 --enable-binaries \
 --with-devel-headers \
 --enable-debug \
 --enable-mpi-threads \
 --with-ft=cr \
 --without-memory-manager \
 --enable-ft-thread \
 CC=gcc CXX=g++ \
 F77=gfortran FC=gfortran
--

The source code is at the link below:
 https://svn.open-mpi.org/trac/ompi/browser/trunk/opal/mca/crs/self


Does anyone have any thoughts on what might be going wrong here?

Thanks,
Josh

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Brian W. Barrett

On Sun, 2 Aug 2009, Ralph Castain wrote:

Perhaps a bigger question needs to be addressed - namely, does the ob1 code 
need to be refactored?


Having been involved a little in the early discussion with bull when we 
debated over where to put this, I know the primary concern was that the code 
not suffer the same fate as the dr module. We have since run into a similar 
issue with the checksum module, so I know where they are coming from.


The problem is that the code base is adjusted to support changes in ob1, 
which is still being debugged. On the order of 95% of the code in ob1 is 
required to be common across all the pml modules, so the rest of us have to 
(a) watch carefully all the commits to see if someone touches ob1, and then 
(b) manually mirror the change in our modules.


This is not a supportable model over the long-term, which is why dr has died, 
and checksum is considering integrating into ob1 using configure #if's to 
avoid impacting non-checksum users. Likewise, device failover has been 
treated similarly here - i.e., configure out the added code unless someone 
wants it.


This -does- lead to messier source code with these #if's in it. If we can 
refactor the ob1 code so the common functionality resides in the base, then 
perhaps we can avoid this problem.


Is it possible?


I think Ralph raises a good point - we need to think about how to allow 
better use of OB1's code base between consumers like checksum and 
failover.  The current situation is problematic to me, for the reasons 
Ralph cited.  However, since the ob1 structures and code have little use 
for PMLs such as CM, I'd rather not push the code into the base - in the 
end, it's very specific to a particular PML implementation and the code 
pushed into the base already made things much more interesting in 
implementing CM than I would have liked.  DR is different in this 
conversation, as it was almost entirely a seperate implementation from ob1 
by the end, due to the removal of many features and the addition of many 
others.


However, I think there's middle ground here which could greatly improve 
the current situation.  With the proper refactoring, there's no technical 
reason why we couldn't move the checksum functionality into ob1 and add 
the failover to ob1, with no impact on performance when the functionality 
isn't used and little impact on code readability.


So, in summary, refactor OB1 to support checksum / failover good, pushing 
ob1 code into base bad.


Brian


Re: [OMPI devel] libtool issue with crs/self

2009-08-05 Thread Brian W. Barrett

Josh -

Just in case it wasn't clear -- if you're only looking for a symbol in the 
executable (which you know is there), you do *NOT* have to dlopen() the 
executable first (you do with libtool to support the "i don't have dynamic 
library support" mode of operatoin).  You only have to dlsym() with 
RTLD_DEFAULT, as the symbol is already in the process space.


It does probably mean we can't support self on platforms without dlsym(), 
but that set is extremely small and since we don't use libtool to link the 
final executable, the lt_dlsym wrappers wouldn't have worked anyway.


Brian

On Wed, 5 Aug 2009, George Bosilca wrote:


Josh,

These look like two different issues to me. One is how some modules from Open 
MPI can use the libltld, and for this you highlighted the issue. The second 
is that the users who want to use the self CRS have to make sure the symbols 
required by self CRS are visible in their application. This is clearly an 
item for the FAQ.


george.

On Aug 5, 2009, at 10:51 , Josh Hursey wrote:

As an update on this thread. I had a bit of time this morning to look into 
this.


I noticed that the "-fvisibility=hidden" option when passed to libltdl will 
cause it to fail in its configure test for:

"checking whether a program can dlopen itself"
This is because the symbol they are trying to look for with dlsym() is not 
postfixed with:

__attribute__ ((visibility("default")))
If I do that, then the test passes correctly.

I am not sure if this is a configure bug in Libtool or not. But what it 
means is that even with the wrapper around the OPAL libltdl routines, it is 
not useful to me since I need to open the executable to examine it for the 
necessary symbols.


So I might try to go down the track of using dlopen/dlsym/dlclose directly 
instead of through the libtool interfaces. However I just wanted to mention 
that this is happening in case there are other places in the codebase that 
ever want to look into the executable for symbols, and find that 
lt_dlopen() fails in non-obvious ways.


-- Josh

On Jul 29, 2009, at 11:01 AM, Brian W. Barrett wrote:

Never mind, I'm an idiot.  I still don't like the wrappers around 
lt_dlopen in util, but it might be your best option.  Are you looking for 
symbols in components or the executable?  I assumed the executable, in 
which case you might be better off just using dlsym() directly.  If you're 
looking for a symbol first place it's found, then you can just do:


dlsym(RTLD_DEFAULT, symbol);

The lt_dlsym only really helps if you're running on really obscure 
platforms which don't support dlsym and loading "preloaded" components.


Brian

On Wed, 29 Jul 2009, Brian W. Barrett wrote:

What are you trying to do with lt_dlopen?  It seems like you should 
always go through the MCA base utilities.  If one's missing, adding it 
there seems like the right mechanism.


Brian

On Wed, 29 Jul 2009, Josh Hursey wrote:

George suggested that to me as well yesterday after the meeting. So we 
would create opal interfaces to libtool (similar to what we do with the 
event engine). That might be the best way to approach this.
I'll start to take a look at implementing this. Since opal/libltdl is 
not part of the repository, is there a 'right' place to put this header? 
maybe in opal/util/?

Thanks,
Josh
On Jul 28, 2009, at 6:57 PM, Jeff Squyres (jsquyres) wrote:
Josh - this is almost certainly what happened. Yoibks. Unfortunately, 
there's good reasons for it. :(
What about if we proxy calls to lt_dlopen through an opal function 
call?

-jms
Sent from my PDA.  No type good.
- Original Message -
From: devel-boun...@open-mpi.org 
To: Open MPI Developers 
Sent: Tue Jul 28 16:39:42 2009
Subject: Re: [OMPI devel] libtool issue with crs/self
It was mentioned to me that r21731 might have caused this problem by
restricting the visibility of the libltdl library.
https://svn.open-mpi.org/trac/ompi/changeset/21731
Brian,
Do you have any thoughts on how we might extend the visibility so that
MCA components could also use the libtool in opal?
I can try to initialize libtool in the Self CRS component and use it
directly, but since it is already opened by OPAL, I think it might be
better to use the instantiation in OPAL.
Cheers,
Josh
On Jul 28, 2009, at 3:06 PM, Josh Hursey wrote:

Once upon a time, the Self CRS module worked correctly, but I admit
that I have not tested it in a long time.
The Self CRS component uses dl_open and friends to inspect the
running process for a particular set of functions. When I try to run
an MPI program that contains these signatures I get the following
error when it tries to resolve lt_dlopen() in
opal_crs_self_component_query():
--
my-app: symbol lookup error: /path/to/install/lib/openmpi/
mca_crs_self.so: undefined symbol: lt_dlopen
--
I am configuring with the followi

Re: [OMPI devel] libtool issue with crs/self

2009-08-05 Thread Brian W. Barrett

On Wed, 5 Aug 2009, Josh Hursey wrote:


On Aug 5, 2009, at 11:35 AM, Brian W. Barrett wrote:


Josh -

Just in case it wasn't clear -- if you're only looking for a symbol in the 
executable (which you know is there), you do *NOT* have to dlopen() the 
executable first (you do with libtool to support the "i don't have dynamic 
library support" mode of operatoin).  You only have to dlsym() with 
RTLD_DEFAULT, as the symbol is already in the process space.


So is it wrong to dlopen() before dlsym()? The patch I just committed in 
r21766 does this, since I was following the man page for dlopen() to make 
sure I was using it correctly.


I don't know that it's "wrong", it's just not necessary.  I believe that:

  handle = dlopen(NULL, RTLD_LOCAL|RTLD_LAZY);
  sym = dlsym(handle, "foo");
  dlclose(handle)l

and

  sym = dlsym(RTLD_DEFAULT, "foo");

are functionally equivalent, but the second one means no handle to pass 
around :).


Brian


Re: [OMPI devel] RFC: PML/CM priority

2009-08-11 Thread Brian W. Barrett

On Tue, 11 Aug 2009, Rainer Keller wrote:


When compiling on systems with MX or Portals, we offer MTLs and BTLs.
If MTLs are used, the PML/CM is loaded as well as the PML/OB1.


Question 1: Is favoring OB1 over CM required for any MTL (MX, Portals, PSM)?


George has in the past had srtong feelings on this issue, believing that 
for MX, OB1 is prefered over CM.  For Portals, it's probably in the noise, 
but the BTL had been better tested than the MTL, so it was left as the 
default.  Obviously, PSM is a much better choice on InfiniPath than 
straight OFED, hence the odd priority bump.


At this point, I would have no objection to making CM's priority higher 
for Portals.



Question 2: If it is, I would like to reflect this in the default priorities,
aka have CM have a priority lower than OB1 and in the case of PSM raising it.


I don't have strong feelings on this one.

Brian


Re: [OMPI devel] Oversubscription/Scheduling Bug

2006-05-26 Thread Brian W. Barrett

On Fri, 26 May 2006, Jeff Squyres (jsquyres) wrote:


You can see this by slightly modifying your test command -- run "env"
instead of "hostname".  You'll see that the environment variable
OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on
the mpirun command line, regardless of a) whether you're oversubscribing
or not, and b) whatever is passed in through the orted.


While Jeff is correct that the parameter informing the MPI process that it 
should idle when it's not busy is correctly set, it turns out that we are 
ignoring this parameter inside the MPI process.  I'm looking into this and 
hope to have a fix this afternoon.


Brian



Re: [OMPI devel] Oversubscription/Scheduling Bug

2006-05-26 Thread Brian W. Barrett

On Fri, 26 May 2006, Brian W. Barrett wrote:


On Fri, 26 May 2006, Jeff Squyres (jsquyres) wrote:


You can see this by slightly modifying your test command -- run "env"
instead of "hostname".  You'll see that the environment variable
OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on
the mpirun command line, regardless of a) whether you're oversubscribing
or not, and b) whatever is passed in through the orted.


While Jeff is correct that the parameter informing the MPI process that it
should idle when it's not busy is correctly set, it turns out that we are
ignoring this parameter inside the MPI process.  I'm looking into this and
hope to have a fix this afternoon.


Mea culpa.  Jeff's right that in a normal application, we are setting up 
to call sched_yield() when idle if the user sets mpi_yield_when_idle to 1, 
regardless of what is in the hostfile .  The problem with my test case was 
that for various reasons, my test code was never actually "idling" - there 
were always things moving along, so our progress engine was deciding that 
the process should not be idled.


Can you share your test code at all?  I'm wondering if something similar 
is happening with your code.  It doesn't sound like it should be "always 
working", but I'm wondering if you're triggering some corner case we 
haven't thought of.


Brian

--
  Brian Barrett
  Graduate Student, Open Systems Lab, Indiana University
  http://www.osl.iu.edu/~brbarret/


Re: [OMPI devel] memory_malloc_hooks.c and dlclose()

2006-05-30 Thread Brian W. Barrett

On Mon, 22 May 2006, Neil Ludban wrote:


I'm getting a core dump when using openmpi-1.0.2 with the MPI extensions
we're developing for the MATLAB interpreter.  This same build of openmpi
is working great with C programs and our extensions for gnu octave.  The
machine is AMD64 running Linux:

Linux kodos 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST 2005 x86_64 x86_64 
x86_64 GNU/Linux

I believe there's a bug in that opal_memory_malloc_hooks_init() links
itself into the __free_hook chain during initialization, but then it
never unlinks itself at shutdown.  In the interpreter environment,
libopal.so is dlclose()d and unmapped from memory long before the
interpreter is done with dynamic memory.  A quick check of the nightly
trunk snapshot reveals some function name changes, but no new shutdown
code.


Can you try the attached patch and see if it solves your problem?  I think 
it will, but I don't have a great way of testing your exact situation.


Thanks,

Brian

--
  Brian Barrett
  Graduate Student, Open Systems Lab, Indiana University
  http://www.osl.iu.edu/~brbarret/Index: opal/mca/memory/malloc_hooks/memory_malloc_hooks.c
===
--- opal/mca/memory/malloc_hooks/memory_malloc_hooks.c  (revision 10123)
+++ opal/mca/memory/malloc_hooks/memory_malloc_hooks.c  (working copy)
@@ -27,6 +27,7 @@
  
 /* Prototypes for our hooks.  */
 void opal_memory_malloc_hooks_init(void);
+void opal_memory_malloc_hooks_finalize(void);
 static void opal_mem_free_free_hook (void*, const void *);
 static void* opal_mem_free_realloc_hook (void*, size_t, const void *);
  
@@ -60,6 +61,18 @@
 }
 
 
+void
+opal_memory_malloc_hooks_finalize(void)
+{
+if (initialized == 0) {
+return;
+}
+
+__free_hook = old_free_hook;
+__realloc_hook = old_realloc_hook;
+initialized = 0;
+}
+
 static void
 opal_mem_free_free_hook (void *ptr, const void *caller)
 {
Index: opal/mca/memory/malloc_hooks/memory_malloc_hooks_component.c
===
--- opal/mca/memory/malloc_hooks/memory_malloc_hooks_component.c
(revision 10123)
+++ opal/mca/memory/malloc_hooks/memory_malloc_hooks_component.c
(working copy)
@@ -22,8 +22,10 @@
 #include "opal/include/constants.h"
 
 extern void opal_memory_malloc_hooks_init(void);
+extern void opal_memory_malloc_hooks_finalize(void);
 
 static int opal_memory_malloc_open(void);
+static int opal_memory_malloc_close(void);
 
 const opal_memory_base_component_1_0_0_t mca_memory_malloc_hooks_component = {
 /* First, the mca_component_t struct containing meta information
@@ -41,7 +43,7 @@
 
 /* Component open and close functions */
 opal_memory_malloc_open,
-NULL
+opal_memory_malloc_close
 },
 
 /* Next the MCA v1.0.0 component meta data */
@@ -58,3 +60,10 @@
 opal_memory_malloc_hooks_init();
 return OPAL_SUCCESS;
 }
+
+static int
+opal_memory_malloc_close(void)
+{
+opal_memory_malloc_hooks_finalize();
+return OPAL_SUCCESS;
+}


Re: [OMPI devel] configure & Fortran problem

2006-10-06 Thread Brian W. Barrett
Before you go off and file a bug, this is not an Open MPI issue, but a 
windows / autoconf issue.  Please don't file a bug on this, or I'm just 
going to have to close it as notabug...


Brian

On Fri, 6 Oct 2006, Jeff Squyres wrote:


Oops.  That's a bug.  I'll file a ticket.


On 10/5/06 12:51 PM, "George Bosilca"  wrote:


I have a problem with configure if no fortran compilers are detected. It
stop with the following error:

configure: error: Cannot support Fortran MPI_ADDRESS_KIND!

As there are not F77 or F90 compilers installed on this machine, it make
sense to not be able to support MPI_ADDRESS_KIND ... but as there are no
fortran compilers we should not care about. I try to manually disable all
kind of fortran support but the error is always the same.

   Any clues ?

   Thanks,
 george.


"We must accept finite disappointment, but we must never lose infinite
hope."
   Martin Luther King
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






--
  Brian Barrett
  Graduate Student, Open Systems Lab, Indiana University
  http://www.osl.iu.edu/~brbarret/


[OMPI devel] Shared memory file changes

2006-10-11 Thread Brian W. Barrett

Hi all -

A couple of weeks ago, I committed some changes to the trunk that greatly 
reduced the size of the shared memory file for small numbers of processes. 
I haven't heard any complaints (the non-blocking send/receive issue is at 
proc counts greater than the size this patch affected).  Anyone object to 
moving this to the v1.2 branch (with reviews, of course).


Brian

--
  Brian Barrett
  Graduate Student, Open Systems Lab, Indiana University
  http://www.osl.iu.edu/~brbarret/


[OMPI devel] configure changes tonight

2006-10-12 Thread Brian W. Barrett

Hi all -

There will be three configure changes committed to the trunk tonight:

  - Some cleanups resulting from the update to the wrapper
compilers for 32/64 bit support
  - A new configure option to deal with some fixes for the
MPI::SEEK_SET (and friends) issue
  - Some cleanups in the pthreads configure tests

The only real affect for everyone should be that you'll have to 
re-autogen.sh.  And that the 32/64 include and libdir flags will no longer 
be available.  I will be updating the wiki shortly w.r.t. how to build a 
multilib wrapper compiler shortly.


Brian

--
  Brian Barrett
  Graduate Student, Open Systems Lab, Indiana University
  http://www.osl.iu.edu/~brbarret/


Re: [OMPI devel] help config.status to not mess up substitutions

2006-10-23 Thread Brian W. Barrett

Thanks, I'll apply ASAP.

Brian

On Mon, 23 Oct 2006, Ralf Wildenhues wrote:


Please apply this robustness patch, which helps to avoid accidental
unwanted substitutions done by config.status.  From all I can tell,
they do not happen now, but first the Autoconf manual warns against
them, second they make some config.status optimizations so much more
difficult than necessary.  :-)

In unrelated news, I tested Automake 1.10 with OpenMPI, and it saves
about 15s of config.status time, and about half a minute of `make dist'
time on my system.  Some pending Fortran changes have only made it into
Automake after 1.10 was released.

Cheers,
Ralf

2006-10-23  Ralf Wildenhues  

* opal/tools/wrappers/Makefile.am: Protect manual substitutions
   from config.status.
* ompi/tools/wrappers/Makefile.am: Likewise.
* orte/tools/wrappers/Makefile.am: Likewise.

Index: opal/tools/wrappers/Makefile.am
===
--- opal/tools/wrappers/Makefile.am (revision 12254)
+++ opal/tools/wrappers/Makefile.am (working copy)
@@ -76,8 +76,8 @@

opalcc.1: opal_wrapper.1
rm -f opalcc.1
-   sed -e 's/@COMMAND@/opalcc/g' -e 's/@PROJECT@/Open PAL/g' -e 
's/@PROJECT_SHORT@/OPAL/g' -e 's/@LANGUAGE@/C/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > opalcc.1
+   sed -e 's/[@]COMMAND[@]/opalcc/g' -e 's/[@]PROJECT[@]/Open PAL/g' -e 
's/[@]PROJECT_SHORT[@]/OPAL/g' -e 's/[@]LANGUAGE[@]/C/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > opalcc.1

opalc++.1: opal_wrapper.1
rm -f opalc++.1
-   sed -e 's/@COMMAND@/opalc++/g' -e 's/@PROJECT@/Open PAL/g' -e 
's/@PROJECT_SHORT@/OPAL/g' -e 's/@LANGUAGE@/C++/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > opalc++.1
+   sed -e 's/[@]COMMAND[@]/opalc++/g' -e 's/[@]PROJECT[@]/Open PAL/g' -e 
's/[@]PROJECT_SHORT[@]/OPAL/g' -e 's/[@]LANGUAGE[@]/C++/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > opalc++.1
Index: ompi/tools/wrappers/Makefile.am
===
--- ompi/tools/wrappers/Makefile.am (revision 12254)
+++ ompi/tools/wrappers/Makefile.am (working copy)
@@ -84,20 +84,20 @@

mpicc.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1
rm -f mpicc.1
-   sed -e 's/@COMMAND@/mpicc/g' -e 's/@PROJECT@/Open MPI/g' -e 
's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/C/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpicc.1
+   sed -e 's/[@]COMMAND[@]/mpicc/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 
's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/C/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpicc.1

mpic++.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1
rm -f mpic++.1
-   sed -e 's/@COMMAND@/mpic++/g' -e 's/@PROJECT@/Open MPI/g' -e 
's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/C++/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpic++.1
+   sed -e 's/[@]COMMAND[@]/mpic++/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 
's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/C++/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpic++.1

mpicxx.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1
rm -f mpicxx.1
-   sed -e 's/@COMMAND@/mpicxx/g' -e 's/@PROJECT@/Open MPI/g' -e 
's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/C++/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpicxx.1
+   sed -e 's/[@]COMMAND[@]/mpicxx/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 
's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/C++/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpicxx.1

mpif77.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1
rm -f mpif77.1
-   sed -e 's/@COMMAND@/mpif77/g' -e 's/@PROJECT@/Open MPI/g' -e 
's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/Fortran 77/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpif77.1
+   sed -e 's/[@]COMMAND[@]/mpif77/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 
's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/Fortran 77/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpif77.1

mpif90.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1
rm -f mpif90.1
-   sed -e 's/@COMMAND@/mpif90/g' -e 's/@PROJECT@/Open MPI/g' -e 
's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/Fortran 90/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpif90.1
+   sed -e 's/[@]COMMAND[@]/mpif90/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 
's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/Fortran 90/g' < 
$(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpif90.1
Index: orte/tools/wrappers/Makefile.am
===
--- orte/tools/wrappers/Makefile.am (revision 12254)
+++ orte/tools/wrappers/Makefile.am (working copy)
@@ -51,8 +51,8 @@

ortecc.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1
rm -f ortecc.1
-   sed -e 's/@COMMAND@/ortecc/g' -e 's/@PROJECT@/OpenRTE/g' -e 
's/@PROJECT_SHORT@/ORTE/g' -e

Re: [OMPI devel] New oob/tcp?

2006-10-25 Thread Brian W. Barrett
The create_listen_thread code should be on both the trunk and v1.2 branch 
right now.  You are correct that the heterogeneous fixes haven't moved 
just yet, because they aren't quite right.  Hope to have that fixed in the 
near future...


brian

On Wed, 25 Oct 2006, Ralph H Castain wrote:


There are a number of things in the trunk that haven't been moved over to
1.2 branch yet. They are coming shortly, though...once the merge is done,
you might get a few more conflicts, but it shouldn't be too bad.


On 10/25/06 7:06 AM, "Adrian Knoth"  wrote:


On Wed, Oct 25, 2006 at 02:48:33PM +0200, Adrian Knoth wrote:


I don't see any new component, Adrian. There have been a few updates to the
existing component, some of which might cause conflicts with the merge, but
those shouldn't be too hard to resolve.

Ok, I just saw something with "create_listen_thread" and so on, but
didn't look closer.


The "new" (current) oob/tcp (in the v1.2 branch) does not have Brian's
fix for #493. (the following constant is missing, the code, too)

   MCA_OOB_TCP_ADDR_TYPE_AFINET

There are probably more differences...

If you want, I can do the merge and we'll use my IPv6 oob with
all the patches up to r12050.




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
  Brian Barrett
  Graduate Student, Open Systems Lab, Indiana University
  http://www.osl.iu.edu/~brbarret/


Re: [OMPI devel] Building OpenMPI on windows

2006-11-21 Thread Brian W Barrett
At one point, a long time ago (before anyone started working on the  
native windows port), I had unpatched OMPI tarballs building on  
Cygwin, using Cygwin's gcc.  Which I believe is all Greg and Beth  
want to do for now.  But I believe that the recent code to support  
Windows natively has caused some issues with our configure script  
when trying to run in that mode.


Brian

On Nov 18, 2006, at 12:39 PM, George Bosilca wrote:


I'm impressed that it work with cygwin out of the box. Last time I
tried, I had to patch the libtool, do some manual modifications of
the configure script (of course after altering some of the .m4
files). It worked, I was able to run a simple ping-pong program, but
it took me something like 4 hours to compile.

I'm out of office for the next week. I can give a try to the whole
cygwin/SFU once I get back.

   Thanks,
 george.

On Nov 18, 2006, at 9:22 AM, Jeff Squyres wrote:


I don't know if we're tried cygwin for a long, long time...  My gut
reaction is that it "should work" (the wrappers are pretty simple C),
but I don't have any cygwin resources to test / fix this.  :-(

George -- got any insight?


On Nov 16, 2006, at 4:44 PM, Ralph Castain wrote:


I'm not sure about running under cygwin at this stage - I have
compiled the
code base there before as you did, but never tried to run anything
in that
environment.

However, I believe 1.2 will operate under Windows itself. Of
course, that
means using the Windows compilers...but if you have those, you
should be
able to run.

I'll have to defer to my colleagues who wrote those wrapper
compilers as to
why cygwin might be taking offense. They are all at the
Supercomputing Expo
this week, so response may be a little delayed.

Ralph


On 11/16/06 1:54 PM, "Beth Tibbitts"  wrote:



I'm trying to build OpenMPI on windows with cygwin, to at least be
able to
demo the Eclipse PTP(Parallel Tools Platform)
on my laptop.

I configured OpenMPI version 1.2 (openmpi-1.2b1) with the following
command:
./configure --with-devel-headers  --enable-mca-no-build=timer-
windows

then did make all and make install, which all seemed to finish ok
When i try to compile a small test mpi program I get a segfault

$ mpicc mpitest.c
Signal:11 info.si_errno:0(No error) si_code:23()
Failing at addr:0x401a06
*** End of error message ***
  15 [main] mpicc 7036 _cygtls::handle_exceptions: Error while
dumping
state
 (probably corrupted stack)
Segmentation fault (core dumped)


...Beth

Beth Tibbitts  (859) 243-4981  (TL 545-4981)
High Productivity Tools / Parallel Tools  http://eclipse.org/ptp
IBM T.J.Watson Research Center
Mailing Address:  IBM Corp., 455 Park Place, Lexington, KY 40511

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Build system changes

2006-11-29 Thread Brian W Barrett

Hi all -

Just wanted to give everyone a heads up that there will be two  
changes to the build system that should have minimal impact on  
everyone, but are worth noting:


  1) If you are using Autoconf 2.60 or later, you *MUST* be using
 Automake 1.10 or later.  Most people are still using AC 2.59,
 so this should have zero impact on the group.

  2) We will now be checking to make sure that the C++, F77, F90,
 and ObjC compilers can link against the C compiler.  This
 should clean up some of the amorphous errors people have been
 getting when they do something like: 'CFLAGS=-m32 CXXFLAGS=-m64',
 usually by not specifying one of the two...

Brian


Re: [OMPI devel] incorrect definition of MPI_ERRCODES_IGNORE?

2006-12-30 Thread Brian W. Barrett
Thanks for the bug report.  You are absolutely correct - the #define is
incorrect in Open MPI.  I've committed a fix to our development trunk and
it should be included in the future releases.  In the mean time, it is
safe to change the line in the installed mpi.h for Open MPI from:

  #define MPI_ERRCODES_IGNORE  ((void *) 0)/* don't return error
codes */

to

  #define MPI_ERRCODES_IGNORE  ((int *) 0)/* don't return error
codes */

Since it's a simple cast, there is no need to recompile Open MPI's libmpi
-- modifying the installed mpi.h is safe.

Thanks again,

Brian


>   OPEN MPI folks,
>
>Please see the possible error in your code, if this is indeed an error
> on your part we would appreciate a fix as soon as possible so that we do
> not have to direct our users to other MPI implementations.
>
>Thanks
>
> Barry
>
>
> On Fri, 29 Dec 2006, Satish Balay wrote:
>
>> Looks like there is some issues with MPI_Spawn() and OpenMPI.
>>
>> >
>> libfast in:
>> /Volumes/MaxtorUFS1/geoframesvn/tools/petsc-dev/src/sys/objects
>> mpinit.c: In function 'PetscErrorCode PetscOpenMPSpawn(PetscMPIInt)':
>> mpinit.c:73: error: invalid conversion from 'void*' to 'int*'
>> mpinit.c:73: error:   initializing argument 8 of 'int
>> MPI_Comm_spawn(char*, char**, int, ompi_info_t*, int, omp
>> i_communicator_t*, ompi_communicator_t**, int*)'
>> ar: mpinit.o: No such file or directory
>> <<
>>
>> ierr =
>> MPI_Comm_spawn(programname,argv,nodesize-1,MPI_INFO_NULL,0,PETSC_COMM_SELF,&children,MPI_ERRCODES_IGNORE);CHKERRQ(ierr);
>>
>>
>> Looks like using MPI_ERRCODES_IGNORE in that function call is
>> correct. However OpenMPI declares it to '((void *) 0)' giving compile
>> error with c++. [MPICH declares it to '(int *)0' - which doesn't give
>> any compile erorrs].
>>
>> I guss the following change should work - but I suspect this is an
>> openmpi bug.. I don't think its appropriate to make this change in
>> PETSc code..
>>
>> ierr =
>> MPI_Comm_spawn(programname,argv,nodesize-1,MPI_INFO_NULL,0,PETSC_COMM_SELF,&children,(int*)
>> MPI_ERRCODES_IGNORE);CHKERRQ(ierr);
>>
>> Satish
>>
>> On Fri, 29 Dec 2006, Charles Williams wrote:
>>
>> > Hi,
>> >
>> > I'm not sure if this is a problem with PETSc or OpenMPI.  Things were
>> building
>> > OK on December 19, and this problem has crept in since then.  Thanks
>> for any
>> > ideas.
>> >
>> > Thanks,
>> > Charles
>> >
>> >
>> >
>> > Charles A. Williams
>> > Dept. of Earth & Environmental Sciences
>> > Science Center, 2C01B
>> > Rensselaer Polytechnic Institute
>> > Troy, NY  12180
>> > Phone:(518) 276-3369
>> > FAX:(518) 276-2012
>> > e-mail:will...@rpi.edu
>> >
>> >
>>
>>
>
> --===0719315771==--
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r12945

2007-01-02 Thread Brian W. Barrett
Because that's what we had been using and I was going for minimal  
change (since this is for v1.2).  Also note that *none* of this code  
is in performance critical areas.  Last I checked, we don't really  
care how fast attribute updates and error handlers are fired...  I  
think there are much better ways of dealing with all the problems  
addressed below, but to do it right means a fairly sizable change and  
that seemed like a bad idea at this time.


Brian

On Jan 2, 2007, at 9:06 AM, George Bosilca wrote:


Using a STL map to keep in relation the C pointer with the C++ object
isn't that way more expensive that it is supposed to be ? The STL map
is just a hash table, it can be as optimized as you want, it's still
a hash table. How about using exactly the same mechanism as for the
Fortran handler ? It's cheap, it's based on an array, it's thread
save and we just reuse the code already there.

   george.

On Dec 30, 2006, at 6:41 PM, brbar...@osl.iu.edu wrote:


Author: brbarret
Date: 2006-12-30 18:41:42 EST (Sat, 30 Dec 2006)
New Revision: 12945

Added:
   trunk/ompi/mpi/cxx/datatype.cc
   trunk/ompi/mpi/cxx/file.cc
   trunk/ompi/mpi/cxx/win.cc
Modified:
   trunk/ompi/errhandler/errhandler.c
   trunk/ompi/errhandler/errhandler.h
   trunk/ompi/mpi/cxx/Makefile.am
   trunk/ompi/mpi/cxx/comm.cc
   trunk/ompi/mpi/cxx/comm.h
   trunk/ompi/mpi/cxx/comm_inln.h
   trunk/ompi/mpi/cxx/datatype.h
   trunk/ompi/mpi/cxx/datatype_inln.h
   trunk/ompi/mpi/cxx/errhandler.h
   trunk/ompi/mpi/cxx/file.h
   trunk/ompi/mpi/cxx/file_inln.h
   trunk/ompi/mpi/cxx/functions.h
   trunk/ompi/mpi/cxx/functions_inln.h
   trunk/ompi/mpi/cxx/intercepts.cc
   trunk/ompi/mpi/cxx/mpicxx.cc
   trunk/ompi/mpi/cxx/mpicxx.h
   trunk/ompi/mpi/cxx/win.h
   trunk/ompi/mpi/cxx/win_inln.h

Log:
A number of MPI-2 compliance fixes for the C++ bindings:

  * Added Create_errhandler for MPI::File
  * Make errors_throw_exceptions a first-class predefined exception
handler, and make it work for Comm, File, and Win
  * Deal with error handlers and attributes for Files, Types, and  
Wins

like we do with Comms - can't just cast the callbacks from C++
signatures to C signatures.  Callbacks will then fire with the
C object, not the C++ object.  That's bad.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI devel] 1.2b3 fails on bluesteel

2007-01-22 Thread Brian W. Barrett

On Jan 22, 2007, at 10:39 AM, Greg Watson wrote:


On Jan 22, 2007, at 9:48 AM, Ralph H Castain wrote:


On 1/22/07 9:39 AM, "Greg Watson"  wrote:

I tried adding '-mca btl ^sm -mca mpi_preconnect_all 1' to the  
mpirun

command line but it still fails with identical error messages.

I don't understand the issue with allocating nodes under bproc.  
Older

versions of OMPI have always just queried bproc for the nodes that
have permissions set so I can execute on them. I've never had to
allocate any nodes using a hostfile or any other mechanism. Are you
saying that this no longer works?


Turned out that mode of operation was a "bug" that caused all  
kinds of

problems in production environments - that's been fixed for quite
some time.
So, yes - you do have to get an official "allocation" of some kind.
Even the
changes I mentioned wouldn't remove that requirement in the way you
describe.


BTW, there's no requirement for a bproc system to employ a job
scheduler. So in my view OMPI is "broken" for bproc systems if it
imposes such a requirement.



I agree that the present assumption that BProc requires LSF be in use  
is broken and we have a fix for that shortly.  However, we still will  
require a resource allocator of some sort (even a hostfile should  
work) to tell us which nodes to run on.  It should be possible to  
write a resource allocator that just grabs nodes out of the available  
pool returned by the bproc status functions should be possible, but I  
don't believe that's on the to-do list in the near future...


Brian

--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




[OMPI devel] Libtool update for v1.2

2007-01-23 Thread Brian W. Barrett

Hi all -

In December I had brought up the idea of updating the snapshot of  
Libtool 2 that we use for building the v1.2 branch to a more recent  
snapshot.  The group seemed to think this was a good idea and I was  
going to do it, then got sidetracked working around a bug in their  
support for dylib (OS X's shared libraries).  I committed a  
workaround to the trunk today for the bug (as well as sending one of  
the LIbtool developers a patch to libtool that resolves the issue).


Once I hear back from Ralf (the LT developer), I'd like to finally do  
the LT update for our v1.2 tarballs.  The advantage to us is slightly  
faster builds, fixed convenience library dependencies (no more having  
to set LIBS=/usr/lib64), and more bug fixes.


Does this still sound agreeable to everyone?

Brian

--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




[OMPI devel] v1.2 / trunk tarball libtool change

2007-01-25 Thread Brian W. Barrett

Hi all -

As of tonight, the version of Libtool used to build "official"  
tarballs for the v1.2 branch and the trunk (this includes nightly  
snapshots, beta releases, and official releases) has been updated  
from a snapshot of Libtool 2 from June/July 2006 to on from Jan 23,  
2007.  This update will solve a number of problems, including the  
multilib .la problem that has bitten a few people over the past  
years.  I also made a copy of the Libtool 2 snapshot we're using to  
build our tarballs available on the SVN building page, so that people  
who wish to use the exact same Libtool version as the nightly  
snapshots for their development can do so.


http://www.open-mpi.org/svn/building.php

Note that no change is required on your part.  You do not have to  
update the copy of Libtool you use for regular testing or development.



Brian

--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI devel] [OMPI svn] svn:open-mpi r13644

2007-02-13 Thread Brian W. Barrett


On Feb 13, 2007, at 5:16 PM, Jeff Squyres wrote:


On Feb 13, 2007, at 7:10 PM, George Bosilca wrote:


It's already in the 1.2 !!! I don't know much you care about
performance, but I do. This patch increase by 10% the latency. It
might be correct for the pathscale compiler, but it didn't look as a
huge requirement for all others compilers. A memory barrier for an
initialization and an unlock definitively looks like killing an ant
with a nuclear strike.


Can we roll this back and find some other way?


Yes, we can.

It's not actually the memory barrier we need, it's the tell the  
compiler to not do anything stupid because we expect memory to be  
invalidated that we need.  I'll commit a new, different fix tonight.



Brian

--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI devel] [OMPI svn] svn:open-mpi r13644

2007-02-13 Thread Brian W. Barrett

On Feb 13, 2007, at 7:37 PM, Brian W. Barrett wrote:


On Feb 13, 2007, at 5:16 PM, Jeff Squyres wrote:


On Feb 13, 2007, at 7:10 PM, George Bosilca wrote:


It's already in the 1.2 !!! I don't know much you care about
performance, but I do. This patch increase by 10% the latency. It
might be correct for the pathscale compiler, but it didn't look as a
huge requirement for all others compilers. A memory barrier for an
initialization and an unlock definitively looks like killing an ant
with a nuclear strike.


Can we roll this back and find some other way?


Yes, we can.

It's not actually the memory barrier we need, it's the tell the
compiler to not do anything stupid because we expect memory to be
invalidated that we need.  I'll commit a new, different fix tonight.


Upon further review, I'm wrong again.  The original patch was wrong  
(not sure what I was thinking this afternoon) and my statement above  
is wrong.  So the problem starts with the code:


a = 1
mylock->lock = 0
b = 2

Which is essentially what you have after inlining the atomic unlock  
as it occurred today.  It's not totally unreasonable for a compiler  
(and we have seen this in practice, including with GCC on LA-MPI and  
likely are having it happen now, just don't realize it) to reorder  
that to:


a = 1
b = 2
mylock->lock = 0

or

mylock->lock = 0
a = 1
b = 2

After all, there's no memory dependencies in the three lines of  
code.  When we had the compare and swap for unlock, there was a  
memory dependency, because the compare and swap inline assembly  
hinted to the compiler that memory was changed by the op and it  
shouldn't reorder memory accesses across that boundary or the compare  
and swap wasn't inlined.  Compilers are pretty much not going to  
reorder memory accesses across a function unless it's 100% clear that  
there is not a side effect that might be important, which is  
basically never in C.


Ok, so we can tell the compiler not to reorder memory access with a  
little care (either compiler hints using inline assembly statements  
that include the "memory" invalidation hint) or by making  
atomic_unlock a function.


But now we start running on hardware, and the memory controller is  
free to start reordering code.  We don't have any instructions  
telling the CPU / memory controller not to reorder our original  
instructions, so it can still do either one of the two bad cases.   
Still not good for us and definitely could lead to incorrect  
programs.  So we need a memory barrier or we have potentially invalid  
code.


The full memory barrier is totally overkill for this situation, but  
some memory barrier is needed.  While not quite correct, I believe  
that something like;


static inline void
opal_atomic_unlock(opal_atomic_lock_t *lock)
{
  opal_atomic_wmb();
  lock->u.lock=OPAL_ATOMIC_UNLOCKED;
}

would be more correct than having the barrier after the write and  
slightly better performance than the full atomic barrier.  On x86 and  
x86_64, memory barriers are "free", in that all they do is limit the  
compiler's reordering of memory access.  But on PPC, Sparc, and  
Alpha, it would have a performance cost.  Don't know what that cost  
is, but I know that we need to pay it for correctness.


Long term, we should probably try to implement spinlocks as inline  
assembly.  This wouldn't provide a whole lot of performance  
difference, but at least I could make sure the memory barrier is in  
the right place and help the compiler not be stupid.


By the way, this is what the Linux kernel does, adding credence to my  
argument, I hope ;).



Brian

--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory




Re: [OMPI devel] [PATCH] ompi_get_libtool_linker_flags.m4: fix $extra_ldflags detection

2007-02-24 Thread Brian W. Barrett
Thanks for the bug report and the patch.  Unfortunately, the remove  
smallest prefix pattern syntax doesn't work with Solaris /bin/sh  
(standards would be better if everyone followed them...), but I  
committed something to our development trunk that handles the issue.   
It should be releases as part of v1.2.1 (we're too far in testing to  
make it part of v1.2).


Thanks,

Brian


On Feb 15, 2007, at 9:12 AM, Bert Wesarg wrote:


Hello,

when using a multi token CC variable (like "gcc -m32"), the logic to
extract $extra_ldflags from libtool don't work. So here is a little  
hack

to remove the $CC prefix from the libtool-link cmd.

Bert Wesarg
diff -ur openmpi-1.1.4/config/ompi_get_libtool_linker_flags.m4  
openmpi-1.1.4-extra_ldflags-fix/config/ 
ompi_get_libtool_linker_flags.m4
--- openmpi-1.1.4/config/ompi_get_libtool_linker_flags.m4	 
2006-04-12 18:12:28.0 +0200
+++ openmpi-1.1.4-extra_ldflags-fix/config/ 
ompi_get_libtool_linker_flags.m4	2007-02-15 15:11:28.285844893 +0100

@@ -76,11 +76,15 @@
 cmd="$libtool --dry-run --mode=link --tag=CC $CC bar.lo libfoo.la - 
o bar $extra_flags"

 ompi_check_linker_flags_work yes

+# use array initializer to remove multiple spaces in $CC
+tempCC=($CC)
+tempCC="${tempCC[@]}"
+output="${output#$tempCC}"
+unset tempCC
 eval "set $output"
 extra_ldflags=
 while test -n "[$]1"; do
 case "[$]1" in
-$CC) ;;
 *.libs/bar*) ;;
 bar*) ;;
 -I*) ;;
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] replace 'atoi' with 'strtol'

2007-04-18 Thread Brian W. Barrett
The patch is so that you can pass in hex in addition to decimal, right?  I
think that makes sense.  But since we're switching to strtol, it might
also make sense to add some error detection while we're at it.  Not a huge
deal, but it would be nice :).

Brian


> Hi,
>
> I want to add a patch to opal mca.
>
> This patch replaces an 'atoi' call with a 'strtol' call.
>
> If it's O.K with everyone I'll submit this patch by the end of the week.
>
>
>
> Index: opal/mca/base/mca_base_param.c
>
> ===
>
> --- opal/mca/base/mca_base_param.c (revision 14391)
>
> +++ opal/mca/base/mca_base_param.c  (working copy)
>
> @@ -1673,7 +1673,7 @@
>
>if (NULL != param->mbp_env_var_name &&
>
>NULL != (env = getenv(param->mbp_env_var_name))) {
>
>  if (MCA_BASE_PARAM_TYPE_INT == param->mbp_type) {
>
> -  storage->intval = atoi(env);
>
> +  storage->intval = (int)strtol(env,(char**)NULL,0);
>
>  } else if (MCA_BASE_PARAM_TYPE_STRING == param->mbp_type) {
>
>storage->stringval = strdup(env);
>
>  }
>
> @@ -1714,7 +1714,7 @@
>
>  if (0 == strcmp(fv->mbpfv_param, param->mbp_full_name)) {
>
>  if (MCA_BASE_PARAM_TYPE_INT == param->mbp_type) {
>
>  if (NULL != fv->mbpfv_value) {
>
> -param->mbp_file_value.intval =
> atoi(fv->mbpfv_value);
>
> +param->mbp_file_value.intval =
> (int)strtol(fv->mbpfv_value,(char**)NULL,0);
>
>  } else {
>
>  param->mbp_file_value.intval = 0;
>
>  }
>
>
>
> Thanks.
>
>
>
> Sharon.
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] replace 'atoi' with 'strtol'

2007-04-18 Thread Brian W. Barrett
>  > Because the target variable is an (int).
>
> If I were writing the code, I would leave the cast out.  By assigning
> the value to an int variable, you get the same effect anyway, so the
> cast is redundant.  And if you ever change the variable to a long, now
> you have to remember to delete the cast too.  So I don't see any
> upside to having the cast.
>
> But it's just a minor style issue...

I agree 100% with Roland on this one.  There's a reason that compilers
don't complain about this particular cast.  Casting from integer type to
integer type just isn't a big deal in my book.

Of course,I generally try to avoid casts at all costs, since they tend to
cover real issues (see all the evil casts of long* to int* that have
screwed us continually with 64 bit big endian machines.

But I don't care enough to argue the point :).

Brian


Re: [OMPI devel] [OMPI svn] svn:open-mpi r14782

2007-05-27 Thread Brian W. Barrett
> On Sun, May 27, 2007 at 10:34:33AM -0600, Galen Shipman wrote:
>> Actually, we still need  MCA_BTL_FLAGS_FAKE_RDMA , it can be used as
>> a hint for components such as one-sided.
> What is the purpose of the hint if it should be set for each interconnect.
> Just assume that it is set and behave accordingly. That what we decided
> to do in OB1. And the name is not very good too :) All RDMA networks
> behave like this.

Yeah, I agree -- the current semantics aren't very useful anymore.  I'd
actually like to just redefine the FAKE_RDMA flag's meaning.  Some of the
BTLs assume that there will be one set of prepare_src / prepare_dst calls
for each put/get call.  This won't work for one-sided RDMA, were we'll
call prepare_dst at window creation time and reuse it.  I'd like to have
FAKE_RDMA set for those BTLs.

Brian


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r15474

2007-07-17 Thread Brian W. Barrett
So first, there's an error in the patch (e-mail with details coming
shortly, as there are many errors in the patch).  There's no need for both
isends (the new one and the one in there already).

Second, this is in code that's a crutch around the real issue, which is
that for a very small class of applications, the way wireup occurs with
InfiniBand makes it time consuming if the application is very asynchronous
(one process does a single send, the other process doesn't enter the MPI
library for many minutes).  It's not on by default and not recommended for
almost all uses.

The goal is not to have a barrier, but to have every process have at least
one channel for MPI communication fully established to every other
process.  The barrier is a side effect.  The MPI barrier isn't used
precisely because it doesn't cause every process to talk to every other
process.  The rotating ring algorithm was used because we're also trying
as hard as possible to reduce single-point contention, which when everyone
is trying to connect at once, caused failures in either the OOB fabric
(which I think I fixed a couple months ago) or in the IB layer (which
seemed to be the nature of IB).

This is not new code, and given the tiny number of users (now that the OOB
is fixed, one app that I know of at LANL), I'm not really concerned about
scalability.

Brian


> If you really want to have a fully featured barrier why not using the
> collective barrier ? This double ring barrier have really bad
> performance, and it will became a real scalability issue.
>
> Or do we really need to force this particular connection shape (left
> & right) ?
>
>george.
>
> Modified: trunk/ompi/runtime/ompi_mpi_preconnect.c
> 
> ==
> --- trunk/ompi/runtime/ompi_mpi_preconnect.c  (original)
> +++ trunk/ompi/runtime/ompi_mpi_preconnect.c  2007-07-17 21:15:59 EDT
> (Tue, 17 Jul 2007)
> @@ -78,6 +78,22 @@
>
>   ret = ompi_request_wait_all(2, requests, MPI_STATUSES_IGNORE);
>   if (OMPI_SUCCESS != ret) return ret;
> +
> +ret = MCA_PML_CALL(isend(outbuf, 1, MPI_CHAR,
> + next, 1,
> + MCA_PML_BASE_SEND_COMPLETE,
> + MPI_COMM_WORLD,
> + &requests[1]));
> +if (OMPI_SUCCESS != ret) return ret;
> +
> +ret = MCA_PML_CALL(irecv(inbuf, 1, MPI_CHAR,
> + prev, 1,
> + MPI_COMM_WORLD,
> + &requests[0]));
> +if(OMPI_SUCCESS != ret) return ret;
> +
> +ret = ompi_request_wait_all(2, requests, MPI_STATUSES_IGNORE);
> +if (OMPI_SUCCESS != ret) return ret;
>   }
>
>   return ret;
>
>
> On Jul 17, 2007, at 9:16 PM, jsquy...@osl.iu.edu wrote:
>
>> Author: jsquyres
>> Date: 2007-07-17 21:15:59 EDT (Tue, 17 Jul 2007)
>> New Revision: 15474
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/15474
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



Re: [OMPI devel] PML cm and heterogeneous support

2007-10-25 Thread Brian W. Barrett
I'm surprised that ompi_mtl_datatype_{pack, unpack} are properly handling 
the heterogeneous issues - I certainly didn't take that into account when 
I wrote them.  The CM code has never been audited for heterogeneous 
safety, which is why there was protection at that level for not running in 
heterogeneous environments.  The various MTLs likewise have not been 
audited for heterogeneous safety, nor has the mtl base datatype 
manipulation functions.


If someone wanted, they could do such an audit, push the heterogeneous 
disabling code down to the MTLs, and figure out what to do with the 
datatype usage.  The CM code likely doesn't do anything 
heterogeneous-evil, but I can't say for sure.


Brian

On Thu, 25 Oct 2007, Sajjad Tabib wrote:


Hi Brian,

I have actually created a new MTL, in which I have added heterogeneous
support. To experiment whether CM worked in this environment, I took out
the safeguards that prevented one to use CM in a heterogeneous
environment. Miraculously, things have been working so far. I haven't
examined data integrity to an extent that I could say everything works
perfectly, but with MPI_INTS, I do not have any endian problems. Now,
based on my initial tests, I have came to the understanding that the PML
CM safeguard against heterogeneous environments was a mechanism to prevent
users from using existing MTLs. But, if an MTL supports heterogeneous
communication, then it is possible to use the CM component. What is your
take on this?
Anyways, going back to the datatype usage. When you say that: "it's known
the datatype usage in the CM PML won't support heterogeneous operation"
could you please breifly explain this in more detail? I have been using
ompi_mtl_datatype_pack and ompi_mtl_datatype_unpack, which use
ompi_convertor_pack and ompi_convertor_unpack, for data packing. Do you
mean that these functions will not work correctly?

Thank You,

Sajjad Tabib




Brian Barrett 
Sent by: devel-boun...@open-mpi.org
10/24/07 10:04 PM
Please respond to
Open MPI Developers 


To
Open MPI Developers 
cc

Subject
Re: [OMPI devel] PML cm and heterogeneous support






No, it's because the CM PML was never designed to be used in a
heterogeneous environment :).  While the MX BTL does support
heterogeneous operations (at one point, I believe I even had it
working), none of the MTLs have ever been tested in heterogeneous
environments and it's known the datatype usage in the CM PML won't
support heterogeneous operation.

Brian

On Oct 24, 2007, at 6:21 PM, Jeff Squyres wrote:


George / Patrick / Rich / Christian --

Any idea why that's there?  Is that because portals, MX, and PSM all
require homogeneous environments?


On Oct 18, 2007, at 3:59 PM, Sajjad Tabib wrote:



Hi,

I am tried to run an MPI program in a heterogeneous environment
using the pml cm component. However, open mpi returned with an
error message indicating that PML add procs returned "Not
supported". I dived into the cm code to see what was wrong and I
came upon the code below, which basically shows that if the
processes are running on different architectures, then return "not
supported". Now, I'm wondering whether my interpretation is correct
or not. Is it true that the cm component does not support a
heterogeneous environment? If so, will the developers support this
in the future? How could I get around this while still using the cm
component? What will happen if I rebuilt openmpi without these
statements?

I would appreciate your help.

 Code:

mca_pml_cm_add_procs(){

#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT
107 for (i = 0 ; i < nprocs ; ++i) {
108 if (procs[i]->proc_arch != ompi_proc_local()-

proc_arch) {

109 return OMPI_ERR_NOT_SUPPORTED;
110 }
111 }
112 #endif
.
.
.
}

Sajjad Tabib
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Question regarding MCA_PML_CM_SEND_REQUEST_INIT_COMMON

2007-10-31 Thread Brian W. Barrett
This is correct -- the MPI_ERROR field should be filled in by the MTL upon 
completion of the request (or when it knows what to stick in there).  The 
CM PML should generally not fill in that field.


Brian

On Wed, 31 Oct 2007, Jeff Squyres wrote:


Again, I'm not a CM guy :-), but in general, I would think: yes, you
set MPI_ERROR when it is appropriate.  I.e., when you know that the
request has been successful or it has failed.


On Oct 31, 2007, at 9:18 AM, Sajjad Tabib wrote:



Hi Jeff,

Now that you mention it, I believe you are right. In fact, I did
not know that I needed to set the req_status.MPI_ERROR in my MTL. I
looked at the mx mtl and realized that req_status.MPI_ERROR is
getting set in their progress function. So, in general, when do you
set the req_status.MPI_ERROR? After a send/recv has completed
successfully?

Thank You,

Sajjad Tabib



Jeff Squyres 
Sent by: devel-boun...@open-mpi.org
10/31/07 07:29 AM
Please respond to
Open MPI Developers 


To
Open MPI Developers 
cc
Subject
Re: [OMPI devel] Question regarding
MCA_PML_CM_SEND_REQUEST_INIT_COMMON





I haven't done any work in the cm pml so I can't definitively answer
your question, but wouldn't you set req_status.MPI_ERROR in your MTL
depending on the result of the request?


On Oct 29, 2007, at 9:20 AM, Sajjad Tabib wrote:



Hi,

I was issuing an MPI_Bcast in a sample program and was hitting an
unknown error; at least that was what MPI was telling me. I traced
through the code to find my error and came upon
MCA_PML_CM_REQUEST_INIT_COMMON macro function in pml_cm_sendreq.h.
I looked at the function and noticed that in this function the
elements of req_status were getting initialized; however,
req_status.MPI_ERROR was not. I thought that maybe MPI_ERROR must
also require initialization because if the value of MPI_ERROR was
some arbitrary value not equal to MPI_SUCCESS then my program will
definitely die. Unless, MPI_ERROR is propragating from upper layers
to signify an error, but I wasn't sure. Anyway, I assumed that
MPI_ERROR was not propagating from upper layers, so then I set
req_status.MPI_ERROR to MPI_SUCCUSS and reran my test program. My
program worked. Now, having gotten my program to work, I thought I
should run this by you to make sure that MPI_ERROR was not
propagating from upper layers. Is it ok that I did a:
"(req_send)->req_base.req_ompi.req_status.MPI_ERROR = MPI_SUCCESS;"
in MCA_PML_CM_REQUEST_INIT_COMMON?

Thank You,

Sajjad Tabib
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] Environment forwarding

2007-11-05 Thread Brian W. Barrett
This is extremely tricky to do.  How do you know which environment 
variables to forward (foo in this case) and which not to (hostname). 
SLURM has a better chance, since it's linux only and generally only run on 
tightly controlled clusters.  But there's a whole variety of things that 
shouldn't be forwarded and that list differs from OS to OS.


I believe we toyed around with the "right thing" in LAM and early on with 
OPen MPI and decided that it was too hard to meet expected behavior.


Brian

On Mon, 5 Nov 2007, Tim Prins wrote:


Hi,

After talking with Torsten today I found something weird. When using the SLURM
pls we seem to forward a user's environment, but when using the rsh pls we do
not.

I.e.:
[tprins@odin ~]$ mpirun -np 1 printenv |grep foo
[tprins@odin ~]$ export foo=bar
[tprins@odin ~]$ mpirun -np 1 printenv |grep foo
foo=bar
[tprins@odin ~]$ mpirun -np 1 -mca pls rsh printenv |grep foo

So my question is which is the expected behavior?

I don't think we can do anything about SLURM automatically forwarding the
environment, but I think there should be a way to make rsh forward the
environment. Perhaps add a flag to mpirun to do this?

Thanks,

Tim
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Environment forwarding

2007-11-05 Thread Brian W. Barrett

On Mon, 5 Nov 2007, Torsten Hoefler wrote:


On Mon, Nov 05, 2007 at 04:57:19PM -0500, Brian W. Barrett wrote:

This is extremely tricky to do.  How do you know which environment
variables to forward (foo in this case) and which not to (hostname).
SLURM has a better chance, since it's linux only and generally only run on
tightly controlled clusters.  But there's a whole variety of things that
shouldn't be forwarded and that list differs from OS to OS.

I believe we toyed around with the "right thing" in LAM and early on with
OPen MPI and decided that it was too hard to meet expected behavior.

Some applications rely on this (I know at least two right away, Gamess
and Abinit) and they work without problems with Lam/Mpich{1,2} but not
with Open MPI. I am *not* arguing that those applications are correct (I
agree that this way of passing arguments is ugly, but it's done).

I know it's not defined in the standard but I think it's a nice
convenient functionality. E.g., setting the LD_LIBRARY_PATH to find
libmpi.so in the .bashrc is also a pain if you have multiple (Open) MPIs
installed.


LAM does not automatically propogate environment variables -- it's 
behavior is almost *exactly* like Open MPI's.  There might be a situation 
where the environment is not quite so scrubbed if a process is started on 
the same node mpirun is executed on, but it's only appearances -- in 
reality, that's the environment that was alive when lamboot was executed.


With both LAM and Open MPI, there is the -x option to propogate a list of 
environment variables, but that's about it.  Neither will push 
LD_LIBRARY_PATH by default (and there are many good reasons for that, 
particularly in heterogeneous situations).


Brian


[OMPI devel] Incorrect one-sided test

2007-11-07 Thread Brian W. Barrett

Hi all -

Lisa Glendenning, who's working on a Portals one-sided component, 
discovered that the test onesided/test_start1.c in our repository is 
incorrect.  It assumes that MPI_Win_start is non-blocking, but the 
standard says that "MPI_WIN_START is allowed to block until the 
corresponding MPI_WIN_POST calls are executed".  The pt2pt and rdma 
components did not block, so the test error did not show up with those 
components.


I've fixed the test in r1223, but thought I'd let everyone know I changed 
one of our conformance tests.


Brian


Re: [OMPI devel] THREAD_MULTIPLE

2007-11-28 Thread Brian W. Barrett

On Wed, 28 Nov 2007, Jeff Squyres wrote:


We've had a few users complain about trying to use THREAD_MULTIPLE
lately and having it not work.

Here's a proposal: why don't we disable it (at least in the 1.2
series)?  Or, at the very least, put in a big stderr warning that is
displayed when THREAD_MULTIPLE is selected?

Comments?


While you're disabiling it, might also want to remove the bullet from the 
front page of www.open-mpi.org that suggests we support it...


Brian



Re: [OMPI devel] RTE Issue II: Interaction between the ROUTED and GRPCOMM frameworks

2007-12-05 Thread Brian W. Barrett

To me, (a) is dumb and (c) isn't a non-starter.

The whole point of the component system is to seperate concerns.  Routing 
topology and collectives operations are two difference concerns.  While 
there's some overlap (a topology-aware collective doesn't make sense when 
using the unity routing structure), it's not overlap in the one implies 
you need the other.  I can think of a couple of different ways of 
implementing the group communication framework, all of which are totally 
independent of the particulars of how routing is tracked.


(b) has a very reasonable track record of working well on the OMPI side 
(the mpool / btl thing figures itself out fairly well).  Bringing such a 
setup over to ORTE wouldn't be bad, but a bit hackish.


Of course, there's at most two routed components built at any time, and 
the defaults are all most non-debugging people will ever need, so I guess 
I"m not convinced (c) isn't a non-starter.


Brian

On Wed, 5 Dec 2007, Tim Prins wrote:


To me, (c) is a non-starter. I think whenever possible we should be
automatically doing the right thing. The user should not need to have
any idea how things work inside the library.

Between options (a) and (b), I don't really care.

(b) would be great if we had a mca component dependency system which has
been much talked about. But without such a system it gets messy.

(a) has the advantage of making sure there is no problems and allowing
the 2 systems to interact very nicely together, but it also might add a
large burden to a component writer.

On a related, but slightly different topic, one thing that has always
bothered me about the grpcomm/routed implementation is that it is not
self contained. There is logic for routing algorithms outside of the
components (for example, in orte/orted/orted_comm.c). So, if there are
any overhauls planned I definitely think this needs to be cleaned up.

Thanks,

Tim

Ralph H Castain wrote:

II. Interaction between the ROUTED and GRPCOMM frameworks
When we initially developed these two frameworks within the RTE, we
envisioned them to operate totally independently of each other. Thus, the
grpcomm collectives provide algorithms such as a binomial "xcast" that uses
the daemons to scalably send messages across the system.

However, we recently realized that the efficacy of the current grpcomm
algorithms directly hinge on the daemons being fully connected - which we
were recently told may not be the case as other people introduce different
ROUTED components. For example, using the binomial algorithm in grpcomm's
xcast while having a ring topology selected in ROUTED would likely result in
terrible performance.

This raises the following questions:

(a) should the GRPCOMM and ROUTED frameworks be consolidated to ensure that
the group collectives algorithms properly "match" the communication
topology?

(b) should we automatically select the grpcomm/routed pairings based on some
internal logic?

(c) should we leave this "as-is" and the user is responsible for making
intelligent choices (and for detecting when the performance is bad due to
this mismatch)?

(d) other suggestions?

Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] vt-integration

2007-12-05 Thread Brian W. Barrett
OS X enforces a no duplicate symbol rule when flat namespaces are in use 
(the default on OS X).  If all the libraries are two-level namespace 
libraries (libSystem.dylib, aka libm.dylib is two-level), then duplicate 
symbols are mostly ok.


Libtool by default forces a flat namespace in sharedlibraries to work 
around an oddity on early OS X systems with undefined references.  There's 
also a way to make static two-level namespaces (I think), but I haven't 
tried that before).  You can cause Libtool (and the linker) to be a bit 
more sane if you set the environment variable MACOSX_DEPLOYMENT_TARGET to 
either 10.3 or 10.4.  The shared library rules followed by Libtool and the 
compiler chain will then be for that OS X release, rather than for the 
original 10.0.  We don't support anything older than 10.3, so this isn't 
really a problem.


Of course, since the default for users is to emit 10.0 target code, that 
can be a bit hard to make work out.  So you might want to have a configure 
test to figure all that out and not build the IO intercept library in some 
cases.


Brian

On Wed, 5 Dec 2007, Jeff Squyres wrote:


I know that OS X's linker is quite different than the Linux linker --
you might want to dig into the ld(1) man page on OS X as a starting
point, and/or consult developer.apple.com for more details.


On Dec 5, 2007, at 10:04 AM, Matthias Jurenz wrote:


Hi Jeff,

I have added checks for the functions open64, creat64, etc. to the
VT's configure script,
so building of VT works fine on MacOS AND Solaris (Terry had the
same problem).
Thanks for your hint ;-)

Unfortunately, there is a new problem on MacOS. I get the following
linker errors, if I try
to link an application with the VT libraries:

gcc -finstrument-functions pi_seq.o -lm -o pi_seq
-L/Users/jurenz/lib/vtrace-5.4.1/lib  -lvt  -lotf -lz -L/usr/local/
lib/ -lbfd -lintl -L/usr/local/lib/ -liberty
/usr/bin/ld: multiple definitions of symbol _close
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(close.So)
definition of _close
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _close in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _fclose
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fclose.So)
definition of _fclose
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _fclose in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _fdopen
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fdopen.So)
definition of _fdopen
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _fdopen in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _fgets
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fgets.So)
definition of _fgets
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _fgets in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _fopen
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fopen.So)
definition of _fopen
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _fopen in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _fprintf
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../
libm.dylib(fprintf.So) definition of _fprintf
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _fprintf in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _fputc
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fputc.So)
definition of _fputc
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _fputc in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _fread
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fread.So)
definition of _fread
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _fread in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _fwrite
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fwrite.So)
definition of _fwrite
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _fwrite in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _open
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(open.So)
definition of _open
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _open in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _read
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(read.So)
definition of _read
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _read in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _rewind
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(rewind.So)
definition of _rewind
/Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition
of _rewind in section (__TEXT,__text)
/usr/bin/ld: multiple definitions of symbol _write
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(write.So)
definition of _write
/Users/jurenz/l

Re: [OMPI devel] opal_condition_wait

2007-12-06 Thread Brian W. Barrett

On Thu, 6 Dec 2007, Tim Prins wrote:


Tim Prins wrote:

First, in opal_condition_wait (condition.h:97) we do not release the
passed mutex if opal_using_threads() is not set. Is there a reason for
this? I ask since this violates the way condition variables are supposed
to work, and it seems like there are situations where this could cause
deadlock.

So in (partial) answer to my own email, this is because throughout the
code we do:
OPAL_THREAD_LOCK(m)
opal_condition_wait(cond, m);
OPAL_THREAD_UNLOCK(m)

So this relies on opal_condition_wait not touching the lock. This
explains it, but it still seems very wrong.


Yes, this is correct.  The assumption is that you are using the 
conditional macro lock/unlock with the condition variables.  I personally 
don't like this (I think we should have had macro conditional condition 
variables), but that obviously isn't how it works today.


The problem with always holding the lock when you enter the condition 
variable is that even when threading is disabled, calling a lock is at 
least as expensive as an add, possibly including a cache miss.  So from a 
performance standpoint, this would be a no-go.



Also, when we are using threads, there is a case where we do not
decrement the signaled count, in condition.h:84. Gleb put this in in
r9451, however the change does not make sense to me. I think that the
signal count should always be decremented.

Can anyone shine any light on these issues?


Unfortunately, I can't add much on this front.

Brian


Re: [OMPI devel] Dynamically Turning On and Off Memory Manager of Open MPI at Runtime??

2007-12-10 Thread Brian W. Barrett

On Mon, 10 Dec 2007, Peter Wong wrote:


Open MPI defines its own malloc (by default), so malloc of glibc
is not called.

But, without calling malloc of glibc, the allocator of libhugetlbfs
to back text and dynamic data by large pages, e.g., 16MB pages on
POWER systems, is not used.

Indeed, we can build Open MPI with --with-memory-manager=none.

I am wondering the feasibility of turning the memory manger on
and off dynamically at runtime as a new feature?


Hi Peter -

The problem is that we actually intercept the malloc() call, so once we've 
done that (which is a link-time thing), it's too late to use the 
underlying malloc to actually do its thing.


I was going to add some code to Open MPI to make it an application link 
time choice (rather than an OMPI-build time choice), but unfortunately 
my current day to day work is not on Open MPI, so unless someone else 
picks it up, it's unlikely this will get implemented in the near future. 
Of course, if someone has the time and desire, I can describe to them what 
I was thinking.


The only way I've found to do memory tracking at run-time is to use 
LD_PRELOAD tricks, which I believe there were some other (easy to 
overcome) problems with.


What would be really nice (although unlikely to occur) is if there was a 
thread-safe way to hook into the memory manager directly (rather than 
playing linking tricks).  GLIBC's malloc provides hooks, but they aren't 
thread safe (as in two user threads calling malloc at the same time would 
result in badness).  Darwin/Mac OS X provides thread-safe hooks that work 
very well (don't require linker tricks and can be turned off at run-time), 
but are slightly higher level than what we want -- there we can intercept 
malloc/free, but what we'd really like to know is when memory is being 
given back to the operating system.


Hope this helps,

Brian


Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Brian W. Barrett

On Tue, 11 Dec 2007, Gleb Natapov wrote:


  I did a rewrite of matching code in OB1. I made it much simpler and 2
times smaller (which is good, less code - less bugs). I also got rid
of huge macros - very helpful if you need to debug something. There
is no performance degradation, actually I even see very small performance
improvement. I ran MTT with this patch and the result is the same as on
trunk. I would like to commit this to the trunk. The patch is attached
for everybody to try.


I don't think we can live without those macros :).  Out of curiousity, is 
there any functionality that was removed as a result of this change?


I'll test on a couple systems over the next couple of days...

Brian


Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Brian W. Barrett

On Wed, 12 Dec 2007, Gleb Natapov wrote:


On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote:

This is better than nothing, but really not very helpful for looking at the
specific issues that can arise with this, unless these systems have several
parallel networks, with tests that will generate a lot of parallel network
traffic, and be able to self check for out-of-order received - i.e. this
needs to be encoded into the payload for verification purposes.  There are
some out-of-order scenarios that need to be generated and checked.  I think
that George may have a system that will be good for this sort of testing.


I am running various test with multiple networks right now. I use
several IB BTLs and TCP BTL simultaneously. I see many reordered
messages and all tests were OK till now, but they don't encode
message sequence in a payload as far as I know. I'll change one of
them to do so.


Other than Rich's comment that we need sequence numbers, why add them?  We 
haven't had them for non-matching packets for the last 3 years in Open MPI 
(ie, forever), and I can't see why we would need them.  Yes, we need 
sequence numbers for match headers to make sure MPI ordering is correct. 
But for the rest of the payload, there's no need with OMPI's datatype 
engine.  It's just more payload for no gain.


Brian


Re: [OMPI devel] IPv4 mapped IPv6 addresses

2007-12-14 Thread Brian W. Barrett

On Fri, 14 Dec 2007, Adrian Knoth wrote:


Should we consider moving towards these mapped addresses? The
implications:

  - less code, only one socket to handle
  - better FD consumption
  - breaks WinXP support, but not Vista/Longhorn or later
  - requires non-default kernel runtime setting on OpenBSD for IPv4
connections

FWIW, FD consumption is the only real issue to consider.


My thought is no.  The resource consumption isn't really an issue to 
consider.  It would also simplify the code (although work that Adrian and 
I did later to clean up the TCP OOB component has limited that).  If you 
look at the FD count issue, you're going to reduce the number of FDs (for 
the OOB anyway) by 2.  Not (2 * NumNodes), but 2 (one for BTL, one for 
OOB).  Today we have a listen socket for IPv4 and another for IPv6.  With 
IPv4 mapped addresses, we'd have one that did both.  In terms of per-peer 
connections, the OOB tries one connection at a time, so there will be at 
most 1 OOB connection between any two peers.


In return for 2 FDs, we'd have to play with code taht we know works and 
with cleanups over the last year has actually become quite simple.  We'd 
have to break WinXP support (when it sounds like no one is really moving 
to Vista), and we'd break out-of-the-box OpenBSD.


Brian


Re: [OMPI devel] ptmalloc and pin down cache problems again

2008-01-07 Thread Brian W. Barrett
Nope, I think that's a valid approach.  For some reason, I believe it 
was problematic for the OpenIB guys to do that at the time we were 
hacking up that code.  But if it works, it sounds like a much better 
approach.


When you make the change to the openib mpool, I'd also 
MORECORE_CANNONT_TRIM back to 0.  mvapi / openib were the only libraries 
that needed the free in the deregistration callback -- GM apppeared to not 
have that particular behavior.  And I don't believe that anyone else 
actually uses the deregistration callbacks.



Brian


On Mon, 7 Jan 2008, Gleb Natapov wrote:


Hi Brian,

I encountered problem with ptmalloc an registration cache. I see that
you (I think it was you) disabled shrinking of a heap memory allocated
by sbrk by setting MORECORE_CANNOT_TRIM to 1. The comment explains that
it should be done because freeing of small objects is not reentrant so
if ompi memory subsystem callback will call free() the code will deadlock.
And the trick indeed works in single threaded programs, but in multithreaded
programs ptmalloc may allocate a heap not only by sbrk, but by mmap too. This
is called "arena". Each thread may have arenas of its own. The problem is
that ptmalloc may free an arena by calling munmap() and then free() that
is called from our callback deadlocks. I tried to compile with USE_ARENAS set
to 0, but the code doesn't compile. I can fix compilation problem of
cause, but it seems that it is not so good idea to disable this feature.
The ptmalloc scalability depends on it (and even if we will disable it
ptmalloc may still create arena by mmap if sbrk fails). I see only one
way to solve this problem: to not call free() inside mpool callbacks.
If freeing of a memory is needed (and it is needed since IB unregister
calls free()) the works should be deferred. For IB mpool we can check what
needs to be unregistered inside a callback, but actually call unregister()
from next mpool->register() call. Do you see any problems with this
approach?

--
Gleb.




Re: [OMPI devel] Fwd: === CREATE FAILURE ===

2008-01-24 Thread Brian W. Barrett

Automake forces v7 mode so that Solaris tar can untar the tarball, IIRC.

Brian

On Thu, 24 Jan 2008, Aurélien Bouteiller wrote:


According to posix, tar should not limit the file name length. Only
the v7 implementation of tar is limited to 99 characters. GNU tar has
never been limited in the number of characters file names can have.
You should check with tar --help that tar on your machine defaults to
format=gnu or format=posix. If it defaults to format=v7 I am curious
why. Are you using solaris ?

Aurelien

Le 24 janv. 08 à 15:18, Jeff Squyres a écrit :


I'm trying to replicate and getting a lot of these:

tar: openmpi-1.3a1r17212M/ompi/mca/pml/v/vprotocol/mca/vprotocol/
pessimist/vprotocol_pessimist_sender_based.c: file name is too long
(max 99); not dumped
tar: openmpi-1.3a1r17212M/ompi/mca/pml/v/vprotocol/mca/vprotocol/
pessimist/vprotocol_pessimist_component.c: file name is too long (max
99); not dumped

I'll bet that this is the real problem.  GNU tar on linux defaults to
99 characters max, and the _component.c filename is 102, for example.

Can you shorten your names?


On Jan 24, 2008, at 3:02 PM, George Bosilca wrote:


We cannot reproduce this one. A simple "make checkdist" exit long
before doing anything in the ompi directory. It is difficult to see
where exactly it fails, but it is somewhere in the opal directory. I
suspect the new carto framework ...

Thanks,
  george.

On Jan 24, 2008, at 7:12 AM, Jeff Squyres wrote:


Aurelien --

Can you fix please?  Last night's tests didn't run because of this
failure.


Begin forwarded message:


From: MPI Team 
Date: January 23, 2008 9:13:30 PM EST
To: test...@open-mpi.org
Subject: === CREATE FAILURE ===
Reply-To: de...@open-mpi.org


ERROR: Command returned a non-zero exist status
   make -j 4 distcheck

Start time: Wed Jan 23 21:00:08 EST 2008
End time:   Wed Jan 23 21:13:30 EST 2008

=
=
=
=
===
[... previous lines snipped ...]
config.status: creating orte/mca/snapc/Makefile
config.status: creating orte/mca/snapc/full/Makefile
config.status: creating ompi/mca/allocator/Makefile
config.status: creating ompi/mca/allocator/basic/Makefile
config.status: creating ompi/mca/allocator/bucket/Makefile
config.status: creating ompi/mca/bml/Makefile
config.status: creating ompi/mca/bml/r2/Makefile
config.status: creating ompi/mca/btl/Makefile
config.status: creating ompi/mca/btl/gm/Makefile
config.status: creating ompi/mca/btl/mx/Makefile
config.status: creating ompi/mca/btl/ofud/Makefile
config.status: creating ompi/mca/btl/openib/Makefile
config.status: creating ompi/mca/btl/portals/Makefile
config.status: creating ompi/mca/btl/sctp/Makefile
config.status: creating ompi/mca/btl/self/Makefile
config.status: creating ompi/mca/btl/sm/Makefile
config.status: creating ompi/mca/btl/tcp/Makefile
config.status: creating ompi/mca/btl/udapl/Makefile
config.status: creating ompi/mca/coll/Makefile
config.status: creating ompi/mca/coll/basic/Makefile
config.status: creating ompi/mca/coll/inter/Makefile
config.status: creating ompi/mca/coll/self/Makefile
config.status: creating ompi/mca/coll/sm/Makefile
config.status: creating ompi/mca/coll/tuned/Makefile
config.status: creating ompi/mca/common/Makefile
config.status: creating ompi/mca/common/mx/Makefile
config.status: creating ompi/mca/common/portals/Makefile
config.status: creating ompi/mca/common/sm/Makefile
config.status: creating ompi/mca/crcp/Makefile
config.status: creating ompi/mca/crcp/coord/Makefile
config.status: creating ompi/mca/io/Makefile
config.status: creating ompi/mca/io/romio/Makefile
config.status: creating ompi/mca/mpool/Makefile
config.status: creating ompi/mca/mpool/rdma/Makefile
config.status: creating ompi/mca/mpool/sm/Makefile
config.status: creating ompi/mca/mtl/Makefile
config.status: creating ompi/mca/mtl/mx/Makefile
config.status: creating ompi/mca/mtl/portals/Makefile
config.status: creating ompi/mca/mtl/psm/Makefile
config.status: creating ompi/mca/osc/Makefile
config.status: creating ompi/mca/osc/pt2pt/Makefile
config.status: creating ompi/mca/osc/rdma/Makefile
config.status: creating ompi/mca/pml/Makefile
config.status: creating ompi/mca/pml/cm/Makefile
config.status: creating ompi/mca/pml/crcpw/Makefile
config.status: creating ompi/mca/pml/dr/Makefile
config.status: creating ompi/mca/pml/ob1/Makefile
config.status: creating ompi/mca/pml/v/vprotocol/Makefile
config.status: error: cannot find input file: ompi/mca/pml/v/
vprotocol/pessimist/Makefile.in
make: *** [distcheck] Error 1
=
=
=
=
===

Your friendly daemon,
Cyrador
___
testing mailing list
test...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/testing



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_

Re: [OMPI devel] xensocket - callbacks through OPAL/libevent

2008-02-05 Thread Brian W. Barrett

On Mon, 4 Feb 2008, Muhammad Atif wrote:

I am trying to port xensockets to openmpi. In principle, I have the 
framework and everything, but there seems to be a small issue, I cannot 
get libevent (or OPAL) to give callbacks for receive (or send) for 
xensockets. I have tried to implement native code for xensockets with 
libevent library, again the same issue.  No call backs! . With normal 
sockets, callbacks do come easily.


So question is, do the socket/file descriptors have to have some special 
mechanism attached to them to support callbacks for libevent/opal? i.e 
some structure/magic?. i.e. maybe the developers of xensockets did not 
add that callback/interrupt thing at the time of creation. Xensockets is 
open source, but my knowledge about these issues is limited. So I though 
some pointer in right direction might be useful.


Yes and no :).  As you discovered, the OPAL interface just repackages a 
library called libevent to handle its socket multiplexing.  Libevent can 
use a number of different mechanisms to look for activity on sockets, 
including select() and poll() calls.  On Linux, it will generally use 
poll().  poll() requires some kernel support to do its thing, so if 
Xensockets doesn't implement the right magic to trigger poll() events, 
then libevent won't work for Xensockets.  There's really nothing you can 
do from the Open MPI front to work around this issue -- it would have to 
be fixed as part of Xensockets.


Second question is, what if we cannot have the callbacks. What is the 
recommended way to implement the btl component for such a device? Do we 
need to do this with event timers?


Have a look at any of the BTLs that isn't TCP -- none of them use libevent 
callbacks for progress.  Instead, they provide a progress function as part 
of the BTL interface, which is called on a regular basis whenever progress 
needs to be made.


Brian


Re: [OMPI devel] 3rd party code contributions

2008-02-08 Thread Brian W. Barrett

On Fri, 8 Feb 2008, Ralph Castain wrote:


1. event library
2. ROMIO
3. VT
4. backtrace
5. PLPA - this one is a little less obvious, but still being released as a
separate package
6. libNBC


Sorry to Ralph, but I clipped everything from his e-mail, then am going to 
make references to it.  oh well :).


One minor correction -- the entire backtrace framework is not a third 
party deal.  The *DARWIN/Mac OS X* component relies heavily on third party 
code, but the others (Linux and Solaris) are just wrappers around code in 
their respective C libraries.


I believe I was responsible for the event library, ROMIO, and backtrace 
before leaving LANL.  I'll go through the motivations and issues with all 
three in terms of integration.


Event Library: The event library is the core "rendezvous" point for all of 
Open MPI, so any issues with it cause lots of issues with Open MPI in 
general.  We've also hacked it considerably since taking the original 
libevent source -- we've renamed all the functions, we've made it thread 
safe in a way the author was unwilling to do, we've fixed some performance 
issues unique to our usage model.  In short, this is no longer really the 
same libevent that might already be installed on the system.  Using such 
an unmodified libevent would be disasterous.


ROMIO is actually one that there was significant discussion about prior to 
me leaveing Los Alamos.  There are a number of problems / issues with 
ROMIO.  First and foremost, without ROMIO, we are not a fully compliant 
MPI implementation.  So we have to ship ROMIO -- it's the only way to have 
that important check mark.  But its current integration has some issues -- 
it's hard to test patches independently.  There is actually a mode in the 
current Open MPI tree where the MPI interface to MPI-I/O is not provided 
by OPen MPI and no io components are built.  This is to allow users to 
build ROMIO independently of Open MPI, for testing updates or whatever. 
There are some disadvantages to this.  First, the independent ROMIO will 
use generalized requests instead of being hooked into our progress engine, 
so there may be some progress issues (I never verified either way). 
Second, it does mean dealing with another package to build on the user's 
site.  Jeff is correct --there was discussion about how to make the 
integration "better" -- many of the changes were on our side, and we were 
going to have to ask for a couple of changes from Argonne.  If someone is 
going to put in the considerable amount of time to make this happen, I'm 
happy to write up whatever notes I can remember / find on the issue.


The Darwin backtrace component is mostly maintanance free.  It doesn't 
support 64-bit Intel chips, but that's fine.  Once every 18 months or so, 
I need to get a new copy for the latest operation system, although the 
truth is I don't think anything bad happens if we just stop doing the 
updates at OS release (by the way, I did the one for Leopard, so we're 
probably all going to be sick of MPI and on to other things before the 
next time it has to be done).  While it's useful, if the community is 
really worried, it could probably be deleted.  But having a stack trace 
when you segfault sure is nice :).


Brian




Re: [OMPI devel] 1.3 Release schedule and contents

2008-02-11 Thread Brian W. Barrett
Out of curiousity, why is one-sided rdma component struck from 1.3?  As 
far as I'm aware, the code is in the trunk and ready for release.


Brian

On Mon, 11 Feb 2008, Brad Benton wrote:


All:

The latest scrub of the 1.3 release schedule and contents is ready for
review and comment.  Please use the following links:
 1.3 milestones:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3
 1.3.1 milestones:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.1

In order to try and keep the dates for 1.3 in, I've pushed a bunch of stuff
(particularly ORTE things) to 1.3.1.  Even though there will be new
functionality slated for 1.3.1, the goal is to not have any interface
changes between the phases.

Please look over the list and schedules and let me or my fellow
1.3co-release manager George Bosilca (
bosi...@eecs.utk.edu) know of any issues, errors, suggestions, omissions,
heartburn, etc.

Thanks,
--Brad

Brad Benton
IBM



Re: [OMPI devel] New address selection for btl-tcp (was Re: [OMPI svn] svn:open-mpi r17307)

2008-02-22 Thread Brian W. Barrett

On Fri, 22 Feb 2008, Adrian Knoth wrote:



I see three approaches:

  a) remove lo globally (in if.c). I expect objections. ;)


I object!  :).  But for a good reason -- it'll break things.  Someone 
tried this before, and the issue is when a node (like a laptop) only has 
lo -- then there are no reported interfaces, and either there needs to be 
lots of extra code in the oob / btl or things break.  So let's not go down 
this path again.



  b) print a warning from BTL/TCP if the interfaces in use contain lo.
 Like "Warning: You've included the loopback for communication.
   This may cause hanging processes due to unreachable peers."


I like this one.


  c) Throw away 127.0.0.1 on the remote side. But when doing so, what's
 the use for including it at all?


This seems hard.

Brian


Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Brian W. Barrett

Jeff / George -

Did you add a way to specify which event modules are used?  Because epoll 
pushs the socket list into the kernel, I can see how it would screw up 
BLCR.  I bet everything would work if we forced the use of poll / select.


Brian

On Tue, 18 Mar 2008, Jeff Squyres wrote:


Crud, ok.  Keep us posted.

On Mar 18, 2008, at 4:16 PM, Josh Hursey wrote:


I'm testing with checkpoint/restart and the new libevent seems to be
messing up the checkpoints generated by BLCR. I'll be taking a look
at it over the next couple of days, but just thought I'd let people
know. Unfortunately I don't have any more details at the moment.

-- Josh

On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:


WHAT: Bring new version of libevent to the trunk.

WHY: Newer version, slightly better performance (lower overheads /
lighter weight), properly integrate the use of epoll and other
scalable fd monitoring mechanisms.

WHERE: 98% of the changes are in opal/event; there's a few changes to
configury and one change to the orted.

TIMEOUT: COB, Friday, 21 March 2008

DESCRIPTION:

George/UTK has done the bulk of the work to integrate a new version
of
libevent on the following tmp branch:

https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge

** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS
BRANCH!
**

Cisco ran MTT on this branch on Friday and everything checked out
(i.e., no more failures than on the trunk).  We just made a few more
minor changes today and I'm running MTT again now, but I'm not
expecting any new failures (MTT will take several hours).  We would
like to bring the new libevent in over this upcoming weekend, but
would very much appreciate if others could test on their platforms
(Cisco tests mainly 64 bit RHEL4U4).  This new libevent *should* be a
fairly side-effect free change, but it is possible that since we're
now using epoll and other scalable fd monitoring tools, we'll run
into
some unanticipated issues on some platforms.

Here's a consolidated diff if you want to see the changes:

https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
2Flibevent-merge&old=17846&new_path=trunk&new=17842

Thanks.

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






[OMPI devel] Libtool for 1.3 / trunk builds

2008-03-19 Thread Brian W. Barrett

Hi all -

Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it 
probably makes sense to update the version of Libtool used to build the 
nightly tarball and releases for the trunk (and eventually v1.3) from the 
nightly snapshot we have been using to the stable LT 2.2 release.


I've done some testing (ie, I installed LT 2.2 for another project, and 
nothing in OMPI broke over the last couple of weeks), so I have some 
confidence this should be a smooth transition.  If the group decides this 
is a good idea, someone at IU would just have to install the new LT 
version and change some symlinks and it should all just work...


Brian


Re: [OMPI devel] Libtool for 1.3 / trunk builds

2008-03-19 Thread Brian W. Barrett
True - I have no objection to waiting for 2.2.1 or 1.3 to be branched, 
whichever comes first.  The main point is that under no circumstance 
should 1.3 be shipped with the same 2.1a pre-release as 1.2 uses -- it's 
time to migrate to something stable.


Brian

On Wed, 19 Mar 2008, Jeff Squyres wrote:


Should we wait for the next LT point release?  I see a fair amount of
activity on the bugs-libtool list; I think they're planning a new
release within the next few weeks.

(I think we will want to go to the LT point release when it comes out;
I don't really have strong feelings about going to 2.2 now or not)



On Mar 19, 2008, at 12:26 PM, Brian W. Barrett wrote:


Hi all -

Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it
probably makes sense to update the version of Libtool used to build
the
nightly tarball and releases for the trunk (and eventually v1.3)
from the
nightly snapshot we have been using to the stable LT 2.2 release.

I've done some testing (ie, I installed LT 2.2 for another project,
and
nothing in OMPI broke over the last couple of weeks), so I have some
confidence this should be a smooth transition.  If the group decides
this
is a good idea, someone at IU would just have to install the new LT
version and change some symlinks and it should all just work...

Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






[OMPI devel] Proc modex change

2008-03-20 Thread Brian W. Barrett

Hi all -

Does anyone know why we go through the modex receive and for the local 
process in ompi_proc_get_info()?  It doesn't seem like it's necessary, and 
it causes some problems on platforms that don't implement the modex (since 
it zeros out useful information determined during the init step).  If no 
one has any objections, I'd like to commit the attached patch that fixes 
that problem.



Thanks,

BrianIndex: ompi/proc/proc.c
===
--- ompi/proc/proc.c	(revision 17898)
+++ ompi/proc/proc.c	(working copy)
@@ -192,6 +192,11 @@
 size_t datalen;
 orte_vpid_t nodeid;
 
+/* Don't reset the information determined about the current
+   process during the init step.  Saves time and problems if
+   modex is unimplemented */
+if (ompi_proc_local() == proc) continue;
+
 if (OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_JOBID,
  &ompi_proc_local_proc->proc_name,
  &proc->proc_name)) {


Re: [OMPI devel] IRIX autoconf failure.

2008-03-21 Thread Brian W. Barrett

On Fri, 21 Mar 2008, Regan Russell wrote:


I am having problems with the Assembler section of the  GNU autoconf stuff on 
OpenMPI.
Is anyone willing to work with me to get this up and running...?


As a warning, MIPS / IRIX is not currently on the list of Open MPI 
supported platforms, so there may be some issues that we can't overcome. 
But this is usually a pretty simple thing -- can you send the config.log 
file generated by configure?


Thanks,

Brian


Re: [OMPI devel] FreeBSD timer_base_open error?

2008-03-26 Thread Brian W. Barrett

George -

Good catch -- that's going to cause a problem :).  But I think we should 
add yet another check to also make sure that we're on Linux.  So the three 
tests would be:


  1) Am I on a platform that we have timer assembly support for?
 (That's the long list of architectures that we recently,
 and incorrectly, added).
  2) Am I on Linux (since we really only know how to parse
 /proc/cpuinfo on Linux)
  3) Is /proc/cpuinfo readable (Because we have a couple architectures
 that are reported by config.guess as Linux, but don't have
 /proc/cpuinfo).

Make sense?

Brian

On Wed, 26 Mar 2008, George Bosilca wrote:

I was working off-list with Brad on this. Brian is right, the logic in 
configure.m4 is wrong. It overwrite the timer_linux_happy to yes if the host 
match "i?86-*|x86_64*|ia64-*|powerpc-*|powerpc64-*|sparc*-*". On FreeBSD host 
is i386-unknown-freebsd6.2.


Here is a quick and dirty patch. I just move the selection logic a little bit 
around, without any major modifications.


george.

Index: configure.m4
===
--- configure.m4(revision 17970)
+++ configure.m4(working copy)
@@ -40,14 +40,12 @@
[timer_linux_happy="yes"],
[timer_linux_happy="no"])])

-AS_IF([test "$timer_linux_happy" = "yes"],
-  [AS_IF([test -r "/proc/cpuinfo"],
- [timer_linux_happy="yes"],
- [timer_linux_happy="no"])])
-
  case "${host}" in
  i?86-*|x86_64*|ia64-*|powerpc-*|powerpc64-*|sparc*-*)
-timer_linux_happy="yes"
+AS_IF([test "$timer_linux_happy" = "yes"],
+  [AS_IF([test -r "/proc/cpuinfo"],
+ [timer_linux_happy="yes"],
+ [timer_linux_happy="no"])])
   ;;
  *)
   timer_linux_happy="no"



On Mar 25, 2008, at 10:31 PM, Brian Barrett wrote:

On Mar 25, 2008, at 6:16 PM, Jeff Squyres wrote:

"linux" is the name of the component.  It looks like opal/mca/timer/
linux/timer_linux_component.c is doing some checks during component
open() and returning an error if it can't be used (e.g,. if it's not
on linux).

The timer components are a little different than normal MCA
frameworks; they *must* be compiled in libopen-pal statically, and
there will only be one of them built.

In this case, I'm guessing that linux was built simply because nothing
else was selected to be built, but then its component_open() function
failed because it didn't find /proc/cpuinfo.



This is actually incorrect.  The linux component looks for /proc/
cpuinfo and builds if it founds that file.  There's a base component
that's built if nothing else is found.  The configure logic for the
linux component is probably not the right thing to do -- it should
probably be modified to check both for that file (there are systems
that call themselves "linux" but don't have a /proc/cpuinfo) is
readable and that we're actually on Linux.

Brian

--
 Brian Barrett

 There is an art . . . to flying. The knack lies in learning how to
 throw yourself at the ground and miss.
 Douglas Adams, 'The Hitchhikers Guide to the Galaxy'



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Memchecker: breaks trunk again

2008-04-21 Thread Brian W. Barrett

On Mon, 21 Apr 2008, Ralph H Castain wrote:


So it appears to be a combination of memchecker=yes automatically requiring
valgrind, and the override on the configure line of a param set by a
platform file not working.


So I can't speak to the valgrind/memchecker issue, but can to the 
platform/configure issue.  The platform file was intended to provide a 
mechanism to allow repeatability in builds.  By design, options in the 
platform file have higher priority than options given on the configure 
command line.


Brian



Re: [OMPI devel] Flush CQ error on iWARP/Out-of-sync shutdown

2008-05-06 Thread Brian W. Barrett

On Tue, 6 May 2008, Jeff Squyres wrote:


On May 5, 2008, at 6:27 PM, Steve Wise wrote:


There is a larger question regarding why the remote node is still
polling the hca and not shutting down, but my immediate question is
if it is an acceptable fix to simply disregard this "error" if it
is an iWARP adapter.


If proc B is still polling the hca, it is likely because it simply has
not yet stopped doing it.  I.e., a big problem in MPI implementations
is that not all actions are exactly synchronous.  MPI disconnects are
*effectively* synchronous, but we probably didn't *guarantee*
synchronicity in this case because we didn't need it (perhaps until
now).


Not to mention...  The BTL has to be able to handle a shutdown from one 
proc while still running its progression engine, as that's a normal 
sequence of events when dynamic processes are involved.  Because of that, 
there wasn't too much care taken to ensure that everyone stopped polling, 
then everyone did del_procs.


Brian


Re: [OMPI devel] btl_openib_iwarp.c : making platform specific calls

2008-05-13 Thread Brian W. Barrett

On Tue, 13 May 2008, Don Kerr wrote:


I believe there are similar operations being used by other areas of open
mpi, place to start looking would be, opal/util/if.c.


Yes, opal/util/if.h and opal/util/net.h provide a portable interface to 
almost everything that comes from getifaddrs().


Brian


Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-21 Thread Brian W. Barrett
I think having a parameter to turn off the warning is a great idea.  So 
great in fact, that it already exists in the trunk and v1.2 :)!  Setting 
the default value for the btl_base_warn_component_unused flag from 0 to 1 
will have the desired effect.


I'm not sure I agree with setting the default to 0, however.  The warning 
has proven extremely useful for diagnosing that IB (or less often GM or 
MX) isn't properly configured on a compute node due to some random error. 
It's trivially easy for any packaging group to have the line


  btl_base_warn_component_unused = 0

added to $prefix/etc/openmpi-mca-params.conf during the install phase of 
the package build (indeed, our simple build scripts at LANL used to do 
this on a regular bases due to our need to tweek the OOB to keep IPoIB 
happier at scale).


I think keeping the Debian guys happy is a good thing.  Giving them an 
easy way to turn off silly warnings is a good thing.  Removing a known 
useful warning to help them doesn't seem like a good thing.



Brian


On Wed, 21 May 2008, Jeff Squyres wrote:


What: Change default in openib BTL to not complain if no OpenFabrics
devices are found

Why: Many linuxes are shipping libibverbs these days, but most users
still don't have OpenFabrics hardware

Where: btl_openib_component.c

When: For v1.3

Timeout: Teleconf, 27 May 2008

Short version
=

Many major linuxes are shipping libibverbs by default these days.
OMPI will therefore build the openib BTL by default, but then
complains at run time when there's no OpenFabrics hardware.

We should change the default in v1.3 to not complain if there is no
OpenFabrics devices found (perhaps have an MCA param to enable the
warning if desired).

Longer version
==

I just got a request from the Debian Open MPI package maintainers to
include the following in the default openmpi-mca-params.conf for the
OMPI v1.2 package:

# Disable the use of InfiniBand
#   btl = ^openib

Having this in the openmpi-mca-params.conf gives Debian an easy
documentation path for users to shut up these warnings when they build
on machines with libibverbs present but no OpenFabrics hardware.

I think that this is fine for the v1.2 series (and will file a CMR for
it).  But for v1.3, I think we should change the default.

The vast majority of users will not have OpenFabrics devices, and we
should therefore not complain if we can't find any at run-time.  We
can/should still complain if we find OpenFabrics devices but no active
ports (i.e., don't change this behavior).

But for optimizing the common case: I think we should (by default) not
print a warning if no OpenFabrics devices are found.  We can also
[easily] have an MCA parameter that *will* display a warning if no
OpenFabrics devices are found.




Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-21 Thread Brian W. Barrett
And there's a typo in my first paragraph.  The flag currently defaults to 
1 (print the warning).  It should be switched to 0 to turn off the 
warning.  Sorry for any confusion I might have caused -- I blame the lack 
of caffeine in the morning.


Brian

On Wed, 21 May 2008, Pavel Shamis (Pasha) wrote:


I'm agree with Brian. We may add to the warning message detailed
description how to disable it.

Pasha

Brian W. Barrett wrote:

I think having a parameter to turn off the warning is a great idea.  So
great in fact, that it already exists in the trunk and v1.2 :)!  Setting
the default value for the btl_base_warn_component_unused flag from 0 to 1
will have the desired effect.

I'm not sure I agree with setting the default to 0, however.  The warning
has proven extremely useful for diagnosing that IB (or less often GM or
MX) isn't properly configured on a compute node due to some random error.
It's trivially easy for any packaging group to have the line

   btl_base_warn_component_unused = 0

added to $prefix/etc/openmpi-mca-params.conf during the install phase of
the package build (indeed, our simple build scripts at LANL used to do
this on a regular bases due to our need to tweek the OOB to keep IPoIB
happier at scale).

I think keeping the Debian guys happy is a good thing.  Giving them an
easy way to turn off silly warnings is a good thing.  Removing a known
useful warning to help them doesn't seem like a good thing.


Brian


On Wed, 21 May 2008, Jeff Squyres wrote:



What: Change default in openib BTL to not complain if no OpenFabrics
devices are found

Why: Many linuxes are shipping libibverbs these days, but most users
still don't have OpenFabrics hardware

Where: btl_openib_component.c

When: For v1.3

Timeout: Teleconf, 27 May 2008

Short version
=

Many major linuxes are shipping libibverbs by default these days.
OMPI will therefore build the openib BTL by default, but then
complains at run time when there's no OpenFabrics hardware.

We should change the default in v1.3 to not complain if there is no
OpenFabrics devices found (perhaps have an MCA param to enable the
warning if desired).

Longer version
==

I just got a request from the Debian Open MPI package maintainers to
include the following in the default openmpi-mca-params.conf for the
OMPI v1.2 package:

# Disable the use of InfiniBand
#   btl = ^openib

Having this in the openmpi-mca-params.conf gives Debian an easy
documentation path for users to shut up these warnings when they build
on machines with libibverbs present but no OpenFabrics hardware.

I think that this is fine for the v1.2 series (and will file a CMR for
it).  But for v1.3, I think we should change the default.

The vast majority of users will not have OpenFabrics devices, and we
should therefore not complain if we can't find any at run-time.  We
can/should still complain if we find OpenFabrics devices but no active
ports (i.e., don't change this behavior).

But for optimizing the common case: I think we should (by default) not
print a warning if no OpenFabrics devices are found.  We can also
[easily] have an MCA parameter that *will* display a warning if no
OpenFabrics devices are found.




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-21 Thread Brian W. Barrett

On Wed, 21 May 2008, Jeff Squyres wrote:


2. An out-of-the-box "mpirun a.out" will print warning messages in
perfectly valid/good configurations (no verbs-capable hardware, but
just happen to have libibverbs installed).  This is a Big Deal.


Which is easily solved with a better error message, as Pasha suggested.


3. Problems with HCA hardware and/or verbs stack are uncommon
(nowadays).  I'd be ok asking someone to enable a debug flag to get
more information on configuration problems or hardware faults.

Shouldn't we be optimizing for the common case?

In short: I think it's no longer safe to assume that machines with
libibverbs installed must also have verbs-capable hardware.


But here's the real problem -- with our current selection logic, a user 
with libibverbs but no IB cards gets an error message saying "hey, we need 
you to set this flag to make this error go away" (or would, per Pasha's 
suggestion).  A user with a busted IB stack on a node (which we still saw 
pretty often at LANL) starts using TCP and their application runs like a 
dog.


I guess it's a matter of how often you see errors in the IB stack that 
cause nic initialization to fail.  The machines I tend to use still 
exhibit this problem pretty often, but it's possible I just work on bad 
hardware more often than is usual in the wild.


It would be great if libibverbs could return two different error messages 
- one for "there's no IB card in this machine" and one for "there's an IB 
card here, but we can't initialize it".  I think that would make this 
argument go away.  Open MPI could probably mimic that behavior by parsing 
the PCI tables, but that sounds ... painful.


I guess the root of my concern is that unexpected behavior with no 
explanation is (in my mind) the most dangerous case and the one we should 
address by default.  And turning this error message off is going to cause 
unexpected behavior without explanation.


Just my $0.02.


Brian


Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-21 Thread Brian W. Barrett
Then we disagree on a core point.  I believe that users should never have 
something silently unexpected happen (like falling back to TCP from a high 
speed interconnect because of a NIC reset / software issue).  YOu clearly 
don't feel this way.  I don't really work on the project, but do have lots 
of experience being yelled at by users when something unexpected happens.


I guarantee you we'll see a report of poor IB / application performance 
because of the silent fallback to TCP.  There's a reason that error 
message was put in.  I don't get a vote anymore, so do whatever you think 
is best.


Brian


On Wed, 21 May 2008, Jeff Squyres wrote:


One thing I should clarify -- the ibverbs error message from my
previous mail is a red herring.  libibverbs prints that message on
systems where the kernel portions of the OFED stack are not installed
(such as the quick-n-dirty test that I did before -- all I did was
install libibverbs without the corresponding kernel stuff).  I
installed the whole OFED stack on a machine with no verbs-capable
hardware and verified that the libibverbs message does *not* appear
when the kernel bits are properly installed and running.

So we're only talking about the Open MPI warning message here.  More
below.



On May 21, 2008, at 12:17 PM, Brian W. Barrett wrote:


2. An out-of-the-box "mpirun a.out" will print warning messages in
perfectly valid/good configurations (no verbs-capable hardware, but
just happen to have libibverbs installed).  This is a Big Deal.


Which is easily solved with a better error message, as Pasha
suggested.


I guess this is where we disagree: I don't believe that the issue is
solved by making a "better" message.  Specifically: this is the first
case where we're saying "if you run with a valid configuration, you're
going to get a warning message and you have to do something extra to
turn it off."

That just seems darn weird to me, especially when other MPI's don't do
the same thing.  Come to think of it, I can't think of many other
software packages that do that.


In short: I think it's no longer safe to assume that machines with
libibverbs installed must also have verbs-capable hardware.


But here's the real problem -- with our current selection logic, a
user
with libibverbs but no IB cards gets an error message saying "hey,
we need
you to set this flag to make this error go away" (or would, per
Pasha's
suggestion).  A user with a busted IB stack on a node (which we
still saw
pretty often at LANL) starts using TCP and their application runs
like a
dog.

I guess it's a matter of how often you see errors in the IB stack that
cause nic initialization to fail.  The machines I tend to use still
exhibit this problem pretty often, but it's possible I just work on
bad
hardware more often than is usual in the wild.


I guess this is the central issue: what *is* the common case?  Which
set of users should be forced to do something different?

I'm claiming that now that the Linux distros are shipping libibverbs,
the number of users who have the openib BTL installed but do not have
verbs-capable hardware will be *much* larger than those with verbs-
capable hardware.  Hence, I think the pain point should be for the
smaller group (those with verbs-capable hardware): set an MCA param if
you want to see the warning message.

(we can debate the default value for the BTL-wide base param later --
let's first just debate the *concept* as specific to the openib BTL)


It would be great if libibverbs could return two different error
messages
- one for "there's no IB card in this machine" and one for "there's
an IB
card here, but we can't initialize it".  I think that would make this
argument go away.  Open MPI could probably mimic that behavior by
parsing
the PCI tables, but that sounds ... painful.


Yes, this capability in libiverbs would be good.  Parsing the PCI
tables doesn't sound like our role.

I'll ask the libibverbs authors about it...


I guess the root of my concern is that unexpected behavior with no
explanation is (in my mind) the most dangerous case and the one we
should
address by default.  And turning this error message off is going to
cause
unexpected behavior without explanation.



But more information is available, and subject to normal
troubleshooting techniques.  And if you're in an environment where you
*do* want to use verbs-capable hardware, then setting the MCA param
seems perfectly acceptable to me.  IIRC, LANL sets a whole pile of MCA
params in the top-level openmpi-mca-params.conf file that are specific
to their environment (right?).  If that's true, what's one more param?

Heck, the OMPI installed by OFED can set an MCA param in openmpi-mca-
params.cof by default (which is what most verbs-capable-hardware-users
utilize).  That would solve the issue

Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-21 Thread Brian W. Barrett

On Wed, 21 May 2008, Jeff Squyres wrote:


On May 21, 2008, at 3:38 PM, Jeff Squyres wrote:


It would be great if libibverbs could return two different error
messages
- one for "there's no IB card in this machine" and one for "there's
an IB
card here, but we can't initialize it".  I think that would make this
argument go away.  Open MPI could probably mimic that behavior by
parsing
the PCI tables, but that sounds ... painful.



Thinking about this a bit more -- I think it depends on what kind of
errors you are worried about seeing.  IBV does separate the discovery
of devices (ibv_get_device_list) from trying to open a device
(ibv_open_device).  So hypothetically, we *can* distinguish between
these kinds of errors already.

Do you see devices that are so broken that they don't show up in the
list returned from ibv_get_device_list?

FWIW: the *only* case I'm talking about changing the default for is
when ibv_get_device_list returns an empty list (meaning that according
to the verbs stack, there are no devices in the host).  I think that
we should *always* warn for any kinds of errors that occur after that
(e.g., we find a device but can't open it, we find one or more devices
but no active ports, etc.).


Previously, there has not been such a distinction, so I really have no 
idea which caused the openib BTL throw its error (and never really cared, 
as it was always somebody else's problem at that point).


I'm only concerned about the case where there's an IB card, the user 
expects the IB card to be used, and the IB card isn't used.  If the 
changes don't silence a warning in that situation, I'm fine with whatever 
you do.  But does ibv_get_device_list return an HCA when the port is down 
(because the SM failed and the machine rebooted since that time)?  If not, 
we still ahve a (fairly common, unfortunately) error case that we need to 
report (in my opinion).



Brian


Re: [OMPI devel] openib btl build question

2008-05-21 Thread Brian W. Barrett

On Wed, 21 May 2008, Jeff Squyres wrote:


On May 21, 2008, at 4:17 PM, Don Kerr wrote:


Just want to make sure what I think I see is true:

Linux build.  openib btl requires ptmalloc2 and ptmalloc2 requires
posix
threads, is that correct?


ptmalloc2 is not *required* by the openib btl.  But it is required on
Linux if you want to use the mpi_leave_pinned functionality.  I see
one function call to __pthread_initialize in the ptmalloc2 code -- it
*looks* like it's a function of glibc, but I don't know for sure.


There's actually more than that, it's just buried a bit.  There's a whole 
bunch of thread-specific data stuff, which is wrapped so that different 
thread packages can be used (although OMPI only supports pthreads).  The 
wrappers are in ptmalloc2/sysdeps/pthreads.


Brian


Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-21 Thread Brian W. Barrett

On Wed, 21 May 2008, Jeff Squyres wrote:


I'm only concerned about the case where there's an IB card, the user
expects the IB card to be used, and the IB card isn't used.


Can you put in a site wide

btl = ^tcp

to avoid the problem?  If the IB card fails, then you'll get
unreachable MPI errors.


And how many users are going to figure that one out before complaining 
loudly?  That's what LANL did (probably still does) and it worked great 
there, but that doesn't mean that others will figure that out (after all, 
not everyone has an OMPI developer on staff...).



If the
changes don't silence a warning in that situation, I'm fine with
whatever
you do.  But does ibv_get_device_list return an HCA when the port is
down
(because the SM failed and the machine rebooted since that time)?


Yes.


If this is true (for some reason I thought it wasn't), then I think we'd 
actually be ok with your proposal, but you're right, you'd need something 
new in the IB btl.  I'm not concerned about the dual rail issue -- if 
you're smart enough to configure dual rail IB, you're smart enough to 
figure out OMPI mca params.  I'm not sure the same is true for a simple 
delivered from the white box vendor IB setup that barely works on a good 
day (and unfortunately, there seems to be evidence that these exist).



Brian


Re: [OMPI devel] openib btl build question

2008-05-22 Thread Brian W. Barrett
Ah.  On Linux, --without-threads really doesn't gain you that much.  The 
default glibc is still thread safe, and there are only a couple small 
parts of the code that use locks (like the OOB TCP).  It's generally just 
easier to leave threads enabled on Linux.


Brian

On Thu, 22 May 2008, Don Kerr wrote:


Thanks Jeff. Thanks Brian.

I ran into this because I was specifically trying to configure with
"--disable-progress-threads --disable-mpi-threads" at which point I
figured, might as well turn off all threads so I added
"--without-threads" as well. But can't live without mpi_leave_pinned so
threads are back.


Jeff Squyres wrote:

On May 21, 2008, at 4:37 PM, Brian W. Barrett wrote:



ptmalloc2 is not *required* by the openib btl.  But it is required on
Linux if you want to use the mpi_leave_pinned functionality.  I see
one function call to __pthread_initialize in the ptmalloc2 code -- it
*looks* like it's a function of glibc, but I don't know for sure.


There's actually more than that, it's just buried a bit.  There's a
whole
bunch of thread-specific data stuff, which is wrapped so that
different
thread packages can be used (although OMPI only supports pthreads).
The
wrappers are in ptmalloc2/sysdeps/pthreads.




Doh!  I didn't "grep -r"; my bad...



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-22 Thread Brian W. Barrett

On Thu, 22 May 2008, Terry Dontje wrote:


The major difference here is that libmyriexpress is not being included
in mainline Linux distributions.  Specifically: if you can find/use
libmyriexpress, it's likely because you have that hardware.  The same
*used* to be true for libibverbs, but is no longer true because Linux
distros are now shipping (e.g., the Debian distribution pulls in
libibverbs when you install Open MPI).

Ok, but there are distributions that do include the myrinet BTL/MTL (ie 
CT).  Though I agree for the most part in the case of myrinet if you 
have libmyriexpress you probably will probably have an operable 
interface.  I guess I am curious how many other BTLs a distribution 
might end up delivering that could run into this reporting issue.  I 
guess my point is could this be worth something more general instead of 
a one off for IB?


From my point of view the btl_warn_unused_components coupled with "-mca 
btl ^mlfbtl" works for me.  However the fact that the IB 
vendors/community (ie CISCO) is solving this for their favorite 
interface makes me pause for a moment.


There's actually a second (in my mind more important) reason why this is 
IB only, as I shared similar concerns (hence yesterday's e-mail barage). 
InfiniBand has a two stage initialization -- you get the list of HCAs, 
then you initialize the HCA you want.  So it's possible to determine that 
there's no HCAs in the system vs. the system couldn't initialize the HCA 
properly (as that would happen in step 2, according to Jeff).


With MX, it's one initialization call (mx_init), and it's not clear from 
the errors it can return that you can differentiate between the two cases. 
I haven't tried it, but it's possible that mx_init would succeed in the no 
nic case, but then have a NIC count of 0.


Anyway, the short answer is that (in my opinion) we should have a btl base 
param similar to warn_unused for whether to warn when no NICs/HCAs are 
found, hopefully with a nice error function similar to today's no_nics 
(which probably needs to be renamed in that case).  That way, if BTL 
authors other than OpenIB want to do some extra work and return better 
error messages, they can.



FWIW, our distribution actually turns off
btl_base_want_component_unused
because it seemed
the majority of our cases would be that users would false positive
sights of the message.


Is the UDAPL library shipped in Solaris by default?  If so, then
you're likely in exactly the same kind of situation that I'm
describing.  The same will be true if Solaris ends up shipping
libibverbs by default.


Yes the UDAPL library is shipped in Solaris by default.  Which is why we
turn off
btl_warn_unused_components.  Yes, and I suspect once Solaris starts
delivering libibverbs
we (Sun) will need to figure out how to handle having both the udapl and
openib btls being
available.


There is some evil configure hackery that could be done to make this work 
in a more general way (don't you love it when I say that). 
Autogen/configure makes no guarantees about the order in which the 
configure.m4 macros for components in the same framework are run, other 
than all components of priority X are run before those of priority Y, iff 
X > Y.  So you could set the priority of all the components except udapl 
to (say) 10 and udapl's to 0.  Then have the udapl configure only build if 
1) it was specifically requested or 2) ompi_check_openib_happy = no.  No 
more Linux-specific stuff, works when Solaris gets OFED, and works on old 
Solaris that has uDAPL but not OFED.


As a matter of fact, it's so trivial to do that I'd recommend doing it for 
1.3.  Really, you could do it minimally by only changing OpenIB's 
configure.params to set its priority to 10, uDAPL's configure.params to 
set its priority to 0, and uDAPL's configure.m4 to remove the Linux stuff 
and look for ompi_check_openib_happy.



Brian


  1   2   >