Re: [OMPI devel] 1.3 PML default choice
The selection logic for the PML is very confusing and doesn't follow the standard priority selection. The reasons for this are convoluted and not worth discussing here. The bottom line, however, is that the OB1 PML will be the default *UNLESS* the PSM (PathScale/Qlogic) MTL can be chosen, in which case the CM PML is used by default. Brian On Tue, 13 Jan 2009, Bogdan Costescu wrote: On Tue, 13 Jan 2009, Tim Mattox wrote: The cm PML does not use BTLs..., only MTLs, so ... the BTL selection is ignored. OK, thanks for clarifying this bit, but... The README for 1.3b2 specifies that CM is now chosen if possible; in my trials, when I specify CM+BTL, it doesn't complain and works well. However either the default (no options) or OB1+BTL leads to the jumps mentioned above, which makes me believe that OB1+BTL is still chosen as default, contrary to what the README specifies. ... this bit is still unclear to me. Should OB1+BTL or CM+MTL be the default ? I have just tried using "mpi_show_mca_params" for both v1.3b2 and v1.3rc3 and this tells me that: pml= (default value) pml_cm_priority=30 (default value) pml_ob1_priority=20 (default value) which, from what I know, should lead to CM being chosen as the default. Still for v1.3b2 OB1 seemed to be chosen; for v1.3rc3 I can't distinguish anymore from timings as they behave very similarly.
Re: [OMPI devel] RFC: [slightly] Optimize Fortran MPI_SEND / MPI_RECV
On Sat, 7 Feb 2009, Jeff Squyres wrote: End result: I guess I'm a little surprised that the difference is that clear -- does a function call really take 10ns? I'm also surprised that the layered C version has significantly more jitter than the non-layered version; I can't really explain that. I'd welcome anyone else replicating experiment and/or eyeballing my code to make sure I didn't bork something up. That is significantly higher than I would have expected for a single function call. When I did all the component tests a couple years ago, a function call into a shared library was about 5ns on an Intel Xeon (pre-Core 2 design) and about 2.5 on an AMD Opteron. Brian
Re: [OMPI devel] RFC: Rename several OMPI_* names to OPAL_*
I have no objections to this change Brian On Tue, 10 Feb 2009, Greg Koenig wrote: RFC: Rename several OMPI_* names to OPAL_* WHAT: Rename several #define values that encode the prefix "OMPI_" to instead encode the prefix "OPAL_" throughout the entire Open MPI source code tree. Also, eliminate unnecessary #include lines from source code files under the ".../ompi/mca/btl" subtree. WHY: (1) These are general source code improvements that update #define values to more accurately describe which layer the values belong and remove unnecessary dependencies within the source code; (2) These changes will help with the effort to move the BTL code into an independent layer. WHERE: 1.4 trunk WHEN: Negotiable -- see below, but probably near split for 1.4 (No earlier than February 19, 2009) Timeout: February 19, 2009 The proposed change involves renaming several #define values that encode the prefix "OMPI_" to instead encode the prefix "OPAL_" throughout the entire Open MPI source code tree. These names are holdovers from when the three existing layers of Open MPI were developed together prior to being split apart. Additionally, the proposed change eliminates a few unnecessary #include lines in BTL source code files under the .../ompi/mca/btl subtree. Specific modifications are detailed following this message text. A script to carry out these modifications is also attached to this message (gzipped to pass unmolested through the ORNL e-mail server). We believe these modifications improve the Open MPI source code by renaming values such that they correspond to the Open MPI layer to which they most closely belong, and that this improvement is itself of benefit to Open MPI. These modifications will also aid our ongoing efforts to extract the BTL code into a new layer ("ONET") that can be built with just direct dependence on the OPAL layer. Although these changes are simple string substitutions, they touch a fair amount of code in the Open MPI tree. Three people have tested these changes at our site on various platforms and have not discovered any problems. However, we recognize that some members of the community may have input/feedback regarding testing and we remain open to suggestions related to testing. One challenge that has been brought up regarding this RFC is that applying patches and/or CMRs to the source code tree after the proposed changes are performed will be more difficult. To that end, the best opportunity to apply the modifications proposed in this RFC seems to be in conjunction with 1.4. (My understanding from the developer conference call this morning is that there are a few other changes waiting for this switch as well.) We are open to suggestions about the best time to apply this RFC to avoid major disruptions. Specific changes follow: * From .../configure.ac. * OMPI_NEED_C_BOOL * OMPI_HAVE_WEAK_SYMBOLS * OMPI_C_HAVE_WEAK_SYMBOLS * OMPI_USE_STDBOOL_H * OMPI_HAVE_SA_RESTART * OMPI_HAVE_VA_COPY * OMPI_HAVE_UNDERSCORE_VA_COPY * OMPI_PTRDIFF_TYPE * (also, ompi_ptrdiff_t) * OMPI_ALIGN_WORD_SIZE_INTEGERS * OMPI_WANT_LIBLTDL * (also, OMPI_ENABLE_DLOPEN_SUPPORT) * OMPI_STDC_HEADERS * OMPI_HAVE_SYS_TIME_H * OMPI_HAVE_LONG_LONG * OMPI_HAVE_SYS_SYNCH_H * OMPI_SIZEOF_BOOL * OMPI_SIZEOF_INT * From .../config/ompi_check_attributes.m4. * OMPI_HAVE_ATTRIBUTE * (also, ompi_cv___attribute__) * OMPI_HAVE_ATTRIBUTE_ALIGNED * (also, ompi_cv___attribute__aligned) * OMPI_HAVE_ATTRIBUTE_ALWAYS_INLINE * (also, ompi_cv___attribute__always_inline) * OMPI_HAVE_ATTRIBUTE_COLD * (also, ompi_cv___attribute__cold) * OMPI_HAVE_ATTRIBUTE_CONST * (also, ompi_cv___attribute__const) * OMPI_HAVE_ATTRIBUTE_DEPRECATED * (also, ompi_cv___attribute__deprecated) * OMPI_HAVE_ATTRIBUTE_FORMAT * (also, ompi_cv___attribute__format) * OMPI_HAVE_ATTRIBUTE_HOT * (also, ompi_cv___attribute__hot) * OMPI_HAVE_ATTRIBUTE_MALLOC * (also, ompi_cv___attribute__malloc) * OMPI_HAVE_ATTRIBUTE_MAY_ALIAS * (also, ompi_cv___attribute__may_alias) * OMPI_HAVE_ATTRIBUTE_NO_INSTRUMENT_FUNCTION * (also, ompi_cv___attribute__no_instrument_function) * OMPI_HAVE_ATTRIBUTE_NONNULL * (also, ompi_cv___attribute__nonnull) * OMPI_HAVE_ATTRIBUTE_NORETURN * (also, ompi_cv___attribute__noreturn) * OMPI_HAVE_ATTRIBUTE_PACKED * (also, ompi_cv___attribute__packed) * OMPI_HAVE_ATTRIBUTE_PURE * (also, ompi_cv___attribute__pure) * OMPI_HAVE_ATTRIBUTE_SENTINEL * (also, ompi_cv___attribute__sentinel) * OMPI_HAVE_ATTRIBUTE_UNUSED * (also, ompi_cv___attribute__unused) * OMPI_HAVE_ATTRIBUTE_VISIBILITY * (also, ompi_cv___attribute__visibility) * OMPI_HAVE_ATTRIBUTE_WARN_UNUSED_RESULT * (also, ompi_cv___attribute__warn_unused_result) * OMPI_HAVE_ATTRIBUTE_WEAK_ALIAS * (also, ompi_cv___attribute__weak
Re: [OMPI devel] RFC: eliminating "descriptor" argument from sendi function
At a high level, it seems reasonable to me. I am not familiar enough with the sendi code, however, to have a strong opinion either way. Brian On Mon, 23 Feb 2009, Jeff Squyres wrote: Sounds reasonable to me. George / Brian? On Feb 21, 2009, at 2:11 AM, Eugene Loh wrote: What: Eliminate the "descriptor" argument from sendi functions. Why: The only thing this argument is used for is so that the sendi function can allocate a descriptor in the event that the "send" cannot complete. But, in that case, the sendi reverts to the PML, where there is already code to allocate a descriptor. So, each sendi function (in each BTL that has a sendi function) must have code that is already in the PML anyhow. This is unnecessary extra coding and not clean design. Where: In each BTL that has a sendi function (only three, and there are not all used) and in the function prototype and at the PML calling site. When: I'd like to incorporate this in the shared-memory latency work I'm doing that we're targetting for 1.3.x. Timeout: Feb 27. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: eliminating "descriptor" argument from sendi function
On Mon, 23 Feb 2009, Jeff Squyres wrote: On Feb 23, 2009, at 10:37 AM, Eugene Loh wrote: I sense an opening here and rush in for the kill... :-) And, why does the PML pass a BTL argument into the sendi function? First, the BTL argument is not typically used. Second, if the BTL sendi function wants to know what BTL it is,... uh, doesn't it already know??? Doesn't a BTL know who it is? Why, then, should the PML have to tell it? I suspect that it's passing in the BTL *module* argument, which may have specific information about the connection that is to be used. Example: if I have a dual-port IB HCA, Open MPI will make 2 different openib BTL modules. In this case, the openib BTL will need to know exactly which module the PML is trying to sendi on. Exactly. In multi-nic situations, the BTL argument is critical. Since the SM btl never really does "multi-nic", it doesn't have to worry about the btl argument. Brian
Re: [OMPI devel] compiler_args in wrapper-data.txt files with Portland Group Compilers
Hi Wayne - Sorry for the delay. I'm the author of that code, and am currently trying to finish my dissertation, so I've been a bit behind. Anyway, at present, the compiler_args field only works on a single token. So you can't have something looking for -tp p7. I thought about how to do this, but never got a chance to add it to the code base. I'm not sure when/if that feature will be added. If you have some time, the code lives in opal/tools/wrappers/opal_wrapper.c, if you want to have a look. Good luck, Brian On Mon, 23 Feb 2009, Wayne Gilmore wrote: I sent this to the users mailing list buy maybe this is a better place for it. Can anyone help with this?? I'm trying to use the compiler_args field in the wrappers script to deal with 32 bit compiles on our cluster. I'm using Portland Group compilers and use the following for 32 bit builds: -tp p7 I've created a separate stanza in the wrapper but I am not able to use the whole option "-tp p7" for the compiler_args. It only works if I do compiler_args=p7 Is there a way to provide compiler_args with arguments that contain a space? This would eliminate cases where 'p7' would appear elsewhere in the compile line and be falsely recognized as a 32 bit build. Here is some additional information from my build: For a regular 64 bit build: (no problems here, works fine) katana:~ % mpicc --showme pgcc -D_REENTRANT -I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include -Wl,-rpath -Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib -L/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lpthread -ldl For a 32 bit build when compiler_args is set to "-tp p7" in the wrapper: (note that in this case is does not pick up the lib32 and include32 dirs) katana:share/openmpi % mpicc -tp p7 --showme pgcc -D_REENTRANT -I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include -tp p7 -Wl,-rpath -Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib -L/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lpthread -ldl For a 32 bit build when compiler_args is set to "p7" in the wrapper (note that in this case it does pick up the lib32 and include32 dirs) katana:share/openmpi % mpicc -tp p7 --showme pgcc -D_REENTRANT -I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include32 -I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include32 -tp p7 -Wl,-rpath -Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib32 -L/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib32 -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lpthread -ldl Here's the mpicc-wrapper-data.txt file that I am using: (with compiler_args set to "p7" only. This works, but if I set it to "-tp p7" it fails to pick up the info in the stanza) compiler_args= project=Open MPI project_short=OMPI version=1.3 language=C compiler_env=CC compiler_flags_env=CFLAGS compiler=pgcc extra_includes= preprocessor_flags=-D_REENTRANT compiler_flags= linker_flags=-Wl,-rpath -Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib libs=-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lpthread -ldl required_file= includedir=${includedir} libdir=${libdir} compiler_args=p7 project=Open MPI project_short=OMPI version=1.3 language=C compiler_env=CC compiler_flags_env=CFLAGS compiler=pgcc extra_includes= preprocessor_flags=-D_REENTRANT -I/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include32 compiler_flags= linker_flags=-Wl,-rpath -Wl,/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib32 libs=-lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lpthread -ldl required_file= includedir=/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/include32 libdir=/project/scv/waygil/local/IT/ofedmpi-1.2.5.5/mpi/pgi/openmpi-1.3/lib32 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.3.1rc3 was borked; 1.3.1rc4 is out
On Tue, 3 Mar 2009, Jeff Squyres wrote: 1.3.1rc3 had a race condition in the ORTE shutdown sequence. The only difference between rc3 and rc4 was a fix for that race condition. Please test ASAP: http://www.open-mpi.org/software/ompi/v1.3/ I'm sorry, I've failed to test rc1 & rc2 on Catamount. I'm getting a compile failure in the ORTE code. I'll do a bit more testing and send Ralph an e-mail this afternoon. Brian
Re: [OMPI devel] calling sendi earlier in the PML
On Tue, 3 Mar 2009, Eugene Loh wrote: First, this behavior is basically what I was proposing and what George didn't feel comfortable with. It is arguably no compromise at all. (Uggh, why must I be so honest?) For eager messages, it favors BTLs with sendi functions, which could lead to those BTLs becoming overloaded. I think favoring BTLs with sendi for short messages is good. George thinks that load balancing BTLs is good. I have two thoughts on the issue: 1) How often are a btl with a sendi and a btl without a sendi going to be used together? Keep in mind, this is two BTLs with the same priority and connectivity to the same peer. My thought is that given the very few heterogeneous networked machines (yes, I know UTK has one, but we're talking percentages), optimizing for that case at the cost of the much more common case is a poor choice. 2) It seems like a much better idea would be to add sendi calls to all btls that are likely to be used at the same priority. This seems like good long-term form anyway, so why not optimize the PML for the long term rather than the short term and assume all BTLs will have a sendi function? Brian
Re: [OMPI devel] calling sendi earlier in the PML
On Tue, 3 Mar 2009, Jeff Squyres wrote: On Mar 3, 2009, at 3:31 PM, Eugene Loh wrote: First, this behavior is basically what I was proposing and what George didn't feel comfortable with. It is arguably no compromise at all. (Uggh, why must I be so honest?) For eager messages, it favors BTLs with sendi functions, which could lead to those BTLs becoming overloaded. I think favoring BTLs with sendi for short messages is good. George thinks that load balancing BTLs is good. Second, the implementation can be simpler than you suggest: *) You don't need a separate list since testing for a sendi-enabled BTL is relatively cheap (I think... could verify). *) You don't need to shuffle the list. The mechanism used by ob1 just resumes the BTL search from the last BTL used. E.g., check https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/pml/ob1/pml_ob1_sendreq.h#mca_pml_ob1_send_request_start . You use mca_bml_base_btl_array_get_next(&btl_eager) to roundrobin over BTLs in a totally fair manner (remembering where the last loop left off), and using mca_bml_base_btl_array_get_size(&btl_eager) to make sure you don't loop endlessly. Cool / fair enough. How about an MCA parameter to switch between this mechanism (early sendi) and the original behavior (late sendi)? This is the usual way that we resolve "I want to do X / I want to do Y" disputes. :-) Of all the options presented, this is the one I dislike most :). This is *THE* critical path of the OB1 PML. It's already horribly complex and hard to follow (as Eugene is finding out the hard way). Making it more complex as a way to settle this argument is pain and suffering just to avoid conflict. However, one possible option that just occurred to me. I propose yet another option. If (AND ONLY IF) ob1/r2 detects that there are at least two BTLs to the same peer at the same priority and at least one has a sendi and at least one does not have a sendi, what about an MCA parameter to disable all sendi functions to that peer? There's only a 1% gain in the FAIR protocol Euegene proposed, so we'd lose that 1% in the heterogeneous multi-nic case (the least common case). There would be a much bigger gain for the sendi homogeneous multi-nic / all single-nic cases (much more common), because the FAST protocol would be used. That way, we get the FAST protocol in all cases for sm, which is what I really want ;). Brian
Re: [OMPI devel] 1.3.1rc3 was borked; 1.3.1rc4 is out
On Tue, 3 Mar 2009, Brian W. Barrett wrote: On Tue, 3 Mar 2009, Jeff Squyres wrote: 1.3.1rc3 had a race condition in the ORTE shutdown sequence. The only difference between rc3 and rc4 was a fix for that race condition. Please test ASAP: http://www.open-mpi.org/software/ompi/v1.3/ I'm sorry, I've failed to test rc1 & rc2 on Catamount. I'm getting a compile failure in the ORTE code. I'll do a bit more testing and send Ralph an e-mail this afternoon. Attached is a patch against v1.3 branch that makes it work on Red Storm. I'm not sure it's right, so I'm just e-mailing it rather than committing.. Sorry Ralph, but can you take a look? :( BrianIndex: orte/mca/odls/base/base.h === --- orte/mca/odls/base/base.h (revision 20705) +++ orte/mca/odls/base/base.h (working copy) @@ -29,9 +29,10 @@ #include "opal/mca/mca.h" #include "opal/class/opal_list.h" +#if !ORTE_DISABLE_FULL_SUPPORT #include "orte/mca/odls/odls.h" +#endif - BEGIN_C_DECLS /** Index: orte/mca/grpcomm/grpcomm.h === --- orte/mca/grpcomm/grpcomm.h (revision 20705) +++ orte/mca/grpcomm/grpcomm.h (working copy) @@ -44,7 +44,6 @@ #include "orte/mca/rmaps/rmaps_types.h" #include "orte/mca/rml/rml_types.h" -#include "orte/mca/odls/odls_types.h" #include "orte/mca/grpcomm/grpcomm_types.h" Index: orte/runtime/orte_globals.c === --- orte/runtime/orte_globals.c (revision 20705) +++ orte/runtime/orte_globals.c (working copy) @@ -40,11 +40,11 @@ #include "orte/runtime/runtime_internals.h" #include "orte/runtime/orte_globals.h" +#if !ORTE_DISABLE_FULL_SUPPORT + /* need the data type support functions here */ #include "orte/runtime/data_type_support/orte_dt_support.h" -#if !ORTE_DISABLE_FULL_SUPPORT - /* globals used by RTE */ bool orte_timing; bool orte_debug_daemons_file_flag = false; @@ -135,7 +135,8 @@ opal_output_set_verbosity(orte_debug_output, 1); } } - + +#if !ORTE_DISABLE_FULL_SUPPORT /** register the base system types with the DSS */ tmp = ORTE_STD_CNTR; if (ORTE_SUCCESS != (rc = opal_dss.register_type(orte_dt_pack_std_cntr, @@ -192,7 +193,6 @@ return rc; } -#if !ORTE_DISABLE_FULL_SUPPORT /* get a clean output channel too */ { opal_output_stream_t lds; Index: orte/runtime/data_type_support/orte_dt_support.h === --- orte/runtime/data_type_support/orte_dt_support.h (revision 20705) +++ orte/runtime/data_type_support/orte_dt_support.h (working copy) @@ -30,7 +30,9 @@ #include "opal/dss/dss_types.h" #include "orte/mca/grpcomm/grpcomm_types.h" +#if !ORTE_DISABLE_FULL_SUPPORT #include "orte/mca/odls/odls_types.h" +#endif #include "orte/mca/plm/plm_types.h" #include "orte/mca/rmaps/rmaps_types.h" #include "orte/mca/rml/rml_types.h"
Re: [OMPI devel] calling sendi earlier in the PML
On Wed, 4 Mar 2009, George Bosilca wrote: I'm churning a lot and not making much progress, but I'll try chewing on that idea (unless someone points out it's utterly ridiculous). I'll look into having PML ignore sendi functions altogether and just make the "send-immediate" path work fast with normal send functions. If that works, then we can get rid of sendi functions and hopefully have a solution that makes sense for everyone. This is utterly ridiculous (I hope you really expect someone to say it). As I said before, SM is only one of the networks supported by Open MPI. Independent on how much I would like to have better shared memory performance, I will not agree with any PML modifications that are SM oriented. We did that in the past with other BTLs and it turned out to be a bad idea, so I'm clearly not in favor of doing the same mistake twice. Regarding the sendi there are at least 3 networks that can take advantage of it: Portals, MX and Sicortex. Some of them do this right now, some others in the near future. Moreover, for these particular networks there is no way to avoid extra overhead without this feature (for very obscure reasons such as non contiguous pieces of memory only known by the BTL that can decrease the number of network operations). How about removing the MCA parameter from my earlier proposal and just having r2 filter out the sendi calls if there are multiple BTLs with heterogeneous BTLs (ie, some with sendi and some without) to the same peer. That way, the early sendi will be bypassed in that case. But for the cases of BTLs that support sendi in common usage scenarios (homogeneous nics), we'll get the optimization? Does that offend you George? :) Brian
Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer
I, not suprisingly, have serious concerns about this RFC. It assumes that the ompi_proc issues and bootstrapping issues (the entire point of the move, as I understand it) can both be solved, but offer no proof to support that claim. Without those two issues solved, we would be left with an onet layer that is dependent on ORTE and OMPI, and which OMPI depends upon. This is not a good place to be. These issues should be resolved before an onet layer is created in the trunk. This is not an unusual requirement. The fault tolerance work took a very long time because of similar requirements. Not only was a full implementation required to prove performance would not be negatively impacted (when FT wasn't active), but we had discussions about its impact on code maintainability. We had a full implementation of all the pieces that impacted the code *before* any of it was allowed into the trunk. We should live by the rules the community has setup. They have served us well in the past. Further, these are not new objections on my part. Since the initial RFCs related to this move started, I have continually brought up the exact same questions and never gotten a satisfactory answer. This RFC even acknowledges the issues, but without presenting any solution and still asks to do the most disruptive work. I simply can't see how that fits with Open MPI's long-standing development proceedures. If all the issues I've asked about previously (which are essentially the ones you've identified in the RFC) can be solved, the impact to code base maintainability is reasonable, and the impact to performance is negligable, I'll gladly remove my objection to this RFC. Further, before any work on this branch is brought into the trunk, the admin-level discussion regarding this issue should be resolved. At this time, that discussion is blocking on ORNL and they've given April as the earliest such a discussion can occur. So at the very least, the RFC timeout should be pushed into April or ORNL should revise their availability for the admin discussion. Brian On Mon, 9 Mar 2009, Rainer Keller wrote: What: Move BTLs into separate layer Why: Several projects have expressed interest to use the BTLs. Use-cases such as the RTE using the BTLs for modex or tools collecting/distributing data in the fastest possible way may be possible. Where:This would affect several components, that the BTLs depend on (namely allocator, mpool, rcache and the common part of the BTLs). Additionally some changes to classes were/are necessary. When: Preferably 1.5 (in case we use the Feature/Stable Release cycle ;-) Timeout: 23.03.2009 There has been much speculation about this project. This RFC should shed some light, if there is some more information required, please feel free to ask/comment. Of course, suggestions are welcome! The BTLs offer access to fast communication framework. Several projects have expressed interest to use them separate of other layers of Open MPI. Additionally (with further changes) BTLs maybe used within ORTE itself. COURSE OF WORK: The extraction is not easy (as was the extraction of ORTE and OMPI in the early stages of Open MPI?). In order to get as much input and be as visible as possible (e.g. in TRACS), the tmp-branch for this work has been set up on: https://svn.open-mpi.org/svn/ompi/tmp/koenig-btl We propose to have a separate ONET library living in onet, based on orte (see attached fig). In order to keep the diff between the trunk and the branch to a minimum several cleanup patches have already been applied to the trunk (e.g. unnecessary #include of ompi and orte header files, integration of ompi_bitmap_t into opal_bitmap_t, #include "*_config.h"). Additionally a script (attached below) has been kept up-to-date (contrib/move- btl-into-onet), that will perform this separation on a fresh checkout of trunk: svn list https://svn.open-mpi.org/svn/ompi/tmp/koenig-btl/contrib/move-btl- into-onet This script requires several patches (see attached TAR-ball). Please update the variable PATCH_DIR to match the location of patches. ./move-btl-into-onet ompi-clean/ # Lots of output deleted. cd ompi-clean/ rm -fr ompi/mca/common/ # No two mcas called common, too bad... ./autogen.sh OTHER RTEs: A preliminary header file is provided in onet/include/rte.h to accommodate the requirements of other RTEs (such as stci), that replaces selected functionality, as proposed by Jeff and Ralph in the Louisville meeting. Additionally, this header file is included before orte-header files (within onet)... By default, this does not change anything in the standard case (ORTE), otherwise -DHAVE_STCI, redefinitions for components orte-functionality required within onet is done. TESTS: First tests have been done locally on Linux/x86_64. The branch compiles without warnings. The wrappers have been updated
Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer
I guess then I missed the point of this RFC if not to move code. It talks about bringing this code into the trunk for the 1.5 time frame. If it's just getting general comments and there will be an RFC for all the changes (including the onet split proposed below) when the issues have been solved, that's great. I'll comment on the proposal as a whole once my 4 month old questions are answered. Until then, I don't think we should be using the RFC process to get permission to move portions of a project with critical questions unanswered (which is exactly what this RFC reads as doing). Brian On Mon, 9 Mar 2009, Rainer Keller wrote: Hi Jeff, thanks for the mail! I completely agree with Your points. To stress the fact: The timeout date does not mean, that we intend to just commit to trunk by that date. It was rather to get comments to this particular date by all the parties interested. (this is what I remembered from previous RFCs, but I could be wrong...) All the work that has been committed should cleanup the code. Anything that was beyond a cleanup deserved an RFC and input from many people (such as bitmap_t change...). We still intend, as in the Louisville meeting, to have as much input from the community (that's why this is TRACS-visible svn-tmp-branch). Thanks, Rainer On Monday 09 March 2009 04:52:28 pm Jeff Squyres wrote: Random points in no particular order (Rainer please correct me if I'm making bad assumptions): - I believe that ORNL is proposing to do this work on a separate branch (this is what we have discussed for some time now, and we discussed this deeply in Louisville). The RFC text doesn't specifically say, but I would be very surprised if this stuff is planned to come back to the trunk in the near future -- as we have all agreed, it's not done yet. - I believe that the timeout field in RFC's is a limit for non- responsiveness -- it is mainly intended to prevent people from ignoring / not responding to RFCs. I do not believe that Rainer was using that date as a "that's when I'm bringing it all back to the trunk." Indeed, he specifically called out the 1.5 series as a target for this work. - I also believe that Rainer is using this RFC as a means to get preliminary review of the work that has been done on the branch so far. He has provided a script that shows what they plan to do, how the code will be laid out, etc. There are still some important core issues to be solved -- and, like Brian, I want to see how they'll get solved before being happy (we have strong precedent for this requirement) -- but I think all that Rainer was saying in his RFC was "here's where we are so far; can people review and see if they hate it?" - It was made abundantly clear in the Louisville meeting that ORTE has no short-term plans for using the ONET layer (probably no long-term plans, either, but hey -- never say "never" :-) ). The design of ONET is such that other RTE's *could* use ONET if they want (e.g., STCI will), but it is not a requirement for the underlying RTE to use ONET. We agreed in Louisville that ORTE will provide sufficient stubs and hooks (all probably effectively no-ops) so that ONET can compile against it in the default OMPI configuration; other RTEs that want to do more meaningful stuff will need to provide more meaningful implementations of the stubs and hooks. - Hopefully the teleconference time tomorrow works out for Rich (his communications were unclear on this point). Otherwise, postponing the admin discussion until April seems problematic. On Mar 9, 2009, at 4:01 PM, Brian W. Barrett wrote: I, not suprisingly, have serious concerns about this RFC. It assumes that the ompi_proc issues and bootstrapping issues (the entire point of the move, as I understand it) can both be solved, but offer no proof to support that claim. Without those two issues solved, we would be left with an onet layer that is dependent on ORTE and OMPI, and which OMPI depends upon. This is not a good place to be. These issues should be resolved before an onet layer is created in the trunk. This is not an unusual requirement. The fault tolerance work took a very long time because of similar requirements. Not only was a full implementation required to prove performance would not be negatively impacted (when FT wasn't active), but we had discussions about its impact on code maintainability. We had a full implementation of all the pieces that impacted the code *before* any of it was allowed into the trunk. We should live by the rules the community has setup. They have served us well in the past. Further, these are not new objections on my part. Since the initial RFCs related to this move started, I have continually brought up the exact same questions and never gotten a satisfactory answer. This RFC even acknowledges the issues, but without presenting any solution and s
Re: [OMPI devel] RFC: move BTLs out of ompi into separate layer
On Wed, 11 Mar 2009, Richard Graham wrote: Brian, Going back over the e-mail trail it seems like you have raised two concerns: - BTL performance after the change, which I would take to be - btl latency - btl bandwidth - Code maintainability - repeated code changes that impact a large number of files - A demonstration that the changes actually achieve their goal. As we discussed after you got off the call, there are two separate goals here - being able to use the btl?s outside the context of mpi, but within the ompi code base - ability to use the btl?s in the context of a run-time other than orte Another concern I have heard raised by others is - mpi startup time Has anything else been missed here ? I would like to make sure that we address all the issues raised in the next version of the RFC. I think the umbrella concerns for the final success of the change are btl performance (in particular, latency and message rates for cache-unfriendly applications/benchmarks) and code maintainability. In addition, there are some intermediate change issues I have, in that this project is working different than other large changes. In particular, there is/was the appearance of being asked to accept changes which only make sense if the btl move is going to move forward, without any way to judge the performance or code impact because critical technical issues still remain. The latency/message rate issues are fairly straight forward from an end measure point-of-view. My concerns on latency/message rate come not from the movement of the BTL to another library (for most operating systems / shared library systems that should be negligible), but from the code changes which surround moving the BTLs. The BTLs are tightly intertwined with a number of pieces of the OMPI layer, in particular the BML and MPool frameworks and the ompi proc structure. I had a productive conversation with Rainer this morning explaining why I'm so concerned about the bml and ompi proc structures. The ompi proc structure currently acts not only as the identifier for a remote endpoint, but stores endpoint specific data for both the PML and BML. The BML structure actually contains each BTL's per process endpoint information, in the form of the base_endpoint_t* structures returned from add_procs(). Moving these structures around must be done with care, as some of the proposals Jeff, Rainer, and I came up with this morning either induced spaghetti code or greatly increased the spread of information needed for the critical send path through the memory space (thereby likely increasing cache misses on send for real applications). The code maintainability issue comes from three separate and independent issues. First, there is the issue of how the pieces of the OMPI layer will interact after the move. The BML/BTL/MPool/Rcache dance is already complicated, and care should be taken to minimize that change. Start-up is also already quite complex, and moving the BTLs to make them independent of starting other pieces of Open MPI can be done well or can be done poorly. We need to ensure it's done well, obviously. Second, there is the issue of wire-up. My impression from conversations with everyone at ORNL was that this move of BTLs would include changes to allow BTLs to wire-up without the RML. I understand that Rich said this was not the case during the part of the admin meeting I missed yesterday, so that may no longer be a concern. Finally, there has been some discussion, mainly second hand in my case, about the mechanisms in which the trunk would be modified to allow for using OMPI without ORTE. I have concerns that we'd add complexity to the BTLs to achieve that, and again that can be done poorly if we're not careful. Talking with Jeff and Rainer this morning helped reduce my concern in this area, but I think it also added to the technical issues with must be solved to consider this project ready for movement to the trunk. There are a couple of technical issues which I believe prevent a reasonable discussion of the performance and maintainability issues based on the current branch. I talked about some of them in the previous two paragraphs, but so that we have a short bullet list, they are: - How will the ompi_proc_t be handled? In particular, where will PML/BML data be stored, and how will we avoid adding new cache misses. - How will the BML and MPool be handled? The BML holds the BTL endpoint data, so changes have to be made if it continues to live in OMPI. - How will the modex and the intricate dance with adding new procs from dynamic processes be handled? - How will we handle the progress mechanisms in cases where the MTLs are used and the BTLs aren't needed by the RTE? - If there are users outside of OMPI, but who want to also use OMPI, how will the library versioning / conflict problem be solved? As was mentioned before, our t
Re: [OMPI devel] Meta Question -- Open MPI: Is it a dessert topping or is it a floor wax?
On Wed, 11 Mar 2009, Andrew Lumsdaine wrote: Hi all -- There is a meta question that I think is underlying some of the discussion about what to do with BTLs etc. Namely, is Open MPI an MPI implementation with a portable run time system -- or is it a distributed OS with an MPI interface? It seems like some of the changes being asked for (e.g., with the BTLs) reflect the latter -- but perhaps not everyone shares that view and hence the impedance mismatch. I doubt this is the last time that tensions will come up because of differing views on this question. I suggest that we come to some kind of common understanding of the question (and answer) and structure development and administration accordingly. My personal (and I believe, Sandia's) view is that Open MPI should seek to be the best MPI implementation it can be and to leave the distributed OS part to a distributed OS project. This can be seen by my work with Ralph over the past few years to reduce the amount of run-time that exists when running on Red Storm. My vision of the (ideal, possibly impractical) Open MPI would be one with a DPM framework (the interface between OMPI and the run-time) and nothing else in the run-time category. That being said, I understand the fact that we need a run-time for platforms which are not as robust as Red Storm. I also understand the desire to build a variety of programming paradigms on top of Open MPI's strong infrastructure. Given the number of broken interfaces out there, only having to fix them once with more software is attractive. In the end, I don't want to give up the high quality MPI implementation part of the project to achieve the goal of wider applicability. Five years ago, we set out to build the best MPI implementation we could, and we're not done yet. We should not give up that goal to support other programming paradigms or projects. However, changes to better support other projects and which do not detract from the primary goal of a high quality MPI implementation should be pursued. Brian
Re: [OMPI devel] Meta Question -- Open MPI: Is it a dessert toppingor is it a floor wax?
I'm going to stay out of the debate about whether Andy correctly characterized the two points you brought up as a distributed OS or not. Sandia's position on these two points remains the same as I previously stated when the question was distributed OS or not. The primary goal of the Open MPI project was and should remain to be the best MPI project available. Low-cost items to support different run-times or different non-MPI communication contexts are worth the work. But high-cost items should be avoided, as they degrade our ability to provide the best MPI project available (of course, others, including OMPI developers, can take the source and do what they wish outside the primary development tree). High performance is a concern, but so is code maintainability. If it takes twices as long to implement feature A because I have to worry about it's impact not only on MPI, but also on projects X, Y, Z, as an MPI developer, I've lost something important. Brian On Thu, 12 Mar 2009, Richard Graham wrote: I am assuming that by distributed OS you are referring to the changes that we (not just ORNL) are trying to do. If this is the case, this is a mischaracterization of the of out intentions. We have two goals - To be able to use a different run-time than ORTE to drive Open MPI - To use the communication primitives outside the context of MPI (with or without ORTE) High performance is critical, and at NO time have we ever said anything about sacrificing performance - these have been concerns that others (rightfully) have expressed. Rich On 3/12/09 8:24 AM, "Jeff Squyres" wrote: I think I have to agree with Terry. I love to inspire and see new, original, and unintended uses for Open MPI. But our primary focus must remain to create, maintain, and continue to deliver a high performance MPI implementation. We have a long history of adding "small" things to Open MPI that are useful to 3rd parties because it helps them, helps further Open MPI adoption/usefulness, and wasn't difficult for us to do ("small" can have varying definitions). I'm in favor of such things, as long as we maintain a policy of "in cases of conflict, OMPI/high performance MPI wins". On Mar 12, 2009, at 9:01 AM, Terry Dontje wrote: Sun's participation in this community was to obtain a stable and performant MPI implementation that had some research work done on the side to improve those goals and the introduction of new features. We don't have problems with others using and improving on the OMPI code base but we need to make sure such usage doesn't detract from our primary goal of performant MPI implementation. However, changes to the OMPI code base to allow it to morph or even support a distributed OS does cause for some concern. That is are we opening the door to having more interfaces to support? If so is this wise in the fact that it seems to me we have a hard enough time trying to focus on the MPI items? Not to mention this definitely starts detracting from the original goals. --td Andrew Lumsdaine wrote: Hi all -- There is a meta question that I think is underlying some of the discussion about what to do with BTLs etc. Namely, is Open MPI an MPI implementation with a portable run time system -- or is it a distributed OS with an MPI interface? It seems like some of the changes being asked for (e.g., with the BTLs) reflect the latter -- but perhaps not everyone shares that view and hence the impedance mismatch. I doubt this is the last time that tensions will come up because of differing views on this question. I suggest that we come to some kind of common understanding of the question (and answer) and structure development and administration accordingly. Best Regards, Andrew Lumsdaine ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Inherent limit on #communicators?
On Thu, 30 Apr 2009, Ralph Castain wrote: We seem to have hit a problem here - it looks like we are seeing a built-in limit on the number of communicators one can create in a program. The program basically does a loop, calling MPI_Comm_split each time through the loop to create a sub-communicator, does a reduce operation on the members of the sub-communicator, and then calls MPI_Comm_free to release it (this is a minimized reproducer for the real code). After 64k times through the loop, the program fails. This looks remarkably like a 16-bit index that hits a max value and then blocks. I have looked at the communicator code, but I don't immediately see such a field. Is anyone aware of some other place where we would have a limit that would cause this problem? There's a maximum of 32768 communicator ids when using OB1 (each PML can set the max contextid, although the communicator code is the part that actually assigns a cid). Assuming that comm_free is actually properly called, there should be plenty of cids available for that pattern. However, I'm not sure I understand the block algorithm someone added to cid allocation - I'd have to guess that there's something funny with that routine and cids aren't being recycled properly. Brian
Re: [OMPI devel] Inherent limit on #communicators?
When we added the CM PML, we added a pml_max_contextid field to the PML structure, which is the max size cid the PML can handle (because the matching interfaces don't allow 32 bits to be used for the cid. At the same time, the max cid for OB1 was shrunk significantly, so that the header on a short message would be packed tightly with no alignment padding. At the time, we believed 32k simultaneous communicators was plenty, and that CIDs were reused (we checked, I'm pretty sure). It sounds like someone removed the CID reuse code, which seems rather bad to me. There have to be unused CIDs in Ralph's example - is there a way to fallback out of the block algorithm when it can't find a new CID and find one it can reuse? Other than setting the multi-threaded case back on, that is? Brian On Thu, 30 Apr 2009, Edgar Gabriel wrote: cid's are in fact not recycled in the block algorithm. The problem is that comm_free is not collective, so you can not make any assumptions whether other procs have also released that communicator. But nevertheless, a cid in the communicator structure is a uint32_t, so it should not hit the 16k limit there yet. this is not new, so if there is a discrepancy between what the comm structure assumes that a cid is and what the pml assumes, than this was in the code since the very first days of Open MPI... Thanks Edgar Brian W. Barrett wrote: On Thu, 30 Apr 2009, Ralph Castain wrote: We seem to have hit a problem here - it looks like we are seeing a built-in limit on the number of communicators one can create in a program. The program basically does a loop, calling MPI_Comm_split each time through the loop to create a sub-communicator, does a reduce operation on the members of the sub-communicator, and then calls MPI_Comm_free to release it (this is a minimized reproducer for the real code). After 64k times through the loop, the program fails. This looks remarkably like a 16-bit index that hits a max value and then blocks. I have looked at the communicator code, but I don't immediately see such a field. Is anyone aware of some other place where we would have a limit that would cause this problem? There's a maximum of 32768 communicator ids when using OB1 (each PML can set the max contextid, although the communicator code is the part that actually assigns a cid). Assuming that comm_free is actually properly called, there should be plenty of cids available for that pattern. However, I'm not sure I understand the block algorithm someone added to cid allocation - I'd have to guess that there's something funny with that routine and cids aren't being recycled properly. Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Inherent limit on #communicators?
On Thu, 30 Apr 2009, Edgar Gabriel wrote: Brian W. Barrett wrote: When we added the CM PML, we added a pml_max_contextid field to the PML structure, which is the max size cid the PML can handle (because the matching interfaces don't allow 32 bits to be used for the cid. At the same time, the max cid for OB1 was shrunk significantly, so that the header on a short message would be packed tightly with no alignment padding. At the time, we believed 32k simultaneous communicators was plenty, and that CIDs were reused (we checked, I'm pretty sure). It sounds like someone removed the CID reuse code, which seems rather bad to me. yes, we added the block algorithm. Not reusing a CID actually doesn't bite me as that dramatic, and I am still not sure and convinced about that:-) We do not have an empty array or something like that, its just a number. The reason for the block algorithm was that the performance of our communicator creation code sucked, and the cid allocation was one portion of that. We used to require *at least* 4 collective operations per communicator creation at that time. We are now potentially down to 0, among others thanks to the block algorithm. However, let me think about reusing entire blocks, its probably doable just requires a little more bookkeeping... There have to be unused CIDs in Ralph's example - is there a way to fallback out of the block algorithm when it can't find a new CID and find one it can reuse? Other than setting the multi-threaded case back on, that is? remember that its not the communicator id allocation that is failing at this point, so the question is do we have to 'validate' a cid with the pml before we declare it to be ok? well, that's only because the code's doing something it shouldn't. Have a look at comm_cid.c:185 - there's the check we added to the multi-threaded case (which was the only case when we added it). The cid generation should never generate a number larger than mca_pml.pml_max_contextid. I'm actually somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got a valid cid in add_comm, which it should probably do. Looking at the differences between v1.2 and v1.3, the max_contextid code was already in v1.2 and OB1 was setting it to 32k. So the cid blocking code removed a rather critical feature and probably should be fixed or removed for v1.3. On Portals, I only get 8k cids, so not having reuse is a rather large problem. Brian
Re: [OMPI devel] Inherent limit on #communicators?
On Thu, 30 Apr 2009, Ralph Castain wrote: well, that's only because the code's doing something it shouldn't. Have a look at comm_cid.c:185 - there's the check we added to the multi-threaded case (which was the only case when we added it). The cid generation should never generate a number larger than mca_pml.pml_max_contextid. I'm actually somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got a valid cid in add_comm, which it should probably do. Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick. Ah. Patch to change the hang into an abort coming RSN. Brian
Re: [OMPI devel] Inherent limit on #communicators?
tory On Apr 30, 2009, at 12:28 PM, Edgar Gabriel wrote: cid's are in fact not recycled in the block algorithm. The problem is that comm_free is not collective, so you can not make any assumptions whether other procs have also released that communicator. But nevertheless, a cid in the communicator structure is a uint32_t, so it should not hit the 16k limit there yet. this is not new, so if there is a discrepancy between what the comm structure assumes that a cid is and what the pml assumes, than this was in the code since the very first days of Open MPI... Thanks Edgar Brian W. Barrett wrote: On Thu, 30 Apr 2009, Ralph Castain wrote: We seem to have hit a problem here - it looks like we are seeing a built-in limit on the number of communicators one can create in a program. The program basically does a loop, calling MPI_Comm_split each time through the loop to create a sub-communicator, does a reduce operation on the members of the sub-communicator, and then calls MPI_Comm_free to release it (this is a minimized reproducer for the real code). After 64k times through the loop,
Re: [OMPI devel] Revise paffinity method?
On Wed, 6 May 2009, Ralph Castain wrote: Any thoughts on this? Should we change it? Yes, we should change this (IMHO) :). If so, who wants to be involved in the re-design? I'm pretty sure it would require some modification of the paffinity framework, plus some minor mods to the odls framework and (since you cannot bind a process other than yourself) addition of a new small "proxy" script that would bind-then-exec each process started by the orted (Eugene posted a candidate on the user list, though we will have to deal with some system-specific issues in it). I can't contribute a whole lot of time, but I'd be happy to lurk, offer advice, and write some small bits of code. But I definitely can't lead. Fist offering of opinion from me. I think we can avoid the "proxy" script by doing the binding after the fork but before the exec. This will definitely require minor changes to the odls and probably a bunch of changes to the paffinity framework. This will make things slightly less fragile than a script would, and yet get us what we want. Brian
Re: [OMPI devel] Build failures on trunk? r21235
On Thu, 14 May 2009, Jeff Squyres wrote: On May 14, 2009, at 1:46 PM, Ralf Wildenhues wrote: A more permanent workaround could be in OpenMPI to list each library that is used *directly* by some other library as a dependency. Sigh. We actually took pains to *not* do that; we *used* to do that and explicitly took it out. :-\ IIRC, it had something to do with dlopen'ing libmpi.so...? Actually, I think that was something else. Today, libopen-rte.la lists libopen-pal.la as a dependency and libmpi.la lists libopen-rte.la. I had removed the dependency of libmpi.la on libopen-pal.la because it was causing libopen-pal.so to be listed twice by libtool, which was causing problems. It would be a trivial fix to change the Makefiles to make libmpi.la to depend on libopen-pal.la as well as libopen-rte.la. Brian
Re: [OMPI devel] Build failures on trunk? r21235
On Thu, 14 May 2009, Ralf Wildenhues wrote: Hi Brian, * Brian W. Barrett wrote on Thu, May 14, 2009 at 08:22:58PM CEST: Actually, I think that was something else. Today, libopen-rte.la lists libopen-pal.la as a dependency and libmpi.la lists libopen-rte.la. I had removed the dependency of libmpi.la on libopen-pal.la because it was causing libopen-pal.so to be listed twice by libtool, which was causing problems. That's weird, and shouldn't happen (the problems, that is). Do you have a pointer for them? I don't - it was many moons ago. And it very likely was when we were in that (evil) period where we were using LT2 before it was released as stable. So it's completely possible we were seeing a transient bug which is long since gone. Brian
Re: [OMPI devel] Build failures on trunk? r21235
On Thu, 14 May 2009, Jeff Squyres wrote: On May 14, 2009, at 2:22 PM, Brian W. Barrett wrote: We actually took pains to *not* do that; we *used* to do that and explicitly took it out. :-\ IIRC, it had something to do with dlopen'ing libmpi.so...? Actually, I think that was something else. Today, libopen-rte.la lists libopen-pal.la as a dependency and libmpi.la lists libopen-rte.la. I had removed the dependency of libmpi.la on libopen-pal.la because it was causing libopen-pal.so to be listed twice by libtool, which was causing problems. It would be a trivial fix to change the Makefiles to make libmpi.la to depend on libopen-pal.la as well as libopen-rte.la. Ah -- am I thinking of us removing libmpi (etc.) from the DSOs? I think so. And that's a change we definitely don't want to undo. Brian
Re: [OMPI devel] opal / fortran / Flogical
I have to agree with Jeff's concerns. Brian On Mon, 1 Jun 2009, Jeff Squyres wrote: Hmm. I'm not sure that I like this commit. George, Brian, and I specifically kept Fortran out of (the non-generated code in) opal because the MPI layer is the *only* layer that uses Fortran. There was one or two minor abstraction breaks (you cited opal/util/arch.c), but now we have Fortran all throughout Opal. Hmmm... :-\ Is MPI_Flogical a real type? I don't see it defined in the MPI-2.2 latex sources, but I could be missing it. I *thought* we used ompi_fortran_logical_t internally because there was no officially sanctioned MPI_ type for it...? On May 30, 2009, at 11:54 AM, wrote: Author: rusraink Date: 2009-05-30 11:54:29 EDT (Sat, 30 May 2009) New Revision: 21330 URL: https://svn.open-mpi.org/trac/ompi/changeset/21330 Log: - Move alignment and size output generated by configure-tests into the OPAL namespace, eliminating cases like opal/util/arch.c testing for ompi_fortran_logical_t. As this is processor- and compiler-related information (e.g. does the compiler/architecture support REAL*16) this should have been on the OPAL layer. - Unifies f77 code using MPI_Flogical instead of opal_fortran_logical_t - Tested locally (Linux/x86-64) with mpich and intel testsuite but would like to get this week-ends MTT output - PLEASE NOTE: configure-internal macro-names and ompi_cv_ variables have not been changed, so that external platform (not in contrib/) files still work. Text files modified: trunk/config/f77_check.m4 |20 trunk/config/f77_check_logical_array.m4 | 6 trunk/config/f77_check_real16_c_equiv.m4 |14 trunk/config/f77_get_fortran_handle_max.m4 | 4 trunk/config/f77_get_value_true.m4 |14 trunk/config/f77_purge_unsupported_kind.m4 | 8 trunk/config/f90_check.m4 |10 trunk/configure.ac |20 trunk/contrib/platform/win32/CMakeModules/f77_check.cmake |24 trunk/contrib/platform/win32/CMakeModules/f77_check_real16_c_equiv.cmake |12 trunk/contrib/platform/win32/CMakeModules/ompi_configure.cmake | 154 trunk/contrib/platform/win32/ConfigFiles/mpi.h.cmake |96 ++-- trunk/contrib/platform/win32/ConfigFiles/opal_config.h.cmake | 222 ++-- trunk/ompi/attribute/attribute.c | 6 trunk/ompi/attribute/attribute.h | 4 trunk/ompi/communicator/comm_init.c | 2 trunk/ompi/datatype/copy_functions.c |10 trunk/ompi/datatype/copy_functions_heterogeneous.c |14 trunk/ompi/datatype/dt_module.c | 224 ++-- trunk/ompi/errhandler/errcode-internal.c | 2 trunk/ompi/errhandler/errcode.c | 2 trunk/ompi/errhandler/errhandler.c | 2 trunk/ompi/file/file.c | 2 trunk/ompi/group/group_init.c | 2 trunk/ompi/include/mpi.h.in |96 ++-- trunk/ompi/include/ompi_config.h.in |48 +- trunk/ompi/info/info.c | 2 trunk/ompi/mca/op/base/functions.h |56 +- trunk/ompi/mca/op/base/op_base_functions.c | 722 trunk/ompi/mca/osc/base/osc_base_obj_convert.c | 8 trunk/ompi/mpi/c/type_create_f90_integer.c | 4 trunk/ompi/mpi/f77/base/attr_fn_f.c |48 +- trunk/ompi/mpi/f77/file_read_all_end_f.c | 6 trunk/ompi/mpi/f77/file_read_all_f.c | 6 trunk/ompi/mpi/f77/file_read_at_all_end_f.c | 6 trunk/ompi/mpi/f77/file_read_at_all_f.c | 6 trunk/ompi/mpi/f77/file_read_at_f.c | 6 trunk/ompi/mpi/f77/file_read_f.c | 6 trunk/ompi/mpi/f77/file_read_ordered_end_f.c | 6 trunk/ompi/mpi/f77/file_read_ordered_f.c | 6 trunk/ompi/mpi/f77/file_read_shared_f.c | 6 trunk/ompi/mpi/f77/file_write_all_end_f.c | 6 trunk/ompi/mpi/f77/file_write_all_f.c | 6 trunk/ompi/mpi/f77/file_write_at_all_end_f.c | 6 trunk/ompi/mpi/f77/file_write_at_all_f.c | 6 trunk/ompi/mpi/f77/file_write_at_f.c | 6 trunk/ompi/mpi/f77/file_write_f.c | 6 trunk/ompi/mpi/f77/file_write_ordered_end_f.c | 6 trunk/ompi/mpi/f77/file_write_ordered_f.c | 6 trunk/ompi/mpi/f77/file_write_shared_f.c | 6 trunk/ompi/mpi/f77/fint_2_int.h |16 trunk/ompi/mpi/f77/iprobe_f.c | 6 trunk/ompi/mpi/f77/probe_f.c | 6 trunk/ompi/mpi/f77/recv_f.c | 6 trunk/ompi/mpi/f77/testsome_f.c | 4 trunk/ompi/mpi/f90/fortran_sizes.h.in |64 +- trunk/ompi/mpi/f90/scripts/mpi_sizeof.f90.sh |16 trunk/ompi/request/request.c | 2 trunk/ompi/tools/ompi_info/param.cc |96 ++-- trunk/ompi/win/win.c | 2 trunk/opal/class/opal_bitmap.c | 2 trunk/opal/class/opal_bitmap.h | 2 trunk/opal/class/opal_pointer_array.c | 4 trunk/opal/include/opal_config_bottom.h |10 trunk/opal/util/arch.c | 6 65 files changed, 1104 insertions(+), 1104 deletions(-) Modified: trunk/config/f77_check.m4 == ---
Re: [OMPI devel] opal / fortran / Flogical
Well, this may just be another sign that the push of the DDT to OPAL is a bad idea. That's been my opinion from the start, so I'm biased. But OPAL was intended to be single process systems portability, not MPI crud. Brian On Mon, 1 Jun 2009, Rainer Keller wrote: Hmm, OK, I see. However, I do see potentially a problem with work getting ddt on the OPAL layer when we do have a fortran compiler with different alignment requirements for the same-sized basic types... As far as I understand the OPAL layer to abstract away from underlying system portability, libc-quirks, and compiler information. But I am perfectly fine with reverting this! Let's discuss, maybe phone? Thanks, Rainer On Monday 01 June 2009 10:38:51 am Jeff Squyres wrote: Hmm. I'm not sure that I like this commit. George, Brian, and I specifically kept Fortran out of (the non- generated code in) opal because the MPI layer is the *only* layer that uses Fortran. There was one or two minor abstraction breaks (you cited opal/util/arch.c), but now we have Fortran all throughout Opal. Hmmm... :-\ Is MPI_Flogical a real type? I don't see it defined in the MPI-2.2 latex sources, but I could be missing it. I *thought* we used ompi_fortran_logical_t internally because there was no officially sanctioned MPI_ type for it...? On May 30, 2009, at 11:54 AM, wrote: Author: rusraink Date: 2009-05-30 11:54:29 EDT (Sat, 30 May 2009) New Revision: 21330 URL: https://svn.open-mpi.org/trac/ompi/changeset/21330 Log: - Move alignment and size output generated by configure-tests into the OPAL namespace, eliminating cases like opal/util/arch.c testing for ompi_fortran_logical_t. As this is processor- and compiler-related information (e.g. does the compiler/architecture support REAL*16) this should have been on the OPAL layer. - Unifies f77 code using MPI_Flogical instead of opal_fortran_logical_t - Tested locally (Linux/x86-64) with mpich and intel testsuite but would like to get this week-ends MTT output - PLEASE NOTE: configure-internal macro-names and ompi_cv_ variables have not been changed, so that external platform (not in contrib/) files still work. Text files modified: trunk/config/ f77_check.m4|20 trunk/config/ f77_check_logical_array.m4 | 6 trunk/config/ f77_check_real16_c_equiv.m4 |14 trunk/config/ f77_get_fortran_handle_max.m4 | 4 trunk/config/ f77_get_value_true.m4 |14 trunk/config/ f77_purge_unsupported_kind.m4 | 8 trunk/config/ f90_check.m4|10 trunk/ configure.ac |20 trunk/contrib/platform/win32/CMakeModules/ f77_check.cmake|24 trunk/contrib/platform/win32/CMakeModules/ f77_check_real16_c_equiv.cmake |12 trunk/contrib/platform/win32/CMakeModules/ ompi_configure.cmake | 154 trunk/contrib/platform/win32/ConfigFiles/ mpi.h.cmake |96 ++-- trunk/contrib/platform/win32/ConfigFiles/ opal_config.h.cmake | 222 ++-- trunk/ompi/attribute/ attribute.c | 6 trunk/ompi/attribute/ attribute.h | 4 trunk/ompi/communicator/ comm_init.c | 2 trunk/ompi/datatype/ copy_functions.c |10 trunk/ompi/datatype/ copy_functions_heterogeneous.c |14 trunk/ompi/datatype/ dt_module.c | 224 + +-- trunk/ompi/errhandler/errcode- internal.c | 2 trunk/ompi/errhandler/ errcode.c | 2 trunk/ompi/errhandler/ errhandler.c | 2 trunk/ompi/file/ file.c | 2 trunk/ompi/group/ group_init.c| 2 trunk/ompi/include/ mpi.h.in |96 ++-- trunk/ompi/include/ ompi_config.h.in |48 +- trunk/ompi/info/ info.c | 2 trunk/ompi/mca/op/base/ functions.h |56 +- trunk/ompi/mca/op/base/ op_base_functions.c | 722 +++ + trunk/ompi/mca/osc/base/ osc_base_obj_convert.c | 8 trunk/ompi/mpi/c/ type_create_f90_integer.c | 4 trunk/ompi/mpi/f77/base/ attr_fn_f.c |48 +- trunk/ompi/mpi/f77/ fi
Re: [OMPI devel] trac ticket 1944 and pending sends
I think that sounds like a rational path forward. Another, more long term, option would be to move from the FIFOs to a linked list (which can even be atomic), which is what MPICH does with nemesis. In that case, there's never a queue to get backed up (although the receive queue for collectives is still a problem). It would also solve the returning a fragment without space problem, as there's always space in a linked list. Brian On Tue, 23 Jun 2009, Eugene Loh wrote: The sm BTL used to have two mechanisms for dealing with congested FIFOs. One was to grow the FIFOs. Another was to queue pending sends locally (on the sender's side). I think the grow-FIFO mechanism was typically invoked and the pending-send mechanism used only under extreme circumstances (no more memory). With the sm makeover of 1.3.2, we dropped the ability to grow FIFOs. The code added complexity and there seemed to be no need to have two mechanisms to deal with congested FIFOs. In ticket 1944, however, we see that repeated collectives can produce hangs, and this seems to be due to the pending-send code not adequately dealing with congested FIFOs. Today, when a process tries to write to a remote FIFO and fails, it queues the write as a pending send. The only condition under which it retries pending sends is when it gets a fragment back from a remote process. I think the logic must have been that the FIFO got congested because we issued too many sends. Getting a fragment back indicates that the remote process has made progress digesting those sends. In ticket 1944, we see that a FIFO can also get congested from too many returning fragments. Further, with shared FIFOs, a FIFO could become congested due to the activity of a third-party process. In sum, getting a fragment back from a remote process is a poor indicator that it's time to retry pending sends. Maybe the real way to know when to retry pending sends is just to check if there's room on the FIFO. So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by checking if there are pending sends. If so, it'll retry them before performing the requested write. This should also help preserve ordering a little better. I'm guessing this will not hurt our message latency in any meaningful way, but I'll check this out. Meanwhile, I wanted to check in with y'all for any guidance you might have. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trac ticket 1944 and pending sends
On Wed, 24 Jun 2009, Eugene Loh wrote: Brian Barrett wrote: Or go to what I proposed and USE A LINKED LIST! (as I said before, not an original idea, but one I think has merit) Then you don't have to size the fifo, because there isn't a fifo. Limit the number of send fragments any one proc can allocate and the only place memory can grow without bound is the OB1 unexpected list. Then use SEND_COMPLETE instead of SEND_NORMAL in the collectives without barrier semantics (bcast, reduce, gather, scatter) and you effectively limit how far ahead any one proc can get to something that we can handle, with no performance hit. I'm still digesting George's mail and trac comments and responses thereto. Meanwhile, a couple of questions here. First, I think it'd be helpful if you said a few words about how you think a linked list should be used here. I can think of a couple of different ways, and I have questions about each way. Instead of my enumerating these ways and those questions, how about you just be more specific? (We used to grow the FIFOs, so sizing them didn't used to be an issue.) My thought is to essentially implement a good chunk of the Nemesis design from MPICH, so reading that paper might give background on where I'm coming from. But if it were me 1) Always limit the number of send fragments that can be allocated to something small. This gives us a concrete upper bound on the size of the shared memory region we need to allocate. 2) Rather than a FIFO in which we put offset pointers, which requires a large amount of memory (p * num_frags), a linked list option offsets that into the size of the fragment - it's two fields in there, plus some constant overhead for the LL structure. 3) On insert, either acquire the lock for the LL and insert at the tail of the list or use atomic ops to update the tail of the list (the nemesis paper talks about the atomic sequence). Because there's no FIFO to fill up, there's no deadlock issues. 4) If, on send, you don't have any send fragments available, as they're a constrainted resource, drain your incoming queue to collect acks - if you don't get any fragments, return failure to the upper layer and let it try again. 5) I can see how #4 might cause issues, as the draining of the queue might actually result in more send requests. In this case, I'd be tempted to have two linked lists (they're small, after all), one for incoming fragments and one for acks. This wasn't an option with the fifos, due to their large size. Second, I'm curious how elaborate of a fix I should be trying for here. Are we looking for something to fix the problems at hand, or are we opening the door to rearchitecting a big part of the sm BTL? Well, like Ralph said, I worry about whether we can strap another bandaid on and keep it working. If we can, great. But George's proposal seems like it undoes all the memory savings work you did, and that worries me. Brian
Re: [OMPI devel] sm BTL flow management
All - Jeff, Eugene, and I had a long discussion this morning on the sm BTL flow management issues and came to a couple of conclusions. * Jeff, Eugene, and I are all convinced that Eugene's addition of polling the receive queue to drain acks when sends start backing up is required for deadlock avoidance. * We're also convinced that George's proposal, while a good idea in general, is not sufficient. The send path doesn't appear to sufficiently progress the btl to avoid the deadlocks we're seeing with the SM btl today. Therefore, while I still recommend sizing the fifo appropriately and limiting the freelist size, I think it's not sufficient to solve all problems. * Finally, it took an hour, but we did determine one of the major differences between 1.2.8 and 1.3.0 in terms of sm is how messages were pulled off the FIFO. In 1.2.8 (and all earlier versions), we return from btl_progress after a single message is received (ack or message) or the fifo was empty. In 1.3.0 (pre-srq work Eugene did), we changed to completely draining all queues before returning from btl_progress. This has led to a situation where a single call to btl_progress can make a large number of callbacks into the PML (900,000 times in one of Eugene's test case). The change was made to resolve an issue Terry was having with performance of a benchmark. We've decided that it would be adventageous to try something between the two points and drain X number of messages from the queue, then return, where X is 100 or so at most. This should cover the performance issues Terry saw, but still not cause the huge number of messages added to the unexpected queue with a single call to MPI_Recv. Since a recv that is matched on the unexpected queue doesn't result in a call to opal_progress, this should help balance the load a little bit better. Eugene's going to take a stab at implementing this short term. I think the combination of Euegene's deadlock avoidance fix and the careful queue draining should make me comfortable enough to start another round of testing, but at least explains the bottom line issues. Brian
Re: [OMPI devel] sm BTL flow management
On Thu, 25 Jun 2009, Eugene Loh wrote: I spoke with Brian and Jeff about this earlier today. Presumably, up through 1.2, mca_btl_component_progress would poll and if it received a message fragment would return. Then, presumably in 1.3.0, behavior was changed to keep polling until the FIFO was empty. Brian said this was based on Terry's desire to keep latency as low as possible in benchmarks. Namely, reaching down into a progress call was a long code path. It would be better to pick up multiple messages, if available on the FIFO, and queue extras up in the unexpected queue. Then, a subsequent call could more efficiently find the anticipated message fragment. I don't see how the behavior would impact short-message pingpongs (the typical way to measure latency) one way or the other. I asked Terry, who struggled to remember the issue and pointed me at this thread: http://www.open-mpi.org/community/lists/devel/2008/06/4158.php . But that is related to an issue that's solved if one keeps polling as long as one gets ACKs (but returns as soon as a real message fragment is found). Can anyone shed some light on the history here? Why keep polling even when a message fragment has been found? The downside of polling too aggressively is that the unexpected queue can grow (without bounds). Brian's proposal is to set some variable that determines how many message fragments a single mca_btl_sm_component_progress call can drain from the FIFO before returning. I checked, and 1.3.2 definitely drains all messages until the fifo is empty. If we were to switch to drain until we receive a data message and that fixes Terry's issue, that seems like a rational change and would not require the fix I suggested. My assumption had been that we needed to drain more than one data message per call to component_progress in order to work around Terry's issue. If not, then let's go with the simple fix and only drain one data message per enterance to component_progress (but drain multiple acks if we have a bunch of acks and then a data message in the queue). Unfortunately I have no more history than what Terry proposed, but it looks like the changes were made around that time. Brian
Re: [OMPI devel] MPI_Accumulate() with MPI_PROC_NULL target rank
On Wed, 15 Jul 2009, Lisandro Dalcin wrote: The MPI 2-1 standard says: "MPI_PROC_NULL is a valid target rank in the MPI RMA calls MPI_ACCUMULATE, MPI_GET, and MPI_PUT. The effect is the same as for MPI_PROC_NULL in MPI point-to-point communication. After any RMA operation with rank MPI_PROC_NULL, it is still necessary to finish the RMA epoch with the synchronization method that started the epoch." Unfortunately, MPI_Accumulate() is not quite the same as point-to-point, as a reduction is involved. Suppose you make this call (let me abuse and use keyword arguments): MPI_Accumulate(..., target_rank=MPI_PROC_NULL, target_datatype=MPI_BYTE, op=MPI_SUM, ...) IIUC, the call fails (with MPI_ERR_OP) in Open MPI because MPI_BYTE is an invalid datatype for MPI_SUM. But provided that the target rank is MPI_PROC_NULL, would it make sense for the call to success? I believe no. We do full argument error checking (that you provided a valid communicator and datatype) on send, receive, put, and get when the source/dest is MPI_PROC_NULL. Therefore, I think it's logical that we extend that to include valid operations for accumulate. Brian
Re: [OMPI devel] autodetect broken
The current autodetect implementation seems like the wrong approach to me. I'm rather unhappy the base functionality was hacked up like it was without any advanced notice or questions about original design intent. We seem to have a set of base functions which are now more unreadable than before, overly complex, and which leak memory. The intent of the installdirs framework was to allow this type of behavior, but without rehacking all this infer crap into base. The autodetect component should just set $prefix in the set of functions it returns (and possibly libdir and bindir if you really want, which might actually make sense if you guess wrong), and let the expansion code take over from there. The general thought on how this would work went something like: - Run after config - If determine you have a special $prefix, set the opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and set your special prefix. - Same with bindir and libdir if needed - Let expansion (which runs after all components have had the chance to fill in their fields) expand out with your special data And the base stays simple, the components do all the heavy lifting, and life is happy. I would not be opposed to putting in a "find expaneded part" type function that takes two strings like "${prefix}/lib" and "/usr/local/lib" and returns "/usr/local" being added to the base so that other autodetect-style components don't need to handle such a case, but that's about the extent of the base changes I think are appropriate. Finally, a first quick code review reveals a couple of problems: - We don't AC_SUBST variables adding .lo files to build sources in OMPI. Instead, we use AM_CONDITIONALS to add sources as needed. - Obviously, there's a problem with the _happy variable name consistency in configure.m4 - There's a naming convention issue - files should all start with opal_installdirs_autodetect_, and a number of the added files do not. - From a finding code standpoint, I'd rather walkcontext.c and backtrace.c be one file with #ifs - for such short functions, it makes it more obvious that it's just portability implementations of the same function. I'd be happy to discuss the issues further or review any code before it gets committed. But I think the changes as they exist today (even with bugs fixed) aren't consistent with what the installdirs framework was trying to accomplish and should be reworked. Brian
Re: [OMPI devel] RFC: meaning of "btl_XXX_eager_limit"
On Thu, 23 Jul 2009, Jeff Squyres wrote: There are two solutions I can think of. Which should we do? a. Pass the (max?) PML header size down into the BTL during initialization such that the the btl_XXX_eager_limit can represent the max MPI data payload size (i.e., the BTL can size its buffers to accommodate its desired max eager payload size, its header size, and the PML header size). Thus, the eager_limit can truly be the MPI data payload size -- and easy to explain to users. This will not work. Remember, the PML IS NOT THE ONLY USER OF THE BTLS. I'm really getting sick of saying this, but it's true. There can be no PML knowledge in the BTL, even if it's something simple like a header size. And since PML headers change depending on the size and type of message, this seems like a really stupid parameter to publish to the user. b. Stay with the current btl_XXX_eager_limit implementation (which OMPI has had for a long, long time) and add the code to check for btl_eager_limit less than the pml header size (per this past Tuesday's discussion). This is the minimal distance change. Since there's already code in Terry's hands to do this, I vote for b. 2. OMPI currently does not publish enough information for a user to set eager_limit to be able to do BTL traffic shaping. That is, one really needs to know the (max) BTL header length and the (max) PML header length values to be able to calculate the correct eager_limit force a specific (max) BTL wire fragment size. Our proposed solution is to have ompi_info print out the (max) PML and BTL header sizes. Regardless of whether 1a) or 1b) is chosen, with these two pieces of information, a determined network administrator could calculate the max wire fragment size used by OMPI, and therefore be able to do at least some of traffic shaping. Actually, there's no need to know the PML header size to shape traffic. There's only need to know the BTL header, and I wouldn't be opposed to changing the behavior so that the BTL eager limit parameter included the btl header size (because the PML header is not a factor in determining size of individual eager packets). It seems idiotic, but whatever - you should more care about what the data size the user is sending than the MTU size. Sending multiple MTUs should have little performance on a network that doesn't suck and we shouldn't be doing all kinds of hacks to support networks who's designers can't figure out which way is up. Again, since there are multiple consumers of the BTLs, allowing network designers to screw around with defaults to try and get what they want (even when it isn't what they actually want) seems stupid. But as long as you don't do 1a, I won't object to uselessness contained in ompi_info. Brian
Re: [OMPI devel] libtool issue with crs/self
What are you trying to do with lt_dlopen? It seems like you should always go through the MCA base utilities. If one's missing, adding it there seems like the right mechanism. Brian On Wed, 29 Jul 2009, Josh Hursey wrote: George suggested that to me as well yesterday after the meeting. So we would create opal interfaces to libtool (similar to what we do with the event engine). That might be the best way to approach this. I'll start to take a look at implementing this. Since opal/libltdl is not part of the repository, is there a 'right' place to put this header? maybe in opal/util/? Thanks, Josh On Jul 28, 2009, at 6:57 PM, Jeff Squyres (jsquyres) wrote: Josh - this is almost certainly what happened. Yoibks. Unfortunately, there's good reasons for it. :( What about if we proxy calls to lt_dlopen through an opal function call? -jms Sent from my PDA. No type good. - Original Message - From: devel-boun...@open-mpi.org To: Open MPI Developers Sent: Tue Jul 28 16:39:42 2009 Subject: Re: [OMPI devel] libtool issue with crs/self It was mentioned to me that r21731 might have caused this problem by restricting the visibility of the libltdl library. https://svn.open-mpi.org/trac/ompi/changeset/21731 Brian, Do you have any thoughts on how we might extend the visibility so that MCA components could also use the libtool in opal? I can try to initialize libtool in the Self CRS component and use it directly, but since it is already opened by OPAL, I think it might be better to use the instantiation in OPAL. Cheers, Josh On Jul 28, 2009, at 3:06 PM, Josh Hursey wrote: Once upon a time, the Self CRS module worked correctly, but I admit that I have not tested it in a long time. The Self CRS component uses dl_open and friends to inspect the running process for a particular set of functions. When I try to run an MPI program that contains these signatures I get the following error when it tries to resolve lt_dlopen() in opal_crs_self_component_query(): -- my-app: symbol lookup error: /path/to/install/lib/openmpi/ mca_crs_self.so: undefined symbol: lt_dlopen -- I am configuring with the following: -- ./configure --prefix=/path/to/install \ --enable-binaries \ --with-devel-headers \ --enable-debug \ --enable-mpi-threads \ --with-ft=cr \ --without-memory-manager \ --enable-ft-thread \ CC=gcc CXX=g++ \ F77=gfortran FC=gfortran -- The source code is at the link below: https://svn.open-mpi.org/trac/ompi/browser/trunk/opal/mca/crs/self Does anyone have any thoughts on what might be going wrong here? Thanks, Josh ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Shared library versioning
On Wed, 29 Jul 2009, Jeff Squyres wrote: On Jul 28, 2009, at 1:56 PM, Ralf Wildenhues wrote: - support files are not versioned (e.g., show_help text files) - include files are not versioned (e.g., mpi.h) - OMPI's DSOs actually are versioned, but more work would be needed in this area to make that versioning scheme work in real world scenarios - ...and probably some other things that I'm not thinking of... You can probably solve most of these issues by just versioning the directory names where you put the files; and with some luck, some downstream distribution can achieve this by merely passing a bunch of --foodir=... options to configure. This is probably true -- we do obey all the Autoconf-specified directories, so overriding --foodir= should work. It may break things like mpirun --prefix behavior, though. But I think that the executables would be problematic -- you'd only have 1 mpirun, orted, etc. OMPI does *not* currently handle the Autoconf --program-* configure options properly. I confess to not recalling the specific issues, but I recall we had long discussions about them -- the issues are quite tangled and complicated. And I remember coming to the conclusion "not worth supporting those." FWIW, Chris is probably right that it's far easier to simply install different OMPI versions into different $prefix trees (IMHO). Agreeed. I was looking at the versioning of shared libraries not as a way to allow multiple installs in the same prefix, but to allow users to know when it was time to recompile their application. Brian
Re: [OMPI devel] libtool issue with crs/self
Never mind, I'm an idiot. I still don't like the wrappers around lt_dlopen in util, but it might be your best option. Are you looking for symbols in components or the executable? I assumed the executable, in which case you might be better off just using dlsym() directly. If you're looking for a symbol first place it's found, then you can just do: dlsym(RTLD_DEFAULT, symbol); The lt_dlsym only really helps if you're running on really obscure platforms which don't support dlsym and loading "preloaded" components. Brian On Wed, 29 Jul 2009, Brian W. Barrett wrote: What are you trying to do with lt_dlopen? It seems like you should always go through the MCA base utilities. If one's missing, adding it there seems like the right mechanism. Brian On Wed, 29 Jul 2009, Josh Hursey wrote: George suggested that to me as well yesterday after the meeting. So we would create opal interfaces to libtool (similar to what we do with the event engine). That might be the best way to approach this. I'll start to take a look at implementing this. Since opal/libltdl is not part of the repository, is there a 'right' place to put this header? maybe in opal/util/? Thanks, Josh On Jul 28, 2009, at 6:57 PM, Jeff Squyres (jsquyres) wrote: Josh - this is almost certainly what happened. Yoibks. Unfortunately, there's good reasons for it. :( What about if we proxy calls to lt_dlopen through an opal function call? -jms Sent from my PDA. No type good. - Original Message - From: devel-boun...@open-mpi.org To: Open MPI Developers Sent: Tue Jul 28 16:39:42 2009 Subject: Re: [OMPI devel] libtool issue with crs/self It was mentioned to me that r21731 might have caused this problem by restricting the visibility of the libltdl library. https://svn.open-mpi.org/trac/ompi/changeset/21731 Brian, Do you have any thoughts on how we might extend the visibility so that MCA components could also use the libtool in opal? I can try to initialize libtool in the Self CRS component and use it directly, but since it is already opened by OPAL, I think it might be better to use the instantiation in OPAL. Cheers, Josh On Jul 28, 2009, at 3:06 PM, Josh Hursey wrote: Once upon a time, the Self CRS module worked correctly, but I admit that I have not tested it in a long time. The Self CRS component uses dl_open and friends to inspect the running process for a particular set of functions. When I try to run an MPI program that contains these signatures I get the following error when it tries to resolve lt_dlopen() in opal_crs_self_component_query(): -- my-app: symbol lookup error: /path/to/install/lib/openmpi/ mca_crs_self.so: undefined symbol: lt_dlopen -- I am configuring with the following: -- ./configure --prefix=/path/to/install \ --enable-binaries \ --with-devel-headers \ --enable-debug \ --enable-mpi-threads \ --with-ft=cr \ --without-memory-manager \ --enable-ft-thread \ CC=gcc CXX=g++ \ F77=gfortran FC=gfortran -- The source code is at the link below: https://svn.open-mpi.org/trac/ompi/browser/trunk/opal/mca/crs/self Does anyone have any thoughts on what might be going wrong here? Thanks, Josh ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Device failover on ob1
On Sun, 2 Aug 2009, Ralph Castain wrote: Perhaps a bigger question needs to be addressed - namely, does the ob1 code need to be refactored? Having been involved a little in the early discussion with bull when we debated over where to put this, I know the primary concern was that the code not suffer the same fate as the dr module. We have since run into a similar issue with the checksum module, so I know where they are coming from. The problem is that the code base is adjusted to support changes in ob1, which is still being debugged. On the order of 95% of the code in ob1 is required to be common across all the pml modules, so the rest of us have to (a) watch carefully all the commits to see if someone touches ob1, and then (b) manually mirror the change in our modules. This is not a supportable model over the long-term, which is why dr has died, and checksum is considering integrating into ob1 using configure #if's to avoid impacting non-checksum users. Likewise, device failover has been treated similarly here - i.e., configure out the added code unless someone wants it. This -does- lead to messier source code with these #if's in it. If we can refactor the ob1 code so the common functionality resides in the base, then perhaps we can avoid this problem. Is it possible? I think Ralph raises a good point - we need to think about how to allow better use of OB1's code base between consumers like checksum and failover. The current situation is problematic to me, for the reasons Ralph cited. However, since the ob1 structures and code have little use for PMLs such as CM, I'd rather not push the code into the base - in the end, it's very specific to a particular PML implementation and the code pushed into the base already made things much more interesting in implementing CM than I would have liked. DR is different in this conversation, as it was almost entirely a seperate implementation from ob1 by the end, due to the removal of many features and the addition of many others. However, I think there's middle ground here which could greatly improve the current situation. With the proper refactoring, there's no technical reason why we couldn't move the checksum functionality into ob1 and add the failover to ob1, with no impact on performance when the functionality isn't used and little impact on code readability. So, in summary, refactor OB1 to support checksum / failover good, pushing ob1 code into base bad. Brian
Re: [OMPI devel] libtool issue with crs/self
Josh - Just in case it wasn't clear -- if you're only looking for a symbol in the executable (which you know is there), you do *NOT* have to dlopen() the executable first (you do with libtool to support the "i don't have dynamic library support" mode of operatoin). You only have to dlsym() with RTLD_DEFAULT, as the symbol is already in the process space. It does probably mean we can't support self on platforms without dlsym(), but that set is extremely small and since we don't use libtool to link the final executable, the lt_dlsym wrappers wouldn't have worked anyway. Brian On Wed, 5 Aug 2009, George Bosilca wrote: Josh, These look like two different issues to me. One is how some modules from Open MPI can use the libltld, and for this you highlighted the issue. The second is that the users who want to use the self CRS have to make sure the symbols required by self CRS are visible in their application. This is clearly an item for the FAQ. george. On Aug 5, 2009, at 10:51 , Josh Hursey wrote: As an update on this thread. I had a bit of time this morning to look into this. I noticed that the "-fvisibility=hidden" option when passed to libltdl will cause it to fail in its configure test for: "checking whether a program can dlopen itself" This is because the symbol they are trying to look for with dlsym() is not postfixed with: __attribute__ ((visibility("default"))) If I do that, then the test passes correctly. I am not sure if this is a configure bug in Libtool or not. But what it means is that even with the wrapper around the OPAL libltdl routines, it is not useful to me since I need to open the executable to examine it for the necessary symbols. So I might try to go down the track of using dlopen/dlsym/dlclose directly instead of through the libtool interfaces. However I just wanted to mention that this is happening in case there are other places in the codebase that ever want to look into the executable for symbols, and find that lt_dlopen() fails in non-obvious ways. -- Josh On Jul 29, 2009, at 11:01 AM, Brian W. Barrett wrote: Never mind, I'm an idiot. I still don't like the wrappers around lt_dlopen in util, but it might be your best option. Are you looking for symbols in components or the executable? I assumed the executable, in which case you might be better off just using dlsym() directly. If you're looking for a symbol first place it's found, then you can just do: dlsym(RTLD_DEFAULT, symbol); The lt_dlsym only really helps if you're running on really obscure platforms which don't support dlsym and loading "preloaded" components. Brian On Wed, 29 Jul 2009, Brian W. Barrett wrote: What are you trying to do with lt_dlopen? It seems like you should always go through the MCA base utilities. If one's missing, adding it there seems like the right mechanism. Brian On Wed, 29 Jul 2009, Josh Hursey wrote: George suggested that to me as well yesterday after the meeting. So we would create opal interfaces to libtool (similar to what we do with the event engine). That might be the best way to approach this. I'll start to take a look at implementing this. Since opal/libltdl is not part of the repository, is there a 'right' place to put this header? maybe in opal/util/? Thanks, Josh On Jul 28, 2009, at 6:57 PM, Jeff Squyres (jsquyres) wrote: Josh - this is almost certainly what happened. Yoibks. Unfortunately, there's good reasons for it. :( What about if we proxy calls to lt_dlopen through an opal function call? -jms Sent from my PDA. No type good. - Original Message - From: devel-boun...@open-mpi.org To: Open MPI Developers Sent: Tue Jul 28 16:39:42 2009 Subject: Re: [OMPI devel] libtool issue with crs/self It was mentioned to me that r21731 might have caused this problem by restricting the visibility of the libltdl library. https://svn.open-mpi.org/trac/ompi/changeset/21731 Brian, Do you have any thoughts on how we might extend the visibility so that MCA components could also use the libtool in opal? I can try to initialize libtool in the Self CRS component and use it directly, but since it is already opened by OPAL, I think it might be better to use the instantiation in OPAL. Cheers, Josh On Jul 28, 2009, at 3:06 PM, Josh Hursey wrote: Once upon a time, the Self CRS module worked correctly, but I admit that I have not tested it in a long time. The Self CRS component uses dl_open and friends to inspect the running process for a particular set of functions. When I try to run an MPI program that contains these signatures I get the following error when it tries to resolve lt_dlopen() in opal_crs_self_component_query(): -- my-app: symbol lookup error: /path/to/install/lib/openmpi/ mca_crs_self.so: undefined symbol: lt_dlopen -- I am configuring with the followi
Re: [OMPI devel] libtool issue with crs/self
On Wed, 5 Aug 2009, Josh Hursey wrote: On Aug 5, 2009, at 11:35 AM, Brian W. Barrett wrote: Josh - Just in case it wasn't clear -- if you're only looking for a symbol in the executable (which you know is there), you do *NOT* have to dlopen() the executable first (you do with libtool to support the "i don't have dynamic library support" mode of operatoin). You only have to dlsym() with RTLD_DEFAULT, as the symbol is already in the process space. So is it wrong to dlopen() before dlsym()? The patch I just committed in r21766 does this, since I was following the man page for dlopen() to make sure I was using it correctly. I don't know that it's "wrong", it's just not necessary. I believe that: handle = dlopen(NULL, RTLD_LOCAL|RTLD_LAZY); sym = dlsym(handle, "foo"); dlclose(handle)l and sym = dlsym(RTLD_DEFAULT, "foo"); are functionally equivalent, but the second one means no handle to pass around :). Brian
Re: [OMPI devel] RFC: PML/CM priority
On Tue, 11 Aug 2009, Rainer Keller wrote: When compiling on systems with MX or Portals, we offer MTLs and BTLs. If MTLs are used, the PML/CM is loaded as well as the PML/OB1. Question 1: Is favoring OB1 over CM required for any MTL (MX, Portals, PSM)? George has in the past had srtong feelings on this issue, believing that for MX, OB1 is prefered over CM. For Portals, it's probably in the noise, but the BTL had been better tested than the MTL, so it was left as the default. Obviously, PSM is a much better choice on InfiniPath than straight OFED, hence the odd priority bump. At this point, I would have no objection to making CM's priority higher for Portals. Question 2: If it is, I would like to reflect this in the default priorities, aka have CM have a priority lower than OB1 and in the case of PSM raising it. I don't have strong feelings on this one. Brian
Re: [OMPI devel] Oversubscription/Scheduling Bug
On Fri, 26 May 2006, Jeff Squyres (jsquyres) wrote: You can see this by slightly modifying your test command -- run "env" instead of "hostname". You'll see that the environment variable OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on the mpirun command line, regardless of a) whether you're oversubscribing or not, and b) whatever is passed in through the orted. While Jeff is correct that the parameter informing the MPI process that it should idle when it's not busy is correctly set, it turns out that we are ignoring this parameter inside the MPI process. I'm looking into this and hope to have a fix this afternoon. Brian
Re: [OMPI devel] Oversubscription/Scheduling Bug
On Fri, 26 May 2006, Brian W. Barrett wrote: On Fri, 26 May 2006, Jeff Squyres (jsquyres) wrote: You can see this by slightly modifying your test command -- run "env" instead of "hostname". You'll see that the environment variable OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on the mpirun command line, regardless of a) whether you're oversubscribing or not, and b) whatever is passed in through the orted. While Jeff is correct that the parameter informing the MPI process that it should idle when it's not busy is correctly set, it turns out that we are ignoring this parameter inside the MPI process. I'm looking into this and hope to have a fix this afternoon. Mea culpa. Jeff's right that in a normal application, we are setting up to call sched_yield() when idle if the user sets mpi_yield_when_idle to 1, regardless of what is in the hostfile . The problem with my test case was that for various reasons, my test code was never actually "idling" - there were always things moving along, so our progress engine was deciding that the process should not be idled. Can you share your test code at all? I'm wondering if something similar is happening with your code. It doesn't sound like it should be "always working", but I'm wondering if you're triggering some corner case we haven't thought of. Brian -- Brian Barrett Graduate Student, Open Systems Lab, Indiana University http://www.osl.iu.edu/~brbarret/
Re: [OMPI devel] memory_malloc_hooks.c and dlclose()
On Mon, 22 May 2006, Neil Ludban wrote: I'm getting a core dump when using openmpi-1.0.2 with the MPI extensions we're developing for the MATLAB interpreter. This same build of openmpi is working great with C programs and our extensions for gnu octave. The machine is AMD64 running Linux: Linux kodos 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST 2005 x86_64 x86_64 x86_64 GNU/Linux I believe there's a bug in that opal_memory_malloc_hooks_init() links itself into the __free_hook chain during initialization, but then it never unlinks itself at shutdown. In the interpreter environment, libopal.so is dlclose()d and unmapped from memory long before the interpreter is done with dynamic memory. A quick check of the nightly trunk snapshot reveals some function name changes, but no new shutdown code. Can you try the attached patch and see if it solves your problem? I think it will, but I don't have a great way of testing your exact situation. Thanks, Brian -- Brian Barrett Graduate Student, Open Systems Lab, Indiana University http://www.osl.iu.edu/~brbarret/Index: opal/mca/memory/malloc_hooks/memory_malloc_hooks.c === --- opal/mca/memory/malloc_hooks/memory_malloc_hooks.c (revision 10123) +++ opal/mca/memory/malloc_hooks/memory_malloc_hooks.c (working copy) @@ -27,6 +27,7 @@ /* Prototypes for our hooks. */ void opal_memory_malloc_hooks_init(void); +void opal_memory_malloc_hooks_finalize(void); static void opal_mem_free_free_hook (void*, const void *); static void* opal_mem_free_realloc_hook (void*, size_t, const void *); @@ -60,6 +61,18 @@ } +void +opal_memory_malloc_hooks_finalize(void) +{ +if (initialized == 0) { +return; +} + +__free_hook = old_free_hook; +__realloc_hook = old_realloc_hook; +initialized = 0; +} + static void opal_mem_free_free_hook (void *ptr, const void *caller) { Index: opal/mca/memory/malloc_hooks/memory_malloc_hooks_component.c === --- opal/mca/memory/malloc_hooks/memory_malloc_hooks_component.c (revision 10123) +++ opal/mca/memory/malloc_hooks/memory_malloc_hooks_component.c (working copy) @@ -22,8 +22,10 @@ #include "opal/include/constants.h" extern void opal_memory_malloc_hooks_init(void); +extern void opal_memory_malloc_hooks_finalize(void); static int opal_memory_malloc_open(void); +static int opal_memory_malloc_close(void); const opal_memory_base_component_1_0_0_t mca_memory_malloc_hooks_component = { /* First, the mca_component_t struct containing meta information @@ -41,7 +43,7 @@ /* Component open and close functions */ opal_memory_malloc_open, -NULL +opal_memory_malloc_close }, /* Next the MCA v1.0.0 component meta data */ @@ -58,3 +60,10 @@ opal_memory_malloc_hooks_init(); return OPAL_SUCCESS; } + +static int +opal_memory_malloc_close(void) +{ +opal_memory_malloc_hooks_finalize(); +return OPAL_SUCCESS; +}
Re: [OMPI devel] configure & Fortran problem
Before you go off and file a bug, this is not an Open MPI issue, but a windows / autoconf issue. Please don't file a bug on this, or I'm just going to have to close it as notabug... Brian On Fri, 6 Oct 2006, Jeff Squyres wrote: Oops. That's a bug. I'll file a ticket. On 10/5/06 12:51 PM, "George Bosilca" wrote: I have a problem with configure if no fortran compilers are detected. It stop with the following error: configure: error: Cannot support Fortran MPI_ADDRESS_KIND! As there are not F77 or F90 compilers installed on this machine, it make sense to not be able to support MPI_ADDRESS_KIND ... but as there are no fortran compilers we should not care about. I try to manually disable all kind of fortran support but the error is always the same. Any clues ? Thanks, george. "We must accept finite disappointment, but we must never lose infinite hope." Martin Luther King ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Brian Barrett Graduate Student, Open Systems Lab, Indiana University http://www.osl.iu.edu/~brbarret/
[OMPI devel] Shared memory file changes
Hi all - A couple of weeks ago, I committed some changes to the trunk that greatly reduced the size of the shared memory file for small numbers of processes. I haven't heard any complaints (the non-blocking send/receive issue is at proc counts greater than the size this patch affected). Anyone object to moving this to the v1.2 branch (with reviews, of course). Brian -- Brian Barrett Graduate Student, Open Systems Lab, Indiana University http://www.osl.iu.edu/~brbarret/
[OMPI devel] configure changes tonight
Hi all - There will be three configure changes committed to the trunk tonight: - Some cleanups resulting from the update to the wrapper compilers for 32/64 bit support - A new configure option to deal with some fixes for the MPI::SEEK_SET (and friends) issue - Some cleanups in the pthreads configure tests The only real affect for everyone should be that you'll have to re-autogen.sh. And that the 32/64 include and libdir flags will no longer be available. I will be updating the wiki shortly w.r.t. how to build a multilib wrapper compiler shortly. Brian -- Brian Barrett Graduate Student, Open Systems Lab, Indiana University http://www.osl.iu.edu/~brbarret/
Re: [OMPI devel] help config.status to not mess up substitutions
Thanks, I'll apply ASAP. Brian On Mon, 23 Oct 2006, Ralf Wildenhues wrote: Please apply this robustness patch, which helps to avoid accidental unwanted substitutions done by config.status. From all I can tell, they do not happen now, but first the Autoconf manual warns against them, second they make some config.status optimizations so much more difficult than necessary. :-) In unrelated news, I tested Automake 1.10 with OpenMPI, and it saves about 15s of config.status time, and about half a minute of `make dist' time on my system. Some pending Fortran changes have only made it into Automake after 1.10 was released. Cheers, Ralf 2006-10-23 Ralf Wildenhues * opal/tools/wrappers/Makefile.am: Protect manual substitutions from config.status. * ompi/tools/wrappers/Makefile.am: Likewise. * orte/tools/wrappers/Makefile.am: Likewise. Index: opal/tools/wrappers/Makefile.am === --- opal/tools/wrappers/Makefile.am (revision 12254) +++ opal/tools/wrappers/Makefile.am (working copy) @@ -76,8 +76,8 @@ opalcc.1: opal_wrapper.1 rm -f opalcc.1 - sed -e 's/@COMMAND@/opalcc/g' -e 's/@PROJECT@/Open PAL/g' -e 's/@PROJECT_SHORT@/OPAL/g' -e 's/@LANGUAGE@/C/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > opalcc.1 + sed -e 's/[@]COMMAND[@]/opalcc/g' -e 's/[@]PROJECT[@]/Open PAL/g' -e 's/[@]PROJECT_SHORT[@]/OPAL/g' -e 's/[@]LANGUAGE[@]/C/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > opalcc.1 opalc++.1: opal_wrapper.1 rm -f opalc++.1 - sed -e 's/@COMMAND@/opalc++/g' -e 's/@PROJECT@/Open PAL/g' -e 's/@PROJECT_SHORT@/OPAL/g' -e 's/@LANGUAGE@/C++/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > opalc++.1 + sed -e 's/[@]COMMAND[@]/opalc++/g' -e 's/[@]PROJECT[@]/Open PAL/g' -e 's/[@]PROJECT_SHORT[@]/OPAL/g' -e 's/[@]LANGUAGE[@]/C++/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > opalc++.1 Index: ompi/tools/wrappers/Makefile.am === --- ompi/tools/wrappers/Makefile.am (revision 12254) +++ ompi/tools/wrappers/Makefile.am (working copy) @@ -84,20 +84,20 @@ mpicc.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 rm -f mpicc.1 - sed -e 's/@COMMAND@/mpicc/g' -e 's/@PROJECT@/Open MPI/g' -e 's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/C/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpicc.1 + sed -e 's/[@]COMMAND[@]/mpicc/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/C/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpicc.1 mpic++.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 rm -f mpic++.1 - sed -e 's/@COMMAND@/mpic++/g' -e 's/@PROJECT@/Open MPI/g' -e 's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/C++/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpic++.1 + sed -e 's/[@]COMMAND[@]/mpic++/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/C++/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpic++.1 mpicxx.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 rm -f mpicxx.1 - sed -e 's/@COMMAND@/mpicxx/g' -e 's/@PROJECT@/Open MPI/g' -e 's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/C++/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpicxx.1 + sed -e 's/[@]COMMAND[@]/mpicxx/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/C++/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpicxx.1 mpif77.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 rm -f mpif77.1 - sed -e 's/@COMMAND@/mpif77/g' -e 's/@PROJECT@/Open MPI/g' -e 's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/Fortran 77/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpif77.1 + sed -e 's/[@]COMMAND[@]/mpif77/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/Fortran 77/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpif77.1 mpif90.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 rm -f mpif90.1 - sed -e 's/@COMMAND@/mpif90/g' -e 's/@PROJECT@/Open MPI/g' -e 's/@PROJECT_SHORT@/OMPI/g' -e 's/@LANGUAGE@/Fortran 90/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpif90.1 + sed -e 's/[@]COMMAND[@]/mpif90/g' -e 's/[@]PROJECT[@]/Open MPI/g' -e 's/[@]PROJECT_SHORT[@]/OMPI/g' -e 's/[@]LANGUAGE[@]/Fortran 90/g' < $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 > mpif90.1 Index: orte/tools/wrappers/Makefile.am === --- orte/tools/wrappers/Makefile.am (revision 12254) +++ orte/tools/wrappers/Makefile.am (working copy) @@ -51,8 +51,8 @@ ortecc.1: $(top_srcdir)/opal/tools/wrappers/opal_wrapper.1 rm -f ortecc.1 - sed -e 's/@COMMAND@/ortecc/g' -e 's/@PROJECT@/OpenRTE/g' -e 's/@PROJECT_SHORT@/ORTE/g' -e
Re: [OMPI devel] New oob/tcp?
The create_listen_thread code should be on both the trunk and v1.2 branch right now. You are correct that the heterogeneous fixes haven't moved just yet, because they aren't quite right. Hope to have that fixed in the near future... brian On Wed, 25 Oct 2006, Ralph H Castain wrote: There are a number of things in the trunk that haven't been moved over to 1.2 branch yet. They are coming shortly, though...once the merge is done, you might get a few more conflicts, but it shouldn't be too bad. On 10/25/06 7:06 AM, "Adrian Knoth" wrote: On Wed, Oct 25, 2006 at 02:48:33PM +0200, Adrian Knoth wrote: I don't see any new component, Adrian. There have been a few updates to the existing component, some of which might cause conflicts with the merge, but those shouldn't be too hard to resolve. Ok, I just saw something with "create_listen_thread" and so on, but didn't look closer. The "new" (current) oob/tcp (in the v1.2 branch) does not have Brian's fix for #493. (the following constant is missing, the code, too) MCA_OOB_TCP_ADDR_TYPE_AFINET There are probably more differences... If you want, I can do the merge and we'll use my IPv6 oob with all the patches up to r12050. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Brian Barrett Graduate Student, Open Systems Lab, Indiana University http://www.osl.iu.edu/~brbarret/
Re: [OMPI devel] Building OpenMPI on windows
At one point, a long time ago (before anyone started working on the native windows port), I had unpatched OMPI tarballs building on Cygwin, using Cygwin's gcc. Which I believe is all Greg and Beth want to do for now. But I believe that the recent code to support Windows natively has caused some issues with our configure script when trying to run in that mode. Brian On Nov 18, 2006, at 12:39 PM, George Bosilca wrote: I'm impressed that it work with cygwin out of the box. Last time I tried, I had to patch the libtool, do some manual modifications of the configure script (of course after altering some of the .m4 files). It worked, I was able to run a simple ping-pong program, but it took me something like 4 hours to compile. I'm out of office for the next week. I can give a try to the whole cygwin/SFU once I get back. Thanks, george. On Nov 18, 2006, at 9:22 AM, Jeff Squyres wrote: I don't know if we're tried cygwin for a long, long time... My gut reaction is that it "should work" (the wrappers are pretty simple C), but I don't have any cygwin resources to test / fix this. :-( George -- got any insight? On Nov 16, 2006, at 4:44 PM, Ralph Castain wrote: I'm not sure about running under cygwin at this stage - I have compiled the code base there before as you did, but never tried to run anything in that environment. However, I believe 1.2 will operate under Windows itself. Of course, that means using the Windows compilers...but if you have those, you should be able to run. I'll have to defer to my colleagues who wrote those wrapper compilers as to why cygwin might be taking offense. They are all at the Supercomputing Expo this week, so response may be a little delayed. Ralph On 11/16/06 1:54 PM, "Beth Tibbitts" wrote: I'm trying to build OpenMPI on windows with cygwin, to at least be able to demo the Eclipse PTP(Parallel Tools Platform) on my laptop. I configured OpenMPI version 1.2 (openmpi-1.2b1) with the following command: ./configure --with-devel-headers --enable-mca-no-build=timer- windows then did make all and make install, which all seemed to finish ok When i try to compile a small test mpi program I get a segfault $ mpicc mpitest.c Signal:11 info.si_errno:0(No error) si_code:23() Failing at addr:0x401a06 *** End of error message *** 15 [main] mpicc 7036 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack) Segmentation fault (core dumped) ...Beth Beth Tibbitts (859) 243-4981 (TL 545-4981) High Productivity Tools / Parallel Tools http://eclipse.org/ptp IBM T.J.Watson Research Center Mailing Address: IBM Corp., 455 Park Place, Lexington, KY 40511 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Server Virtualization Business Unit Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Build system changes
Hi all - Just wanted to give everyone a heads up that there will be two changes to the build system that should have minimal impact on everyone, but are worth noting: 1) If you are using Autoconf 2.60 or later, you *MUST* be using Automake 1.10 or later. Most people are still using AC 2.59, so this should have zero impact on the group. 2) We will now be checking to make sure that the C++, F77, F90, and ObjC compilers can link against the C compiler. This should clean up some of the amorphous errors people have been getting when they do something like: 'CFLAGS=-m32 CXXFLAGS=-m64', usually by not specifying one of the two... Brian
Re: [OMPI devel] incorrect definition of MPI_ERRCODES_IGNORE?
Thanks for the bug report. You are absolutely correct - the #define is incorrect in Open MPI. I've committed a fix to our development trunk and it should be included in the future releases. In the mean time, it is safe to change the line in the installed mpi.h for Open MPI from: #define MPI_ERRCODES_IGNORE ((void *) 0)/* don't return error codes */ to #define MPI_ERRCODES_IGNORE ((int *) 0)/* don't return error codes */ Since it's a simple cast, there is no need to recompile Open MPI's libmpi -- modifying the installed mpi.h is safe. Thanks again, Brian > OPEN MPI folks, > >Please see the possible error in your code, if this is indeed an error > on your part we would appreciate a fix as soon as possible so that we do > not have to direct our users to other MPI implementations. > >Thanks > > Barry > > > On Fri, 29 Dec 2006, Satish Balay wrote: > >> Looks like there is some issues with MPI_Spawn() and OpenMPI. >> >> > >> libfast in: >> /Volumes/MaxtorUFS1/geoframesvn/tools/petsc-dev/src/sys/objects >> mpinit.c: In function 'PetscErrorCode PetscOpenMPSpawn(PetscMPIInt)': >> mpinit.c:73: error: invalid conversion from 'void*' to 'int*' >> mpinit.c:73: error: initializing argument 8 of 'int >> MPI_Comm_spawn(char*, char**, int, ompi_info_t*, int, omp >> i_communicator_t*, ompi_communicator_t**, int*)' >> ar: mpinit.o: No such file or directory >> << >> >> ierr = >> MPI_Comm_spawn(programname,argv,nodesize-1,MPI_INFO_NULL,0,PETSC_COMM_SELF,&children,MPI_ERRCODES_IGNORE);CHKERRQ(ierr); >> >> >> Looks like using MPI_ERRCODES_IGNORE in that function call is >> correct. However OpenMPI declares it to '((void *) 0)' giving compile >> error with c++. [MPICH declares it to '(int *)0' - which doesn't give >> any compile erorrs]. >> >> I guss the following change should work - but I suspect this is an >> openmpi bug.. I don't think its appropriate to make this change in >> PETSc code.. >> >> ierr = >> MPI_Comm_spawn(programname,argv,nodesize-1,MPI_INFO_NULL,0,PETSC_COMM_SELF,&children,(int*) >> MPI_ERRCODES_IGNORE);CHKERRQ(ierr); >> >> Satish >> >> On Fri, 29 Dec 2006, Charles Williams wrote: >> >> > Hi, >> > >> > I'm not sure if this is a problem with PETSc or OpenMPI. Things were >> building >> > OK on December 19, and this problem has crept in since then. Thanks >> for any >> > ideas. >> > >> > Thanks, >> > Charles >> > >> > >> > >> > Charles A. Williams >> > Dept. of Earth & Environmental Sciences >> > Science Center, 2C01B >> > Rensselaer Polytechnic Institute >> > Troy, NY 12180 >> > Phone:(518) 276-3369 >> > FAX:(518) 276-2012 >> > e-mail:will...@rpi.edu >> > >> > >> >> > > --===0719315771==-- > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r12945
Because that's what we had been using and I was going for minimal change (since this is for v1.2). Also note that *none* of this code is in performance critical areas. Last I checked, we don't really care how fast attribute updates and error handlers are fired... I think there are much better ways of dealing with all the problems addressed below, but to do it right means a fairly sizable change and that seemed like a bad idea at this time. Brian On Jan 2, 2007, at 9:06 AM, George Bosilca wrote: Using a STL map to keep in relation the C pointer with the C++ object isn't that way more expensive that it is supposed to be ? The STL map is just a hash table, it can be as optimized as you want, it's still a hash table. How about using exactly the same mechanism as for the Fortran handler ? It's cheap, it's based on an array, it's thread save and we just reuse the code already there. george. On Dec 30, 2006, at 6:41 PM, brbar...@osl.iu.edu wrote: Author: brbarret Date: 2006-12-30 18:41:42 EST (Sat, 30 Dec 2006) New Revision: 12945 Added: trunk/ompi/mpi/cxx/datatype.cc trunk/ompi/mpi/cxx/file.cc trunk/ompi/mpi/cxx/win.cc Modified: trunk/ompi/errhandler/errhandler.c trunk/ompi/errhandler/errhandler.h trunk/ompi/mpi/cxx/Makefile.am trunk/ompi/mpi/cxx/comm.cc trunk/ompi/mpi/cxx/comm.h trunk/ompi/mpi/cxx/comm_inln.h trunk/ompi/mpi/cxx/datatype.h trunk/ompi/mpi/cxx/datatype_inln.h trunk/ompi/mpi/cxx/errhandler.h trunk/ompi/mpi/cxx/file.h trunk/ompi/mpi/cxx/file_inln.h trunk/ompi/mpi/cxx/functions.h trunk/ompi/mpi/cxx/functions_inln.h trunk/ompi/mpi/cxx/intercepts.cc trunk/ompi/mpi/cxx/mpicxx.cc trunk/ompi/mpi/cxx/mpicxx.h trunk/ompi/mpi/cxx/win.h trunk/ompi/mpi/cxx/win_inln.h Log: A number of MPI-2 compliance fixes for the C++ bindings: * Added Create_errhandler for MPI::File * Make errors_throw_exceptions a first-class predefined exception handler, and make it work for Comm, File, and Win * Deal with error handlers and attributes for Files, Types, and Wins like we do with Comms - can't just cast the callbacks from C++ signatures to C signatures. Callbacks will then fire with the C object, not the C++ object. That's bad. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] 1.2b3 fails on bluesteel
On Jan 22, 2007, at 10:39 AM, Greg Watson wrote: On Jan 22, 2007, at 9:48 AM, Ralph H Castain wrote: On 1/22/07 9:39 AM, "Greg Watson" wrote: I tried adding '-mca btl ^sm -mca mpi_preconnect_all 1' to the mpirun command line but it still fails with identical error messages. I don't understand the issue with allocating nodes under bproc. Older versions of OMPI have always just queried bproc for the nodes that have permissions set so I can execute on them. I've never had to allocate any nodes using a hostfile or any other mechanism. Are you saying that this no longer works? Turned out that mode of operation was a "bug" that caused all kinds of problems in production environments - that's been fixed for quite some time. So, yes - you do have to get an official "allocation" of some kind. Even the changes I mentioned wouldn't remove that requirement in the way you describe. BTW, there's no requirement for a bproc system to employ a job scheduler. So in my view OMPI is "broken" for bproc systems if it imposes such a requirement. I agree that the present assumption that BProc requires LSF be in use is broken and we have a fix for that shortly. However, we still will require a resource allocator of some sort (even a hostfile should work) to tell us which nodes to run on. It should be possible to write a resource allocator that just grabs nodes out of the available pool returned by the bproc status functions should be possible, but I don't believe that's on the to-do list in the near future... Brian -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
[OMPI devel] Libtool update for v1.2
Hi all - In December I had brought up the idea of updating the snapshot of Libtool 2 that we use for building the v1.2 branch to a more recent snapshot. The group seemed to think this was a good idea and I was going to do it, then got sidetracked working around a bug in their support for dylib (OS X's shared libraries). I committed a workaround to the trunk today for the bug (as well as sending one of the LIbtool developers a patch to libtool that resolves the issue). Once I hear back from Ralf (the LT developer), I'd like to finally do the LT update for our v1.2 tarballs. The advantage to us is slightly faster builds, fixed convenience library dependencies (no more having to set LIBS=/usr/lib64), and more bug fixes. Does this still sound agreeable to everyone? Brian -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
[OMPI devel] v1.2 / trunk tarball libtool change
Hi all - As of tonight, the version of Libtool used to build "official" tarballs for the v1.2 branch and the trunk (this includes nightly snapshots, beta releases, and official releases) has been updated from a snapshot of Libtool 2 from June/July 2006 to on from Jan 23, 2007. This update will solve a number of problems, including the multilib .la problem that has bitten a few people over the past years. I also made a copy of the Libtool 2 snapshot we're using to build our tarballs available on the SVN building page, so that people who wish to use the exact same Libtool version as the nightly snapshots for their development can do so. http://www.open-mpi.org/svn/building.php Note that no change is required on your part. You do not have to update the copy of Libtool you use for regular testing or development. Brian -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] [OMPI svn] svn:open-mpi r13644
On Feb 13, 2007, at 5:16 PM, Jeff Squyres wrote: On Feb 13, 2007, at 7:10 PM, George Bosilca wrote: It's already in the 1.2 !!! I don't know much you care about performance, but I do. This patch increase by 10% the latency. It might be correct for the pathscale compiler, but it didn't look as a huge requirement for all others compilers. A memory barrier for an initialization and an unlock definitively looks like killing an ant with a nuclear strike. Can we roll this back and find some other way? Yes, we can. It's not actually the memory barrier we need, it's the tell the compiler to not do anything stupid because we expect memory to be invalidated that we need. I'll commit a new, different fix tonight. Brian -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] [OMPI svn] svn:open-mpi r13644
On Feb 13, 2007, at 7:37 PM, Brian W. Barrett wrote: On Feb 13, 2007, at 5:16 PM, Jeff Squyres wrote: On Feb 13, 2007, at 7:10 PM, George Bosilca wrote: It's already in the 1.2 !!! I don't know much you care about performance, but I do. This patch increase by 10% the latency. It might be correct for the pathscale compiler, but it didn't look as a huge requirement for all others compilers. A memory barrier for an initialization and an unlock definitively looks like killing an ant with a nuclear strike. Can we roll this back and find some other way? Yes, we can. It's not actually the memory barrier we need, it's the tell the compiler to not do anything stupid because we expect memory to be invalidated that we need. I'll commit a new, different fix tonight. Upon further review, I'm wrong again. The original patch was wrong (not sure what I was thinking this afternoon) and my statement above is wrong. So the problem starts with the code: a = 1 mylock->lock = 0 b = 2 Which is essentially what you have after inlining the atomic unlock as it occurred today. It's not totally unreasonable for a compiler (and we have seen this in practice, including with GCC on LA-MPI and likely are having it happen now, just don't realize it) to reorder that to: a = 1 b = 2 mylock->lock = 0 or mylock->lock = 0 a = 1 b = 2 After all, there's no memory dependencies in the three lines of code. When we had the compare and swap for unlock, there was a memory dependency, because the compare and swap inline assembly hinted to the compiler that memory was changed by the op and it shouldn't reorder memory accesses across that boundary or the compare and swap wasn't inlined. Compilers are pretty much not going to reorder memory accesses across a function unless it's 100% clear that there is not a side effect that might be important, which is basically never in C. Ok, so we can tell the compiler not to reorder memory access with a little care (either compiler hints using inline assembly statements that include the "memory" invalidation hint) or by making atomic_unlock a function. But now we start running on hardware, and the memory controller is free to start reordering code. We don't have any instructions telling the CPU / memory controller not to reorder our original instructions, so it can still do either one of the two bad cases. Still not good for us and definitely could lead to incorrect programs. So we need a memory barrier or we have potentially invalid code. The full memory barrier is totally overkill for this situation, but some memory barrier is needed. While not quite correct, I believe that something like; static inline void opal_atomic_unlock(opal_atomic_lock_t *lock) { opal_atomic_wmb(); lock->u.lock=OPAL_ATOMIC_UNLOCKED; } would be more correct than having the barrier after the write and slightly better performance than the full atomic barrier. On x86 and x86_64, memory barriers are "free", in that all they do is limit the compiler's reordering of memory access. But on PPC, Sparc, and Alpha, it would have a performance cost. Don't know what that cost is, but I know that we need to pay it for correctness. Long term, we should probably try to implement spinlocks as inline assembly. This wouldn't provide a whole lot of performance difference, but at least I could make sure the memory barrier is in the right place and help the compiler not be stupid. By the way, this is what the Linux kernel does, adding credence to my argument, I hope ;). Brian -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] [PATCH] ompi_get_libtool_linker_flags.m4: fix $extra_ldflags detection
Thanks for the bug report and the patch. Unfortunately, the remove smallest prefix pattern syntax doesn't work with Solaris /bin/sh (standards would be better if everyone followed them...), but I committed something to our development trunk that handles the issue. It should be releases as part of v1.2.1 (we're too far in testing to make it part of v1.2). Thanks, Brian On Feb 15, 2007, at 9:12 AM, Bert Wesarg wrote: Hello, when using a multi token CC variable (like "gcc -m32"), the logic to extract $extra_ldflags from libtool don't work. So here is a little hack to remove the $CC prefix from the libtool-link cmd. Bert Wesarg diff -ur openmpi-1.1.4/config/ompi_get_libtool_linker_flags.m4 openmpi-1.1.4-extra_ldflags-fix/config/ ompi_get_libtool_linker_flags.m4 --- openmpi-1.1.4/config/ompi_get_libtool_linker_flags.m4 2006-04-12 18:12:28.0 +0200 +++ openmpi-1.1.4-extra_ldflags-fix/config/ ompi_get_libtool_linker_flags.m4 2007-02-15 15:11:28.285844893 +0100 @@ -76,11 +76,15 @@ cmd="$libtool --dry-run --mode=link --tag=CC $CC bar.lo libfoo.la - o bar $extra_flags" ompi_check_linker_flags_work yes +# use array initializer to remove multiple spaces in $CC +tempCC=($CC) +tempCC="${tempCC[@]}" +output="${output#$tempCC}" +unset tempCC eval "set $output" extra_ldflags= while test -n "[$]1"; do case "[$]1" in -$CC) ;; *.libs/bar*) ;; bar*) ;; -I*) ;; ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] replace 'atoi' with 'strtol'
The patch is so that you can pass in hex in addition to decimal, right? I think that makes sense. But since we're switching to strtol, it might also make sense to add some error detection while we're at it. Not a huge deal, but it would be nice :). Brian > Hi, > > I want to add a patch to opal mca. > > This patch replaces an 'atoi' call with a 'strtol' call. > > If it's O.K with everyone I'll submit this patch by the end of the week. > > > > Index: opal/mca/base/mca_base_param.c > > === > > --- opal/mca/base/mca_base_param.c (revision 14391) > > +++ opal/mca/base/mca_base_param.c (working copy) > > @@ -1673,7 +1673,7 @@ > >if (NULL != param->mbp_env_var_name && > >NULL != (env = getenv(param->mbp_env_var_name))) { > > if (MCA_BASE_PARAM_TYPE_INT == param->mbp_type) { > > - storage->intval = atoi(env); > > + storage->intval = (int)strtol(env,(char**)NULL,0); > > } else if (MCA_BASE_PARAM_TYPE_STRING == param->mbp_type) { > >storage->stringval = strdup(env); > > } > > @@ -1714,7 +1714,7 @@ > > if (0 == strcmp(fv->mbpfv_param, param->mbp_full_name)) { > > if (MCA_BASE_PARAM_TYPE_INT == param->mbp_type) { > > if (NULL != fv->mbpfv_value) { > > -param->mbp_file_value.intval = > atoi(fv->mbpfv_value); > > +param->mbp_file_value.intval = > (int)strtol(fv->mbpfv_value,(char**)NULL,0); > > } else { > > param->mbp_file_value.intval = 0; > > } > > > > Thanks. > > > > Sharon. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] replace 'atoi' with 'strtol'
> > Because the target variable is an (int). > > If I were writing the code, I would leave the cast out. By assigning > the value to an int variable, you get the same effect anyway, so the > cast is redundant. And if you ever change the variable to a long, now > you have to remember to delete the cast too. So I don't see any > upside to having the cast. > > But it's just a minor style issue... I agree 100% with Roland on this one. There's a reason that compilers don't complain about this particular cast. Casting from integer type to integer type just isn't a big deal in my book. Of course,I generally try to avoid casts at all costs, since they tend to cover real issues (see all the evil casts of long* to int* that have screwed us continually with 64 bit big endian machines. But I don't care enough to argue the point :). Brian
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14782
> On Sun, May 27, 2007 at 10:34:33AM -0600, Galen Shipman wrote: >> Actually, we still need MCA_BTL_FLAGS_FAKE_RDMA , it can be used as >> a hint for components such as one-sided. > What is the purpose of the hint if it should be set for each interconnect. > Just assume that it is set and behave accordingly. That what we decided > to do in OB1. And the name is not very good too :) All RDMA networks > behave like this. Yeah, I agree -- the current semantics aren't very useful anymore. I'd actually like to just redefine the FAKE_RDMA flag's meaning. Some of the BTLs assume that there will be one set of prepare_src / prepare_dst calls for each put/get call. This won't work for one-sided RDMA, were we'll call prepare_dst at window creation time and reuse it. I'd like to have FAKE_RDMA set for those BTLs. Brian
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r15474
So first, there's an error in the patch (e-mail with details coming shortly, as there are many errors in the patch). There's no need for both isends (the new one and the one in there already). Second, this is in code that's a crutch around the real issue, which is that for a very small class of applications, the way wireup occurs with InfiniBand makes it time consuming if the application is very asynchronous (one process does a single send, the other process doesn't enter the MPI library for many minutes). It's not on by default and not recommended for almost all uses. The goal is not to have a barrier, but to have every process have at least one channel for MPI communication fully established to every other process. The barrier is a side effect. The MPI barrier isn't used precisely because it doesn't cause every process to talk to every other process. The rotating ring algorithm was used because we're also trying as hard as possible to reduce single-point contention, which when everyone is trying to connect at once, caused failures in either the OOB fabric (which I think I fixed a couple months ago) or in the IB layer (which seemed to be the nature of IB). This is not new code, and given the tiny number of users (now that the OOB is fixed, one app that I know of at LANL), I'm not really concerned about scalability. Brian > If you really want to have a fully featured barrier why not using the > collective barrier ? This double ring barrier have really bad > performance, and it will became a real scalability issue. > > Or do we really need to force this particular connection shape (left > & right) ? > >george. > > Modified: trunk/ompi/runtime/ompi_mpi_preconnect.c > > == > --- trunk/ompi/runtime/ompi_mpi_preconnect.c (original) > +++ trunk/ompi/runtime/ompi_mpi_preconnect.c 2007-07-17 21:15:59 EDT > (Tue, 17 Jul 2007) > @@ -78,6 +78,22 @@ > > ret = ompi_request_wait_all(2, requests, MPI_STATUSES_IGNORE); > if (OMPI_SUCCESS != ret) return ret; > + > +ret = MCA_PML_CALL(isend(outbuf, 1, MPI_CHAR, > + next, 1, > + MCA_PML_BASE_SEND_COMPLETE, > + MPI_COMM_WORLD, > + &requests[1])); > +if (OMPI_SUCCESS != ret) return ret; > + > +ret = MCA_PML_CALL(irecv(inbuf, 1, MPI_CHAR, > + prev, 1, > + MPI_COMM_WORLD, > + &requests[0])); > +if(OMPI_SUCCESS != ret) return ret; > + > +ret = ompi_request_wait_all(2, requests, MPI_STATUSES_IGNORE); > +if (OMPI_SUCCESS != ret) return ret; > } > > return ret; > > > On Jul 17, 2007, at 9:16 PM, jsquy...@osl.iu.edu wrote: > >> Author: jsquyres >> Date: 2007-07-17 21:15:59 EDT (Tue, 17 Jul 2007) >> New Revision: 15474 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/15474 > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] PML cm and heterogeneous support
I'm surprised that ompi_mtl_datatype_{pack, unpack} are properly handling the heterogeneous issues - I certainly didn't take that into account when I wrote them. The CM code has never been audited for heterogeneous safety, which is why there was protection at that level for not running in heterogeneous environments. The various MTLs likewise have not been audited for heterogeneous safety, nor has the mtl base datatype manipulation functions. If someone wanted, they could do such an audit, push the heterogeneous disabling code down to the MTLs, and figure out what to do with the datatype usage. The CM code likely doesn't do anything heterogeneous-evil, but I can't say for sure. Brian On Thu, 25 Oct 2007, Sajjad Tabib wrote: Hi Brian, I have actually created a new MTL, in which I have added heterogeneous support. To experiment whether CM worked in this environment, I took out the safeguards that prevented one to use CM in a heterogeneous environment. Miraculously, things have been working so far. I haven't examined data integrity to an extent that I could say everything works perfectly, but with MPI_INTS, I do not have any endian problems. Now, based on my initial tests, I have came to the understanding that the PML CM safeguard against heterogeneous environments was a mechanism to prevent users from using existing MTLs. But, if an MTL supports heterogeneous communication, then it is possible to use the CM component. What is your take on this? Anyways, going back to the datatype usage. When you say that: "it's known the datatype usage in the CM PML won't support heterogeneous operation" could you please breifly explain this in more detail? I have been using ompi_mtl_datatype_pack and ompi_mtl_datatype_unpack, which use ompi_convertor_pack and ompi_convertor_unpack, for data packing. Do you mean that these functions will not work correctly? Thank You, Sajjad Tabib Brian Barrett Sent by: devel-boun...@open-mpi.org 10/24/07 10:04 PM Please respond to Open MPI Developers To Open MPI Developers cc Subject Re: [OMPI devel] PML cm and heterogeneous support No, it's because the CM PML was never designed to be used in a heterogeneous environment :). While the MX BTL does support heterogeneous operations (at one point, I believe I even had it working), none of the MTLs have ever been tested in heterogeneous environments and it's known the datatype usage in the CM PML won't support heterogeneous operation. Brian On Oct 24, 2007, at 6:21 PM, Jeff Squyres wrote: George / Patrick / Rich / Christian -- Any idea why that's there? Is that because portals, MX, and PSM all require homogeneous environments? On Oct 18, 2007, at 3:59 PM, Sajjad Tabib wrote: Hi, I am tried to run an MPI program in a heterogeneous environment using the pml cm component. However, open mpi returned with an error message indicating that PML add procs returned "Not supported". I dived into the cm code to see what was wrong and I came upon the code below, which basically shows that if the processes are running on different architectures, then return "not supported". Now, I'm wondering whether my interpretation is correct or not. Is it true that the cm component does not support a heterogeneous environment? If so, will the developers support this in the future? How could I get around this while still using the cm component? What will happen if I rebuilt openmpi without these statements? I would appreciate your help. Code: mca_pml_cm_add_procs(){ #if OMPI_ENABLE_HETEROGENEOUS_SUPPORT 107 for (i = 0 ; i < nprocs ; ++i) { 108 if (procs[i]->proc_arch != ompi_proc_local()- proc_arch) { 109 return OMPI_ERR_NOT_SUPPORTED; 110 } 111 } 112 #endif . . . } Sajjad Tabib ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Question regarding MCA_PML_CM_SEND_REQUEST_INIT_COMMON
This is correct -- the MPI_ERROR field should be filled in by the MTL upon completion of the request (or when it knows what to stick in there). The CM PML should generally not fill in that field. Brian On Wed, 31 Oct 2007, Jeff Squyres wrote: Again, I'm not a CM guy :-), but in general, I would think: yes, you set MPI_ERROR when it is appropriate. I.e., when you know that the request has been successful or it has failed. On Oct 31, 2007, at 9:18 AM, Sajjad Tabib wrote: Hi Jeff, Now that you mention it, I believe you are right. In fact, I did not know that I needed to set the req_status.MPI_ERROR in my MTL. I looked at the mx mtl and realized that req_status.MPI_ERROR is getting set in their progress function. So, in general, when do you set the req_status.MPI_ERROR? After a send/recv has completed successfully? Thank You, Sajjad Tabib Jeff Squyres Sent by: devel-boun...@open-mpi.org 10/31/07 07:29 AM Please respond to Open MPI Developers To Open MPI Developers cc Subject Re: [OMPI devel] Question regarding MCA_PML_CM_SEND_REQUEST_INIT_COMMON I haven't done any work in the cm pml so I can't definitively answer your question, but wouldn't you set req_status.MPI_ERROR in your MTL depending on the result of the request? On Oct 29, 2007, at 9:20 AM, Sajjad Tabib wrote: Hi, I was issuing an MPI_Bcast in a sample program and was hitting an unknown error; at least that was what MPI was telling me. I traced through the code to find my error and came upon MCA_PML_CM_REQUEST_INIT_COMMON macro function in pml_cm_sendreq.h. I looked at the function and noticed that in this function the elements of req_status were getting initialized; however, req_status.MPI_ERROR was not. I thought that maybe MPI_ERROR must also require initialization because if the value of MPI_ERROR was some arbitrary value not equal to MPI_SUCCESS then my program will definitely die. Unless, MPI_ERROR is propragating from upper layers to signify an error, but I wasn't sure. Anyway, I assumed that MPI_ERROR was not propagating from upper layers, so then I set req_status.MPI_ERROR to MPI_SUCCUSS and reran my test program. My program worked. Now, having gotten my program to work, I thought I should run this by you to make sure that MPI_ERROR was not propagating from upper layers. Is it ok that I did a: "(req_send)->req_base.req_ompi.req_status.MPI_ERROR = MPI_SUCCESS;" in MCA_PML_CM_REQUEST_INIT_COMMON? Thank You, Sajjad Tabib ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Environment forwarding
This is extremely tricky to do. How do you know which environment variables to forward (foo in this case) and which not to (hostname). SLURM has a better chance, since it's linux only and generally only run on tightly controlled clusters. But there's a whole variety of things that shouldn't be forwarded and that list differs from OS to OS. I believe we toyed around with the "right thing" in LAM and early on with OPen MPI and decided that it was too hard to meet expected behavior. Brian On Mon, 5 Nov 2007, Tim Prins wrote: Hi, After talking with Torsten today I found something weird. When using the SLURM pls we seem to forward a user's environment, but when using the rsh pls we do not. I.e.: [tprins@odin ~]$ mpirun -np 1 printenv |grep foo [tprins@odin ~]$ export foo=bar [tprins@odin ~]$ mpirun -np 1 printenv |grep foo foo=bar [tprins@odin ~]$ mpirun -np 1 -mca pls rsh printenv |grep foo So my question is which is the expected behavior? I don't think we can do anything about SLURM automatically forwarding the environment, but I think there should be a way to make rsh forward the environment. Perhaps add a flag to mpirun to do this? Thanks, Tim ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Environment forwarding
On Mon, 5 Nov 2007, Torsten Hoefler wrote: On Mon, Nov 05, 2007 at 04:57:19PM -0500, Brian W. Barrett wrote: This is extremely tricky to do. How do you know which environment variables to forward (foo in this case) and which not to (hostname). SLURM has a better chance, since it's linux only and generally only run on tightly controlled clusters. But there's a whole variety of things that shouldn't be forwarded and that list differs from OS to OS. I believe we toyed around with the "right thing" in LAM and early on with OPen MPI and decided that it was too hard to meet expected behavior. Some applications rely on this (I know at least two right away, Gamess and Abinit) and they work without problems with Lam/Mpich{1,2} but not with Open MPI. I am *not* arguing that those applications are correct (I agree that this way of passing arguments is ugly, but it's done). I know it's not defined in the standard but I think it's a nice convenient functionality. E.g., setting the LD_LIBRARY_PATH to find libmpi.so in the .bashrc is also a pain if you have multiple (Open) MPIs installed. LAM does not automatically propogate environment variables -- it's behavior is almost *exactly* like Open MPI's. There might be a situation where the environment is not quite so scrubbed if a process is started on the same node mpirun is executed on, but it's only appearances -- in reality, that's the environment that was alive when lamboot was executed. With both LAM and Open MPI, there is the -x option to propogate a list of environment variables, but that's about it. Neither will push LD_LIBRARY_PATH by default (and there are many good reasons for that, particularly in heterogeneous situations). Brian
[OMPI devel] Incorrect one-sided test
Hi all - Lisa Glendenning, who's working on a Portals one-sided component, discovered that the test onesided/test_start1.c in our repository is incorrect. It assumes that MPI_Win_start is non-blocking, but the standard says that "MPI_WIN_START is allowed to block until the corresponding MPI_WIN_POST calls are executed". The pt2pt and rdma components did not block, so the test error did not show up with those components. I've fixed the test in r1223, but thought I'd let everyone know I changed one of our conformance tests. Brian
Re: [OMPI devel] THREAD_MULTIPLE
On Wed, 28 Nov 2007, Jeff Squyres wrote: We've had a few users complain about trying to use THREAD_MULTIPLE lately and having it not work. Here's a proposal: why don't we disable it (at least in the 1.2 series)? Or, at the very least, put in a big stderr warning that is displayed when THREAD_MULTIPLE is selected? Comments? While you're disabiling it, might also want to remove the bullet from the front page of www.open-mpi.org that suggests we support it... Brian
Re: [OMPI devel] RTE Issue II: Interaction between the ROUTED and GRPCOMM frameworks
To me, (a) is dumb and (c) isn't a non-starter. The whole point of the component system is to seperate concerns. Routing topology and collectives operations are two difference concerns. While there's some overlap (a topology-aware collective doesn't make sense when using the unity routing structure), it's not overlap in the one implies you need the other. I can think of a couple of different ways of implementing the group communication framework, all of which are totally independent of the particulars of how routing is tracked. (b) has a very reasonable track record of working well on the OMPI side (the mpool / btl thing figures itself out fairly well). Bringing such a setup over to ORTE wouldn't be bad, but a bit hackish. Of course, there's at most two routed components built at any time, and the defaults are all most non-debugging people will ever need, so I guess I"m not convinced (c) isn't a non-starter. Brian On Wed, 5 Dec 2007, Tim Prins wrote: To me, (c) is a non-starter. I think whenever possible we should be automatically doing the right thing. The user should not need to have any idea how things work inside the library. Between options (a) and (b), I don't really care. (b) would be great if we had a mca component dependency system which has been much talked about. But without such a system it gets messy. (a) has the advantage of making sure there is no problems and allowing the 2 systems to interact very nicely together, but it also might add a large burden to a component writer. On a related, but slightly different topic, one thing that has always bothered me about the grpcomm/routed implementation is that it is not self contained. There is logic for routing algorithms outside of the components (for example, in orte/orted/orted_comm.c). So, if there are any overhauls planned I definitely think this needs to be cleaned up. Thanks, Tim Ralph H Castain wrote: II. Interaction between the ROUTED and GRPCOMM frameworks When we initially developed these two frameworks within the RTE, we envisioned them to operate totally independently of each other. Thus, the grpcomm collectives provide algorithms such as a binomial "xcast" that uses the daemons to scalably send messages across the system. However, we recently realized that the efficacy of the current grpcomm algorithms directly hinge on the daemons being fully connected - which we were recently told may not be the case as other people introduce different ROUTED components. For example, using the binomial algorithm in grpcomm's xcast while having a ring topology selected in ROUTED would likely result in terrible performance. This raises the following questions: (a) should the GRPCOMM and ROUTED frameworks be consolidated to ensure that the group collectives algorithms properly "match" the communication topology? (b) should we automatically select the grpcomm/routed pairings based on some internal logic? (c) should we leave this "as-is" and the user is responsible for making intelligent choices (and for detecting when the performance is bad due to this mismatch)? (d) other suggestions? Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] vt-integration
OS X enforces a no duplicate symbol rule when flat namespaces are in use (the default on OS X). If all the libraries are two-level namespace libraries (libSystem.dylib, aka libm.dylib is two-level), then duplicate symbols are mostly ok. Libtool by default forces a flat namespace in sharedlibraries to work around an oddity on early OS X systems with undefined references. There's also a way to make static two-level namespaces (I think), but I haven't tried that before). You can cause Libtool (and the linker) to be a bit more sane if you set the environment variable MACOSX_DEPLOYMENT_TARGET to either 10.3 or 10.4. The shared library rules followed by Libtool and the compiler chain will then be for that OS X release, rather than for the original 10.0. We don't support anything older than 10.3, so this isn't really a problem. Of course, since the default for users is to emit 10.0 target code, that can be a bit hard to make work out. So you might want to have a configure test to figure all that out and not build the IO intercept library in some cases. Brian On Wed, 5 Dec 2007, Jeff Squyres wrote: I know that OS X's linker is quite different than the Linux linker -- you might want to dig into the ld(1) man page on OS X as a starting point, and/or consult developer.apple.com for more details. On Dec 5, 2007, at 10:04 AM, Matthias Jurenz wrote: Hi Jeff, I have added checks for the functions open64, creat64, etc. to the VT's configure script, so building of VT works fine on MacOS AND Solaris (Terry had the same problem). Thanks for your hint ;-) Unfortunately, there is a new problem on MacOS. I get the following linker errors, if I try to link an application with the VT libraries: gcc -finstrument-functions pi_seq.o -lm -o pi_seq -L/Users/jurenz/lib/vtrace-5.4.1/lib -lvt -lotf -lz -L/usr/local/ lib/ -lbfd -lintl -L/usr/local/lib/ -liberty /usr/bin/ld: multiple definitions of symbol _close /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(close.So) definition of _close /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _close in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _fclose /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fclose.So) definition of _fclose /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _fclose in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _fdopen /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fdopen.So) definition of _fdopen /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _fdopen in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _fgets /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fgets.So) definition of _fgets /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _fgets in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _fopen /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fopen.So) definition of _fopen /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _fopen in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _fprintf /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../ libm.dylib(fprintf.So) definition of _fprintf /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _fprintf in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _fputc /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fputc.So) definition of _fputc /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _fputc in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _fread /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fread.So) definition of _fread /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _fread in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _fwrite /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(fwrite.So) definition of _fwrite /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _fwrite in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _open /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(open.So) definition of _open /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _open in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _read /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(read.So) definition of _read /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _read in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _rewind /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(rewind.So) definition of _rewind /Users/jurenz/lib/vtrace-5.4.1/lib/libvt.a(vt_iowrap.o) definition of _rewind in section (__TEXT,__text) /usr/bin/ld: multiple definitions of symbol _write /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libm.dylib(write.So) definition of _write /Users/jurenz/l
Re: [OMPI devel] opal_condition_wait
On Thu, 6 Dec 2007, Tim Prins wrote: Tim Prins wrote: First, in opal_condition_wait (condition.h:97) we do not release the passed mutex if opal_using_threads() is not set. Is there a reason for this? I ask since this violates the way condition variables are supposed to work, and it seems like there are situations where this could cause deadlock. So in (partial) answer to my own email, this is because throughout the code we do: OPAL_THREAD_LOCK(m) opal_condition_wait(cond, m); OPAL_THREAD_UNLOCK(m) So this relies on opal_condition_wait not touching the lock. This explains it, but it still seems very wrong. Yes, this is correct. The assumption is that you are using the conditional macro lock/unlock with the condition variables. I personally don't like this (I think we should have had macro conditional condition variables), but that obviously isn't how it works today. The problem with always holding the lock when you enter the condition variable is that even when threading is disabled, calling a lock is at least as expensive as an add, possibly including a cache miss. So from a performance standpoint, this would be a no-go. Also, when we are using threads, there is a case where we do not decrement the signaled count, in condition.h:84. Gleb put this in in r9451, however the change does not make sense to me. I think that the signal count should always be decremented. Can anyone shine any light on these issues? Unfortunately, I can't add much on this front. Brian
Re: [OMPI devel] Dynamically Turning On and Off Memory Manager of Open MPI at Runtime??
On Mon, 10 Dec 2007, Peter Wong wrote: Open MPI defines its own malloc (by default), so malloc of glibc is not called. But, without calling malloc of glibc, the allocator of libhugetlbfs to back text and dynamic data by large pages, e.g., 16MB pages on POWER systems, is not used. Indeed, we can build Open MPI with --with-memory-manager=none. I am wondering the feasibility of turning the memory manger on and off dynamically at runtime as a new feature? Hi Peter - The problem is that we actually intercept the malloc() call, so once we've done that (which is a link-time thing), it's too late to use the underlying malloc to actually do its thing. I was going to add some code to Open MPI to make it an application link time choice (rather than an OMPI-build time choice), but unfortunately my current day to day work is not on Open MPI, so unless someone else picks it up, it's unlikely this will get implemented in the near future. Of course, if someone has the time and desire, I can describe to them what I was thinking. The only way I've found to do memory tracking at run-time is to use LD_PRELOAD tricks, which I believe there were some other (easy to overcome) problems with. What would be really nice (although unlikely to occur) is if there was a thread-safe way to hook into the memory manager directly (rather than playing linking tricks). GLIBC's malloc provides hooks, but they aren't thread safe (as in two user threads calling malloc at the same time would result in badness). Darwin/Mac OS X provides thread-safe hooks that work very well (don't require linker tricks and can be turned off at run-time), but are slightly higher level than what we want -- there we can intercept malloc/free, but what we'd really like to know is when memory is being given back to the operating system. Hope this helps, Brian
Re: [OMPI devel] matching code rewrite in OB1
On Tue, 11 Dec 2007, Gleb Natapov wrote: I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance improvement. I ran MTT with this patch and the result is the same as on trunk. I would like to commit this to the trunk. The patch is attached for everybody to try. I don't think we can live without those macros :). Out of curiousity, is there any functionality that was removed as a result of this change? I'll test on a couple systems over the next couple of days... Brian
Re: [OMPI devel] matching code rewrite in OB1
On Wed, 12 Dec 2007, Gleb Natapov wrote: On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: This is better than nothing, but really not very helpful for looking at the specific issues that can arise with this, unless these systems have several parallel networks, with tests that will generate a lot of parallel network traffic, and be able to self check for out-of-order received - i.e. this needs to be encoded into the payload for verification purposes. There are some out-of-order scenarios that need to be generated and checked. I think that George may have a system that will be good for this sort of testing. I am running various test with multiple networks right now. I use several IB BTLs and TCP BTL simultaneously. I see many reordered messages and all tests were OK till now, but they don't encode message sequence in a payload as far as I know. I'll change one of them to do so. Other than Rich's comment that we need sequence numbers, why add them? We haven't had them for non-matching packets for the last 3 years in Open MPI (ie, forever), and I can't see why we would need them. Yes, we need sequence numbers for match headers to make sure MPI ordering is correct. But for the rest of the payload, there's no need with OMPI's datatype engine. It's just more payload for no gain. Brian
Re: [OMPI devel] IPv4 mapped IPv6 addresses
On Fri, 14 Dec 2007, Adrian Knoth wrote: Should we consider moving towards these mapped addresses? The implications: - less code, only one socket to handle - better FD consumption - breaks WinXP support, but not Vista/Longhorn or later - requires non-default kernel runtime setting on OpenBSD for IPv4 connections FWIW, FD consumption is the only real issue to consider. My thought is no. The resource consumption isn't really an issue to consider. It would also simplify the code (although work that Adrian and I did later to clean up the TCP OOB component has limited that). If you look at the FD count issue, you're going to reduce the number of FDs (for the OOB anyway) by 2. Not (2 * NumNodes), but 2 (one for BTL, one for OOB). Today we have a listen socket for IPv4 and another for IPv6. With IPv4 mapped addresses, we'd have one that did both. In terms of per-peer connections, the OOB tries one connection at a time, so there will be at most 1 OOB connection between any two peers. In return for 2 FDs, we'd have to play with code taht we know works and with cleanups over the last year has actually become quite simple. We'd have to break WinXP support (when it sounds like no one is really moving to Vista), and we'd break out-of-the-box OpenBSD. Brian
Re: [OMPI devel] ptmalloc and pin down cache problems again
Nope, I think that's a valid approach. For some reason, I believe it was problematic for the OpenIB guys to do that at the time we were hacking up that code. But if it works, it sounds like a much better approach. When you make the change to the openib mpool, I'd also MORECORE_CANNONT_TRIM back to 0. mvapi / openib were the only libraries that needed the free in the deregistration callback -- GM apppeared to not have that particular behavior. And I don't believe that anyone else actually uses the deregistration callbacks. Brian On Mon, 7 Jan 2008, Gleb Natapov wrote: Hi Brian, I encountered problem with ptmalloc an registration cache. I see that you (I think it was you) disabled shrinking of a heap memory allocated by sbrk by setting MORECORE_CANNOT_TRIM to 1. The comment explains that it should be done because freeing of small objects is not reentrant so if ompi memory subsystem callback will call free() the code will deadlock. And the trick indeed works in single threaded programs, but in multithreaded programs ptmalloc may allocate a heap not only by sbrk, but by mmap too. This is called "arena". Each thread may have arenas of its own. The problem is that ptmalloc may free an arena by calling munmap() and then free() that is called from our callback deadlocks. I tried to compile with USE_ARENAS set to 0, but the code doesn't compile. I can fix compilation problem of cause, but it seems that it is not so good idea to disable this feature. The ptmalloc scalability depends on it (and even if we will disable it ptmalloc may still create arena by mmap if sbrk fails). I see only one way to solve this problem: to not call free() inside mpool callbacks. If freeing of a memory is needed (and it is needed since IB unregister calls free()) the works should be deferred. For IB mpool we can check what needs to be unregistered inside a callback, but actually call unregister() from next mpool->register() call. Do you see any problems with this approach? -- Gleb.
Re: [OMPI devel] Fwd: === CREATE FAILURE ===
Automake forces v7 mode so that Solaris tar can untar the tarball, IIRC. Brian On Thu, 24 Jan 2008, Aurélien Bouteiller wrote: According to posix, tar should not limit the file name length. Only the v7 implementation of tar is limited to 99 characters. GNU tar has never been limited in the number of characters file names can have. You should check with tar --help that tar on your machine defaults to format=gnu or format=posix. If it defaults to format=v7 I am curious why. Are you using solaris ? Aurelien Le 24 janv. 08 à 15:18, Jeff Squyres a écrit : I'm trying to replicate and getting a lot of these: tar: openmpi-1.3a1r17212M/ompi/mca/pml/v/vprotocol/mca/vprotocol/ pessimist/vprotocol_pessimist_sender_based.c: file name is too long (max 99); not dumped tar: openmpi-1.3a1r17212M/ompi/mca/pml/v/vprotocol/mca/vprotocol/ pessimist/vprotocol_pessimist_component.c: file name is too long (max 99); not dumped I'll bet that this is the real problem. GNU tar on linux defaults to 99 characters max, and the _component.c filename is 102, for example. Can you shorten your names? On Jan 24, 2008, at 3:02 PM, George Bosilca wrote: We cannot reproduce this one. A simple "make checkdist" exit long before doing anything in the ompi directory. It is difficult to see where exactly it fails, but it is somewhere in the opal directory. I suspect the new carto framework ... Thanks, george. On Jan 24, 2008, at 7:12 AM, Jeff Squyres wrote: Aurelien -- Can you fix please? Last night's tests didn't run because of this failure. Begin forwarded message: From: MPI Team Date: January 23, 2008 9:13:30 PM EST To: test...@open-mpi.org Subject: === CREATE FAILURE === Reply-To: de...@open-mpi.org ERROR: Command returned a non-zero exist status make -j 4 distcheck Start time: Wed Jan 23 21:00:08 EST 2008 End time: Wed Jan 23 21:13:30 EST 2008 = = = = === [... previous lines snipped ...] config.status: creating orte/mca/snapc/Makefile config.status: creating orte/mca/snapc/full/Makefile config.status: creating ompi/mca/allocator/Makefile config.status: creating ompi/mca/allocator/basic/Makefile config.status: creating ompi/mca/allocator/bucket/Makefile config.status: creating ompi/mca/bml/Makefile config.status: creating ompi/mca/bml/r2/Makefile config.status: creating ompi/mca/btl/Makefile config.status: creating ompi/mca/btl/gm/Makefile config.status: creating ompi/mca/btl/mx/Makefile config.status: creating ompi/mca/btl/ofud/Makefile config.status: creating ompi/mca/btl/openib/Makefile config.status: creating ompi/mca/btl/portals/Makefile config.status: creating ompi/mca/btl/sctp/Makefile config.status: creating ompi/mca/btl/self/Makefile config.status: creating ompi/mca/btl/sm/Makefile config.status: creating ompi/mca/btl/tcp/Makefile config.status: creating ompi/mca/btl/udapl/Makefile config.status: creating ompi/mca/coll/Makefile config.status: creating ompi/mca/coll/basic/Makefile config.status: creating ompi/mca/coll/inter/Makefile config.status: creating ompi/mca/coll/self/Makefile config.status: creating ompi/mca/coll/sm/Makefile config.status: creating ompi/mca/coll/tuned/Makefile config.status: creating ompi/mca/common/Makefile config.status: creating ompi/mca/common/mx/Makefile config.status: creating ompi/mca/common/portals/Makefile config.status: creating ompi/mca/common/sm/Makefile config.status: creating ompi/mca/crcp/Makefile config.status: creating ompi/mca/crcp/coord/Makefile config.status: creating ompi/mca/io/Makefile config.status: creating ompi/mca/io/romio/Makefile config.status: creating ompi/mca/mpool/Makefile config.status: creating ompi/mca/mpool/rdma/Makefile config.status: creating ompi/mca/mpool/sm/Makefile config.status: creating ompi/mca/mtl/Makefile config.status: creating ompi/mca/mtl/mx/Makefile config.status: creating ompi/mca/mtl/portals/Makefile config.status: creating ompi/mca/mtl/psm/Makefile config.status: creating ompi/mca/osc/Makefile config.status: creating ompi/mca/osc/pt2pt/Makefile config.status: creating ompi/mca/osc/rdma/Makefile config.status: creating ompi/mca/pml/Makefile config.status: creating ompi/mca/pml/cm/Makefile config.status: creating ompi/mca/pml/crcpw/Makefile config.status: creating ompi/mca/pml/dr/Makefile config.status: creating ompi/mca/pml/ob1/Makefile config.status: creating ompi/mca/pml/v/vprotocol/Makefile config.status: error: cannot find input file: ompi/mca/pml/v/ vprotocol/pessimist/Makefile.in make: *** [distcheck] Error 1 = = = = === Your friendly daemon, Cyrador ___ testing mailing list test...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/testing -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel _
Re: [OMPI devel] xensocket - callbacks through OPAL/libevent
On Mon, 4 Feb 2008, Muhammad Atif wrote: I am trying to port xensockets to openmpi. In principle, I have the framework and everything, but there seems to be a small issue, I cannot get libevent (or OPAL) to give callbacks for receive (or send) for xensockets. I have tried to implement native code for xensockets with libevent library, again the same issue. No call backs! . With normal sockets, callbacks do come easily. So question is, do the socket/file descriptors have to have some special mechanism attached to them to support callbacks for libevent/opal? i.e some structure/magic?. i.e. maybe the developers of xensockets did not add that callback/interrupt thing at the time of creation. Xensockets is open source, but my knowledge about these issues is limited. So I though some pointer in right direction might be useful. Yes and no :). As you discovered, the OPAL interface just repackages a library called libevent to handle its socket multiplexing. Libevent can use a number of different mechanisms to look for activity on sockets, including select() and poll() calls. On Linux, it will generally use poll(). poll() requires some kernel support to do its thing, so if Xensockets doesn't implement the right magic to trigger poll() events, then libevent won't work for Xensockets. There's really nothing you can do from the Open MPI front to work around this issue -- it would have to be fixed as part of Xensockets. Second question is, what if we cannot have the callbacks. What is the recommended way to implement the btl component for such a device? Do we need to do this with event timers? Have a look at any of the BTLs that isn't TCP -- none of them use libevent callbacks for progress. Instead, they provide a progress function as part of the BTL interface, which is called on a regular basis whenever progress needs to be made. Brian
Re: [OMPI devel] 3rd party code contributions
On Fri, 8 Feb 2008, Ralph Castain wrote: 1. event library 2. ROMIO 3. VT 4. backtrace 5. PLPA - this one is a little less obvious, but still being released as a separate package 6. libNBC Sorry to Ralph, but I clipped everything from his e-mail, then am going to make references to it. oh well :). One minor correction -- the entire backtrace framework is not a third party deal. The *DARWIN/Mac OS X* component relies heavily on third party code, but the others (Linux and Solaris) are just wrappers around code in their respective C libraries. I believe I was responsible for the event library, ROMIO, and backtrace before leaving LANL. I'll go through the motivations and issues with all three in terms of integration. Event Library: The event library is the core "rendezvous" point for all of Open MPI, so any issues with it cause lots of issues with Open MPI in general. We've also hacked it considerably since taking the original libevent source -- we've renamed all the functions, we've made it thread safe in a way the author was unwilling to do, we've fixed some performance issues unique to our usage model. In short, this is no longer really the same libevent that might already be installed on the system. Using such an unmodified libevent would be disasterous. ROMIO is actually one that there was significant discussion about prior to me leaveing Los Alamos. There are a number of problems / issues with ROMIO. First and foremost, without ROMIO, we are not a fully compliant MPI implementation. So we have to ship ROMIO -- it's the only way to have that important check mark. But its current integration has some issues -- it's hard to test patches independently. There is actually a mode in the current Open MPI tree where the MPI interface to MPI-I/O is not provided by OPen MPI and no io components are built. This is to allow users to build ROMIO independently of Open MPI, for testing updates or whatever. There are some disadvantages to this. First, the independent ROMIO will use generalized requests instead of being hooked into our progress engine, so there may be some progress issues (I never verified either way). Second, it does mean dealing with another package to build on the user's site. Jeff is correct --there was discussion about how to make the integration "better" -- many of the changes were on our side, and we were going to have to ask for a couple of changes from Argonne. If someone is going to put in the considerable amount of time to make this happen, I'm happy to write up whatever notes I can remember / find on the issue. The Darwin backtrace component is mostly maintanance free. It doesn't support 64-bit Intel chips, but that's fine. Once every 18 months or so, I need to get a new copy for the latest operation system, although the truth is I don't think anything bad happens if we just stop doing the updates at OS release (by the way, I did the one for Leopard, so we're probably all going to be sick of MPI and on to other things before the next time it has to be done). While it's useful, if the community is really worried, it could probably be deleted. But having a stack trace when you segfault sure is nice :). Brian
Re: [OMPI devel] 1.3 Release schedule and contents
Out of curiousity, why is one-sided rdma component struck from 1.3? As far as I'm aware, the code is in the trunk and ready for release. Brian On Mon, 11 Feb 2008, Brad Benton wrote: All: The latest scrub of the 1.3 release schedule and contents is ready for review and comment. Please use the following links: 1.3 milestones: https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3 1.3.1 milestones: https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.1 In order to try and keep the dates for 1.3 in, I've pushed a bunch of stuff (particularly ORTE things) to 1.3.1. Even though there will be new functionality slated for 1.3.1, the goal is to not have any interface changes between the phases. Please look over the list and schedules and let me or my fellow 1.3co-release manager George Bosilca ( bosi...@eecs.utk.edu) know of any issues, errors, suggestions, omissions, heartburn, etc. Thanks, --Brad Brad Benton IBM
Re: [OMPI devel] New address selection for btl-tcp (was Re: [OMPI svn] svn:open-mpi r17307)
On Fri, 22 Feb 2008, Adrian Knoth wrote: I see three approaches: a) remove lo globally (in if.c). I expect objections. ;) I object! :). But for a good reason -- it'll break things. Someone tried this before, and the issue is when a node (like a laptop) only has lo -- then there are no reported interfaces, and either there needs to be lots of extra code in the oob / btl or things break. So let's not go down this path again. b) print a warning from BTL/TCP if the interfaces in use contain lo. Like "Warning: You've included the loopback for communication. This may cause hanging processes due to unreachable peers." I like this one. c) Throw away 127.0.0.1 on the remote side. But when doing so, what's the use for including it at all? This seems hard. Brian
Re: [OMPI devel] RFC: libevent update
Jeff / George - Did you add a way to specify which event modules are used? Because epoll pushs the socket list into the kernel, I can see how it would screw up BLCR. I bet everything would work if we forced the use of poll / select. Brian On Tue, 18 Mar 2008, Jeff Squyres wrote: Crud, ok. Keep us posted. On Mar 18, 2008, at 4:16 PM, Josh Hursey wrote: I'm testing with checkpoint/restart and the new libevent seems to be messing up the checkpoints generated by BLCR. I'll be taking a look at it over the next couple of days, but just thought I'd let people know. Unfortunately I don't have any more details at the moment. -- Josh On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote: WHAT: Bring new version of libevent to the trunk. WHY: Newer version, slightly better performance (lower overheads / lighter weight), properly integrate the use of epoll and other scalable fd monitoring mechanisms. WHERE: 98% of the changes are in opal/event; there's a few changes to configury and one change to the orted. TIMEOUT: COB, Friday, 21 March 2008 DESCRIPTION: George/UTK has done the bulk of the work to integrate a new version of libevent on the following tmp branch: https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge ** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS BRANCH! ** Cisco ran MTT on this branch on Friday and everything checked out (i.e., no more failures than on the trunk). We just made a few more minor changes today and I'm running MTT again now, but I'm not expecting any new failures (MTT will take several hours). We would like to bring the new libevent in over this upcoming weekend, but would very much appreciate if others could test on their platforms (Cisco tests mainly 64 bit RHEL4U4). This new libevent *should* be a fairly side-effect free change, but it is possible that since we're now using epoll and other scalable fd monitoring tools, we'll run into some unanticipated issues on some platforms. Here's a consolidated diff if you want to see the changes: https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public% 2Flibevent-merge&old=17846&new_path=trunk&new=17842 Thanks. -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Libtool for 1.3 / trunk builds
Hi all - Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it probably makes sense to update the version of Libtool used to build the nightly tarball and releases for the trunk (and eventually v1.3) from the nightly snapshot we have been using to the stable LT 2.2 release. I've done some testing (ie, I installed LT 2.2 for another project, and nothing in OMPI broke over the last couple of weeks), so I have some confidence this should be a smooth transition. If the group decides this is a good idea, someone at IU would just have to install the new LT version and change some symlinks and it should all just work... Brian
Re: [OMPI devel] Libtool for 1.3 / trunk builds
True - I have no objection to waiting for 2.2.1 or 1.3 to be branched, whichever comes first. The main point is that under no circumstance should 1.3 be shipped with the same 2.1a pre-release as 1.2 uses -- it's time to migrate to something stable. Brian On Wed, 19 Mar 2008, Jeff Squyres wrote: Should we wait for the next LT point release? I see a fair amount of activity on the bugs-libtool list; I think they're planning a new release within the next few weeks. (I think we will want to go to the LT point release when it comes out; I don't really have strong feelings about going to 2.2 now or not) On Mar 19, 2008, at 12:26 PM, Brian W. Barrett wrote: Hi all - Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it probably makes sense to update the version of Libtool used to build the nightly tarball and releases for the trunk (and eventually v1.3) from the nightly snapshot we have been using to the stable LT 2.2 release. I've done some testing (ie, I installed LT 2.2 for another project, and nothing in OMPI broke over the last couple of weeks), so I have some confidence this should be a smooth transition. If the group decides this is a good idea, someone at IU would just have to install the new LT version and change some symlinks and it should all just work... Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Proc modex change
Hi all - Does anyone know why we go through the modex receive and for the local process in ompi_proc_get_info()? It doesn't seem like it's necessary, and it causes some problems on platforms that don't implement the modex (since it zeros out useful information determined during the init step). If no one has any objections, I'd like to commit the attached patch that fixes that problem. Thanks, BrianIndex: ompi/proc/proc.c === --- ompi/proc/proc.c (revision 17898) +++ ompi/proc/proc.c (working copy) @@ -192,6 +192,11 @@ size_t datalen; orte_vpid_t nodeid; +/* Don't reset the information determined about the current + process during the init step. Saves time and problems if + modex is unimplemented */ +if (ompi_proc_local() == proc) continue; + if (OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_JOBID, &ompi_proc_local_proc->proc_name, &proc->proc_name)) {
Re: [OMPI devel] IRIX autoconf failure.
On Fri, 21 Mar 2008, Regan Russell wrote: I am having problems with the Assembler section of the GNU autoconf stuff on OpenMPI. Is anyone willing to work with me to get this up and running...? As a warning, MIPS / IRIX is not currently on the list of Open MPI supported platforms, so there may be some issues that we can't overcome. But this is usually a pretty simple thing -- can you send the config.log file generated by configure? Thanks, Brian
Re: [OMPI devel] FreeBSD timer_base_open error?
George - Good catch -- that's going to cause a problem :). But I think we should add yet another check to also make sure that we're on Linux. So the three tests would be: 1) Am I on a platform that we have timer assembly support for? (That's the long list of architectures that we recently, and incorrectly, added). 2) Am I on Linux (since we really only know how to parse /proc/cpuinfo on Linux) 3) Is /proc/cpuinfo readable (Because we have a couple architectures that are reported by config.guess as Linux, but don't have /proc/cpuinfo). Make sense? Brian On Wed, 26 Mar 2008, George Bosilca wrote: I was working off-list with Brad on this. Brian is right, the logic in configure.m4 is wrong. It overwrite the timer_linux_happy to yes if the host match "i?86-*|x86_64*|ia64-*|powerpc-*|powerpc64-*|sparc*-*". On FreeBSD host is i386-unknown-freebsd6.2. Here is a quick and dirty patch. I just move the selection logic a little bit around, without any major modifications. george. Index: configure.m4 === --- configure.m4(revision 17970) +++ configure.m4(working copy) @@ -40,14 +40,12 @@ [timer_linux_happy="yes"], [timer_linux_happy="no"])]) -AS_IF([test "$timer_linux_happy" = "yes"], - [AS_IF([test -r "/proc/cpuinfo"], - [timer_linux_happy="yes"], - [timer_linux_happy="no"])]) - case "${host}" in i?86-*|x86_64*|ia64-*|powerpc-*|powerpc64-*|sparc*-*) -timer_linux_happy="yes" +AS_IF([test "$timer_linux_happy" = "yes"], + [AS_IF([test -r "/proc/cpuinfo"], + [timer_linux_happy="yes"], + [timer_linux_happy="no"])]) ;; *) timer_linux_happy="no" On Mar 25, 2008, at 10:31 PM, Brian Barrett wrote: On Mar 25, 2008, at 6:16 PM, Jeff Squyres wrote: "linux" is the name of the component. It looks like opal/mca/timer/ linux/timer_linux_component.c is doing some checks during component open() and returning an error if it can't be used (e.g,. if it's not on linux). The timer components are a little different than normal MCA frameworks; they *must* be compiled in libopen-pal statically, and there will only be one of them built. In this case, I'm guessing that linux was built simply because nothing else was selected to be built, but then its component_open() function failed because it didn't find /proc/cpuinfo. This is actually incorrect. The linux component looks for /proc/ cpuinfo and builds if it founds that file. There's a base component that's built if nothing else is found. The configure logic for the linux component is probably not the right thing to do -- it should probably be modified to check both for that file (there are systems that call themselves "linux" but don't have a /proc/cpuinfo) is readable and that we're actually on Linux. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams, 'The Hitchhikers Guide to the Galaxy' ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Memchecker: breaks trunk again
On Mon, 21 Apr 2008, Ralph H Castain wrote: So it appears to be a combination of memchecker=yes automatically requiring valgrind, and the override on the configure line of a param set by a platform file not working. So I can't speak to the valgrind/memchecker issue, but can to the platform/configure issue. The platform file was intended to provide a mechanism to allow repeatability in builds. By design, options in the platform file have higher priority than options given on the configure command line. Brian
Re: [OMPI devel] Flush CQ error on iWARP/Out-of-sync shutdown
On Tue, 6 May 2008, Jeff Squyres wrote: On May 5, 2008, at 6:27 PM, Steve Wise wrote: There is a larger question regarding why the remote node is still polling the hca and not shutting down, but my immediate question is if it is an acceptable fix to simply disregard this "error" if it is an iWARP adapter. If proc B is still polling the hca, it is likely because it simply has not yet stopped doing it. I.e., a big problem in MPI implementations is that not all actions are exactly synchronous. MPI disconnects are *effectively* synchronous, but we probably didn't *guarantee* synchronicity in this case because we didn't need it (perhaps until now). Not to mention... The BTL has to be able to handle a shutdown from one proc while still running its progression engine, as that's a normal sequence of events when dynamic processes are involved. Because of that, there wasn't too much care taken to ensure that everyone stopped polling, then everyone did del_procs. Brian
Re: [OMPI devel] btl_openib_iwarp.c : making platform specific calls
On Tue, 13 May 2008, Don Kerr wrote: I believe there are similar operations being used by other areas of open mpi, place to start looking would be, opal/util/if.c. Yes, opal/util/if.h and opal/util/net.h provide a portable interface to almost everything that comes from getifaddrs(). Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. Brian On Wed, 21 May 2008, Jeff Squyres wrote: What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found.
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
And there's a typo in my first paragraph. The flag currently defaults to 1 (print the warning). It should be switched to 0 to turn off the warning. Sorry for any confusion I might have caused -- I blame the lack of caffeine in the morning. Brian On Wed, 21 May 2008, Pavel Shamis (Pasha) wrote: I'm agree with Brian. We may add to the warning message detailed description how to disable it. Pasha Brian W. Barrett wrote: I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1 will have the desired effect. I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error. It's trivially easy for any packaging group to have the line btl_base_warn_component_unused = 0 added to $prefix/etc/openmpi-mca-params.conf during the install phase of the package build (indeed, our simple build scripts at LANL used to do this on a regular bases due to our need to tweek the OOB to keep IPoIB happier at scale). I think keeping the Debian guys happy is a good thing. Giving them an easy way to turn off silly warnings is a good thing. Removing a known useful warning to help them doesn't seem like a good thing. Brian On Wed, 21 May 2008, Jeff Squyres wrote: What: Change default in openib BTL to not complain if no OpenFabrics devices are found Why: Many linuxes are shipping libibverbs these days, but most users still don't have OpenFabrics hardware Where: btl_openib_component.c When: For v1.3 Timeout: Teleconf, 27 May 2008 Short version = Many major linuxes are shipping libibverbs by default these days. OMPI will therefore build the openib BTL by default, but then complains at run time when there's no OpenFabrics hardware. We should change the default in v1.3 to not complain if there is no OpenFabrics devices found (perhaps have an MCA param to enable the warning if desired). Longer version == I just got a request from the Debian Open MPI package maintainers to include the following in the default openmpi-mca-params.conf for the OMPI v1.2 package: # Disable the use of InfiniBand # btl = ^openib Having this in the openmpi-mca-params.conf gives Debian an easy documentation path for users to shut up these warnings when they build on machines with libibverbs present but no OpenFabrics hardware. I think that this is fine for the v1.2 series (and will file a CMR for it). But for v1.3, I think we should change the default. The vast majority of users will not have OpenFabrics devices, and we should therefore not complain if we can't find any at run-time. We can/should still complain if we find OpenFabrics devices but no active ports (i.e., don't change this behavior). But for optimizing the common case: I think we should (by default) not print a warning if no OpenFabrics devices are found. We can also [easily] have an MCA parameter that *will* display a warning if no OpenFabrics devices are found. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. Which is easily solved with a better error message, as Pasha suggested. 3. Problems with HCA hardware and/or verbs stack are uncommon (nowadays). I'd be ok asking someone to enable a debug flag to get more information on configuration problems or hardware faults. Shouldn't we be optimizing for the common case? In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. But here's the real problem -- with our current selection logic, a user with libibverbs but no IB cards gets an error message saying "hey, we need you to set this flag to make this error go away" (or would, per Pasha's suggestion). A user with a busted IB stack on a node (which we still saw pretty often at LANL) starts using TCP and their application runs like a dog. I guess it's a matter of how often you see errors in the IB stack that cause nic initialization to fail. The machines I tend to use still exhibit this problem pretty often, but it's possible I just work on bad hardware more often than is usual in the wild. It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. I guess the root of my concern is that unexpected behavior with no explanation is (in my mind) the most dangerous case and the one we should address by default. And turning this error message off is going to cause unexpected behavior without explanation. Just my $0.02. Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Then we disagree on a core point. I believe that users should never have something silently unexpected happen (like falling back to TCP from a high speed interconnect because of a NIC reset / software issue). YOu clearly don't feel this way. I don't really work on the project, but do have lots of experience being yelled at by users when something unexpected happens. I guarantee you we'll see a report of poor IB / application performance because of the silent fallback to TCP. There's a reason that error message was put in. I don't get a vote anymore, so do whatever you think is best. Brian On Wed, 21 May 2008, Jeff Squyres wrote: One thing I should clarify -- the ibverbs error message from my previous mail is a red herring. libibverbs prints that message on systems where the kernel portions of the OFED stack are not installed (such as the quick-n-dirty test that I did before -- all I did was install libibverbs without the corresponding kernel stuff). I installed the whole OFED stack on a machine with no verbs-capable hardware and verified that the libibverbs message does *not* appear when the kernel bits are properly installed and running. So we're only talking about the Open MPI warning message here. More below. On May 21, 2008, at 12:17 PM, Brian W. Barrett wrote: 2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal. Which is easily solved with a better error message, as Pasha suggested. I guess this is where we disagree: I don't believe that the issue is solved by making a "better" message. Specifically: this is the first case where we're saying "if you run with a valid configuration, you're going to get a warning message and you have to do something extra to turn it off." That just seems darn weird to me, especially when other MPI's don't do the same thing. Come to think of it, I can't think of many other software packages that do that. In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware. But here's the real problem -- with our current selection logic, a user with libibverbs but no IB cards gets an error message saying "hey, we need you to set this flag to make this error go away" (or would, per Pasha's suggestion). A user with a busted IB stack on a node (which we still saw pretty often at LANL) starts using TCP and their application runs like a dog. I guess it's a matter of how often you see errors in the IB stack that cause nic initialization to fail. The machines I tend to use still exhibit this problem pretty often, but it's possible I just work on bad hardware more often than is usual in the wild. I guess this is the central issue: what *is* the common case? Which set of users should be forced to do something different? I'm claiming that now that the Linux distros are shipping libibverbs, the number of users who have the openib BTL installed but do not have verbs-capable hardware will be *much* larger than those with verbs- capable hardware. Hence, I think the pain point should be for the smaller group (those with verbs-capable hardware): set an MCA param if you want to see the warning message. (we can debate the default value for the BTL-wide base param later -- let's first just debate the *concept* as specific to the openib BTL) It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Yes, this capability in libiverbs would be good. Parsing the PCI tables doesn't sound like our role. I'll ask the libibverbs authors about it... I guess the root of my concern is that unexpected behavior with no explanation is (in my mind) the most dangerous case and the one we should address by default. And turning this error message off is going to cause unexpected behavior without explanation. But more information is available, and subject to normal troubleshooting techniques. And if you're in an environment where you *do* want to use verbs-capable hardware, then setting the MCA param seems perfectly acceptable to me. IIRC, LANL sets a whole pile of MCA params in the top-level openmpi-mca-params.conf file that are specific to their environment (right?). If that's true, what's one more param? Heck, the OMPI installed by OFED can set an MCA param in openmpi-mca- params.cof by default (which is what most verbs-capable-hardware-users utilize). That would solve the issue
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: On May 21, 2008, at 3:38 PM, Jeff Squyres wrote: It would be great if libibverbs could return two different error messages - one for "there's no IB card in this machine" and one for "there's an IB card here, but we can't initialize it". I think that would make this argument go away. Open MPI could probably mimic that behavior by parsing the PCI tables, but that sounds ... painful. Thinking about this a bit more -- I think it depends on what kind of errors you are worried about seeing. IBV does separate the discovery of devices (ibv_get_device_list) from trying to open a device (ibv_open_device). So hypothetically, we *can* distinguish between these kinds of errors already. Do you see devices that are so broken that they don't show up in the list returned from ibv_get_device_list? FWIW: the *only* case I'm talking about changing the default for is when ibv_get_device_list returns an empty list (meaning that according to the verbs stack, there are no devices in the host). I think that we should *always* warn for any kinds of errors that occur after that (e.g., we find a device but can't open it, we find one or more devices but no active ports, etc.). Previously, there has not been such a distinction, so I really have no idea which caused the openib BTL throw its error (and never really cared, as it was always somebody else's problem at that point). I'm only concerned about the case where there's an IB card, the user expects the IB card to be used, and the IB card isn't used. If the changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down (because the SM failed and the machine rebooted since that time)? If not, we still ahve a (fairly common, unfortunately) error case that we need to report (in my opinion). Brian
Re: [OMPI devel] openib btl build question
On Wed, 21 May 2008, Jeff Squyres wrote: On May 21, 2008, at 4:17 PM, Don Kerr wrote: Just want to make sure what I think I see is true: Linux build. openib btl requires ptmalloc2 and ptmalloc2 requires posix threads, is that correct? ptmalloc2 is not *required* by the openib btl. But it is required on Linux if you want to use the mpi_leave_pinned functionality. I see one function call to __pthread_initialize in the ptmalloc2 code -- it *looks* like it's a function of glibc, but I don't know for sure. There's actually more than that, it's just buried a bit. There's a whole bunch of thread-specific data stuff, which is wrapped so that different thread packages can be used (although OMPI only supports pthreads). The wrappers are in ptmalloc2/sysdeps/pthreads. Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Wed, 21 May 2008, Jeff Squyres wrote: I'm only concerned about the case where there's an IB card, the user expects the IB card to be used, and the IB card isn't used. Can you put in a site wide btl = ^tcp to avoid the problem? If the IB card fails, then you'll get unreachable MPI errors. And how many users are going to figure that one out before complaining loudly? That's what LANL did (probably still does) and it worked great there, but that doesn't mean that others will figure that out (after all, not everyone has an OMPI developer on staff...). If the changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down (because the SM failed and the machine rebooted since that time)? Yes. If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist). Brian
Re: [OMPI devel] openib btl build question
Ah. On Linux, --without-threads really doesn't gain you that much. The default glibc is still thread safe, and there are only a couple small parts of the code that use locks (like the OOB TCP). It's generally just easier to leave threads enabled on Linux. Brian On Thu, 22 May 2008, Don Kerr wrote: Thanks Jeff. Thanks Brian. I ran into this because I was specifically trying to configure with "--disable-progress-threads --disable-mpi-threads" at which point I figured, might as well turn off all threads so I added "--without-threads" as well. But can't live without mpi_leave_pinned so threads are back. Jeff Squyres wrote: On May 21, 2008, at 4:37 PM, Brian W. Barrett wrote: ptmalloc2 is not *required* by the openib btl. But it is required on Linux if you want to use the mpi_leave_pinned functionality. I see one function call to __pthread_initialize in the ptmalloc2 code -- it *looks* like it's a function of glibc, but I don't know for sure. There's actually more than that, it's just buried a bit. There's a whole bunch of thread-specific data stuff, which is wrapped so that different thread packages can be used (although OMPI only supports pthreads). The wrappers are in ptmalloc2/sysdeps/pthreads. Doh! I didn't "grep -r"; my bad... ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On Thu, 22 May 2008, Terry Dontje wrote: The major difference here is that libmyriexpress is not being included in mainline Linux distributions. Specifically: if you can find/use libmyriexpress, it's likely because you have that hardware. The same *used* to be true for libibverbs, but is no longer true because Linux distros are now shipping (e.g., the Debian distribution pulls in libibverbs when you install Open MPI). Ok, but there are distributions that do include the myrinet BTL/MTL (ie CT). Though I agree for the most part in the case of myrinet if you have libmyriexpress you probably will probably have an operable interface. I guess I am curious how many other BTLs a distribution might end up delivering that could run into this reporting issue. I guess my point is could this be worth something more general instead of a one off for IB? From my point of view the btl_warn_unused_components coupled with "-mca btl ^mlfbtl" works for me. However the fact that the IB vendors/community (ie CISCO) is solving this for their favorite interface makes me pause for a moment. There's actually a second (in my mind more important) reason why this is IB only, as I shared similar concerns (hence yesterday's e-mail barage). InfiniBand has a two stage initialization -- you get the list of HCAs, then you initialize the HCA you want. So it's possible to determine that there's no HCAs in the system vs. the system couldn't initialize the HCA properly (as that would happen in step 2, according to Jeff). With MX, it's one initialization call (mx_init), and it's not clear from the errors it can return that you can differentiate between the two cases. I haven't tried it, but it's possible that mx_init would succeed in the no nic case, but then have a NIC count of 0. Anyway, the short answer is that (in my opinion) we should have a btl base param similar to warn_unused for whether to warn when no NICs/HCAs are found, hopefully with a nice error function similar to today's no_nics (which probably needs to be renamed in that case). That way, if BTL authors other than OpenIB want to do some extra work and return better error messages, they can. FWIW, our distribution actually turns off btl_base_want_component_unused because it seemed the majority of our cases would be that users would false positive sights of the message. Is the UDAPL library shipped in Solaris by default? If so, then you're likely in exactly the same kind of situation that I'm describing. The same will be true if Solaris ends up shipping libibverbs by default. Yes the UDAPL library is shipped in Solaris by default. Which is why we turn off btl_warn_unused_components. Yes, and I suspect once Solaris starts delivering libibverbs we (Sun) will need to figure out how to handle having both the udapl and openib btls being available. There is some evil configure hackery that could be done to make this work in a more general way (don't you love it when I say that). Autogen/configure makes no guarantees about the order in which the configure.m4 macros for components in the same framework are run, other than all components of priority X are run before those of priority Y, iff X > Y. So you could set the priority of all the components except udapl to (say) 10 and udapl's to 0. Then have the udapl configure only build if 1) it was specifically requested or 2) ompi_check_openib_happy = no. No more Linux-specific stuff, works when Solaris gets OFED, and works on old Solaris that has uDAPL but not OFED. As a matter of fact, it's so trivial to do that I'd recommend doing it for 1.3. Really, you could do it minimally by only changing OpenIB's configure.params to set its priority to 10, uDAPL's configure.params to set its priority to 0, and uDAPL's configure.m4 to remove the Linux stuff and look for ompi_check_openib_happy. Brian