Re: [OMPI devel] [OMPI svn] svn:open-mpi r28456 - trunk
Hi, Briefly, I'm Thomas. I work at ORNL. I changed autogen.pl on my first commit to OMPI trunk. (Insert rookie joke here. :-D) The changes in r28456 for autogen.pl were pretty basic/minor. I apologize for not sending a follow-up email to devel mailing list outlining the changes -- poor netiquette on my part. :-/ There were four changes included in the patch. They related mainly to the recent changes for MCA frameworks. I'll give a little more description below. Ralph, I also included your feedback and a response for #2. Let me know if this makes sense as I think it provides the "right" behavior but want to double check. Thanks. 1) Add ifdef guard to project's autogenerated "frameworks.h" header file, e.g., "opal/inlude/opal/frameworks.h" would have "OPAL_FRAMEWORKS_H". This one simply adds and ifdef to top of auto-generated file, so if code includes the "framework.h" file you avoid multiple includes of same file. This is generic to the given project so the "opal/" project would generate something like: $ cat opal/include/opal/frameworks.h /* * This file is autogenerated by autogen.pl. Do not edit this file by hand. */ #ifndef OPAL_FRAMEWORKS_H #define OPAL_FRAMEWORKS_H #include extern mca_base_framework_t opal_backtrace_base_framework; .. #endif /* OPAL_FRAMEWORKS_H */ This would also be done for "ompi/" and "orte/" project directories. 2) Avoid adding "ignored" frameworks to the autogenerated "frameworks.h" header file. This simply applies the same ignored() function that is used elsewhere in the autogen.pl script for omitting "ignored" MCA directories from the processing. This just picks up the ".ompi_ignore" (and/or ".ompi_unignore) files. The intent being that if you ignore a component (subdir) that will be removed from the list, but you could also remove an entire framework if you put the ignore file in the top-level of the framework. The intent being that if for whatever reason you ignore a framework in the "${project}/mca/" space, you will not have it automatically show up in the project's "frameworks.h" file. On Tue, 7 May 2013, Ralph Castain wrote: We use the frameworks.h file to "discover" the frameworks in ompi_info. Even if no components are built for that framework, there still are MCA params relating to the base of that framework. Sounds silly, I know - but there may be reasons to access those params - e.g., to set verbosity to verify that no components are being selected. I think we need those frameworks to be listed... Ok, I didn't realize the 'ompi_info' aspect. Good to know. However, I think honouring the ignore behavior is good in this case b/c if you drop an ignore file in a framework, you won't get any other autogen touches (i.e., no Makefile's are autogenerated). So it seems that having no Makefiles but including the framework in the "framework.h" would break regardless. Again, this is more of a safety guard. 3) Avoid adding non-MCA projects to the autogenerated 'mca_project_list', which maintains existing support for "projects" with new MCA framework enhancements. Moves this down to mca_run_global(). This was just a bit of shifting code and didn't sound like there was any discussion on this point. This is a "do no harm" factor to support pre-existing functionality. The gist is that if you have a "project" in the build directory that doesn't have an MCA directory structure, just avoid adding it to the list of MCA projects. 4) Add small loop at end to add projects with a "config/" subdir to the list of includes for 'autoreconf'. This again is a "do no harm" factor to support pre-exising functionality. If you have a "${project}/config/" directory. This appends the "-I ${project}/config/" to the autoreconf list. If you do not have a "${project}/config/" dir, there is no change. Again, I hope that gives more context/description to the changes included in the autogen.pl patch. In the future, I'll try to do a better job of sending a heads up to the devel list. Thanks, --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 7 May 2013, Ralph Castain wrote: Crud - it just struck me that you don't want to do one thing in that patch. + Avoid adding "ignored" frameworks to the autogenerated "frameworks.h" header file. We use the frameworks.h file to "discover" the frameworks in ompi_info. Even if no components are built for that fr
Re: [OMPI devel] Q: project based MCA param files
Hi, Ok, looks like this may just do the trick. We briefly discussed this today and probably can change our use case to make use of this mechanism instead and avoid any further enhancments. Question: If you do a setenv for this MCA param, does that extend the default search path? Or does it replace/override the default? Thanks Jeff for forwarding info to devel list to get broader feedback, and to Ralph for providing the suggestion. --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 7 May 2013, Ralph Castain wrote: I believe we already have a way of defining where to get the default mca params: ret = mca_base_var_register ("opal", "mca", "base", "param_files", "Path for MCA " "configuration files containing variable values", MCA_BASE_VAR_TYPE_STRING, NULL, 0, 0, OPAL_INFO_LVL_2, MCA_BASE_VAR_SCOPE_READONLY, &mca_base_var_files); So wouldn't it be as easy as defining an envar? It's what we did when using the OMPI code with ORCM a couple of years ago, and we used it again for a recent project in Greenplum where the default mca param was specified in a different location than usual. On May 7, 2013, at 6:28 AM, Jeff Squyres (jsquyres) wrote: Given Ralph's questions about rhttps://svn.open-mpi.org/trac/ompi/changeset/28456, ORNL's second question to me/Nathan about MCA params is probably worth forwarding to the list -- see below. Thoughts on this proposal? Begin forwarded message: From: "Boehm, Swen" Subject: Re: Q: project based MCA param files Date: May 3, 2013 5:03:43 PM EDT To: "Jeff Squyres (jsquyres)" Cc: Nathan Hjelm , "Vallee, Geoffroy R." , "Naughton III, Thomas J." Hi Jeff, Here is a short description of the enhancement we would like to contribute. Let us know what you think. The purpose of the suggested improvements is to enable "projects" to read MCA parameters from project specific locations. This enables the usage of OPAL and the MCA Frameworks outside the OpenMPI project without interfering with OpenMPI specific parameters and removes the need to patch OPAL (e.g., to pick up params from different locations). The possible scenarios would be the following: a) adding the option to pick up a project specific mca-param.conf file Example: $HOME/.mca/${project}-mca-param.conf and /etc/mca/${project}-mca-param.conf) b) add the option to pick up the mca-param.conf file from a project specific directory Example: $HOME/.${project}/mca-param.conf and /etc/${project}/mca-param.conf and/or /etc/${project}/${project}-mca-param.conf) c) prefixing the mca param with the project name in the existing mca-param.conf file and therefore following the new MCA variable system naming scheme. Example: mca_${project}_${framework}_${component}_${var_name} The implementation has to be compatible with the current system, that is, it should work as it does today without any added burden to the user. The suggested approach is to provide an addition to the MCA API (something like mca_base_add_config_file_path ()) to add lookup paths to the MCA system. This way additional files can be picked up for the MCA param parsing if needed. To wrap it up: 1) Is the motivation clear? 2) Is it possible to implement the desired capability within a reasonable time and without changing the current behavior? 3) Does it line up with the planning / future capabilities? 4) Which of the above options (A, B, C) would you prefer? -- Swen Boehm | Email: bo...@ornl.gov Oak Ridge National Laboratory | Phone: +1 865-576-6125 On Apr 26, 2013, at 7:50 PM, Thomas Naughton wrote: Hi, Ok, sounds good. We'll check on this next week and get back to you. Thanks, --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Fri, 26 Apr 2013, Jeff Squyres (jsquyres) wrote: Email would probably be easiest -- I will need to page in/refresh this area of the code, anyway, so if you guys do the initial homework and submit some ideas, that would probably be easiest (For me). :-D On Apr 26, 2013, at 6:33 PM, Thomas Naughton wrote: Hi Jeff, We don't have one yet but we can code something up and submit a patch. If useful we could quickly sync up beforehand to ensure we are on the same page. Phone or email, whatever would be easiest. What do you think? --tjn
Re: [OMPI devel] [OMPI svn] svn:open-mpi r28456 - trunk
Hi Ralph, On Tue, 7 May 2013, Ralph Castain wrote: 2) Avoid adding "ignored" frameworks to the autogenerated "frameworks.h" header file. This simply applies the same ignored() function that is used elsewhere in the autogen.pl script for omitting "ignored" MCA directories from the processing. This just picks up the ".ompi_ignore" (and/or ".ompi_unignore) files. The intent being that if you ignore a component (subdir) that will be removed from the list, but you could also remove an entire framework if you put the ignore file in the top-level of the framework. That is new - I would suggest not doing that as it behaves differently than you might expect. The .ompi_ignore in a component prevents that component from building at all, so it won't ever be opened etc. However, the framework *must* build the base code no matter what - and that means the framework will be opened, selected, and closed at the minimum. I would prefer we keep ompi_ignore cleanly defined. You can ignore all components by simply putting --enable-mca-no-build= on your configure line. The intent being that if for whatever reason you ignore a framework in the "${project}/mca/" space, you will not have it automatically show up in the project's "frameworks.h" file. On Tue, 7 May 2013, Ralph Castain wrote: We use the frameworks.h file to "discover" the frameworks in ompi_info. Even if no components are built for that framework, there still are MCA params relating to the base of that framework. Sounds silly, I know - but there may be reasons to access those params - e.g., to set verbosity to verify that no components are being selected. I think we need those frameworks to be listed... Ok, I didn't realize the 'ompi_info' aspect. Good to know. However, I think honouring the ignore behavior is good in this case b/c if you drop an ignore file in a framework, you won't get any other autogen touches (i.e., no Makefile's are autogenerated). So it seems that having no Makefiles but including the framework in the "framework.h" would break regardless. Again, this is more of a safety guard. Actually, I disagree. As stated above, the framework will *always* build the base code and be opened, selected, and closed - so you at least need access to the verbosity parameter so you can verify those operations. Keeping it in ompi_info is of value. I guess I misunderstood the scope of use for the ".ompi_ignore" file. I thought that it could be placed at the top of the framework and it would ignore the entire directory. I just did a quick test with the earlier version of autogen.pl (r28241) and it does indeed generate the Makefiles for that directory. So it does seem reasonable that if autogen.pl processes the directory for Makefile stuff*, that it should process it for the "frameworks.h" entry. I'll revert that part of the changeset to previous functionality. Sorry, my bad, --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184
[OMPI devel] 'install-sh' in SVN
Hi, It looks like an auto-generated 'install-sh' was accidentally added to SVN under libevent in OPAL: ompi-trunk/opal/mca/event/libevent2021/libevent/install-sh Ok, to remove it? --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184
Re: [OMPI devel] ROMIO update breaks trunk
Hi Ralph, Does the version in AM_INIT_AUTOMAKE in configure.ac also need to be increased? It currently shows 1.11. Thanks, --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Thu, 14 Nov 2013, Ralph Castain wrote: Ha! Jeff points out that our web site says we are at AM 1.12.2 - yet our HACKING file says 1.11.1 Sadness I'll leave the romio update alone and update the HACKING file to avoid future confusion On Nov 14, 2013, at 12:41 PM, Ralph Castain wrote: Just in case others are encountering this: the recent ROMIO update contains a line in its configure.ac that breaks the trunk for automake versions less than 1.12: "I've looked a bit around online for this, and the consensus generally seems to be that AM_PROG_AR should be added in libtool, not in every configure.ac script out there. It's especially problematic as AM_PROG_AR doesn't exist i n automake before 1.12, which means it breaks, among others, with the automa ke we use to build our distribution tarballs :-) See e.g. http://debbugs.gnu.org/cgi/bugreport.cgi?bug=11401 for a discussion ." I'm going to comment that line out in ompi/mca/io/romio/romio/configure.ac s o the trunk can build until someone figures out (a) if it is really needed, and (b) how to correctly add it Ralph
[OMPI devel] Q: MPI-RTE / ompi_proc_t vs. ompi_process_info_t ?
Hi Ralph, Question about the MPI-RTE interface change in r29931. The change was not reflected in the "ompi/mca/rte/rte.h" file. I'm curious how the newly added "struct ompi_proc_t" relates to the "struct ompi_process_info_t" that is described in the "rte.h" file? I understand the general motivation for the API change but it is less clear to me how the information previously defined in the header changes (or does not change)? Thanks, --tjn _________ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Mon, 16 Dec 2013, svn-commit-mai...@open-mpi.org wrote: Author: rhc (Ralph Castain) Date: 2013-12-16 22:26:00 EST (Mon, 16 Dec 2013) New Revision: 29931 URL: https://svn.open-mpi.org/trac/ompi/changeset/29931 Log: Revert r29917 and replace it with a fix that resolves the thread deadlock while retaining the desired debug info. In an earlier commit, we had changed the modex accordingly: * automatically retrieve the hostname (and all RTE info) for all procs during MPI_Init if nprocs < cutoff * if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc upon the first call to modex_recv for that proc. This would provide the hostname for debugging purposes as we only report errors on messages, and so we must have called modex_recv to get the endpoint info * BTLs are not to call modex_recv until they need the endpoint info for first message - i.e., not during add_procs so we don't call it for every process in the job, but only those with whom we communicate My understanding is that only some BTLs have been modified to meet that third requirement, but those include the Cray ones where jobs are big enough that launch times were becoming an issue. Other BTLs would hopefully be modified as time went on and interest in using them at scale arose. Meantime, those BTLs would call modex_recv on every proc, and we would therefore be no worse than the prior behavior. This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of the ompi_process_name_t for the proc so that the hostname can be easily inserted. I have advised the ORNL folks of the change. cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock Text files modified: trunk/ompi/mca/rte/orte/rte_orte.h| 7 --- trunk/ompi/mca/rte/orte/rte_orte_module.c |27 ++- trunk/ompi/proc/proc.c|26 ++ trunk/ompi/runtime/ompi_module_exchange.c |10 +- 4 files changed, 49 insertions(+), 21 deletions(-)
Re: [OMPI devel] Q: MPI-RTE / ompi_proc_t vs. ompi_process_info_t ?
Hi Ralph, OK, thanks for clarification and code pointers. I'll update "rte.h" to reflect the updates. Thanks, --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Wed, 18 Dec 2013, Ralph Castain wrote: There is no relation at all between ompi_proc_t and ompi_process_info_t. The ompi_proc_t is defined in the MPI layer and is used in that layer in various places very much like orte_proc_t is used in the ORTE layer. If you look in ompi/mca/rte/orte/rte_orte.c, you'll see how we handle the revised function calls. Basically, we use the process name to retrieve the modex data via the opal_db, and then load a pointer to the hostname into the ompi_proc_t proc_hostname field. Thus, the definition of ompi_proc_t remains in the MPI layer. So there was no need to change the ompi/mca/rte/rte.h file, nor to #define anything in the component .h file - just have to modify the wrapper code inside the RTE component itself. HTH Ralph On Dec 18, 2013, at 1:50 PM, Thomas Naughton wrote: Hi Ralph, Question about the MPI-RTE interface change in r29931. The change was not reflected in the "ompi/mca/rte/rte.h" file. I'm curious how the newly added "struct ompi_proc_t" relates to the "struct ompi_process_info_t" that is described in the "rte.h" file? I understand the general motivation for the API change but it is less clear to me how the information previously defined in the header changes (or does not change)? Thanks, --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Mon, 16 Dec 2013, svn-commit-mai...@open-mpi.org wrote: Author: rhc (Ralph Castain) Date: 2013-12-16 22:26:00 EST (Mon, 16 Dec 2013) New Revision: 29931 URL: https://svn.open-mpi.org/trac/ompi/changeset/29931 Log: Revert r29917 and replace it with a fix that resolves the thread deadlock while retaining the desired debug info. In an earlier commit, we had changed the modex accordingly: * automatically retrieve the hostname (and all RTE info) for all procs during MPI_Init if nprocs < cutoff * if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc upon the first call to modex_recv for that proc. This would provide the hostname for debugging purposes as we only report errors on messages, and so we must have called modex_recv to get the endpoint info * BTLs are not to call modex_recv until they need the endpoint info for first message - i.e., not during add_procs so we don't call it for every process in the job, but only those with whom we communicate My understanding is that only some BTLs have been modified to meet that third requirement, but those include the Cray ones where jobs are big enough that launch times were becoming an issue. Other BTLs would hopefully be modified as time went on and interest in using them at scale arose. Meantime, those BTLs would call modex_recv on every proc, and we would therefore be no worse than the prior behavior. This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of the ompi_process_name_t for the proc so that the hostname can be easily inserted. I have advised the ORNL folks of the change. cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock Text files modified: trunk/ompi/mca/rte/orte/rte_orte.h| 7 --- trunk/ompi/mca/rte/orte/rte_orte_module.c |27 ++- trunk/ompi/proc/proc.c|26 ++ trunk/ompi/runtime/ompi_module_exchange.c |10 +- 4 files changed, 49 insertions(+), 21 deletions(-) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] yesterday commits caused a crash in helloworld with --mca btl tcp, self
Hi, I'm also seeing some sporadic failures with recent commits to trunk. My tests are using slightly different build/configuration, and use a different rte, but the errors are coming from the OMPI ob1 layer. works: r31777 (I did not test r31778..r31783) fails: r31784M (plus manually applied patch from r31786) My test was something simple: cd examples/ mpicc -g hello_c.c -o hello_c mpirun -np 10 hello_c Again it is sporadic, I was able to reproduce the failure with different values of '-np' > 1; sometimes np=3, other times np=11. Here's some backtrace / debug info... Program terminated with signal 11, Segmentation fault. [New process 7242] [New process 7255] #0 0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec, btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139 139 if( array->bml_btls[i].btl == btl ) { (gdb) bt #0 0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec, btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139 #1 0xb7a7539f in mca_bml_r2_del_proc_btl (proc=0x80debe8, btl=0xb7a721c0) at bml_r2.c:551 #2 0xb7a757d8 in mca_bml_r2_finalize () at bml_r2.c:648 #3 0xb70c50b8 in mca_pml_ob1_component_fini () at pml_ob1_component.c:290 #4 0xb7f5a755 in mca_pml_v_component_parasite_finalize () at pml_v_component.c:161 #5 0xb7f58c63 in mca_pml_base_finalize () at base/pml_base_frame.c:120 #6 0xb7ec81e1 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:291 #7 0xb7ef1042 in PMPI_Finalize () at pfinalize.c:46 #8 0x0804874d in main (argc=2, argv=0xbfc8d394) at hello_c.c:24 (gdb) p array->bml_btls $1 = (mca_bml_base_btl_t *) 0x0 (gdb) p btl $2 = (struct mca_btl_base_module_t *) 0xb7a721c0 (gdb) p *btl $3 = {btl_component = 0xb7a72240, btl_eager_limit = 131072, btl_rndv_eager_limit = 131072, btl_max_send_size = 262144, btl_rdma_pipeline_send_length = 2147483647, btl_rdma_pipeline_frag_size = 2147483647, btl_min_rdma_pipeline_size = 2147614719, btl_exclusivity = 65536, btl_latency = 0, btl_bandwidth = 100, btl_flags = 10, btl_seg_size = 16, btl_add_procs = 0xb7a6fd9c , btl_del_procs = 0xb7a6fdf9 , btl_register = 0, btl_finalize = 0xb7a6fe03 , btl_alloc = 0xb7a6fe0d , btl_free = 0xb7a70074 , btl_prepare_src = 0xb7a70329 , btl_prepare_dst = 0xb7a70702 , btl_send = 0xb7a70831 , btl_sendi = 0, btl_put = 0xb7a70910 , btl_get = 0xb7a70910 , btl_dump = 0xb7f35b57 , btl_mpool = 0x0, btl_register_error = 0, btl_ft_event = 0xb7a70b00 } (gdb) l 134 struct mca_btl_base_module_t* btl ) 135 { 136 size_t i = 0; 137 /* find the btl */ 138 for( i = 0; i < array->arr_size; i++ ) { 139 if( array->bml_btls[i].btl == btl ) { 140 /* make sure not to go out of bounds */ 141 for( ; i < array->arr_size-1; i++ ) { 142 /* move all btl's back by 1, so the found 143btl is "removed" */ (gdb) p array->arr_size $4 = 69 (gdb) p array->bml_btls $5 = (mca_bml_base_btl_t *) 0x0 Anyone else seeing problems? --tjn _________ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Fri, 16 May 2014, Gilles Gouaillardet wrote: Folks, a simple mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld crashes after some of yesterday's commits (i would blame r31778 and/or r31782, but i am not 100% sure) /* a list receives a negative value, so the program takes some time before crashing, symptom may vary from one system to an other */ i digged into this, and found what looks like an old bug/typo in mca_bml_r2_del_procs(). the bug has *not* been introduced by yesterday commits. i believe this path was not executed since yesterday, that is why we (only) now hit the bug i fixed this in r31786 Gilles ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14814.php
Re: [OMPI devel] RFC: remove PMI component in OMPI/RTE framework
Hi Ralph, This component does provide a alternate reference for the ompi-rte framework. But if it is unused (unmaintained), it seems less useful in practice. I'll post another RFC for related request. --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Sun, 25 May 2014, Ralph Castain wrote: WHAT: remove stale and unmaintained component in ompi/rte framework WHY: because it is unused, unmaintained, and doesn't even compile? WHEN: without objections, after telecon on June 9 HOW: svn del ompi/rte/pmi This was a component added by Brian as a test of the ompi/rte framework while we developed that system. It never really had any purpose other than to provide an alternative to ORTE while we tested the revised integration. So far as we know, nobody ever used it in an actual installation. Ralph ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14838.php
[OMPI devel] RFC: add STCI component to OMPI/RTE framework
WHAT: add new component to ompi/rte framework WHY: because it will simplify our maintenance & provide an alt. reference WHEN: no rush, soon-ish? (June 12?) This is a component we currently maintain outside of the ompi tree to support using OMPI with an alternate runtime system. This will also provide an alternate component to ORTE, which was motivation for PMI component in related RFC. We build/test nightly and it occasionally catches ompi-rte abstraction violations, etc. Thomas _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184
Re: [OMPI devel] RFC: add STCI component to OMPI/RTE framework
Inline comments ... way at the bottom. ;-) --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 27 May 2014, Ralph Castain wrote: On May 27, 2014, at 12:50 PM, Edgar Gabriel wrote: On 5/27/2014 2:46 PM, Ralph Castain wrote: On May 27, 2014, at 12:27 PM, Edgar Gabriel wrote: I'll let ORNL talk about the STCI component itself (which might have additional reasons), but keeping the code in trunk vs. an outside github/mercurial repository has two advantages in my opinion: i) it simplifies the propagation of know-how between the groups, Afraid I don't understand that - this is just glue, right? yes, but its easier to look in one place vs. n places for every features. and ii) avoids having to keep a separate branch up to date. (We did the second part with OMPIO for a couple of years, and that was really painful). Ah, perhaps this is the "rub" - are you saying that you expect us to propagate any changes in the RTE interface to your component? If so, then that violates the original agreement about this framework. It was solely to provide a point-of-interface for *external* groups to connect their RTE's into OMPI. We agreed that we would notify people of changes to that interface, and give them a chance to provide input on those changes - but under no conditions were we wiling to accept responsibility for maintaining those branch interfaces. Given that the interface is wholly contained in the ompi/rte component, I guess I struggle to understand the code conflict issue. There is no change in the OMPI code base that can possibly conflict with your component. The only things that could impact you are changes in the OMPI layer that require modification to your component, which is something you'd have to do regardless. We will not test nor update that component for you. no, not all. My point was that we invested enormous efforts at that time to just do the svn merge from the changes on trunk to our branch, that's all. If you are on a branch that contains an svn checkout of the trunk, plus one component directory in one framework, then I'm afraid I cannot understand how you get merge conflicts. I've been doing this for years and haven't hit one yet. The only possible source of a conflict is if I touch code that is common to the two repos - i.e., outside of the area that I'm adding. In this case, that should never happen, yes? If it does, then you touched code outside your component, and you either (a) are going to encounter this no matter what because you haven't pushed it up yet, or (b) couldn't commit that up to the main repo anyway if it impacted the RTE interface. Sorry, but I'm really struggling to understand how adding only this one component, which you solely modify and control, can possibly help with maintaining your branch. I can't speak for them but know that maintaining our rte/stci component often requires some attention for changes at different levels. Most notably changes APIs related to the modex. The "glue" code in the ompi-rte interface is generally described in the comments of rte.h, but in my experience it generally requires a look at rte/orte/* to know what really changes. Having a few different ompi-rte components in the tree seems like it offers a bit more information about what is required. It also helps to clarify who's maintaining components when API changes are proposed that effect the RTE layer. These are generally announced but telegraphed but it can be helpful to just see the directories, reminder about who's paying attention if you see "rte/stci" and "rte/hpx". Also, to respond to earlier comment. We will continue to maintain our code (rte/stci component), but it does simplify the patches/processing we maintain for integration with Open-MPI work, i.e., OMPI-trunk + OMPI-RTE-SCI + STCI. I'd expect this would be the case for the HPX or other instances. I thought this was the strength of the component infrastructure. For example, the ALPS code is external, but there are alps components in different frameworks, etc. And those who care about ALPS test that path. --tjn Thanks Edgar In addition, IANAL, but I was actually wandering about the implications of using separate code repositories outside of ompi for sharing code, and whether that is truly still covered by the contributors agreement that we all signed. Of course not - OMPI's license only declares that anything you push into the main OMPI code repo (and hence, our official releases) is covered by that agreement. Anything you add or distribute externally is on your own. You can *choose* to license that code in accordance with the OMPI license, but you aren't *req
Re: [OMPI devel] RFC: add STCI component to OMPI/RTE framework
Sure, if its helpful I can join a call. --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 27 May 2014, Ralph Castain wrote: Forgot to add: would it help to discuss this over the phone instead? On May 27, 2014, at 12:56 PM, Ralph Castain wrote: On May 27, 2014, at 12:50 PM, Edgar Gabriel wrote: On 5/27/2014 2:46 PM, Ralph Castain wrote: On May 27, 2014, at 12:27 PM, Edgar Gabriel wrote: I'll let ORNL talk about the STCI component itself (which might have additional reasons), but keeping the code in trunk vs. an outside github/mercurial repository has two advantages in my opinion: i) it simplifies the propagation of know-how between the groups, Afraid I don't understand that - this is just glue, right? yes, but its easier to look in one place vs. n places for every features. and ii) avoids having to keep a separate branch up to date. (We did the second part with OMPIO for a couple of years, and that was really painful). Ah, perhaps this is the "rub" - are you saying that you expect us to propagate any changes in the RTE interface to your component? If so, then that violates the original agreement about this framework. It was solely to provide a point-of-interface for *external* groups to connect their RTE's into OMPI. We agreed that we would notify people of changes to that interface, and give them a chance to provide input on those changes - but under no conditions were we wiling to accept responsibility for maintaining those branch interfaces. Given that the interface is wholly contained in the ompi/rte component, I guess I struggle to understand the code conflict issue. There is no change in the OMPI code base that can possibly conflict with your component. The only things that could impact you are changes in the OMPI layer that require modification to your component, which is something you'd have to do regardless. We will not test nor update that component for you. no, not all. My point was that we invested enormous efforts at that time to just do the svn merge from the changes on trunk to our branch, that's all. If you are on a branch that contains an svn checkout of the trunk, plus one component directory in one framework, then I'm afraid I cannot understand how you get merge conflicts. I've been doing this for years and haven't hit one yet. The only possible source of a conflict is if I touch code that is common to the two repos - i.e., outside of the area that I'm adding. In this case, that should never happen, yes? If it does, then you touched code outside your component, and you either (a) are going to encounter this no matter what because you haven't pushed it up yet, or (b) couldn't commit that up to the main repo anyway if it impacted the RTE interface. Sorry, but I'm really struggling to understand how adding only this one component, which you solely modify and control, can possibly help with maintaining your branch. Thanks Edgar In addition, IANAL, but I was actually wandering about the implications of using separate code repositories outside of ompi for sharing code, and whether that is truly still covered by the contributors agreement that we all signed. Of course not - OMPI's license only declares that anything you push into the main OMPI code repo (and hence, our official releases) is covered by that agreement. Anything you add or distribute externally is on your own. You can *choose* to license that code in accordance with the OMPI license, but you aren't *required* to do so. Anyway, I don't have strong feelings either way as well, just would see a couple of advantages (for us) if the code was in the trunk. I'm still trying to understand those - sorry to be a pain, but my biggest fear at this point is that the perceived advantage is based on a misunderstanding, and I'd like to head that off before it causes problems. Thanks Edgar On 5/27/2014 1:45 PM, Ralph Castain wrote: I think so long as we leave these components out of any release, there is a limited potential for problems (probably most importantly, we sidestep all the issues about syncing releases!). However, that said, I'm not sure what it gains anyone to include a component that *isn't* going in a release. Nobody outside your organizations is going to build against it - so what did it accomplish to push the code into the repo? Mind you, I'm not saying I'm staunchly opposed - just trying to understand how it benefits anyone. On May 27, 2014, at 11:28 AM, Edgar Gabriel wrote: To through in my $0.02, I would see a benefit in adding the component to the trunk. As I mentioned in the last teleconf, we are currently working on adding support for the HPX runtime envi
Re: [OMPI devel] RFC: add STCI component to OMPI/RTE framework
Hi, Thanks Jeff, I think that was a pretty good summary of things. Thomas indicated there was no rush on the RFC; perhaps we can discuss this next-next-Tuesday (June 10)? Phone discussion seems like a good idea and June 10 sounds good to me. Thanks, --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Thu, 29 May 2014, Jeff Squyres (jsquyres) wrote: I refrained from speaking up on this thread because I was on travel, and I wanted to think a bit more about this before I said anything. Let me try to summarize the arguments that have been made so far... A. Things people seem to agree on: 1. Inclusion in trunk has no correlation to being included in a release 2. Prior examples of (effectively) single-organization components B. Reasons to have STCI/HPX/etc. components in SVN trunk: 1. Multiple organizations are asking (ORNL, UTK, UH) 2. Easier to develop/merge the STCI/HPX/etc. components over time 3. Find all alternate RTE components in one place (vs. multiple internet repos) 4. More examples of how to use the RTE framework C. Reasons not to have STCI/HPX/etc. components in the SVN trunk: 1. What is the (technical) gain is for being in the trunk? 2. Concerns about external release schedule pressure 3. Why have something on the trunk if it's not eventually destined for a release? In particular, I think B2 and C1 seem to be in conflict with each other. I have several thoughts about this topic, but I'm hesitant to continue this already lengthy thread on a contentious topic. I also don't want to spend the next 30 minutes writing a lengthy, carefully-worded email that will just spawn further lengthy, carefully-worded emails (each costing 15-30 minutes). Prior history has shown that we discuss and resolve issues much more rationally on the phone (vs. email hell). I would therefore like to discuss this on a weekly Tuesday call. Next week is bad because it's the MPI Forum meeting; I suspect that some -- but not all -- of us will not be on the Tuesday call because we'll be at the Forum. Thomas indicated there was no rush on the RFC; perhaps we can discuss this next-next-Tuesday (June 10)? On May 27, 2014, at 12:25 PM, Thomas Naughton wrote: WHAT: add new component to ompi/rte framework WHY: because it will simplify our maintenance & provide an alt. reference WHEN: no rush, soon-ish? (June 12?) This is a component we currently maintain outside of the ompi tree to support using OMPI with an alternate runtime system. This will also provide an alternate component to ORTE, which was motivation for PMI component in related RFC. We build/test nightly and it occasionally catches ompi-rte abstraction violations, etc. Thomas _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14852.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14904.php
[OMPI devel] Q: Using a hostfile in managed environment?
Hi, We're trying to track down some curious behavior and decided to take a step back and check a base assumption. When running within a managed environment (job allocation): Q: Should you be able to use `--hostfile` or `--host` options to operate on a subset of the resources in the allocation? (Example: within 4 node SLURM allocation, run on just 2 nodes in allocation.) Q: Additionally, should this be the same when launching the DVM in order to run on a subset of resources using subsequent 'mpirun --hnp ...' commands? (Only 'orte-dvm' would need to have `--hostfile` or `--host` args.) There are a variety of interactions with ess/ras/rmaps and the resource manager, but the thought was that you "should" be able to use a hostfile to operate on a subset of the allocation. Is that a flawed assumption? Thanks, --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Q: Using a hostfile in managed environment?
Hi Ralph, OK, that's pretty much what I thought but wanted to get a sanity check. :-) I'll see if I can reproduce the issue in a more precise manner and open an issue if I find something off in the mapping. Thanks, --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Fri, 24 Feb 2017, r...@open-mpi.org wrote: On Feb 24, 2017, at 11:57 AM, Thomas Naughton wrote: Hi, We're trying to track down some curious behavior and decided to take a step back and check a base assumption. When running within a managed environment (job allocation): Q: Should you be able to use `--hostfile` or `--host` options to operate on a subset of the resources in the allocation? (Example: within 4 node SLURM allocation, run on just 2 nodes in allocation.) Yes - those options are used to “filter” the allocation prior to launch Q: Additionally, should this be the same when launching the DVM in order to run on a subset of resources using subsequent 'mpirun --hnp ...' commands? (Only 'orte-dvm' would need to have `--hostfile` or `--host` args.) Yes - only the DVM needs to know the filter. When operating with a DVM, “mpirun --hnp...” only packages up the cmd line and sends it to the DVM. All the mapping occurs in orte-dvm. There are a variety of interactions with ess/ras/rmaps and the resource manager, but the thought was that you "should" be able to use a hostfile to operate on a subset of the allocation. Is that a flawed assumption? Thanks, --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Remove prun tool from OMPI?
Hi Ralph, Is the 'prun' tool required to launch the DVM? I know that at some point things shifted to use 'prun' and didn't require the URI on command-line, but I've not tested in few months. Thanks, --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 5 Jun 2018, r...@open-mpi.org wrote: Hey folks Does anyone have heartburn if I remove the “prun” tool from ORTE? I don’t believe anyone is using it, and it doesn’t look like it even works. I ask because the name conflicts with PRRTE and can cause problems when running OMPI against PRRTE Ralph ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Remove prun tool from OMPI?
Hi Ralph, All it means is that PRRTE users must be careful to have PRRTE before OMPI in their path values. Otherwise, they get the wrong “prun” and it fails. I suppose I could update the “prun” in OMPI to match the one in PRRTE, if that helps - there isn’t anything incompatible between ORTE and PRRTE. Would that make sense? Yes, if updating "OMPI prun" with latest "PRRTE prun" works ok, that seems like a reasonable way to keep DVM for OMPI usage. I agree that it does seem likely that users could easily get the wrong 'prun' but this may be something that falls out in future (based on discussion on call today). I guess the main point of interest would be to have some method for launching the DVM scenario with OMPI. Another option could be to rename the binary in OMPI? Thanks, --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 5 Jun 2018, r...@open-mpi.org wrote: I know we were headed that way - it might still work when run against the current ORTE. I can check that and see. If so, then I guess it might be advisable to retain it. All it means is that PRRTE users must be careful to have PRRTE before OMPI in their path values. Otherwise, they get the wrong “prun” and it fails. I suppose I could update the “prun” in OMPI to match the one in PRRTE, if that helps - there isn’t anything incompatible between ORTE and PRRTE. Would that make sense? FWIW: Got a similar complaint from the OpenHPC folks - I gather they also have a “prun”’ in their distribution that they use as an abstraction over all the RM launchers. I’m less concerned about that one, though. On Jun 5, 2018, at 9:55 AM, Thomas Naughton wrote: Hi Ralph, Is the 'prun' tool required to launch the DVM? I know that at some point things shifted to use 'prun' and didn't require the URI on command-line, but I've not tested in few months. Thanks, --tjn _________ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 5 Jun 2018, r...@open-mpi.org wrote: Hey folks Does anyone have heartburn if I remove the “prun” tool from ORTE? I don’t believe anyone is using it, and it doesn’t look like it even works. I ask because the name conflicts with PRRTE and can cause problems when running OMPI against PRRTE Ralph ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Remove prun tool from OMPI?
On Tue, 5 Jun 2018, r...@open-mpi.org wrote: On Jun 5, 2018, at 11:59 AM, Thomas Naughton wrote: Hi Ralph, All it means is that PRRTE users must be careful to have PRRTE before OMPI in their path values. Otherwise, they get the wrong “prun” and it fails. I suppose I could update the “prun” in OMPI to match the one in PRRTE, if that helps - there isn’t anything incompatible between ORTE and PRRTE. Would that make sense? Yes, if updating "OMPI prun" with latest "PRRTE prun" works ok, that seems like a reasonable way to keep DVM for OMPI usage. I agree that it does seem likely that users could easily get the wrong 'prun' but this may be something that falls out in future (based on discussion on call today). I guess the main point of interest would be to have some method for launching the DVM scenario with OMPI. Another option could be to rename the binary in OMPI? Yeah, that’s what the OHPC folks did in their distro - they renamed it to “ompi-prun”. If that works for you, then perhaps the best path forward is to do the rename and update it as well. Sounds good to me -- seems like a good way to avoid confusion. And having the 'ompi-prun' be in sync with (prrte) prun will make sure things run properly, i.e., easy to drop in new snapshot of the tool when updating PRRTE snapshots in OMPI. (Or however done in future) Thanks, Ralph! --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 Thanks, --tjn _________ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 5 Jun 2018, r...@open-mpi.org wrote: I know we were headed that way - it might still work when run against the current ORTE. I can check that and see. If so, then I guess it might be advisable to retain it. All it means is that PRRTE users must be careful to have PRRTE before OMPI in their path values. Otherwise, they get the wrong “prun” and it fails. I suppose I could update the “prun” in OMPI to match the one in PRRTE, if that helps - there isn’t anything incompatible between ORTE and PRRTE. Would that make sense? FWIW: Got a similar complaint from the OpenHPC folks - I gather they also have a “prun”’ in their distribution that they use as an abstraction over all the RM launchers. I’m less concerned about that one, though. On Jun 5, 2018, at 9:55 AM, Thomas Naughton wrote: Hi Ralph, Is the 'prun' tool required to launch the DVM? I know that at some point things shifted to use 'prun' and didn't require the URI on command-line, but I've not tested in few months. Thanks, --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 5 Jun 2018, r...@open-mpi.org wrote: Hey folks Does anyone have heartburn if I remove the “prun” tool from ORTE? I don’t believe anyone is using it, and it doesn’t look like it even works. I ask because the name conflicts with PRRTE and can cause problems when running OMPI against PRRTE Ralph ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Need to know your Github ID
naughtont -> naughtont3 Thanks, --tjn _ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Wed, 10 Sep 2014, Jeff Squyres (jsquyres) wrote: As the next step of the planned migration to Github, I need to know: - Your Github ID (so that you can be added to the new OMPI git repo) - Your SVN ID (so that I can map SVN->Github IDs, and therefore map Trac tickets to appropriate owners) Here's the list of SVN IDs who have committed over the past year -- I'm guessing that most of these people will need Github IDs: adrian alekseys alex alinas amikheev bbenton bosilca (done) bouteill brbarret bwesarg devendar dgoodell (done) edgar eugene ggouaillardet hadi hjelmn hpcchris hppritcha igoru jjhursey (done) jladd jroman jsquyres (done) jurenz kliteyn manjugv miked (done) mjbhaskar mpiteam (done) naughtont osvegis pasha regrant rfaucett rhc (done) rolfv (done) samuel shiqing swise tkordenbrock vasily vvenkates vvenkatesan yaeld yosefe -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/09/15788.php
Re: [OMPI devel] [EXTERNAL] Git submodules are coming
Hi Jeff, I'm not sure where the issue templates reside, but it might be useful to add `git submodule status` to the list of commands when reporting issues. (Once the first submodule PR is merged) beaker:$ git submodule status b94e2617df3fd9a3e83c388fa1c691c0057a77e9 opal/mca/pmix/pmix4x/openpmix (v1.1.3-2128-gb94e261) 52d498811f19be5306bd55b8433024733d3b589a prrte (dev-30165-g52d4988) beaker:$ --tjn _____ Thomas Naughton naught...@ornl.gov Research Associate (865) 576-4184 On Tue, 7 Jan 2020, Jeff Squyres (jsquyres) via devel wrote: We now have two PRs pending that will introduce the use of Git submodules (and there are probably more such PRs on the way). At last one of these first two PRs will likely be merged "Real Soon Now". We've been talking about using Git submodules forever. Now we're just about ready. ** *** DEVELOPERS: THIS AFFECTS YOU!! *** ** You cannot just "clone and build" any more: - git clone g...@github.com:open-mpi/ompi.git cd ompi && ./autogen.pl && ./configure ... - You will *have* to initialize the Git submodule(s) -- either during or after the clone. *THEN* you can build Open MPI. Go read this wiki: https://github.com/open-mpi/ompi/wiki/GitSubmodules May the force be with us! -- Jeff Squyres jsquy...@cisco.com