[OMPI devel] ORTE
Over the next month, there will be significant changes to ORTE both in terms of framework APIs and internal behavior. This work will focus on a few areas: 1. launch scalability and timing. I try to review our status on this whenever we prepare for the start of a new release series, and as usual this prompted some work in this area. Most of the effort will focus on development of the async modex functionality described in a separate email thread. 2. access to the BTLs, which we recently agreed to move to the OPAL layer. 3. memory footprint reduction, particularly the removal or minimization of per-proc data stored in every process (e.g., retaining a complete copy of all modex info for all ranks in each process, regardless of communication needs). It was my understanding that others interested in ORTE had forked their code bases and were not tracking the main developer's trunk. However, at the recent developers meeting, this understanding was altered as it appears other groups are actually attempting to track the trunk, resolving conflicts behind the scenes. In an attempt to aid these groups, I thought it might help if I outlined what will be happening in the near future. The biggest anticipated changes lie in the modex and RML/OOB areas. I've outlined the async modex changes in a separate email thread. One additional element of that work will be the porting of the "db" (database) framework back from the ORCM project to ORTE. This framework provides a "hook" for researchers working on distributed, high-performance databases to investigate alternative ways of scalably supporting our modex information in a fault-tolerant manner. Eventually, the work in areas such as distributed hash-tables (DHTs) used in ORCM may make its way back to OMPI. In addition to scalability, the modex work is intended to contribute to the memory reduction goal. The primary emphasis here will be on changing from having each process retain complete knowledge of the contact info, locality, etc. for every process in the job, to a strategy of only caching info for processes with which the proc actually is communicating. We may look at removing all per-proc caching of info (perhaps using a shared memory model), but that has performance implications and needs further investigation. As part of that effort, we will be removing the nidmap/pidmap constructs and storing that info in the same database being used by the modex, thus collapsing the grpcomm and ess APIs by consolidating the access to proc-related data in the "db" API. The grpcomm framework will retain responsibility for executing RTE collective operations such as modex, but the data will be stored in the db. Likewise, the ess will no longer be used to access data such as a proc's locality - instead, that data will be obtained from the db, however it is stored or where it is located. The modex work is tentatively slated for the 1.7 series, though how much of it gets there remains to be seen. The work is being done in a bitbucket repo: https://bitbucket.org/rhc/ompi-modex Changes to the RML/OOB are largely driven by the long-standing need to cleanup/refactor that code, the need to support async progress on messaging, and the upcoming availability of the BTLs. This code has served us well for quite some time, but the to-do list has grown over the years, including the desire for better support of multi-NIC environments. The work will require significant changes to the RML and OOB framework APIs, including: * the removal of blocking sends (a persistent source of trouble over the years) * moving receive matching logic to the RML layer, thus simplifying the OOB components and making them look more like the BTLs. * adding a UDP component (ported back from ORCM) to the OOB, along with creating retransmit and flow control support frameworks in OPAL (modeled after the ORCM version) to handle unreliable transports in both BTL (which will also receive a UDP component) and OOB * converting the OOB to a standalone (i.e., no longer opened and inited from inside the RML), multi-select framework that supports multiple transports * allowing each OOB component to return an array of modules, one for each interface (ala the BTL) - this obviously has implications for the "comm failed" error response as a failed connection to one OOB module may not mean complete loss of connectivity or process death * changing the URI construct/parsing methods for the initial contact info that goes on the orted cmd line to reflect the above changes, allowing multiple OOB modules to contribute to it while retaining the ability to limit overall string size * altering the OOBs to use the modex construct for exchange of endpoint info * shifting the routing responsibilities from the RML to the OOB level to accommodate connectionless transports. The OOB module will determine if routing is required and send the message accordingly. When received, the message will
Re: [OMPI devel] RFC: Pineapple Runtime Interposition Project
Hi Josh I had a chance to review your code this morning, and generally find it is okay with me. I see a couple of things that appear to limit it, though they may be intentional: 1. the call to pineapple_init really needs a third flag to define the process type. Locking the underlying orte_init to MPI seems to somewhat defeat your goal of allowing pineapple to be used for non-MPI purposes 2. the barrier and other collectives are locked to the MPI_Init and MPI_Finalize procedures due to hardcoding of the collective id. You might want to consider altering the API to pass a collective id down so these functions can be used in other places. Finally, we have to get rid of the "pineapple" name. It seems to me that the primary purpose of this work is to allow ORTE to be used more generally, and to support multiple variants of ORTE within OMPI. So how about calling it "ORte Abstraction Layer", or ORAL? This would emphasize that we are not trying to create the ultimate generalized RTE abstraction, which I think is important for all the reasons raised at the recent meeting. HTH Ralph On Jun 15, 2012, at 12:55 PM, Josh Hursey wrote: > What: A Runtime Interposition Project - Codename Pineapple > > Why: Define clear API and semantics for runtime requirements of the OMPI > layer. > > When: > - F June 22, 2012 - Work completed > - T June 26, 2012 - Discuss on teleconf > - R June 28, 2012 - Commit to trunk > > Where: Trunk (development BitBucket branch below) > https://bitbucket.org/jjhursey/ompi-pineapple > > Attached: > PDF of slides presented on the June 12, 2012 teleconf. Note that the > timeline was slightly adjusted above (work completed date moved > ealier). > > > Description: Short Version > -- > Define, in an 'rte.h', the interfaces and semantics that the OMPI > layer requires of a runtime environment. Currently this interface > matches the subset of ORTE functionality that is used by the OMPI > layer. Runtime symbols (e.g., orte_ess.proc_get_locality) are isolated > to a framework inside this project to provide linker-level protection > against accidental breakage of the pineapple interposition layer. > > The interposition project provides researchers working on side > projects above and below the 'rte.h' interface a single location in > the code base to watch for interface and semantic changes that they > need to be concerned about. Researchers working above the pineapple > layer might explore something other than (or in addition to) OMPI > (e.g., Extended OMPI, UPC+OMPI). Researchers working below the > pineapple layer might explore something other than (or in addition to) > ORTE under OMPI (e.g., specialized runtimes for specific > environments). > > > Description: Other notes > > The pineapple interface provides OMPI developers with a runtime API to > program against without requiring detailed knowledge of the layout of > ORTE and its frameworks. In some places in OMPI a single source file > needs to include >5 (up to 12 in one place) different header files to > get all of the necessary symbols. Developers must not only know where > these headers are, but must also understand the differences between > the various frameworks in ORTE to use ORTE. The developer must also be > aware that there are certain APIs and data structure fields that are > not available to the MPI process, so should not be used. The pineapple > project provides an API representing the small subset of ORTE that is > used by OMPI. With this API a developer only needs to look at a single > location in the code base to understand what is provided by the > runtime for use in the OMPI layer. > > A similar statement could be made for runtime developers trying to > figure out what the OMPI layer requires from the a runtime > environment. Currently they need a deep understanding of the behavior > of ORTE to understand the semantics of various calls to ORTE from the > OMPI layer. Then they must develop a custom patch for the OMPI layer > that extracts the ORTE symbols, and replaces them with their own > symbols. This process is messy, error prone, and tedious to say the > least. Having a single set of interfaces and semantics will allow such > developers to focus their efforts on supporting the Open MPI community > defined API, and not necessarily the evolution of the ORTE or OMPI > project internals. This is advantageous when porting Open MPI to an > environment with a full featured runtime already running on the > machine, and for researchers exploring radical runtime designs for > future systems. The pineapple API allows such projects to develop > beside the mainline Open MPI trunk a little more easily than without > the pineapple API. > > > FAQ: > > (1) Why is this a separate project and not a framework of OMPI? or a > framework of ORTE? > > After much deliberation between the developers, from a software > engineering perspective, making the pineapple rte.h
Re: [OMPI devel] RFC: hide btl segment keys within btl
Looks good to me. I would add some checks regarding the number and size of the segments and the allocated space (MCA_BTL_SEG_MAX_SIZE) to make sure we never hit the corner case where there are too many segments compared with the available space. And add a huge comment in the btl.h about the fact that mca_btl_base_segment_t should be used with extreme care. george. On Jun 14, 2012, at 18:42 , Jeff Squyres wrote: > This sounds like a good thing to me. +1 > > On Jun 13, 2012, at 12:58 PM, Nathan Hjelm wrote: > >> What: hide btl segment keys from PML/OSC code. >> >> Why: As it stands new BTLs with larger segment keys (smcuda for example) >> require changes in both OSC/rdma as well as the PMLs. This RFC makes will >> make changes in segment keys transparent to all btl users. >> >> When: The changes are very straight-forward so I am setting the timeout for >> this to June 22, 2012 >> >> Where: See the attached patch or check out the bitbucket >> http://bitbucket.org/hjelmn/ompi-btl-interface-update >> >> All the relevant PMLs/BTLs + OSC/rdma have been updated with the exception >> of btl/wv. I have also tested the following components: >> - ob1 >> - csum >> - bfo >> - ugni (now works with MPI one-sides) >> - sm >> - vader >> - openib (in progress) >> >> Brian and Rolf, please take a look at your components and let me know if I >> screwed anything up. >> >> -Nathan Hjelm >> HPC-3, LANL >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel