date:20120616

[OMPI devel] ORTE

2012-06-16 Thread Ralph Castain

Over the next month, there will be significant changes to ORTE both in terms of 
framework APIs and internal behavior. This work will focus on a few areas:

1. launch scalability and timing. I try to review our status on this whenever 
we prepare for the start of a new release series, and as usual this prompted 
some work in this area. Most of the effort will focus on development of the 
async modex functionality described in a separate email thread.

2. access to the BTLs, which we recently agreed to move to the OPAL layer.

3. memory footprint reduction, particularly the removal or minimization of 
per-proc data stored in every process (e.g., retaining a complete copy of all 
modex info for all ranks in each process, regardless of communication needs).

It was my understanding that others interested in ORTE had forked their code 
bases and were not tracking the main developer's trunk. However, at the recent 
developers meeting, this understanding was altered as it appears other groups 
are actually attempting to track the trunk, resolving conflicts behind the 
scenes. In an attempt to aid these groups, I thought it might help if I 
outlined what will be happening in the near future.


The biggest anticipated changes lie in the modex and RML/OOB areas. I've 
outlined the async modex changes in a separate email thread. One additional 
element of that work will be the porting of the "db" (database) framework back 
from the ORCM project to ORTE. This framework provides a "hook" for researchers 
working on distributed, high-performance databases to investigate alternative 
ways of scalably supporting our modex information in a fault-tolerant manner. 
Eventually, the work in areas such as distributed hash-tables (DHTs) used in 
ORCM may make its way back to OMPI.

In addition to scalability, the modex work is intended to contribute to the 
memory reduction goal. The primary emphasis here will be on changing from 
having each process retain complete knowledge of the contact info, locality, 
etc. for every process in the job, to a strategy of only caching info for 
processes with which the proc actually is communicating. We may look at 
removing all per-proc caching of info (perhaps using a shared memory model), 
but that has performance implications and needs further investigation.

As part of that effort, we will be removing the nidmap/pidmap constructs and 
storing that info in the same database being used by the modex, thus collapsing 
the grpcomm and ess APIs by consolidating the access to proc-related data in 
the "db" API. The grpcomm framework will retain responsibility for executing 
RTE collective operations such as modex, but the data will be stored in the db. 
Likewise, the ess will no longer be used to access data such as a proc's 
locality - instead, that data will be obtained from the db, however it is 
stored or where it is located.

The modex work is tentatively slated for the 1.7 series, though how much of it 
gets there remains to be seen. The work is being done in a bitbucket repo: 
https://bitbucket.org/rhc/ompi-modex


Changes to the RML/OOB are largely driven by the long-standing need to 
cleanup/refactor that code, the need to support async progress on messaging, 
and the upcoming availability of the BTLs. This code has served us well for 
quite some time, but the to-do list has grown over the years, including the 
desire for better support of multi-NIC environments. The work will require 
significant changes to the RML and OOB framework APIs, including:

* the removal of blocking sends (a persistent source of trouble over the years)

* moving receive matching logic to the RML layer, thus simplifying the OOB 
components and making them look more like the BTLs.

* adding a UDP component (ported back from ORCM) to the OOB, along with 
creating retransmit and flow control support frameworks in OPAL (modeled after 
the ORCM version) to handle unreliable transports in both BTL (which will also 
receive a UDP component) and OOB

* converting the OOB to a standalone (i.e., no longer opened and inited from 
inside the RML), multi-select framework that supports multiple transports

* allowing each OOB component to return an array of modules, one for each 
interface (ala the BTL) - this obviously has implications for the "comm failed" 
error response as a failed connection to one OOB module may not mean complete 
loss of connectivity or process death

* changing the URI construct/parsing methods for the initial contact info that 
goes on the orted cmd line to reflect the above changes, allowing multiple OOB 
modules to contribute to it while retaining the ability to limit overall string 
size

* altering the OOBs to use the modex construct for exchange of endpoint info

* shifting the routing responsibilities from the RML to the OOB level to 
accommodate connectionless transports. The OOB module will determine if routing 
is required and send the message accordingly. When received, the message will

Re: [OMPI devel] RFC: Pineapple Runtime Interposition Project

2012-06-16 Thread Ralph Castain

Hi Josh

I had a chance to review your code this morning, and generally find it is okay 
with me. I see a couple of things that appear to limit it, though they may be 
intentional:

1. the call to pineapple_init really needs a third flag to define the process 
type. Locking the underlying orte_init to MPI seems to somewhat defeat your 
goal of allowing pineapple to be used for non-MPI purposes

2. the barrier and other collectives are locked to the MPI_Init and 
MPI_Finalize procedures due to hardcoding of the collective id. You might want 
to consider altering the API to pass a collective id down so these functions 
can be used in other places.

Finally, we have to get rid of the "pineapple" name. It seems to me that the 
primary purpose of this work is to allow ORTE to be used more generally, and to 
support multiple variants of ORTE within OMPI. So how about calling it "ORte 
Abstraction Layer", or ORAL? This would emphasize that we are not trying to 
create the ultimate generalized RTE abstraction, which I think is important for 
all the reasons raised at the recent meeting.

HTH
Ralph


On Jun 15, 2012, at 12:55 PM, Josh Hursey wrote:

> What: A Runtime Interposition Project - Codename Pineapple
> 
> Why: Define clear API and semantics for runtime requirements of the OMPI 
> layer.
> 
> When:
> - F June 22, 2012 - Work completed
> - T June 26, 2012 - Discuss on teleconf
> - R June 28, 2012 - Commit to trunk
> 
> Where: Trunk (development BitBucket branch below)
>  https://bitbucket.org/jjhursey/ompi-pineapple
> 
> Attached:
>  PDF of slides presented on the June 12, 2012 teleconf. Note that the
> timeline was slightly adjusted above (work completed date moved
> ealier).
> 
> 
> Description: Short Version
> --
> Define, in an 'rte.h', the interfaces and semantics that the OMPI
> layer requires of a runtime environment. Currently this interface
> matches the subset of ORTE functionality that is used by the OMPI
> layer. Runtime symbols (e.g., orte_ess.proc_get_locality) are isolated
> to a framework inside this project to provide linker-level protection
> against accidental breakage of the pineapple interposition layer.
> 
> The interposition project provides researchers working on side
> projects above and below the 'rte.h' interface a single location in
> the code base to watch for interface and semantic changes that they
> need to be concerned about. Researchers working above the pineapple
> layer might explore something other than (or in addition to) OMPI
> (e.g., Extended OMPI, UPC+OMPI). Researchers working below the
> pineapple layer might explore something other than (or in addition to)
> ORTE under OMPI (e.g., specialized runtimes for specific
> environments).
> 
> 
> Description: Other notes
> 
> The pineapple interface provides OMPI developers with a runtime API to
> program against without requiring detailed knowledge of the layout of
> ORTE and its frameworks. In some places in OMPI a single source file
> needs to include >5 (up to 12 in one place) different header files to
> get all of the necessary symbols. Developers must not only know where
> these headers are, but must also understand the differences between
> the various frameworks in ORTE to use ORTE. The developer must also be
> aware that there are certain APIs and data structure fields that are
> not available to the MPI process, so should not be used. The pineapple
> project provides an API representing the small subset of ORTE that is
> used by OMPI. With this API a developer only needs to look at a single
> location in the code base to understand what is provided by the
> runtime for use in the OMPI layer.
> 
> A similar statement could be made for runtime developers trying to
> figure out what the OMPI layer requires from the a runtime
> environment. Currently they need a deep understanding of the behavior
> of ORTE to understand the semantics of various calls to ORTE from the
> OMPI layer. Then they must develop a custom patch for the OMPI layer
> that extracts the ORTE symbols, and replaces them with their own
> symbols. This process is messy, error prone, and tedious to say the
> least. Having a single set of interfaces and semantics will allow such
> developers to focus their efforts on supporting the Open MPI community
> defined API, and not necessarily the evolution of the ORTE or OMPI
> project internals. This is advantageous when porting Open MPI to an
> environment with a full featured runtime already running on the
> machine, and for researchers exploring radical runtime designs for
> future systems. The pineapple API allows such projects to develop
> beside the mainline Open MPI trunk a little more easily than without
> the pineapple API.
> 
> 
> FAQ:
> 
> (1) Why is this a separate project and not a framework of OMPI? or a
> framework of ORTE?
> 
> After much deliberation between the developers, from a software
> engineering perspective, making the pineapple rte.h

Re: [OMPI devel] RFC: hide btl segment keys within btl

2012-06-16 Thread George Bosilca

Looks good to me. I would add some checks regarding the number and size of the 
segments and the allocated space (MCA_BTL_SEG_MAX_SIZE) to make sure we never 
hit the corner case where there are too many segments compared with the 
available space. And add a huge comment in the btl.h about the fact that 
mca_btl_base_segment_t should be used with extreme care.

  george.

On Jun 14, 2012, at 18:42 , Jeff Squyres wrote:

> This sounds like a good thing to me.  +1
> 
> On Jun 13, 2012, at 12:58 PM, Nathan Hjelm wrote:
> 
>> What: hide btl segment keys from PML/OSC code.
>> 
>> Why: As it stands new BTLs with larger segment keys (smcuda for example) 
>> require changes in both OSC/rdma as well as the PMLs. This RFC makes will 
>> make changes in segment keys transparent to all btl users.
>> 
>> When: The changes are very straight-forward so I am setting the timeout for 
>> this to June 22, 2012
>> 
>> Where: See the attached patch or check out the bitbucket 
>> http://bitbucket.org/hjelmn/ompi-btl-interface-update
>> 
>> All the relevant PMLs/BTLs + OSC/rdma have been updated with the exception 
>> of btl/wv. I have also tested the following components:
>> - ob1
>> - csum
>> - bfo
>> - ugni (now works with MPI one-sides)
>> - sm
>> - vader
>> - openib (in progress)
>> 
>> Brian and Rolf, please take a look at your components and let me know if I 
>> screwed anything up.
>> 
>> -Nathan Hjelm
>> HPC-3, LANL
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] ORTE

Re: [OMPI devel] RFC: Pineapple Runtime Interposition Project

Re: [OMPI devel] RFC: hide btl segment keys within btl

3 matches

Site Navigation

Mail list logo

Footer information