Re: [OMPI devel] RFC: Change ompi_proc_t endpoint data lookup

Jeff Squyres (jsquyres) Thu, 18 Jul 2013 21:26:17 -0400

+1, but I helped come up with the idea.  :-)


On Jul 18, 2013, at 5:32 PM, "Barrett, Brian W" <bwba...@sandia.gov> wrote:

> What: Change the ompi_proc_t endpoint data lookup to be more flexible
> 
> Why: As collectives and one-sided components are using transports
> directly, an old problem of endpoint tracking is resurfacing.  We need a
> fix that doesn't suck.
> 
> When: Assuming there are no major objections, I'll start writing the code
> next week...
> 
> More Info: 
> 
> Today, endpoint information is stored in one of two places on the
> ompi_proc_t: proc_pml and proc_bml.  The proc_pml pointer is an opaque
> structure having meaning only to the PML and the proc_bml pointer is an
> opaque structure having meaning only to the BML.  CM, OB1, and BFO don't
> use proc_pml, although the MTLs store their endpoint data on the proc_pml.
> R2 uses the proc_bml to hold an opaque data structure which holds all the
> btl endpoint data.
> 
> The specific problem is the Portals 4 collective and one-sided components.
> They both need endpoint information for communication (obviously).
> Before there was a Portals 4 BTL, they peeked at the proc_pml pointer,
> knew what it looked like, and were ok.  Now the data they need is possibly
> in the proc_pml or in the (opaque) proc_bml, which poses a problem.
> 
> Jeff and I talked about this and had a number of restrictions that seemed
> to make sense for a solution:
> 
>  * Don't make ompi_proc_t bigger than absolutely necessary
>  * Avoid adding extra indirection into the endpoint resolution path
>  * Allow enough flexibility that IB or friends could use the same
> mechanism
>  * Don't break the BML / BTL interface (too much work)
> 
> What we came up with was a two pronged approach, depending on run-time
> needs.
> 
> First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we
> would have a proc_endpoint[] array of fixed size.  The size of the array
> would be determined at compile time based on compile-time registering of
> endpoint slots.  At compile time, a #define with a component's slot would
> be set, removing any extra indexing overhead over today's mechanism.  So
> R2 would have a call in it's configure.m4 like:
> 
>  OMPI_REQUIRE_ENDPOINT_TAG(BML_R2)
> 
> And would then find it's endpoint data with a call like:
> 
>  r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2];
> 
> which (assuming modest compiler optimization) is instruction equivalent to:
> 
>  r2_endpoint = proc->proc_bml;
> 
> To allow for dynamic indexing (something we haven't had to date), the last
> entry in the array would be a pointer to an object like an
> opal_pointer_array, but without the locking, and some allocation calls
> during init.  Since the indexes never need to be used by a remote process,
> there's no synchronization required in registering.  The dynamic indexing
> could be turned off at configure time for space-concious builds.  For
> example, on our big systems, I disable dlopen support, so static
> allocation of endpoint slots is good enough.
> 
> In the average build, the only tag registered would be BML_R2.  If we lazy
> allocate the pointer array element, that's two entries in the
> proc_endpoint array, so the same size as today.  I was going to have the
> CM stop using the endpoint and push that handling on the MTL.  Assuming
> all MTLs but Portals shared the same tag (easy to do), there'd be an
> 8*nprocs increase in space used per process if an MTL was built, but if
> you disabled R2, that disappears.
> 
> How does this solve my problem?  Rather than having Portals 4 use the MTL
> tag, it would have it's own tag, shared between the MTL, BTL, OSC, and
> COLL components.  Since the chances of Portals 4 being built on a platform
> with support for another MTL is almost zero, in most cases, the size of
> the ompi_proc_t only increases by 8 bytes over today's setup.  Since most
> Portals 4 builds will be on more static platforms, I can disable dynamic
> indexing and be back at today's size, but with an easy way to deal with
> endpoint data sharing between components of different frameworks.
> 
> So, to review our original goals:
> 
>  * ompi_proc_t will remain the same size on most platforms, increase by
> 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on
> static systems (by disabling dynamic indexing and building only one of
> either the MTLs or BMLs).
>  * If you're using a pre-allocated tag, there's no extra indirection or
> math, assuming basic compiler optimization.  There is a higher cost for
> dynamic tags, but that's probably ok for us.
>  * I think that IB could start registering a tag if it needed for sharing
> QP information between frameworks, at the cost of an extra tag.  Probably
> makes the most sense for the MXM case (assuming someone writes an MXM osc
> component).
>  * The PML interface would change slightly (remove about 5 lines of code
> / pml).  The MTL would have to change a bit to look at their own tag
> instead of the proc_pml (fairly easy).  The R2 BML would need to change to
> use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that
> shouldn't be hard.  The consumers of the BML (OB1, BFO, RDMA OSC, etc.)
> would not have to change.
> 
> I know RFCs are usually sent after the code is written, but I wanted some
> thoughts before I started coding, since it's kind of a high impact change
> to a performance-critical piece of OMPI.
> 
> Thoughts?
> 
> Brian
> 
> --
>  Brian W. Barrett
>  Scalable System Software Group
>  Sandia National Laboratories
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] RFC: Change ompi_proc_t endpoint data lookup

Reply via email to