What: Change the ompi_proc_t endpoint data lookup to be more flexible

Why: As collectives and one-sided components are using transports
directly, an old problem of endpoint tracking is resurfacing.  We need a
fix that doesn't suck.

When: Assuming there are no major objections, I'll start writing the code
next week...

More Info: 

Today, endpoint information is stored in one of two places on the
ompi_proc_t: proc_pml and proc_bml.  The proc_pml pointer is an opaque
structure having meaning only to the PML and the proc_bml pointer is an
opaque structure having meaning only to the BML.  CM, OB1, and BFO don't
use proc_pml, although the MTLs store their endpoint data on the proc_pml.
 R2 uses the proc_bml to hold an opaque data structure which holds all the
btl endpoint data.

The specific problem is the Portals 4 collective and one-sided components.
 They both need endpoint information for communication (obviously).
Before there was a Portals 4 BTL, they peeked at the proc_pml pointer,
knew what it looked like, and were ok.  Now the data they need is possibly
in the proc_pml or in the (opaque) proc_bml, which poses a problem.

Jeff and I talked about this and had a number of restrictions that seemed
to make sense for a solution:

  * Don't make ompi_proc_t bigger than absolutely necessary
  * Avoid adding extra indirection into the endpoint resolution path
  * Allow enough flexibility that IB or friends could use the same
  * Don't break the BML / BTL interface (too much work)

What we came up with was a two pronged approach, depending on run-time

First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we
would have a proc_endpoint[] array of fixed size.  The size of the array
would be determined at compile time based on compile-time registering of
endpoint slots.  At compile time, a #define with a component's slot would
be set, removing any extra indexing overhead over today's mechanism.  So
R2 would have a call in it's configure.m4 like:


And would then find it's endpoint data with a call like:

  r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2];

which (assuming modest compiler optimization) is instruction equivalent to:

  r2_endpoint = proc->proc_bml;

To allow for dynamic indexing (something we haven't had to date), the last
entry in the array would be a pointer to an object like an
opal_pointer_array, but without the locking, and some allocation calls
during init.  Since the indexes never need to be used by a remote process,
there's no synchronization required in registering.  The dynamic indexing
could be turned off at configure time for space-concious builds.  For
example, on our big systems, I disable dlopen support, so static
allocation of endpoint slots is good enough.

In the average build, the only tag registered would be BML_R2.  If we lazy
allocate the pointer array element, that's two entries in the
proc_endpoint array, so the same size as today.  I was going to have the
CM stop using the endpoint and push that handling on the MTL.  Assuming
all MTLs but Portals shared the same tag (easy to do), there'd be an
8*nprocs increase in space used per process if an MTL was built, but if
you disabled R2, that disappears.

How does this solve my problem?  Rather than having Portals 4 use the MTL
tag, it would have it's own tag, shared between the MTL, BTL, OSC, and
COLL components.  Since the chances of Portals 4 being built on a platform
with support for another MTL is almost zero, in most cases, the size of
the ompi_proc_t only increases by 8 bytes over today's setup.  Since most
Portals 4 builds will be on more static platforms, I can disable dynamic
indexing and be back at today's size, but with an easy way to deal with
endpoint data sharing between components of different frameworks.

So, to review our original goals:

  * ompi_proc_t will remain the same size on most platforms, increase by
8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on
static systems (by disabling dynamic indexing and building only one of
either the MTLs or BMLs).
  * If you're using a pre-allocated tag, there's no extra indirection or
math, assuming basic compiler optimization.  There is a higher cost for
dynamic tags, but that's probably ok for us.
  * I think that IB could start registering a tag if it needed for sharing
QP information between frameworks, at the cost of an extra tag.  Probably
makes the most sense for the MXM case (assuming someone writes an MXM osc
  * The PML interface would change slightly (remove about 5 lines of code
/ pml).  The MTL would have to change a bit to look at their own tag
instead of the proc_pml (fairly easy).  The R2 BML would need to change to
use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that
shouldn't be hard.  The consumers of the BML (OB1, BFO, RDMA OSC, etc.)
would not have to change.

I know RFCs are usually sent after the code is written, but I wanted some
thoughts before I started coding, since it's kind of a high impact change
to a performance-critical piece of OMPI.



