+1, but I helped come up with the idea. :-)
On Jul 18, 2013, at 5:32 PM, "Barrett, Brian W" <bwba...@sandia.gov> wrote: > What: Change the ompi_proc_t endpoint data lookup to be more flexible > > Why: As collectives and one-sided components are using transports > directly, an old problem of endpoint tracking is resurfacing. We need a > fix that doesn't suck. > > When: Assuming there are no major objections, I'll start writing the code > next week... > > More Info: > > Today, endpoint information is stored in one of two places on the > ompi_proc_t: proc_pml and proc_bml. The proc_pml pointer is an opaque > structure having meaning only to the PML and the proc_bml pointer is an > opaque structure having meaning only to the BML. CM, OB1, and BFO don't > use proc_pml, although the MTLs store their endpoint data on the proc_pml. > R2 uses the proc_bml to hold an opaque data structure which holds all the > btl endpoint data. > > The specific problem is the Portals 4 collective and one-sided components. > They both need endpoint information for communication (obviously). > Before there was a Portals 4 BTL, they peeked at the proc_pml pointer, > knew what it looked like, and were ok. Now the data they need is possibly > in the proc_pml or in the (opaque) proc_bml, which poses a problem. > > Jeff and I talked about this and had a number of restrictions that seemed > to make sense for a solution: > > * Don't make ompi_proc_t bigger than absolutely necessary > * Avoid adding extra indirection into the endpoint resolution path > * Allow enough flexibility that IB or friends could use the same > mechanism > * Don't break the BML / BTL interface (too much work) > > What we came up with was a two pronged approach, depending on run-time > needs. > > First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we > would have a proc_endpoint[] array of fixed size. The size of the array > would be determined at compile time based on compile-time registering of > endpoint slots. At compile time, a #define with a component's slot would > be set, removing any extra indexing overhead over today's mechanism. So > R2 would have a call in it's configure.m4 like: > > OMPI_REQUIRE_ENDPOINT_TAG(BML_R2) > > And would then find it's endpoint data with a call like: > > r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2]; > > which (assuming modest compiler optimization) is instruction equivalent to: > > r2_endpoint = proc->proc_bml; > > To allow for dynamic indexing (something we haven't had to date), the last > entry in the array would be a pointer to an object like an > opal_pointer_array, but without the locking, and some allocation calls > during init. Since the indexes never need to be used by a remote process, > there's no synchronization required in registering. The dynamic indexing > could be turned off at configure time for space-concious builds. For > example, on our big systems, I disable dlopen support, so static > allocation of endpoint slots is good enough. > > In the average build, the only tag registered would be BML_R2. If we lazy > allocate the pointer array element, that's two entries in the > proc_endpoint array, so the same size as today. I was going to have the > CM stop using the endpoint and push that handling on the MTL. Assuming > all MTLs but Portals shared the same tag (easy to do), there'd be an > 8*nprocs increase in space used per process if an MTL was built, but if > you disabled R2, that disappears. > > How does this solve my problem? Rather than having Portals 4 use the MTL > tag, it would have it's own tag, shared between the MTL, BTL, OSC, and > COLL components. Since the chances of Portals 4 being built on a platform > with support for another MTL is almost zero, in most cases, the size of > the ompi_proc_t only increases by 8 bytes over today's setup. Since most > Portals 4 builds will be on more static platforms, I can disable dynamic > indexing and be back at today's size, but with an easy way to deal with > endpoint data sharing between components of different frameworks. > > So, to review our original goals: > > * ompi_proc_t will remain the same size on most platforms, increase by > 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on > static systems (by disabling dynamic indexing and building only one of > either the MTLs or BMLs). > * If you're using a pre-allocated tag, there's no extra indirection or > math, assuming basic compiler optimization. There is a higher cost for > dynamic tags, but that's probably ok for us. > * I think that IB could start registering a tag if it needed for sharing > QP information between frameworks, at the cost of an extra tag. Probably > makes the most sense for the MXM case (assuming someone writes an MXM osc > component). > * The PML interface would change slightly (remove about 5 lines of code > / pml). The MTL would have to change a bit to look at their own tag > instead of the proc_pml (fairly easy). The R2 BML would need to change to > use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that > shouldn't be hard. The consumers of the BML (OB1, BFO, RDMA OSC, etc.) > would not have to change. > > I know RFCs are usually sent after the code is written, but I wanted some > thoughts before I started coding, since it's kind of a high impact change > to a performance-critical piece of OMPI. > > Thoughts? > > Brian > > -- > Brian W. Barrett > Scalable System Software Group > Sandia National Laboratories > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/