What: Change the ompi_proc_t endpoint data lookup to be more flexible Why: As collectives and one-sided components are using transports directly, an old problem of endpoint tracking is resurfacing. We need a fix that doesn't suck.
When: Assuming there are no major objections, I'll start writing the code next week... More Info: Today, endpoint information is stored in one of two places on the ompi_proc_t: proc_pml and proc_bml. The proc_pml pointer is an opaque structure having meaning only to the PML and the proc_bml pointer is an opaque structure having meaning only to the BML. CM, OB1, and BFO don't use proc_pml, although the MTLs store their endpoint data on the proc_pml. R2 uses the proc_bml to hold an opaque data structure which holds all the btl endpoint data. The specific problem is the Portals 4 collective and one-sided components. They both need endpoint information for communication (obviously). Before there was a Portals 4 BTL, they peeked at the proc_pml pointer, knew what it looked like, and were ok. Now the data they need is possibly in the proc_pml or in the (opaque) proc_bml, which poses a problem. Jeff and I talked about this and had a number of restrictions that seemed to make sense for a solution: * Don't make ompi_proc_t bigger than absolutely necessary * Avoid adding extra indirection into the endpoint resolution path * Allow enough flexibility that IB or friends could use the same mechanism * Don't break the BML / BTL interface (too much work) What we came up with was a two pronged approach, depending on run-time needs. First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we would have a proc_endpoint[] array of fixed size. The size of the array would be determined at compile time based on compile-time registering of endpoint slots. At compile time, a #define with a component's slot would be set, removing any extra indexing overhead over today's mechanism. So R2 would have a call in it's configure.m4 like: OMPI_REQUIRE_ENDPOINT_TAG(BML_R2) And would then find it's endpoint data with a call like: r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2]; which (assuming modest compiler optimization) is instruction equivalent to: r2_endpoint = proc->proc_bml; To allow for dynamic indexing (something we haven't had to date), the last entry in the array would be a pointer to an object like an opal_pointer_array, but without the locking, and some allocation calls during init. Since the indexes never need to be used by a remote process, there's no synchronization required in registering. The dynamic indexing could be turned off at configure time for space-concious builds. For example, on our big systems, I disable dlopen support, so static allocation of endpoint slots is good enough. In the average build, the only tag registered would be BML_R2. If we lazy allocate the pointer array element, that's two entries in the proc_endpoint array, so the same size as today. I was going to have the CM stop using the endpoint and push that handling on the MTL. Assuming all MTLs but Portals shared the same tag (easy to do), there'd be an 8*nprocs increase in space used per process if an MTL was built, but if you disabled R2, that disappears. How does this solve my problem? Rather than having Portals 4 use the MTL tag, it would have it's own tag, shared between the MTL, BTL, OSC, and COLL components. Since the chances of Portals 4 being built on a platform with support for another MTL is almost zero, in most cases, the size of the ompi_proc_t only increases by 8 bytes over today's setup. Since most Portals 4 builds will be on more static platforms, I can disable dynamic indexing and be back at today's size, but with an easy way to deal with endpoint data sharing between components of different frameworks. So, to review our original goals: * ompi_proc_t will remain the same size on most platforms, increase by 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on static systems (by disabling dynamic indexing and building only one of either the MTLs or BMLs). * If you're using a pre-allocated tag, there's no extra indirection or math, assuming basic compiler optimization. There is a higher cost for dynamic tags, but that's probably ok for us. * I think that IB could start registering a tag if it needed for sharing QP information between frameworks, at the cost of an extra tag. Probably makes the most sense for the MXM case (assuming someone writes an MXM osc component). * The PML interface would change slightly (remove about 5 lines of code / pml). The MTL would have to change a bit to look at their own tag instead of the proc_pml (fairly easy). The R2 BML would need to change to use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that shouldn't be hard. The consumers of the BML (OB1, BFO, RDMA OSC, etc.) would not have to change. I know RFCs are usually sent after the code is written, but I wanted some thoughts before I started coding, since it's kind of a high impact change to a performance-critical piece of OMPI. Thoughts? Brian -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories
smime.p7s
Description: S/MIME cryptographic signature
