Re: [OMPI devel] allocating sm memory with page alignment

Jeff Squyres Sun, 31 Aug 2008 08:14:04 -0400

On Aug 30, 2008, at 10:56 AM, Eugene Loh wrote:

There's probably some law of software engineering that applieshere. Basically, upon first read, I was filled with bitterresentment against those who had written the code. :^) Then, as Ibegan to feel mastery over its, um, intricacies -- to feel that I,too, was becoming a member of the inner cabal -- I began to feelpride and a desire to protect the code against intrusiveoverhaul. :^)

:-)

I did peek at some of the Open MPI papers, and they talked aboutOpen MPI's modular design. The idea is that someone should be ableto play with one component of the architecture without having tobecome an expert in the whole thing. The reality I seem to befacing is that to understand one part (like the sm BTL), I have tounderstand many parts (mpool, allocator, common, etc.) and the onlyway to do so is to read code, step through with debugger, and askexperts.

Ya, I think the reality has turned out that it became more importantfor us to be able to maintain software layer abstractions more thanconnectedness. In short, yes, you have to have a bunch of general and/or specific knowledge, but you can still muck around in your componentfairly independently of the rest of the system.

I believe the main rationale for doing page-line alignments wasfor memory affinity, since (at least on Linux, I don't know aboutsolaris) you can only affinity-ize pages.
Solaris maps on a per-page basis.
On your big 512 proc machines, I'm assuming that the page memoryaffinity will matter...?
You mean for latency? I could imagine so, but don't know for sure.I'm no expert on this stuff. Theoretically, I could imagine asystem where some of this stuff might fly from cache-to-cache, withthe location of the backing memory not being relevent.
If locality did matter, I could imagine two reasonable choices:FIFOs being local to the sender or to the receiver -- with the bestchoice depending on the system.

To be honest, I forget the specifics; I think our FIFOs are local tothe receiver. I'm pretty sure that affinity will matter, even ifthey're cache-to-cache operations.

That being said, we're certainly open to making things better.E.g., if a few procs share a memory locality (can you detect thatin Solaris?), have them share a page or somesuch...?
Yes, I believe you can detect these things in Solaris.

This might be nifty to do. We *may* be able to leverage the existingpaffinity and/or maffinity frameworks to discover this information andthen have processes that share the same local memory be able to sharelocal pages intelligently. This would be a step in the rightdirection, no?

I could imagine splitting the global shared memory segment up perprocess. This might have two advantages:
*) If the processes are bound and there is some sort of first-touchpolicy, you could manage memory locality just by having the rightprocess make the allocation. No need for page alignment of tinyallocations.

I think even first-touch will make *the whole page* be local to theprocess that touches it. So if you have each process take N bytes(where N << page_size), then the 0th process will make that whole pagebe local; it may be remote for others.

*) You wouldn't need to control memory allocations with a lock(except for multithreaded apps). I haven't looked at this tooclosely yet, but the 3*n*n memory allocations in shared memoryduring MPI_Init are currently serialized, which sounds disturbingwhen n is 100 to 500 local processes.

If I'm understanding your proposal right, you're saying that eachprocess would create its own shared memory space, right? Then anyother process that wants to send to that process would mmap/shmattach/whatever to the receiver's shared memory space. Right?

The total amount of shared memory will likely not go down, because theOS will still likely allocate on a per-page basis, right? But peryour 2nd point, would the resources required for each process to mmap/shmattach/whatever 511 other process' shared memory spaces beprohibitive?

Graham, Richard L. wrote:
I have not looked at the code in a long time, so not sure how manythings have changed ... In general what you are suggesting isreasonable. However, especially on large machines you also need toworry about memory locality, so should allocate from memory poolsthat are appropriately located. I expect that memory allocated ona per-socket basis would do.
Is this what "maffinity" and "memory nodes" are about? If so, Iwould think memory locality should be handled there rather than inpage alignment of individual 12-byte and 64-byte allocations.

maffinity was a first stab at memory affinity and is currently (andhas been for a long, long time) no frills and didn't have a lot ofthought put into it.

I see the "node id" and "bind" functions in there; I think Gleb musthave added them somewhere along the way. I'm not sure how muchthought was put into making those be truly generic functions (I seethem implemented in libnuma, which AFAIK is Linux-specific). DoesSolaris have memory affinity function calls?


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] allocating sm memory with page alignment

Reply via email to