On Aug 30, 2008, at 10:56 AM, Eugene Loh wrote:
There's probably some law of software engineering that applies
here. Basically, upon first read, I was filled with bitter
resentment against those who had written the code. :^) Then, as I
began to feel mastery over its, um, intricacies -- to feel that I,
too, was becoming a member of the inner cabal -- I began to feel
pride and a desire to protect the code against intrusive
overhaul. :^)
:-)
I did peek at some of the Open MPI papers, and they talked about
Open MPI's modular design. The idea is that someone should be able
to play with one component of the architecture without having to
become an expert in the whole thing. The reality I seem to be
facing is that to understand one part (like the sm BTL), I have to
understand many parts (mpool, allocator, common, etc.) and the only
way to do so is to read code, step through with debugger, and ask
experts.
Ya, I think the reality has turned out that it became more important
for us to be able to maintain software layer abstractions more than
connectedness. In short, yes, you have to have a bunch of general and/
or specific knowledge, but you can still muck around in your component
fairly independently of the rest of the system.
I believe the main rationale for doing page-line alignments was
for memory affinity, since (at least on Linux, I don't know about
solaris) you can only affinity-ize pages.
Solaris maps on a per-page basis.
On your big 512 proc machines, I'm assuming that the page memory
affinity will matter...?
You mean for latency? I could imagine so, but don't know for sure.
I'm no expert on this stuff. Theoretically, I could imagine a
system where some of this stuff might fly from cache-to-cache, with
the location of the backing memory not being relevent.
If locality did matter, I could imagine two reasonable choices:
FIFOs being local to the sender or to the receiver -- with the best
choice depending on the system.
To be honest, I forget the specifics; I think our FIFOs are local to
the receiver. I'm pretty sure that affinity will matter, even if
they're cache-to-cache operations.
That being said, we're certainly open to making things better.
E.g., if a few procs share a memory locality (can you detect that
in Solaris?), have them share a page or somesuch...?
Yes, I believe you can detect these things in Solaris.
This might be nifty to do. We *may* be able to leverage the existing
paffinity and/or maffinity frameworks to discover this information and
then have processes that share the same local memory be able to share
local pages intelligently. This would be a step in the right
direction, no?
I could imagine splitting the global shared memory segment up per
process. This might have two advantages:
*) If the processes are bound and there is some sort of first-touch
policy, you could manage memory locality just by having the right
process make the allocation. No need for page alignment of tiny
allocations.
I think even first-touch will make *the whole page* be local to the
process that touches it. So if you have each process take N bytes
(where N << page_size), then the 0th process will make that whole page
be local; it may be remote for others.
*) You wouldn't need to control memory allocations with a lock
(except for multithreaded apps). I haven't looked at this too
closely yet, but the 3*n*n memory allocations in shared memory
during MPI_Init are currently serialized, which sounds disturbing
when n is 100 to 500 local processes.
If I'm understanding your proposal right, you're saying that each
process would create its own shared memory space, right? Then any
other process that wants to send to that process would mmap/shmattach/
whatever to the receiver's shared memory space. Right?
The total amount of shared memory will likely not go down, because the
OS will still likely allocate on a per-page basis, right? But per
your 2nd point, would the resources required for each process to mmap/
shmattach/whatever 511 other process' shared memory spaces be
prohibitive?
Graham, Richard L. wrote:
I have not looked at the code in a long time, so not sure how many
things have changed ... In general what you are suggesting is
reasonable. However, especially on large machines you also need to
worry about memory locality, so should allocate from memory pools
that are appropriately located. I expect that memory allocated on
a per-socket basis would do.
Is this what "maffinity" and "memory nodes" are about? If so, I
would think memory locality should be handled there rather than in
page alignment of individual 12-byte and 64-byte allocations.
maffinity was a first stab at memory affinity and is currently (and
has been for a long, long time) no frills and didn't have a lot of
thought put into it.
I see the "node id" and "bind" functions in there; I think Gleb must
have added them somewhere along the way. I'm not sure how much
thought was put into making those be truly generic functions (I see
them implemented in libnuma, which AFAIK is Linux-specific). Does
Solaris have memory affinity function calls?
--
Jeff Squyres
Cisco Systems