(I'm new to Open MPI.)
I'm looking at the sm BTL.
In mca_btl_sm_add_procs(), there's a loop over peer processes, with a
call to ompi_fifo_init(). That is, one call to ompi_fifo_init() for
each connection (sender/receiver pair).
In ompi_fifo_init(), there's an allocation of
sizeof(ompi_cb_fifo_wrapper_t), and a call to ompi_cb_fifo_init(), which
in turn has two allocations: one of a bunch of pointers and another of
sizeof(ompi_cb_fifo_ctl_t).
In short, for each connection, there are three allocations:
*) sizeof(ompi_cb_fifo_wrapper_t)... about 64 bytes on LP64
*) a bunch of pointers... about 1 Kbyte on LP64
*) sizeof(ompi_cb_fifo_ctl_t)... about 12 bytes
Let me say this yet another way. For N local processes, there are
N*(N-1) per-connection allocations, most of which are 64 bytes or smaller.
BUT, in ompi_fifo_init() and ompi_cb_fifo_init(), we ask for page
alignment of each allocation. Further, in mca_mpool_sm_alloc() that
alignment is further reinforced to be on page boundaries.
As the number of local processes increases, therefore these
per-connection allocations become very costly. For 8K pages, for
example, and 100 on-node processes, we're talking 3*100*100*8K = 240
Mbytes. For 512 on-node processes (yes, we have nodes this big), that's
6 Gbyte... most of which is unused. (E.g., allocating more than an 8K
page when we only need 64 or 12 bytes.)
Okay, long intro. Let me start with a short question: do we really
need page alignment for these allocations? Would cacheline alignment be
okay?
(I imagine I'll have follow-up questions once the answers start to roll in.)