(I'm new to Open MPI.)

I'm looking at the sm BTL.

In mca_btl_sm_add_procs(), there's a loop over peer processes, with a call to ompi_fifo_init(). That is, one call to ompi_fifo_init() for each connection (sender/receiver pair).

In ompi_fifo_init(), there's an allocation of sizeof(ompi_cb_fifo_wrapper_t), and a call to ompi_cb_fifo_init(), which in turn has two allocations: one of a bunch of pointers and another of sizeof(ompi_cb_fifo_ctl_t).

In short, for each connection, there are three allocations:

*) sizeof(ompi_cb_fifo_wrapper_t)... about 64 bytes on LP64
*) a bunch of pointers... about 1 Kbyte on LP64
*) sizeof(ompi_cb_fifo_ctl_t)... about 12 bytes

Let me say this yet another way. For N local processes, there are N*(N-1) per-connection allocations, most of which are 64 bytes or smaller.

BUT, in ompi_fifo_init() and ompi_cb_fifo_init(), we ask for page alignment of each allocation. Further, in mca_mpool_sm_alloc() that alignment is further reinforced to be on page boundaries.

As the number of local processes increases, therefore these per-connection allocations become very costly. For 8K pages, for example, and 100 on-node processes, we're talking 3*100*100*8K = 240 Mbytes. For 512 on-node processes (yes, we have nodes this big), that's 6 Gbyte... most of which is unused. (E.g., allocating more than an 8K page when we only need 64 or 12 bytes.)

Okay, long intro. Let me start with a short question: do we really need page alignment for these allocations? Would cacheline alignment be okay?

(I imagine I'll have follow-up questions once the answers start to roll in.)

Reply via email to