Ralph Castain wrote:

I too am interested - I think we need to do something about the sm backing file situation as larger core machines are slated to become more prevalent shortly.

I think there is at least one piece of low-flying fruit: get rid of a lot of the page alignments. Especially as one goes to large core counts, the O(n^2) number of local "connections" becomes important, and each connection starts with three page-aligned allocations, each allocation very tiny (and hence uses only a tiny portion of the page+ that is allocated to it). So, most of the allocated memory is never used.

Personally, I question the rationale for the page alignment in the first place, but don't mind listening to anyone who wants to explain it to me. Presumably, in a NUMA machine, localizing FIFOs to separate physical memory improves performance. I get that basic premise. I just question the reasoning beyond that.

The page alignment appears in ompi_fifo_init and ompi_cb_fifo_init. It comes additionally from mca_mpool_sm_alloc. Four minor changes could change alignment from page to cacheline size.

what happens when there isn't enough memory to support all this? Are we smart enough to detect this situation? Does the sm subsystem quietly shut down? Warn and shut down? Segfault?

I'm not exactly sure.  I think it's a combination of three things:

*) some attempt to signal problems correctly
*) some degree just to live with less shared memory (possibly leading to performance degradation)
*) poorly tested in any case


I have two examples so far:

1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single node, 2ppn, with btl=openib,sm,self. The program started, but segfaulted on the first MPI_Send. No warnings were printed.

2. again with a ramdisk, /tmp was reportedly set to 16MB (unverified - some uncertainty, could be have been much larger). OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self. The program ran to completion without errors or warning. I don't know the communication pattern - could be no local comm was performed, though that sounds doubtful.


Reply via email to