George Bosilca wrote:

Then it looks like the safest solution is the use either ftruncate or the lseek method and then touch the first byte of all memory pages. Unfortunately, I see two problems with this. First, there is a clear performance hit on the startup time. And second, we will have to find a pretty smart way to do this or we will completely break the memory affinity stuff.

We're basically touching all the pages on start-up anyhow.

Let me explain.

The sm BTL needs to set up a shared/mmap file to accommodate what's needed at MPI_Init time and how much space you'll want for growing during the course of the run. We used to size this file "arbitrarily" (mpool_sm_per_peer_size and mpool_sm_[min|max]_size), which allocated shared memory excessively for small jobs but insufficiently (won't start up) for big jobs. As part of moving to the single-queue model, I tried to size the shared memory more reasonably -- at a minimu, so that jobs would start up. The current formula is to estimate how much memory will be needed at MPI_Init time and set the file for that size. We can argue about whether or not headroom should be included, but currently (1.3.2) none is really provided.

So, the shared area is basically filled up during MPI_Init(). For large np, most of that space is eager fragments. An eager fragment in the shared area includes a pointer back to the free list that manages that fragment. Those pointers have to be initialized. Since eager fragments by default are 4K, it turns out that basically every page is touched during MPI_Init(). (Fine print: not true of the max fragments, but there aren't very many of those.)

Reply via email to