[OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

Eugene Loh Thu, 7 Jan 2010 17:15:11 -0500

Could someone tell me how these settings are used in OMPI or give anyguidance on how they should or should not be used?

The background is that (on Linux? with GCC libc? with OMPI?) smallmemory allocations are allocated on the heap, with brk() or sbrk() usedto modify the high-water mark. Lest a large, freed allocation cannot bereturned to the OS due to a small, active allocation, the memoryallocator uses mmap() instead of brk/sbrk for large allocations. Thereis some discussion of the internet about how mmap is a costly way ofallocating memory, but I'm concerned about something else. With mmap,you get page-aligned allocations back. This means that if you loop overthe elements of multiple large arrays (which is common in HPC), you cangenerate a lot of cache conflicts, depending on the cache associativity.

There are multiple reasons one might want to modify the behavior of thememory allocator, including high cost of mmap calls, wanting to registermemory for faster communications, and now this cache-conflict issue.The usual solution is


setenv MALLOC_MMAP_MAX_        0
setenv MALLOC_TRIM_THRESHOLD_ -1

or the equivalent mallopt() calls.

This issue becomes an MPI issue for at least three reasons:

*) MPI may care about these settings due to memory registration andpinning. (I invite you to explain to me what I mean. I'm talking overmy head here.)

*) (Related to the previous bullet), MPI performance comparisons mayreflect these effects. Specifically, in comparing performance of OMPI,Intel MPI, Scali/Platform MPI, and MVAPICH2, some tests (such as HPCCand SPECmpi) have shown large performance differences between thevarious MPIs when, it seems, none were actually spending much time inMPI. Rather, some MPI implementations were turning off large-mallocmmaps and getting good performance (and sadly OMPI looked bad incomparison).

*) These settings seem to be desirable for HPC codes since they don'tdo much allocation/deallocation and they do tend to have loop nests thatwade through multiple large arrays at once. For best "out of the box"performance, a software stack should turn these settings on for HPC.Codes don't typically identify themselves as "HPC", but some indicatorsinclude Fortran, OpenMP, and MPI.

I don't know the full scope of the problem, but I've run into this withat least HPCC STREAM (which shouldn't depend on MPI at all, but OMPIlooks much slower than Scali/Platform on some tests) and SPECmpi(primarily one or two codes, though it depends also on problem size).


Discussion is invited.

[OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

Reply via email to