Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-09 Thread Jeff Squyres
I'm not sure I follow -- are you saying that Open MPI is disabling the large 
mmap allocations, and we shouldn't?

On Jan 8, 2010, at 9:25 AM, Sylvain Jeaugey wrote:

> On Thu, 7 Jan 2010, Eugene Loh wrote:
> 
> > Could someone tell me how these settings are used in OMPI or give any
> > guidance on how they should or should not be used?
> This is a very good question :-) As this whole e-mail, though it's hard
> (in my opinion) to give it a Good (TM) answer.
> 
> > This means that if you loop over the elements of multiple large arrays
> > (which is common in HPC), you can generate a lot of cache conflicts,
> > depending on the cache associativity.
> On the other hand, high buffer alignment sometimes gives better
> performance (e.g. Infiniband QDR bandwidth).
> 
> > There are multiple reasons one might want to modify the behavior of the
> > memory allocator, including high cost of mmap calls, wanting to register
> > memory for faster communications, and now this cache-conflict issue.  The
> > usual solution is
> >
> > setenv MALLOC_MMAP_MAX_0
> > setenv MALLOC_TRIM_THRESHOLD_ -1
> >
> > or the equivalent mallopt() calls.
> But yes, this set of settings is the number one tweak on HPC code that I'm
> aware of.
> 
> > This issue becomes an MPI issue for at least three reasons:
> >
> > *)  MPI may care about these settings due to memory registration and 
> > pinning.
> > (I invite you to explain to me what I mean.  I'm talking over my head here.)
> Avoiding mmap is good since it prevents from calling munmap (a function we
> need to hack to prevent data corruption).
> 
> > *)  (Related to the previous bullet), MPI performance comparisons may 
> > reflect
> > these effects.  Specifically, in comparing performance of OMPI, Intel MPI,
> > Scali/Platform MPI, and MVAPICH2, some tests (such as HPCC and SPECmpi) have
> > shown large performance differences between the various MPIs when, it seems,
> > none were actually spending much time in MPI.  Rather, some MPI
> > implementations were turning off large-malloc mmaps and getting good
> > performance (and sadly OMPI looked bad in comparison).
> I don't think this bullet is related to the previous one. The first one is
> a good reason, this one is typically the Bad reason. Bad, but
> unfortunately true : competitors' MPI libraries are faster because ...
> they do much more than MPI (accelerate malloc being the main difference).
> Which I think is Bad, because all these settings should be let in
> developper's hands. You'll always find an application where these settings
> will waste memory and prevent an application from running.
> 
> > *)  These settings seem to be desirable for HPC codes since they don't do
> > much allocation/deallocation and they do tend to have loop nests that wade
> > through multiple large arrays at once.  For best "out of the box"
> > performance, a software stack should turn these settings on for HPC.  Codes
> > don't typically identify themselves as "HPC", but some indicators include
> > Fortran, OpenMP, and MPI.
> In practice, I agree. Most HPC codes benefit from it. But I also ran into
> codes where the memory waste was a problem.
> 
> > I don't know the full scope of the problem, but I've run into this with at
> > least HPCC STREAM (which shouldn't depend on MPI at all, but OMPI looks much
> > slower than Scali/Platform on some tests) and SPECmpi (primarily one or two
> > codes, though it depends also on problem size).
> I had also those codes in mind. That's also why I don't like those MPI
> "benchmarks", since they benchmark much more than MPI. They hence
> encourage MPI provider to incorporate into their libraries things that
> have (more or less) nothing to do with MPI.
> 
> But again, yes, from the (basic) user point of view, library X seems
> faster than library Y. When there is nothing left to improve on MPI, start
> optimizing the rest .. maybe we should reimplement a faster libc inside
> MPI :-)
> 
> Sylvain
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-09 Thread Eugene Loh

Jeff Squyres wrote:


I'm not sure I follow -- are you saying that Open MPI is disabling the large 
mmap allocations, and we shouldn't?
 

Basically the reverse.  The default (I think this means Linux, whether 
with gcc, gfortran, Sun f90, etc.) is to use mmap to malloc large 
allocations.  We don't change this, but arguably we should.


Try this:

#include 
#include 

int main(int argc, char **argv) {
 size_t size, nextsize;
 void  *ptr, *nextptr;

 size = 1;
 ptr  = malloc(size);
 while ( size < 100 ) {
   nextsize = 1.1 * size + 1;
   nextptr  = malloc(nextsize);
   printf("%9ld %18lx %18lx %18lx\n", size, size, nextptr - ptr, ptr);
   size = nextsize;
   ptr  = nextptr ;
 }

 return 0;
}

Here is sample output:

  # bytes #bytes (hex)   #bytes  ptr (hex)
  to next ptr
 (hex)

   58279   e3a7   e3b0 58f870
   64107   fa6b   fa80 59dc20
   70518  11376  11380 5ad6a0
   77570  12f02  12f10 5bea20
   85328  14d50  14d60 5d1930
   93861  16ea5  16eb0 5e6690
  103248  19350  19360 5fd540
  113573  1bba5  1bbb0 6168a0
  124931  1e803   2b3044655bc0 632450
  137425  218d1  22000   2b3044c88010
  151168  24e80  25000   2b3044caa010
  166285  2898d  29000   2b3044ccf010
  182914  2ca82  2d000   2b3044cf8010
  201206  311f6 294000   2b3044d25010
  221327  3608f  37000   2b3044fb9010
  243460  3b704  3c000   2b3044ff0010

So, below 128K allocations, pointers are allocated at successively 
higher addresses, each one just barely far enough to make room for the 
allocation.  E.g., an allocation of 0xE3A7 will push the "high-water 
mark" up 0xE3B0 further.


Beyond 128K allocations, allocations are page aligned.  The pointers all 
end in 0x010.  That is, whole numbers of pages are allocated and the 
returned address is 16 bytes (0x10) into the first page.  The size of 
the allocations are the requested amount, plus a few bytes of padding, 
rounded up to the nearest whole page size multiple.


The motivation to change, in my case, is performance.  I don't know how 
widespread this problem is, but...



On Jan 8, 2010, at 9:25 AM, Sylvain Jeaugey wrote:
 


On Thu, 7 Jan 2010, Eugene Loh wrote:


setenv MALLOC_MMAP_MAX_0
setenv MALLOC_TRIM_THRESHOLD_ -1
 


But yes, this set of settings is the number one tweak on HPC code that I'm
aware of.
   


Wow!  I might vote for "compiling with -O", but let's not pick nits here.


Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-09 Thread Barrett, Brian W
We should absolutely not change this.  For simple applications, yes, things 
work if large blocks are allocated on the heap.  However, ptmalloc (and most 
allocators, really), can't rationally cope with repeated allocations and 
deallocations of large blocks.  It would be *really bad* (as we've seen before) 
to change the behavior of our version of ptmalloc from that which is provided 
by Linux.  Pain and suffering is all that path has ever lead to.

Just my $0.02, of course.

Brian


From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of 
Eugene Loh [eugene@sun.com]
Sent: Saturday, January 09, 2010 9:55 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

Jeff Squyres wrote:

>I'm not sure I follow -- are you saying that Open MPI is disabling the large 
>mmap allocations, and we shouldn't?
>
>
Basically the reverse.  The default (I think this means Linux, whether
with gcc, gfortran, Sun f90, etc.) is to use mmap to malloc large
allocations.  We don't change this, but arguably we should.

Try this:

#include 
#include 

int main(int argc, char **argv) {
  size_t size, nextsize;
  void  *ptr, *nextptr;

  size = 1;
  ptr  = malloc(size);
  while ( size < 100 ) {
nextsize = 1.1 * size + 1;
nextptr  = malloc(nextsize);
printf("%9ld %18lx %18lx %18lx\n", size, size, nextptr - ptr, ptr);
size = nextsize;
ptr  = nextptr ;
  }

  return 0;
}

Here is sample output:

   # bytes #bytes (hex)   #bytes  ptr (hex)
   to next ptr
  (hex)

58279   e3a7   e3b0 58f870
64107   fa6b   fa80 59dc20
70518  11376  11380 5ad6a0
77570  12f02  12f10 5bea20
85328  14d50  14d60 5d1930
93861  16ea5  16eb0 5e6690
   103248  19350  19360 5fd540
   113573  1bba5  1bbb0 6168a0
   124931  1e803   2b3044655bc0 632450
   137425  218d1  22000   2b3044c88010
   151168  24e80  25000   2b3044caa010
   166285  2898d  29000   2b3044ccf010
   182914  2ca82  2d000   2b3044cf8010
   201206  311f6 294000   2b3044d25010
   221327  3608f  37000   2b3044fb9010
   243460  3b704  3c000   2b3044ff0010

So, below 128K allocations, pointers are allocated at successively
higher addresses, each one just barely far enough to make room for the
allocation.  E.g., an allocation of 0xE3A7 will push the "high-water
mark" up 0xE3B0 further.

Beyond 128K allocations, allocations are page aligned.  The pointers all
end in 0x010.  That is, whole numbers of pages are allocated and the
returned address is 16 bytes (0x10) into the first page.  The size of
the allocations are the requested amount, plus a few bytes of padding,
rounded up to the nearest whole page size multiple.

The motivation to change, in my case, is performance.  I don't know how
widespread this problem is, but...

>On Jan 8, 2010, at 9:25 AM, Sylvain Jeaugey wrote:
>
>
>>On Thu, 7 Jan 2010, Eugene Loh wrote:
>>
>>>setenv MALLOC_MMAP_MAX_0
>>>setenv MALLOC_TRIM_THRESHOLD_ -1
>>>
>>>
>>But yes, this set of settings is the number one tweak on HPC code that I'm
>>aware of.
>>
>>
Wow!  I might vote for "compiling with -O", but let's not pick nits here.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel