Re: [naviserver-devel] Quest for malloc

Mike Mon, 15 Jan 2007 01:38:41 -0800

Vlad has written an allocator that uses mmap to obtain
memory for the system and munmap that memory on thread
exit, if possible.


I have spent more than 3 weeks fiddling with that and
discussing it with Vlad and this is what we bith come to:

    http://www.archiware.com/downloads/vtmalloc-0.0.1.tar.gz

I believe we have solved most of my needs. Below is an excerpt
from the README file for the qurious.

If anybody would care to test it in his/her own environment?
If all goes well, I might TIP this to be included in Tcl core
as replacement of (or addition to) the zippy allocator.


Although not entirely sure why, I have spent some time analyzing the
behavior and code of both of these allocators.  Since I don't really
want to spend too much more, the following comments are not organized
in any particulr order of importance or relevance...

a)
The test program Zoran includes biases Zippy toward "standard"
allocator, which it does not do for VT.  The following patch
"corrects" this behavior:

+++ memtest.c   Sun Jan 14 16:43:23 2007
@@ -211,6 +211,7 @@
            } else {
                size &= 0x3FFF; /* Limit to 16K */
            }
+                                               if (size>16000) size = 16000;
            *toallocptr++ = size;
        }
    }

b)
The key difference between Zippy and VT allocators arises form their
use of the shared "freed" memory pool.  Zippy calls this the shared
cache, VT calls this the global cache.  Zippy's goal appears to have
been to minimize memory usage (while the stated goal is to reduce lock
contention).  Zippy does this by aggressively moving freed blocks to
the shared cache, allowing any thread to later allocate memory from
this shared pool.  Meanwhile VT targets speed, trading off bloat, and
allowing freed blocks to return to the private per-thread pools.  To
allow for this speed optimization, VT keeps a pointer to the cache
that allocated it within each "page", something that can be done for
Zippy if speed was the goal.

c)
The key reason why Zippy substantially lags behind VT in performance
is actually because Zippy beats itself at its own game.  While it's
stated goal is to minimize lock contention, the hardcoded constants
used in Zippy actually completely sacrifice lock contention for
storage.  Naturally, thread-local pools can be used to allocate blocks
immediately, while the shared pool must be locked by a mutex when
allocation is performed.  The current Zippy configuration minimizes
the amount of storage "wasted" in per-thread pools by aggressively
moving larger blocks to the shared cache.  The more threads attempt to
allocate/free large blocks, the worse the contention and the lower the
performance.

Zoran's test program produces allocation sizes that are uniform
random, so large blocks are equally likely to small blocks, therefore
performance suffers substantially.  A more accurate benchmark would
take common usage patterns from Tcl/NaviServer, which I suspect are
heavily biased toward allocation of small objects.

The following patch allows Zippy to be a lot less aggressive in
putting blocks into the shared pool, bringing the performance of Zippy
much closer to VT, at the expense of substantially higher memory
"waste":

@@ -128,12 +174,12 @@
    {   64,  256, 128, NULL},
    {  128,  128,  64, NULL},
    {  256,   64,  32, NULL},
-    {  512,   32,  16, NULL},
-    { 1024,   16,   8, NULL},
-    { 2048,    8,   4, NULL},
-    { 4096,    4,   2, NULL},
-    { 8192,    2,   1, NULL},
-    {16284,    1,   1, NULL},
+    {  512,   64,  32, NULL},
+    { 1024,   64,  32, NULL},
+    { 2048,   64,  32, NULL},
+    { 4096,   64,  32, NULL},
+    { 8192,   64,  32, NULL},
+    {16284,   64,  32, NULL},

d)
VT uses mmap by default to allocate memory, Zippy uses the system
malloc.  By doing this, VT actually penalizes itself in an environment
where lots of small blocks are frequently allocated and threads are
often created/destroyed. VT releases the memory held in a thread's
local pool when a thread terminates.  Since it uses mmap by default,
this means that de-allocated storage is actually released to the
operating system, forcing new threads to call mmap() again to get
memory, thereby incurring system call overhead that could be avoided
in some cases if the system malloc implementation did not lower the
sbrk point at each deallocation.  Using malloc() in VT allocator
should give it much more uniform and consisent performance.  Using
mmap() in Zippy has less performnace impact since memory is never
released by Zippy (at thread termintion it is just placed back into
the shared pool).

Another obvious downside of using mmap() for Zippy is that realloc()
must always fall back to the slow allocate/copy/free mechanism and can
never be optimized.

e)
Both allocators use an O(n) algorithm to compute the power of two
"bucket" for the allocated size.  This is just plain silly since an
O(n log n) algorithm will ofer non-negligible speed up in both
allocators.  This is the current O(n) code:
    while (bucket < NBUCKETS && globalCache.sizes[bucket].blocksize < size) {
       ++bucket;
   }

f)
Zippy uses Ptr2Block and Block2Ptr functions where as VT uses macros
for this.  Zippy also does more checks on MAGIC numbers on each
allocation which VT only performs on de-allocation.  I am not sure if
current compilers are smart enough to inline the functions in Zippy, I
did not test this.  When compiled with -O0 with gcc, changing Zippy to
use macros instead of function calls offers non-trivial and
statistically significant speedup.

g)
VT does some things that are too complicated for me to follow to
achieve it's high-speed no-lock needed allocation.  In particular, the
following code worries me:
   if (pagePtr->p_cachePtr == cachePtr /* Dirty read! */) {
       blockPtr->nextPtr = cachePtr->blocks[bucket].head;
       cachePtr->blocks[bucket].head = blockPtr;
       cachePtr->blocks[bucket].count++;
   } else {
although to its credit, I have not figured out any case where it would break.

h)
One of the key drivers behind VT over Zippy is the "memory leak
problem" requiring server restart after a day of operation.  Although
this has been attributed to Zippy "never freeing" memory, it does not
look like this is related to the problem.  Furthermore, Zoran's test
in this respect (looking at data footprint before program exit but
after thread join) is inherently flawed.  Of course Zippy will show
higher usage since VT explicitly deallocates memory when a thread
exists and Zippy retains it in the shared pool.  But this is not a
leak - Zippy should not grow in size as a result of this.

What does appear to be responsible for Zippy's growth is a design bug.
When allocating a new block, Zippy will first look in the bucket
responsible for the given size, and then look in larger buckets,
seeking a block that can be split up and placed into the smaller
buckets.  This is a design flaw which I believe is the most likely
cause of the "needs daily restart" phenomenon.  Zippy lacks any
ability to coalesce smaller blocks in order to promote them to a
larger bucket.  This most likely leads to more and more allocation of
large blocks when they are needed, and slow and gradual movement and
splitting of blocks from larger buckets when smaller buckets do not
have any available entries.  Of course this is nothing but a theory as
I have nothing to back it up other than reading the code, but it
should be reasonably easy for someone who experiences the daily
restart problem to check by applying the following simple patch:
@@ -910,6 +960,7 @@
       blockPtr = NULL;
       n = NBUCKETS;
       size = 0; /* lint */
+       #if 0
       while (--n > bucket) {
           if (cachePtr->buckets[n].nfree > 0) {
               size = binfo[n].blocksize;
@@ -919,6 +970,7 @@
               break;
           }
       }
+       #endif

       /*
        * Otherwise, allocate a big new block directly.

i)
After having reviewed code for both of the allocators, I must say that
I like VT better.  It's design and implementation is simpler, it
translaes to running fewer instructions per allocation, it actually
achieve's Zippy's goal of minimizing lock contention.

-----

As I've stated way up top, I've alreayd spent a lot more time on this
than I should have, and thus I can not even go back up to proof-read
this long mail or check it for completeness of my findings.  I do
notice that I "blame" Zoran in a bunch of the above paragraphs.  This
is purely accidental, I mean no offense, and the statements are meant
simply to provide an actor for describing the benchmark/methodology,
not to find a scape goat.  Zoran - if you feel offended in any way, I
apologize, it was not at all intended.

I hope the above is at least in some ways helpful.

Re: [naviserver-devel] Quest for malloc

Reply via email to