> a)
> The test program Zoran includes biases Zippy toward "standard"
> allocator, which it does not do for VT.  The following patch
> "corrects" this behavior:
>
> +++ memtest.c   Sun Jan 14 16:43:23 2007
> @@ -211,6 +211,7 @@
>              } else {
>                  size &= 0x3FFF; /* Limit to 16K */
>              }
> +                                               if (size>16000)
> size = 16000;
>              *toallocptr++ = size;
>          }
>      }
>


First of all, I wanted to give Zippy a fair chance. If I increase
the max allocation size, Zippy becomes even more slow than it is.
And, Zippy handles 16K pages, whereas we handle 32K pages.
Hence the

     size &= 0x3FFF; /* Limit to 16K */

which limits the allocation size to 16K max. To increase that
would even more hit Zippy than us.

Zoran, I believe you misunderstood.  The "patch" above limits blocks
allocated by your tester to 16000 instead of 16384 blocks.  The reason
for this is that Zippy's "largest bucket" is configured to be
16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo).
By making uniformly random requests sizes up to 16_3_84, you are
causing Zippy to fall back to system malloc for a small fraction of
requests, substantially penalizing its performance in these cases.

> The following patch allows Zippy to be a lot less aggressive in
> putting blocks into the shared pool, bringing the performance of Zippy
> much closer to VT, at the expense of substantially higher memory
> "waste":
>
> @@ -128,12 +174,12 @@
>      {   64,  256, 128, NULL},
>      {  128,  128,  64, NULL},
>      {  256,   64,  32, NULL},
> -    {  512,   32,  16, NULL},
> -    { 1024,   16,   8, NULL},
> -    { 2048,    8,   4, NULL},
> -    { 4096,    4,   2, NULL},
> -    { 8192,    2,   1, NULL},
> -    {16284,    1,   1, NULL},
> +    {  512,   64,  32, NULL},
> +    { 1024,   64,  32, NULL},
> +    { 2048,   64,  32, NULL},
> +    { 4096,   64,  32, NULL},
> +    { 8192,   64,  32, NULL},
> +    {16284,   64,  32, NULL},
>

I cannot comment on that. Possibly you are right but I do not
see much benefit of that except speeding up Zippy to be on pair
with VT, whereas most important VT feature is not the speed,
it is the memory handling.

You wanted to know why Zippy is slower on your test, this is the
reason.  This has substantial impact on FreeBSD and linux, and my
guess is that it will have a drammatic effect on Mac OSX.

> VT releases the memory held in a thread's
> local pool when a thread terminates.  Since it uses mmap by default,
> this means that de-allocated storage is actually released to the
> operating system, forcing new threads to call mmap() again to get
> memory, thereby incurring system call overhead that could be avoided
> in some cases if the system malloc implementation did not lower the
> sbrk point at each deallocation. Using malloc() in VT allocator
> should give it much more uniform and consisent performance.

Not necessarily.  We'd shoot ourselves in the foot by doing so,
because most OS allocators never return memory to the system and
one of our major benefits will be gone.
What we could do: timestamp each page, return all pages to the
global cache and prune older. Or, put a size constraint on the
global cache. But then you'd have yet-another-knob to adjust
and the difficulty would be to find the right setup. VT is more
simple in that as it does not offer you ANY knobs you can trim
(for better or for worse). In some early stages of the design
we had number of knobs and were not certain how to adjust them.
So we threw that away and redesigned all parts to be "self adjusting"
if possible.

The benefit of mmap() is being able to "for sure" release memory back
to the system.  The drawback is that it always incurrs a substantial
syscall overhead compared to malloc.  You decide which you prefer (I
think I would lean slightly toward mmap() for long lived applications,
but not by much, since the syscall introduces a lot of variance and an
average performance degradation).

> e)
> Both allocators use an O(n) algorithm to compute the power of two
> "bucket" for the allocated size.  This is just plain silly since an
> O(n log n) algorithm will ofer non-negligible speed up in both
> allocators.  This is the current O(n) code:
>      while (bucket < NBUCKETS && globalCache.sizes
> [bucket].blocksize < size) {
>         ++bucket;
>     }

How about adding this into the code?

I think the most obvious replacement is just using an if "tree":
if (size>0xff) bucket+=8, size&=0xff;
if (size>0xf) bucket+=4, size&0xf;
...
it takes a minute to get the math right, but the performance gain
should be substantial.

> f)
> Zippy uses Ptr2Block and Block2Ptr functions where as VT uses macros
> for this.  Zippy also does more checks on MAGIC numbers on each
> allocation which VT only performs on de-allocation.  I am not sure if
> current compilers are smart enough to inline the functions in Zippy, I
> did not test this.  When compiled with -O0 with gcc, changing Zippy to
> use macros instead of function calls offers non-trivial and
> statistically significant speedup.

I believe you must add "inline" to the function definition for them
to do so. Anyways, you get some percent of speed by inlining or
using macros, that is correct. But I would not expect this to be
a breakthrough. But, on the bottom line: yes, on all places that are
repeatedly called, one should resort to inlining or macros, if possible.

In my tests, due to the frequency of calls of these functions they
contribute 10% to 15% overhead in performance.

> g)
> VT does some things that are too complicated for me to follow to
> achieve it's high-speed no-lock needed allocation.  In particular, the
> following code worries me:
>     if (pagePtr->p_cachePtr == cachePtr /* Dirty read! */) {
>         blockPtr->nextPtr = cachePtr->blocks[bucket].head;
>         cachePtr->blocks[bucket].head = blockPtr;
>         cachePtr->blocks[bucket].count++;
>     } else {
> although to its credit, I have not figured out any case where it
> would break.

Ha! It is pretty simple: you can atomically check pointer equivalence
without risking a core (at least this is my experience). You are not
expected to make far-reaching decisions based on it, though.
In this particular example, even if the test was false, there would be
no "harm" done, just an inoptimal path would be selected.
I have marked that "Dirty read" to draw people attention on that place.
And, I succeeded obviously :-)

The dirty read I have no problem with.  It's the the possibility of
taking of the head element which could be placed there by another
thread that bothers me.

> Although
> this has been attributed to Zippy "never freeing" memory, it does not
> look like this is related to the problem.  Furthermore, Zoran's test
> in this respect (looking at data footprint before program exit but
> after thread join) is inherently flawed.  Of course Zippy will show
> higher usage since VT explicitly deallocates memory when a thread
> exists and Zippy retains it in the shared pool.  But this is not a
> leak - Zippy should not grow in size as a result of this.

Consider this scenario: you start a process, then make a task which
requires you 50 threads to execute. After that you never ever need
that large number of them for the lifetime of the process. Zippy will
allocate and *hold* that memory all the time, which a complete waste.
VT will allocate and release all the memory. It will more naturally
follow the "landscape" of the process...

Just *one* usage pattern is better served by Zippy: you always have
a constant number of threads in the process. VT would incur more
locking without giving you any additional benefit. So, only in such
scenarios, Zippy is actually better. In my experience I have not
encountered such scenarios in long-lived application servers. I can
imagine that some specialized applications may follow that pattern.

In a commercial server environment you typically want the process to
allocate the memory it will need and hold it there, this avoids
"surprizes".  In smaller scale servers that are much greater in
number, however, this behavior may be sub-optimal because the server
performs different tasks at different times and one might be counting
on those tasks to be performed at disjoint times (e.g. running FOP to
produce PDF reports during off-peak hours on the same machine that
serves as the web server/back-end and consumes most of the machine's
memory during peak hours).  One argument toward handling the latter is
a large amount of swap, but releasing the memory when not used is
certainly a more "polite" approach.

> What does appear to be responsible for Zippy's growth is a design bug.
>  When allocating a new block, Zippy will first look in the bucket
> responsible for the given size, and then look in larger buckets,
> seeking a block that can be split up and placed into the smaller
> buckets.  This is a design flaw which I believe is the most likely
> cause of the "needs daily restart" phenomenon.  Zippy lacks any
> ability to coalesce smaller blocks in order to promote them to a
> larger bucket.  This most likely leads to more and more allocation of
> large blocks when they are needed, and slow and gradual movement and
> splitting of blocks from larger buckets when smaller buckets do not
> have any available entries.  Of course this is nothing but a theory as
> I have nothing to back it up other than reading the code, but it
> should be reasonably easy for someone who experiences the daily
> restart problem to check by applying the following simple patch:
> @@ -910,6 +960,7 @@
>         blockPtr = NULL;
>         n = NBUCKETS;
>         size = 0; /* lint */
> +       #if 0
>         while (--n > bucket) {
>             if (cachePtr->buckets[n].nfree > 0) {
>                 size = binfo[n].blocksize;
> @@ -919,6 +970,7 @@
>                 break;
>             }
>         }
> +       #endif
>
>         /*
>          * Otherwise, allocate a big new block directly.

Interesting...

It sounds like you are in the best position to test this change to see
if it fixes the "unbounded" growth problem.

Reply via email to