On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

Am 15.01.2007 um 22:37 schrieb Zoran Vasiljevic:

>
> Am 15.01.2007 um 22:22 schrieb Mike:
>
>>
>> Zoran, I believe you misunderstood.  The "patch" above limits blocks
>> allocated by your tester to 16000 instead of 16384 blocks.  The
>> reason
>> for this is that Zippy's "largest bucket" is configured to be
>> 16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo).
>> By making uniformly random requests sizes up to 16_3_84, you are
>> causing Zippy to fall back to system malloc for a small fraction of
>> requests, substantially penalizing its performance in these cases.
>
> Ah! That's right. I will fix that.
>
>>
>> You wanted to know why Zippy is slower on your test, this is the
>> reason.  This has substantial impact on FreeBSD and linux, and my
>> guess is that it will have a drammatic effect on Mac OSX.
>
> I will check that tomorrow on my machines.

YES. That did the trick. We have now demystified the behaviour
on the Mac. Indeed, when I limit the max alloc size to below *16284*
bytes, Zippy runs almosts as fast as VT alloc. So, it was my
overlooking of the fact that it was 16284 and not 16K (16384) !!
I wanted to give Zippy a fair chance but I missed that for about
100 bytes. Which made huge difference. Still, it shows again one
of the weaknesses of Zippy: dependence of (potentially suboptimal)
system memory allocator. But that is not to blame on zippy, rather
on weak system malloc, as on the Mac. I guess same could have happened
to us with a slow mmap()/munmap()...

>
>>>
>>> How about adding this into the code?
>>
>> I think the most obvious replacement is just using an if "tree":
>> if (size>0xff) bucket+=8, size&=0xff;
>> if (size>0xf) bucket+=4, size&0xf;
>> ...
>> it takes a minute to get the math right, but the performance gain
>> should be substantial.
>
> Well, I can test that allright. I have the feeling that a tight
> loop as that (will mostly sping 5-12 times) gets well compiled
> in machine code, but it is better to test.

Allright. Gustaf came with this, and it saves about 10% of time:

#if 0
      while (bucket<NBUCKETS && globalCache.sizes
[bucket].blocksize<size) {
          ++bucket;
      }
#else
     s = (size-1) >> 4;
     while (s > 0xFF) {
         s = s >> 5;
         bucket += 5;
     }
     while (s > 0x0F) {
         s = s >> 4;
         bucket += 4;
     }
     while (s > 0x08) {
         s = s >> 3;
         bucket += 3;
     }
     while (s > 0x04) {
         s = s >> 2;
         bucket += 2;
     }
     while (s > 0x00) {
         s = s >> 1;
         bucket++;
     }

I will leave the above loop in the code and provide ifdef,
as by looking at the below it is hard to understand what
is really happening. But it works and it works fine.

Cheers
Zoran


Can you import this into CVS?  Top level.

Reply via email to