Zoran Vasiljevic schrieb:

Am 16.01.2007 um 10:46 schrieb Gustaf Neumann:

This is most probably the best variabt so far, and not complicated, such a
optimizer can do "the right thing" easily. sorry for the many versions..
-gustaf


   { unsigned register int s = (size-1) >> 3;
     while (s>1) { s >>= 1; bucket++; }
   }

     if (bucket > NBUCKETS) {
   bucket = NBUCKETS;
     }

You'd be surprised that this one
i am. that's the story of the unrolled loops.

Btw, the version you have listed as the fastest
has wrong boundary tests (but still gives the same
result.

below is is corrected version, which needs up to
one mio max 2 shift operations.

The nice thing of this code (due to staggered whiles)
is that any of the while loops (execpt the last)
can be removed and the code works still correctly
(but needs more shift operations). that's the
reason, why yesterdays version actually works.

if all cases are used, all but the first loops are executed
mostly once and could be changed into ifs... i will send
you with a separate mail on such variant, but i am running
currently out of battery.

   while (s >= 0x1000) {
     s >>= 12;
     bucket += 12;
   }
   while (s >= 0x0800) {
     s >>=  11;
     bucket += 11;
   }
   while (s >= 0x0400) {
     s >>=  10;
     bucket += 10;
   }
   while (s >= 0x200) {
     s >>=  9;
     bucket += 9;
   }
   while (s >= 0x0100) {
     s >>=  8;
     bucket += 8;
} while (s >= 0x80) {
     s >>= 7;
     bucket += 7;
} while (s >= 0x40) {
     s >>=  6;
     bucket += 6;
} while (s >= 0x20) {
     s >>=  5;
     bucket += 5;
   }
   while (s >= 0x10) {
     s >>=  4;
     bucket += 4;
   }
   while (s >= 0x08) {
     s >>=  3;
     bucket += 3;
   }
   while (s >= 0x04) {
     s >>=  2;
     bucket += 2;
   }
   while(s >= 1) {
     s >>= 1;
     bucket++;
   }
if (bucket > NBUCKETS) {
     bucket = NBUCKETS;
   }




Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 10098495 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)


whereas this one:

     s = (size-1) >> 3;
     while (s>1) { s >>= 1; bucket++;}

gives:

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 9720847 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)

That is (10098495-9720847/10098495)*100 = 3% less

That is all measured on Linux. I haven't done it on the Mac
and on the Sun yet. I now have all versions inside and will
play a little on each plaform to see which one operates best
overall. The latest one is more appealing because of the
siplicitly of the code, so we can close an eye on that 3%
I guess.

Cheers
Zoran


Reply via email to