Gustaf Neumann wrote:
This is most probably the best variabt so far, and not complicated, such a
optimizer can do "the right thing" easily. sorry for the many versions..
-gustaf
{ unsigned register int s = (size-1) >> 3;
while (s>1) { s >>= 1; bucket++; }
}
if (bucket > NBUCKETS) {
bucket = NBUCKETS;
}
I don't think anyone has pointed this out yet, but this is a logarithm
in base 2 (log2), and there are a fair number of implementations of this
available; for maximum performance there are assembly implementations
using 'bsr' on x86 architectures, such as this one from google's tcmalloc:
// Return floor(log2(n)) for n > 0.
#if (defined __i386__ || defined __x86_64__) && defined __GNUC__
static inline int LgFloor(size_t n) {
// "ro" for the input spec means the input can come from either a
// register ("r") or offsetable memory ("o").
size_t result;
__asm__("bsr %1, %0"
: "=r" (result) // Output spec
: "ro" (n) // Input spec
: "cc" // Clobbers condition-codes
);
return result;
}
#else
// Note: the following only works for "n"s that fit in 32-bits, but
// that is fine since we only use it for small sizes.
static inline int LgFloor(size_t n) {
int log = 0;
for (int i = 4; i >= 0; --i) {
int shift = (1 << i);
size_t x = n >> shift;
if (x != 0) {
n = x;
log += shift;
}
}
ASSERT(n == 1);
return log;
}
#endif
(Disclaimer - this comment is based on my explorations of zippy, not vt,
so the logic may be entirely different) If this log2(requested_size) is
used to translate directly index into the bucket table that necessarily
restricts you to having power-of-2 bucket sizes, meaning you allocate on
average nearly 50% more than requested (i.e., nearly 33% of allocated
memory is overhead/wasted). Adding more, closer-spaced buckets adds to
the base footprint but possibly reduces the max usage by dropping the
wasted space. I believe tcmalloc uses buckets spaced so that the
average waste is only 12.5%.
-J