Stefan Kaltenbrunner <ste...@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> and it turns out that Intel hasn't seen fit to put a lot of effort into
>> the BSR instruction.  It's constant time, all right, but on most of
>> their CPUs that constant time is like 8 or 16 times slower than an ADD;
>> cf http://www.intel.com/Assets/PDF/manual/248966.pdf

> hmm interesting - I don't have the exact numbers any more but that 
> patch(or a previous version of it) definitly showed a noticable 
> improvement when I tested with sysbench on a current generation Intel 
> Nehalem...

Hmm.  I may be overestimating the importance of the smaller size
categories.  To try to get some trustworthy numbers, I made a quick-hack
patch (attachment #1) to count the actual numbers of calls to
AllocSetFreeIndex, and measured the totals for a run of the regression
tests on both a 32-bit machine and a 64-bit machine.  On 32 I got
these totals:

0       5190113
1       5663980
2       3573261
3       4476398
4       4246178
5       1100634
6       386501
7       601654
8       44884
9       52372
10      202801

and on 64 these:

0       2139534
1       5994692
2       5711479
3       3289034
4       4550931
5       2573389
6       487566
7       588470
8       155148
9       52750
10      202597

If you want to do the same in some other workload, feel free.  I
wouldn't trust the results from a single-purpose benchmark too much,
though.

I then put together a test harness that exercises AllocSetFreeIndex
according to these distributions (attachments #2,#3).  (Note: I found
out that it's necessary to split the test into two files --- otherwise
gcc will inline AllocSetFreeIndex and partially const-fold the work,
leading to skewed results.)

What I'm seeing with this harness on my x86 machines is that
__builtin_clz is indeed a bit faster than a naive loop, but not by
very much --- it saves maybe 25% of the runtime.  It's better on an
old PPC Mac; saves about 50%.  Still, these are not impressive numbers
for a microbenchmark that is testing *only* AllocSetFreeIndex.

I'm still interested in the idea of doing a manual unroll instead of
relying on a compiler-specific feature.  However, some quick testing
didn't find an unrolling that helps much.

                        regards, tom lane

*** src/backend/storage/ipc/ipc.c.orig  Thu Jun 11 10:55:12 2009
--- src/backend/storage/ipc/ipc.c       Mon Jul 20 13:56:06 2009
***************
*** 183,188 ****
--- 183,190 ----
                                                                  
on_proc_exit_list[on_proc_exit_index].arg);
  
        on_proc_exit_index = 0;
+ 
+       AllocSetPrintStats();
  }
  
  /* ------------------
*** src/backend/utils/mmgr/aset.c.orig  Thu Jun 11 10:55:25 2009
--- src/backend/utils/mmgr/aset.c       Mon Jul 20 13:56:19 2009
***************
*** 255,260 ****
--- 255,271 ----
  #define AllocAllocInfo(_cxt, _chunk)
  #endif
  
+ static unsigned long allocsizes[ALLOCSET_NUM_FREELISTS];
+ 
+ void
+ AllocSetPrintStats()
+ {
+       int i;
+ 
+       for (i = 0; i < ALLOCSET_NUM_FREELISTS; i++)
+               fprintf(stderr, "category %2d count %lu\n", i, allocsizes[i]);
+ }
+ 
  /* ----------
   * AllocSetFreeIndex -
   *
***************
*** 277,282 ****
--- 288,294 ----
                        size >>= 1;
                }
                Assert(idx < ALLOCSET_NUM_FREELISTS);
+               allocsizes[idx]++;
        }
  
        return idx;
/*
 * usage: time ./testit N
 *      N is a repeat count, set it large enough to get repeatable timings
 */

#include <stdio.h>
#include <stdlib.h>

#define ALLOC_MINBITS           3       /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS  11

/* number of calls to AllocSetFreeIndex with each result category */
static const int numoccurs[ALLOCSET_NUM_FREELISTS] = {
#ifdef USE_64BIT_COUNTS
2139534,
5994692,
5711479,
3289034,
4550931,
2573389,
487566,
588470,
155148,
52750,
202597
#else
5190113,
5663980,
3573261,
4476398,
4246178,
1100634,
386501,
601654,
44884,
52372,
202801
#endif
};


extern int AllocSetFreeIndex(size_t size);


int
main(int argc, char **argv)
{
        int             repeat = atoi(argv[1]);

        while (repeat-- > 0)
        {
                int             k;

                for (k = 0; k < ALLOCSET_NUM_FREELISTS; k++)
                {
                        int             n = numoccurs[k];
                        size_t  siz = (1 << (ALLOC_MINBITS + k));

                        while (n-- > 0)
                        {
                                AllocSetFreeIndex(siz);
                        }
                }
        }

        return 0;
}
#include <stdio.h>
#include <stdlib.h>

#define ALLOC_MINBITS           3       /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS  11
#define BITS_PER_BYTE 8

/* ----------
 * AllocSetFreeIndex -
 *
 *              Depending on the size of an allocation compute which freechunk
 *              list of the alloc set it belongs to.  Caller must have verified
 *              that size <= ALLOC_CHUNK_LIMIT.
 * ----------
 */
int
AllocSetFreeIndex(size_t size)
{
        int                     idx = 0;

        if (size > (1 << ALLOC_MINBITS))
        {
                size = (size - 1) >> ALLOC_MINBITS;

#ifdef HAVE_BUILTIN_CLZ
                idx = sizeof(unsigned int) * BITS_PER_BYTE -
                        __builtin_clz((unsigned int)size);
#else
                do
                {
                        idx++;
                        size >>= 1;
                }
                while (size != 0);
#endif
        }

        return idx;
}
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to