Stefan Kaltenbrunner <[email protected]> writes:
> Tom Lane wrote:
>> and it turns out that Intel hasn't seen fit to put a lot of effort into
>> the BSR instruction. It's constant time, all right, but on most of
>> their CPUs that constant time is like 8 or 16 times slower than an ADD;
>> cf http://www.intel.com/Assets/PDF/manual/248966.pdf
> hmm interesting - I don't have the exact numbers any more but that
> patch(or a previous version of it) definitly showed a noticable
> improvement when I tested with sysbench on a current generation Intel
> Nehalem...
Hmm. I may be overestimating the importance of the smaller size
categories. To try to get some trustworthy numbers, I made a quick-hack
patch (attachment #1) to count the actual numbers of calls to
AllocSetFreeIndex, and measured the totals for a run of the regression
tests on both a 32-bit machine and a 64-bit machine. On 32 I got
these totals:
0 5190113
1 5663980
2 3573261
3 4476398
4 4246178
5 1100634
6 386501
7 601654
8 44884
9 52372
10 202801
and on 64 these:
0 2139534
1 5994692
2 5711479
3 3289034
4 4550931
5 2573389
6 487566
7 588470
8 155148
9 52750
10 202597
If you want to do the same in some other workload, feel free. I
wouldn't trust the results from a single-purpose benchmark too much,
though.
I then put together a test harness that exercises AllocSetFreeIndex
according to these distributions (attachments #2,#3). (Note: I found
out that it's necessary to split the test into two files --- otherwise
gcc will inline AllocSetFreeIndex and partially const-fold the work,
leading to skewed results.)
What I'm seeing with this harness on my x86 machines is that
__builtin_clz is indeed a bit faster than a naive loop, but not by
very much --- it saves maybe 25% of the runtime. It's better on an
old PPC Mac; saves about 50%. Still, these are not impressive numbers
for a microbenchmark that is testing *only* AllocSetFreeIndex.
I'm still interested in the idea of doing a manual unroll instead of
relying on a compiler-specific feature. However, some quick testing
didn't find an unrolling that helps much.
regards, tom lane
*** src/backend/storage/ipc/ipc.c.orig Thu Jun 11 10:55:12 2009
--- src/backend/storage/ipc/ipc.c Mon Jul 20 13:56:06 2009
***************
*** 183,188 ****
--- 183,190 ----
on_proc_exit_list[on_proc_exit_index].arg);
on_proc_exit_index = 0;
+
+ AllocSetPrintStats();
}
/* ------------------
*** src/backend/utils/mmgr/aset.c.orig Thu Jun 11 10:55:25 2009
--- src/backend/utils/mmgr/aset.c Mon Jul 20 13:56:19 2009
***************
*** 255,260 ****
--- 255,271 ----
#define AllocAllocInfo(_cxt, _chunk)
#endif
+ static unsigned long allocsizes[ALLOCSET_NUM_FREELISTS];
+
+ void
+ AllocSetPrintStats()
+ {
+ int i;
+
+ for (i = 0; i < ALLOCSET_NUM_FREELISTS; i++)
+ fprintf(stderr, "category %2d count %lu\n", i, allocsizes[i]);
+ }
+
/* ----------
* AllocSetFreeIndex -
*
***************
*** 277,282 ****
--- 288,294 ----
size >>= 1;
}
Assert(idx < ALLOCSET_NUM_FREELISTS);
+ allocsizes[idx]++;
}
return idx;
/*
* usage: time ./testit N
* N is a repeat count, set it large enough to get repeatable timings
*/
#include <stdio.h>
#include <stdlib.h>
#define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS 11
/* number of calls to AllocSetFreeIndex with each result category */
static const int numoccurs[ALLOCSET_NUM_FREELISTS] = {
#ifdef USE_64BIT_COUNTS
2139534,
5994692,
5711479,
3289034,
4550931,
2573389,
487566,
588470,
155148,
52750,
202597
#else
5190113,
5663980,
3573261,
4476398,
4246178,
1100634,
386501,
601654,
44884,
52372,
202801
#endif
};
extern int AllocSetFreeIndex(size_t size);
int
main(int argc, char **argv)
{
int repeat = atoi(argv[1]);
while (repeat-- > 0)
{
int k;
for (k = 0; k < ALLOCSET_NUM_FREELISTS; k++)
{
int n = numoccurs[k];
size_t siz = (1 << (ALLOC_MINBITS + k));
while (n-- > 0)
{
AllocSetFreeIndex(siz);
}
}
}
return 0;
}
#include <stdio.h>
#include <stdlib.h>
#define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS 11
#define BITS_PER_BYTE 8
/* ----------
* AllocSetFreeIndex -
*
* Depending on the size of an allocation compute which freechunk
* list of the alloc set it belongs to. Caller must have verified
* that size <= ALLOC_CHUNK_LIMIT.
* ----------
*/
int
AllocSetFreeIndex(size_t size)
{
int idx = 0;
if (size > (1 << ALLOC_MINBITS))
{
size = (size - 1) >> ALLOC_MINBITS;
#ifdef HAVE_BUILTIN_CLZ
idx = sizeof(unsigned int) * BITS_PER_BYTE -
__builtin_clz((unsigned int)size);
#else
do
{
idx++;
size >>= 1;
}
while (size != 0);
#endif
}
return idx;
}
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers