Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-21 Thread Tom Lane
Jeremy Kerr j...@ozlabs.org writes:
 Thanks for the benchmark app, thought I'd pitch in with some ppc 
 results:

It looks to me like we should go with the lookup table approach, as
being the best tradeoff of speed improvement vs platform and compiler
independence.  The float hack is right out ;-)

I wonder whether it is worth fooling with alternatives for the data type
of the lookup table entries.  unsigned char might be a tad faster
(avoid useless sign-extension work).  int might be faster yet, but it
would enlarge the table, creating a distributed overhead that the
microbenchmark would completely fail to show.  Or we could buy that
back by reducing the table to cover only 6 bits, knowing that 12 is
more than we need.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-21 Thread pg
 Normally I'd try a small lookup table (1-byte index to 1-byte value) in this 
case. But if the bitscan instruction were even close in performance, it'd be 
preferable, due to its more-reliable caching behavior; it should be possible to 
capture this at code-configuration time (aligned so as to produce an optimal 
result for each test case; see below).

The specific code for large-versus-small testing would be useful; did I 
overlook it?

Note that instruction alignment with respect to words is not the only potential 
instruction-alignment issue. In the past, when optimizing code to an extreme, 
I've run into cache-line issues where a small change that should've produced a 
small improvement resulted in a largish performance loss, without further work. 
Lookup tables can have an analogous issue; this could, in a simplistic test, 
explain an anomalous large-better-than-small result, if part of the large 
lookup table remains cached. (Do any modern CPUs attempt to address this??) 
This is difficult to tune in a multiplatform code base, so the numbers in a 
particular benchmark do not tell the whole tale; you'd need to make a judgment 
call, and perhaps to allow a code-configuration override.

David Hudson





Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-21 Thread Tom Lane
p...@thetdh.com writes:
  Normally I'd try a small lookup table (1-byte index to 1-byte value)
  in this case. But if the bitscan instruction were even close in
  performance, it'd be preferable, due to its more-reliable caching
  behavior;

Well, the problem with the bitscan instruction is we won't have it at
all on a lot of configurations.  I don't think this problem is so
important as to justify *two* hacked-up code paths.

 The specific code for large-versus-small testing would be useful; did I 
 overlook it?

Sorry, I should have provided that.  What I've tested is


#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n

static const char LogTable256[256] = 
{
0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4,
LT(5), LT(6), LT(6), LT(7), LT(7), LT(7), LT(7),
LT(8), LT(8), LT(8), LT(8), LT(8), LT(8), 
LT(8), LT(8)
};


int
AllocSetFreeIndex_lt(size_t size)
{
int idx = 0;

if (size  (1  ALLOC_MINBITS))
{
unsigned int t, tt; // temporaries

size = (size - 1)  ALLOC_MINBITS;

if ((tt = (size  16)))
{
idx = (t = tt  8) ? 24 + LogTable256[t] : 16 + 
LogTable256[tt];
}
else 
{
idx = (t = size  8) ? 8 + LogTable256[t] : 
LogTable256[size];
}

}

return idx;
}

int
AllocSetFreeIndex_lts(size_t size)
{
int idx = 0;

if (size  (1  ALLOC_MINBITS))
{
unsigned int t; // temporaries

size = (size - 1)  ALLOC_MINBITS;

t = size  8;
idx = t ? 8 + LogTable256[t] : LogTable256[(unsigned int) size];
}

return idx;
}

plus the obvious additions to the other file for an additional test
type.

 Note that instruction alignment with respect to words is not the only
 potential instruction-alignment issue. In the past, when optimizing
 code to an extreme, I've run into cache-line issues where a small
 change that should've produced a small improvement resulted in a
 largish performance loss, without further work. Lookup tables can have
 an analogous issue; this could, in a simplistic test, explain an
 anomalous large-better-than-small result, if part of the large lookup
 table remains cached. (Do any modern CPUs attempt to address this??)

Yeah, I've seen similar things when trying to micro-optimize code.
That's why it's so critical to get results from multiple
platforms/machines before trusting the answers too much.

 This is difficult to tune in a multiplatform code base, so the numbers
 in a particular benchmark do not tell the whole tale; you'd need to
 make a judgment call, and perhaps to allow a code-configuration
 override.

Well, my judgment call is that we should go with the lookup table.
So far, the only machine where that doesn't seem to provide a nice win
is my old Mac G4 laptop, which is hardly a platform that people are
likely to be stressing about database performance on.  Jeremy's PPC
machines like it a lot more, so there's not something fundamentally
wrong with the approach for that architecture.  The other alternatives
all have significantly greater downsides.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-21 Thread Tom Lane
Jeremy Kerr j...@ozlabs.org writes:
 Rather than testing single bits in a loop, change AllocSetFreeIndex to
 use the __builtin_clz() function to calculate the chunk index.

Per subsequent discussion and testing, I've committed a patch that
uses a lookup table instead of depending on __builtin_clz().
http://archives.postgresql.org/message-id/20090721195312.207ca753...@cvs.postgresql.org

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Tom Lane
Jeremy Kerr j...@ozlabs.org writes:
 Rather than testing single bits in a loop, change AllocSetFreeIndex to
 use the __builtin_clz() function to calculate the chunk index.

 This requires a new check for __builtin_clz in the configure script.

 Results in a ~2% performance increase on sysbench on PowerPC.

I did some performance testing on this by extracting the
AllocSetFreeIndex function into a standalone test program that just
executed it a lot of times in a loop.  And there's a problem: on
x86_64 it is not much of a win.  The code sequence that gcc generates
for __builtin_clz is basically

bsrl%eax, %eax
xorl$31, %eax

and it turns out that Intel hasn't seen fit to put a lot of effort into
the BSR instruction.  It's constant time, all right, but on most of
their CPUs that constant time is like 8 or 16 times slower than an ADD;
cf http://www.intel.com/Assets/PDF/manual/248966.pdf

What I found on both a couple-years-old Xeon and a pretty new Macbook
is that the __builtin_clz code is actually *slower* than the naive loop
for the fewest-iterations case (request size up to 16 bytes) and about
a breakeven at the next size category (up to 32).  It doesn't start to
look like a useful win until you get to sizes up to 1KB or so.
Unfortunately PG spends a lot more time allocating little structs than
big ones.

I suppose that it's more of a win on PPC (where probably __builtin_clz
is actually fast), but on the basis of these results it's going to be
no gain and maybe a loss on Intel.

The other problem is that this approach is gcc-specific, which seems
particularly bad for something that's only going to be a win on
non-Intel platforms; they're much more likely to be using non-gcc
compilers.

I wonder whether it would work better to do manual unrolling of the
shift loop.  That is, forget about the builtin and do something like

while (size = 8)
{
idx += 4;
size = 4;
}
while (size)
{
idx++;
size = 1;
}

Obviously there's room for experimental optimization of the particular
shift strides we use here, but remember that the total shift count is
never going to exceed about 10, and will usually be quite a bit less.

I did some quick experimentation along these lines and couldn't really
persuade myself that unrolling was going to win much, but that's an
Intel-specific result.

Anyway, based on what I'm seeing here I can't convince myself that this
is worth introducing a compiler dependency for.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Stefan Kaltenbrunner

Tom Lane wrote:

Jeremy Kerr j...@ozlabs.org writes:

Rather than testing single bits in a loop, change AllocSetFreeIndex to
use the __builtin_clz() function to calculate the chunk index.



This requires a new check for __builtin_clz in the configure script.



Results in a ~2% performance increase on sysbench on PowerPC.


I did some performance testing on this by extracting the
AllocSetFreeIndex function into a standalone test program that just
executed it a lot of times in a loop.  And there's a problem: on
x86_64 it is not much of a win.  The code sequence that gcc generates
for __builtin_clz is basically

bsrl%eax, %eax
xorl$31, %eax

and it turns out that Intel hasn't seen fit to put a lot of effort into
the BSR instruction.  It's constant time, all right, but on most of
their CPUs that constant time is like 8 or 16 times slower than an ADD;
cf http://www.intel.com/Assets/PDF/manual/248966.pdf



hmm interesting - I don't have the exact numbers any more but that 
patch(or a previous version of it) definitly showed a noticable 
improvement when I tested with sysbench on a current generation Intel 
Nehalem...



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 Tom Lane wrote:
 and it turns out that Intel hasn't seen fit to put a lot of effort into
 the BSR instruction.  It's constant time, all right, but on most of
 their CPUs that constant time is like 8 or 16 times slower than an ADD;
 cf http://www.intel.com/Assets/PDF/manual/248966.pdf

 hmm interesting - I don't have the exact numbers any more but that 
 patch(or a previous version of it) definitly showed a noticable 
 improvement when I tested with sysbench on a current generation Intel 
 Nehalem...

Hmm.  I may be overestimating the importance of the smaller size
categories.  To try to get some trustworthy numbers, I made a quick-hack
patch (attachment #1) to count the actual numbers of calls to
AllocSetFreeIndex, and measured the totals for a run of the regression
tests on both a 32-bit machine and a 64-bit machine.  On 32 I got
these totals:

0   5190113
1   5663980
2   3573261
3   4476398
4   4246178
5   1100634
6   386501
7   601654
8   44884
9   52372
10  202801

and on 64 these:

0   2139534
1   5994692
2   5711479
3   3289034
4   4550931
5   2573389
6   487566
7   588470
8   155148
9   52750
10  202597

If you want to do the same in some other workload, feel free.  I
wouldn't trust the results from a single-purpose benchmark too much,
though.

I then put together a test harness that exercises AllocSetFreeIndex
according to these distributions (attachments #2,#3).  (Note: I found
out that it's necessary to split the test into two files --- otherwise
gcc will inline AllocSetFreeIndex and partially const-fold the work,
leading to skewed results.)

What I'm seeing with this harness on my x86 machines is that
__builtin_clz is indeed a bit faster than a naive loop, but not by
very much --- it saves maybe 25% of the runtime.  It's better on an
old PPC Mac; saves about 50%.  Still, these are not impressive numbers
for a microbenchmark that is testing *only* AllocSetFreeIndex.

I'm still interested in the idea of doing a manual unroll instead of
relying on a compiler-specific feature.  However, some quick testing
didn't find an unrolling that helps much.

regards, tom lane

*** src/backend/storage/ipc/ipc.c.orig  Thu Jun 11 10:55:12 2009
--- src/backend/storage/ipc/ipc.c   Mon Jul 20 13:56:06 2009
***
*** 183,188 
--- 183,190 
  
on_proc_exit_list[on_proc_exit_index].arg);
  
on_proc_exit_index = 0;
+ 
+   AllocSetPrintStats();
  }
  
  /* --
*** src/backend/utils/mmgr/aset.c.orig  Thu Jun 11 10:55:25 2009
--- src/backend/utils/mmgr/aset.c   Mon Jul 20 13:56:19 2009
***
*** 255,260 
--- 255,271 
  #define AllocAllocInfo(_cxt, _chunk)
  #endif
  
+ static unsigned long allocsizes[ALLOCSET_NUM_FREELISTS];
+ 
+ void
+ AllocSetPrintStats()
+ {
+   int i;
+ 
+   for (i = 0; i  ALLOCSET_NUM_FREELISTS; i++)
+   fprintf(stderr, category %2d count %lu\n, i, allocsizes[i]);
+ }
+ 
  /* --
   * AllocSetFreeIndex -
   *
***
*** 277,282 
--- 288,294 
size = 1;
}
Assert(idx  ALLOCSET_NUM_FREELISTS);
+   allocsizes[idx]++;
}
  
return idx;
/*
 * usage: time ./testit N
 *  N is a repeat count, set it large enough to get repeatable timings
 */

#include stdio.h
#include stdlib.h

#define ALLOC_MINBITS   3   /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS  11

/* number of calls to AllocSetFreeIndex with each result category */
static const int numoccurs[ALLOCSET_NUM_FREELISTS] = {
#ifdef USE_64BIT_COUNTS
2139534,
5994692,
5711479,
3289034,
4550931,
2573389,
487566,
588470,
155148,
52750,
202597
#else
5190113,
5663980,
3573261,
4476398,
4246178,
1100634,
386501,
601654,
44884,
52372,
202801
#endif
};


extern int AllocSetFreeIndex(size_t size);


int
main(int argc, char **argv)
{
int repeat = atoi(argv[1]);

while (repeat--  0)
{
int k;

for (k = 0; k  ALLOCSET_NUM_FREELISTS; k++)
{
int n = numoccurs[k];
size_t  siz = (1  (ALLOC_MINBITS + k));

while (n--  0)
{
AllocSetFreeIndex(siz);
}
}
}

return 0;
}
#include stdio.h
#include stdlib.h

#define ALLOC_MINBITS   3   /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS  11
#define BITS_PER_BYTE 8

/* --
 * AllocSetFreeIndex -
 *
 *  Depending on the size of an allocation compute which freechunk
 *  list of the alloc set it belongs to.  Caller must have verified
 *   

Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Tom Lane
I wrote:
 I'm still interested in the idea of doing a manual unroll instead of
 relying on a compiler-specific feature.  However, some quick testing
 didn't find an unrolling that helps much.

Hmm, actually this seems to work ok:

idx++;
size = 1;
if (size != 0)
{
idx++;
size = 1;
if (size != 0)
{
idx++;
size = 1;
if (size != 0)
{
idx++;
size = 1;
while (size != 0)
{
idx++;
size = 1;
}
}
}
}

(this is with the initial if (size  (1  ALLOC_MINBITS)) so that
we know the starting value is nonzero)

This seems to be about a wash or a small gain on x86_64, but on my
PPC Mac laptop it's very nearly identical in speed to the __builtin_clz
code.  I also see a speedup on HPPA, for which my gcc is too old to
know about __builtin_clz.

Anyone want to see if they can beat that?  Some testing on other
architectures would help too.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Greg Stark
On Mon, Jul 20, 2009 at 8:37 PM, Tom Lanet...@sss.pgh.pa.us wrote:
 Anyone want to see if they can beat that?  Some testing on other
 architectures would help too.

Hm, I took the three implementations so far (normal, unrolled, and
clz) as well as the two from
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
and got some very strange results:


normal:   1.494s
clz:  2.214s
unrolled: 2.966s
lookup table: 0.001s
float hack:  11.930s


I can't see why the unrolled implementation is slower than the
non-unrolled so I'm suspecting something's wrong with my #ifdefs but I
don't see it.

I do think the code I grabbed from the stanford page might be
off-by-one for our purposes but I haven't looked closely at that.

I also wonder if this microbenchmark is actually ok because it's
testing the same value over and over so any branch-prediction will
shine unrealistically well.



-- 
greg
http://mit.edu/~gsstark/resume.pdf

/*
 * usage: time ./testit N
 *  N is a repeat count, set it large enough to get repeatable timings
 */

#include stdio.h
#include stdlib.h

#define ALLOC_MINBITS   3   /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS  11

/* number of calls to AllocSetFreeIndex with each result category */
static const int numoccurs[ALLOCSET_NUM_FREELISTS] = {
#ifdef USE_64BIT_COUNTS
2139534,
5994692,
5711479,
3289034,
4550931,
2573389,
487566,
588470,
155148,
52750,
202597
#else
5190113,
5663980,
3573261,
4476398,
4246178,
1100634,
386501,
601654,
44884,
52372,
202801
#endif
};


extern int AllocSetFreeIndex(size_t size);


int
main(int argc, char **argv)
{
   int repeat = atoi(argv[1]);

   while (repeat--  0)
   {
   int k;

   for (k = 0; k  ALLOCSET_NUM_FREELISTS; k++)
   {
   int n = numoccurs[k];
   size_t  siz = (1  (ALLOC_MINBITS + k));

   while (n--  0)
   {
   AllocSetFreeIndex(siz);
   }
   }
   }

   return 0;
}

#include stdio.h
#include stdlib.h

#define ALLOC_MINBITS   3   /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS  11
#define BITS_PER_BYTE 8

/* --
 * AllocSetFreeIndex -
 *
 *  Depending on the size of an allocation compute which freechunk
 *  list of the alloc set it belongs to.  Caller must have verified
 *  that size = ALLOC_CHUNK_LIMIT.
 * --
 */
int
AllocSetFreeIndex(size_t size)
{
   int idx = 0;

   if (size  (1  ALLOC_MINBITS))
   {
   size = (size - 1)  ALLOC_MINBITS;

#if HAVE_BUILTIN_CLZ
   idx = sizeof(unsigned int) * BITS_PER_BYTE -
   __builtin_clz((unsigned int)size);

#elif TEST_FLOAT_HACK

	   union { unsigned int u[2]; double d; } t; // temp

	   t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x4330;
	   t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = size;
	   t.d -= 4503599627370496.0;
	   idx = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN]  20) - 0x3FF;
	   
#elif TEST_LOOKUP_TABLE

#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n

	   static const char LogTable256[256] = 
		 {
		   -1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
		   LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
		   LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
		 };

	   unsigned int t, tt; // temporaries

	   if (tt = size  16)
		 {
		   idx = (t = tt  8) ? 24 + LogTable256[t] : 16 + LogTable256[tt];
		 }
	   else 
		 {
		   idx = (t = size  8) ? 8 + LogTable256[t] : LogTable256[size];
		 }

#elif TEST_UNROLL

   idx++;
   size = 1;
   if (size != 0)
   {
   idx++;
   size = 1;
   if (size != 0)
   {
   idx++;
   size = 1;
   if (size != 0)
   {
   idx++;
   size = 1;
   while (size != 0)
   {
   idx++;
   size = 1;
   }
   }
   }
   }
#else 
   do
   {
   idx++;
   size = 1;
   }
   while (size != 0);
#endif
   }

   return idx;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Greg Stark
Doh, I forgot this bit. Will rerun tests now.

On Mon, Jul 20, 2009 at 8:25 PM, Tom Lanet...@sss.pgh.pa.us wrote:
  (Note: I found
 out that it's necessary to split the test into two files --- otherwise
 gcc will inline AllocSetFreeIndex and partially const-fold the work,
 leading to skewed results.)




-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 I also wonder if this microbenchmark is actually ok because it's
 testing the same value over and over so any branch-prediction will
 shine unrealistically well.

Yeah, that is a good point --- and it would benefit the unrolled loop
more than the other versions.  We should probably revise the test
harness so it mixes the size requests a bit.  I'm not sure of a suitably
simple way to do that, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Greg Stark
On Tue, Jul 21, 2009 at 12:07 AM, Tom Lanet...@sss.pgh.pa.us wrote:
 Greg Stark gsst...@mit.edu writes:
 I also wonder if this microbenchmark is actually ok because it's
 testing the same value over and over so any branch-prediction will
 shine unrealistically well.

 Yeah, that is a good point --- and it would benefit the unrolled loop
 more than the other versions.  We should probably revise the test
 harness so it mixes the size requests a bit.  I'm not sure of a suitably
 simple way to do that, though.

Well it was a bit of a pain but I filled an array with (1/1000 scaled
down) values and then permuted them. I also went ahead and set the
low-order bits to random values since the lookup table based algorithm
might be affected by it.

The results are a bit disappointing on my machine, only the CLZ and
lookup table come out significantly ahead:

$ ./testit 10
   0:  620
   1: 4949
   2: 5378
   3: 3560
   4: 4426
   5: 4218
   6: 1098
   7:  387
   8:  599
   9:   44
  10:   52
 clz 1.530s
lookup table 1.720s
  float hack 4.424s
unrolled 5.280s
  normal 5.369s



-- 
greg
http://mit.edu/~gsstark/resume.pdf

/*
 * usage: time ./testit N
 *  N is a repeat count, set it large enough to get repeatable timings
 */

#include stdio.h
#include stdlib.h
#include sys/time.h

#define ALLOC_MINBITS   3   /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS  11

/* number of calls to AllocSetFreeIndex with each result category */
static const int numoccurs[ALLOCSET_NUM_FREELISTS] = {
#ifdef USE_64BIT_COUNTS
	2139534,
	5994692,
	5711479,
	3289034,
	4550931,
	2573389,
	487566,
	588470,
	155148,
	52750,
	202597
#else
	5190113,
	5663980,
	3573261,
	4476398,
	4246178,
	1100634,
	386501,
	601654,
	44884,
	52372,
	202801
#endif
};

extern int AllocSetFreeIndex_clz(size_t size);
extern int AllocSetFreeIndex_lt(size_t size);
extern int AllocSetFreeIndex_fh(size_t size);
extern int AllocSetFreeIndex_ur(size_t size);
extern int AllocSetFreeIndex(size_t size);

int
main(int argc, char **argv)
{
	int repeat = atoi(argv[1]);
	int k,l,m, min=1000, total=0;
	size_t *arr;
	struct timeval before, after;
	double elapsed;

//	for (k=0; k ALLOCSET_NUM_FREELISTS; k++) 
//		if (numoccurs[k]  min)
//			min = numoccurs[k];
	for (k=0; k ALLOCSET_NUM_FREELISTS; k++) 
		total += numoccurs[k]/min;
	
	arr = malloc(sizeof(size_t)*total);

	m = 0;
	for (k=0; k  ALLOCSET_NUM_FREELISTS; k++) 
		for (l=0; l  numoccurs[k]/min; l++)
		{
			int p = 1  (ALLOC_MINBITS + k);
			p |= (p*2 - 1)  random();
			arr[m++] = p;
		}

	for (m=0; m  total; m++)
	{
		int r = random()%total;
		if (r != m) 
		{
			int tmp = arr[r];
			arr[r] = arr[m];
			arr[m] = tmp;
		}
	}

	repeat *= min;

	static int occurs[ALLOCSET_NUM_FREELISTS];

	for (m=0; m  total; m++)
	{
		int retval_clz = AllocSetFreeIndex_clz(arr[m]);
		int retval_fh  = AllocSetFreeIndex_fh(arr[m]);
		int retval_lt  = AllocSetFreeIndex_lt(arr[m]);
		int retval_ur  = AllocSetFreeIndex_ur(arr[m]);
		int retval = AllocSetFreeIndex(arr[m]);
		
		if (retval != retval_clz ||
			retval != retval_fh ||
			retval != retval_lt ||
			retval != retval_ur)
			printf(%08x: %4d %4d %4d %4d %4d\n, arr[m], retval, retval_clz, retval_fh, retval_lt, retval_ur);
		else
			occurs[retval]++;
	}
	for (k=0; kALLOCSET_NUM_FREELISTS; k++)
		printf(%4d: %8d\n, k, occurs[k]);
	
	gettimeofday(before, NULL);
	for(l=0; lrepeat; l++)
		for (m=0; m  total; m++)
			AllocSetFreeIndex_clz(arr[m]);
	gettimeofday(after, NULL);
	elapsed = after.tv_sec - before.tv_sec;
	elapsed += (after.tv_usec - before.tv_usec)/100.0;
	printf(%20s %.3fs\n, clz, elapsed);

	gettimeofday(before, NULL);
	for(l=0; lrepeat; l++)
		for (m=0; m  total; m++)
			AllocSetFreeIndex_lt(arr[m]);
	gettimeofday(after, NULL);
	elapsed = after.tv_sec - before.tv_sec;
	elapsed += (after.tv_usec - before.tv_usec)/100.0;
	printf(%20s %.3fs\n, lookup table, elapsed);

	gettimeofday(before, NULL);
	for(l=0; lrepeat; l++)
		for (m=0; m  total; m++)
			AllocSetFreeIndex_fh(arr[m]);
	gettimeofday(after, NULL);
	elapsed = after.tv_sec - before.tv_sec;
	elapsed += (after.tv_usec - before.tv_usec)/100.0;
	printf(%20s %.3fs\n, float hack, elapsed);

	gettimeofday(before, NULL);
	for(l=0; lrepeat; l++)
		for (m=0; m  total; m++)
			AllocSetFreeIndex_ur(arr[m]);
	gettimeofday(after, NULL);
	elapsed = after.tv_sec - before.tv_sec;
	elapsed += (after.tv_usec - before.tv_usec)/100.0;
	printf(%20s %.3fs\n, unrolled, elapsed);

	gettimeofday(before, NULL);
	for(l=0; lrepeat; l++)
		for (m=0; m  total; m++)
			AllocSetFreeIndex(arr[m]);
	gettimeofday(after, NULL);
	elapsed = after.tv_sec - before.tv_sec;
	elapsed += (after.tv_usec - before.tv_usec)/100.0;
	printf(%20s %.3fs\n, normal, elapsed);

	return 0;
}

#include stdio.h
#include stdlib.h

#define ALLOC_MINBITS   3   /* smallest chunk size is 8 bytes */
#define 

Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 Well it was a bit of a pain but I filled an array with (1/1000 scaled
 down) values and then permuted them. I also went ahead and set the
 low-order bits to random values since the lookup table based algorithm
 might be affected by it.

 The results are a bit disappointing on my machine, only the CLZ and
 lookup table come out significantly ahead:

  clz 1.530s
 lookup table 1.720s
   float hack 4.424s
 unrolled 5.280s
   normal 5.369s

It strikes me that we could assume that the values are  64K and hence
drop the first case in the lookup table code.  I've added that variant
and get these results on my machines:

x86_64 (Xeon):

 clz 15.357s
lookup table 16.582s
  small lookup table 16.705s
  float hack 25.138s
unrolled 64.630s
  normal 79.025s

PPC:

 clz 3.842s
lookup table 7.298s
  small lookup table 8.799s
  float hack 19.418s
unrolled 7.656s
  normal 8.949s

HPPA:
 clz (n/a)
lookup table 11.515s
  small lookup table 10.803s
  float hack 16.502s
unrolled 17.632s
  normal 19.754s

Not sure why the small lookup table variant actually seems slower
than the original on two of these machines; it can hardly be slower in
reality since it's strictly less code.  Maybe some weird code-alignment
issue?

It also seems weird that the x86_64 is now showing a much bigger gap
between clz and normal than before.  I don't see how branch prediction
would do much for the normal code.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-20 Thread Jeremy Kerr
Hi Greg,

Thanks for the benchmark app, thought I'd pitch in with some ppc 
results:

  clz 1.530s
 lookup table 1.720s
   float hack 4.424s
 unrolled 5.280s
   normal 5.369s

POWER5+:
 clz 2.046s
lookup table 2.188s
  float hack 7.786s
unrolled 6.353s
  normal 7.033s

POWER6:
 clz 1.063s
lookup table 1.199s
  float hack 3.575s
unrolled 2.290s
  normal 3.456s

Cheers,


Jeremy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

2009-07-19 Thread Jeremy Kerr
Rather than testing single bits in a loop, change AllocSetFreeIndex to
use the __builtin_clz() function to calculate the chunk index.

This requires a new check for __builtin_clz in the configure script.

Results in a ~2% performance increase on sysbench on PowerPC.

Signed-off-by: Jeremy Kerr j...@ozlabs.org

---
v4: don't use separate fls() function, remove hardcoded 32

---
 configure.in  |   13 +
 src/backend/utils/mmgr/aset.c |   14 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

*** a/configure.in
--- b/configure.in
***
*** 1361,1366  case $host_os in
--- 1361,1379 
AC_FUNC_FSEEKO;;
  esac
  
+ # GCC builtins
+ #
+ # We need AC_TRY_LINK here, as the prototype generated by AC_CHECK_FUNC
+ # will cause gcc to try to reference a non-builtin symbol.
+ 
+ AC_MSG_CHECKING([for __builtin_clz])
+ AC_TRY_LINK([],
+   [__builtin_clz(0);],
+   [AC_DEFINE(HAVE_BUILTIN_CLZ, 1,
+   [Define to 1 if you have __builtin_clz().])
+   AC_MSG_RESULT(yes)],
+   [AC_MSG_RESULT(no)])
+ 
  
  #
  # Pthreads
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 268,281  AllocSetFreeIndex(Size size)
  {
int idx = 0;
  
!   if (size  0)
{
size = (size - 1)  ALLOC_MINBITS;
while (size != 0)
{
idx++;
size = 1;
}
Assert(idx  ALLOCSET_NUM_FREELISTS);
}
  
--- 268,293 
  {
int idx = 0;
  
!   if (size  (1  ALLOC_MINBITS))
{
size = (size - 1)  ALLOC_MINBITS;
+ 
+ #ifdef HAVE_BUILTIN_CLZ
+   /* If possible, use __builtin_clz to calculate the chunk index 
- this
+* should be O(1) rather than O(n). The builtin takes an 
unsigned int,
+* so we need to cast from the possibly 64-bit Size type. This 
cast
+* won't overflow, as the limit is at 2^(32 + ALLOC_MINBITS) 
bytes,
+* which is larger than ALLOC_CHUNK_LIMIT.
+*/
+   idx = sizeof(unsigned int) * BITS_PER_BYTE -
+   __builtin_clz((unsigned int)size);
+ #else
while (size != 0)
{
idx++;
size = 1;
}
+ #endif
Assert(idx  ALLOCSET_NUM_FREELISTS);
}
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers