Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Jeremy Kerr j...@ozlabs.org writes: Thanks for the benchmark app, thought I'd pitch in with some ppc results: It looks to me like we should go with the lookup table approach, as being the best tradeoff of speed improvement vs platform and compiler independence. The float hack is right out ;-) I wonder whether it is worth fooling with alternatives for the data type of the lookup table entries. unsigned char might be a tad faster (avoid useless sign-extension work). int might be faster yet, but it would enlarge the table, creating a distributed overhead that the microbenchmark would completely fail to show. Or we could buy that back by reducing the table to cover only 6 bits, knowing that 12 is more than we need. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Normally I'd try a small lookup table (1-byte index to 1-byte value) in this case. But if the bitscan instruction were even close in performance, it'd be preferable, due to its more-reliable caching behavior; it should be possible to capture this at code-configuration time (aligned so as to produce an optimal result for each test case; see below). The specific code for large-versus-small testing would be useful; did I overlook it? Note that instruction alignment with respect to words is not the only potential instruction-alignment issue. In the past, when optimizing code to an extreme, I've run into cache-line issues where a small change that should've produced a small improvement resulted in a largish performance loss, without further work. Lookup tables can have an analogous issue; this could, in a simplistic test, explain an anomalous large-better-than-small result, if part of the large lookup table remains cached. (Do any modern CPUs attempt to address this??) This is difficult to tune in a multiplatform code base, so the numbers in a particular benchmark do not tell the whole tale; you'd need to make a judgment call, and perhaps to allow a code-configuration override. David Hudson
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
p...@thetdh.com writes: Normally I'd try a small lookup table (1-byte index to 1-byte value) in this case. But if the bitscan instruction were even close in performance, it'd be preferable, due to its more-reliable caching behavior; Well, the problem with the bitscan instruction is we won't have it at all on a lot of configurations. I don't think this problem is so important as to justify *two* hacked-up code paths. The specific code for large-versus-small testing would be useful; did I overlook it? Sorry, I should have provided that. What I've tested is #define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n static const char LogTable256[256] = { 0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, LT(5), LT(6), LT(6), LT(7), LT(7), LT(7), LT(7), LT(8), LT(8), LT(8), LT(8), LT(8), LT(8), LT(8), LT(8) }; int AllocSetFreeIndex_lt(size_t size) { int idx = 0; if (size (1 ALLOC_MINBITS)) { unsigned int t, tt; // temporaries size = (size - 1) ALLOC_MINBITS; if ((tt = (size 16))) { idx = (t = tt 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt]; } else { idx = (t = size 8) ? 8 + LogTable256[t] : LogTable256[size]; } } return idx; } int AllocSetFreeIndex_lts(size_t size) { int idx = 0; if (size (1 ALLOC_MINBITS)) { unsigned int t; // temporaries size = (size - 1) ALLOC_MINBITS; t = size 8; idx = t ? 8 + LogTable256[t] : LogTable256[(unsigned int) size]; } return idx; } plus the obvious additions to the other file for an additional test type. Note that instruction alignment with respect to words is not the only potential instruction-alignment issue. In the past, when optimizing code to an extreme, I've run into cache-line issues where a small change that should've produced a small improvement resulted in a largish performance loss, without further work. Lookup tables can have an analogous issue; this could, in a simplistic test, explain an anomalous large-better-than-small result, if part of the large lookup table remains cached. (Do any modern CPUs attempt to address this??) Yeah, I've seen similar things when trying to micro-optimize code. That's why it's so critical to get results from multiple platforms/machines before trusting the answers too much. This is difficult to tune in a multiplatform code base, so the numbers in a particular benchmark do not tell the whole tale; you'd need to make a judgment call, and perhaps to allow a code-configuration override. Well, my judgment call is that we should go with the lookup table. So far, the only machine where that doesn't seem to provide a nice win is my old Mac G4 laptop, which is hardly a platform that people are likely to be stressing about database performance on. Jeremy's PPC machines like it a lot more, so there's not something fundamentally wrong with the approach for that architecture. The other alternatives all have significantly greater downsides. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Jeremy Kerr j...@ozlabs.org writes: Rather than testing single bits in a loop, change AllocSetFreeIndex to use the __builtin_clz() function to calculate the chunk index. Per subsequent discussion and testing, I've committed a patch that uses a lookup table instead of depending on __builtin_clz(). http://archives.postgresql.org/message-id/20090721195312.207ca753...@cvs.postgresql.org regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Jeremy Kerr j...@ozlabs.org writes: Rather than testing single bits in a loop, change AllocSetFreeIndex to use the __builtin_clz() function to calculate the chunk index. This requires a new check for __builtin_clz in the configure script. Results in a ~2% performance increase on sysbench on PowerPC. I did some performance testing on this by extracting the AllocSetFreeIndex function into a standalone test program that just executed it a lot of times in a loop. And there's a problem: on x86_64 it is not much of a win. The code sequence that gcc generates for __builtin_clz is basically bsrl%eax, %eax xorl$31, %eax and it turns out that Intel hasn't seen fit to put a lot of effort into the BSR instruction. It's constant time, all right, but on most of their CPUs that constant time is like 8 or 16 times slower than an ADD; cf http://www.intel.com/Assets/PDF/manual/248966.pdf What I found on both a couple-years-old Xeon and a pretty new Macbook is that the __builtin_clz code is actually *slower* than the naive loop for the fewest-iterations case (request size up to 16 bytes) and about a breakeven at the next size category (up to 32). It doesn't start to look like a useful win until you get to sizes up to 1KB or so. Unfortunately PG spends a lot more time allocating little structs than big ones. I suppose that it's more of a win on PPC (where probably __builtin_clz is actually fast), but on the basis of these results it's going to be no gain and maybe a loss on Intel. The other problem is that this approach is gcc-specific, which seems particularly bad for something that's only going to be a win on non-Intel platforms; they're much more likely to be using non-gcc compilers. I wonder whether it would work better to do manual unrolling of the shift loop. That is, forget about the builtin and do something like while (size = 8) { idx += 4; size = 4; } while (size) { idx++; size = 1; } Obviously there's room for experimental optimization of the particular shift strides we use here, but remember that the total shift count is never going to exceed about 10, and will usually be quite a bit less. I did some quick experimentation along these lines and couldn't really persuade myself that unrolling was going to win much, but that's an Intel-specific result. Anyway, based on what I'm seeing here I can't convince myself that this is worth introducing a compiler dependency for. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Tom Lane wrote: Jeremy Kerr j...@ozlabs.org writes: Rather than testing single bits in a loop, change AllocSetFreeIndex to use the __builtin_clz() function to calculate the chunk index. This requires a new check for __builtin_clz in the configure script. Results in a ~2% performance increase on sysbench on PowerPC. I did some performance testing on this by extracting the AllocSetFreeIndex function into a standalone test program that just executed it a lot of times in a loop. And there's a problem: on x86_64 it is not much of a win. The code sequence that gcc generates for __builtin_clz is basically bsrl%eax, %eax xorl$31, %eax and it turns out that Intel hasn't seen fit to put a lot of effort into the BSR instruction. It's constant time, all right, but on most of their CPUs that constant time is like 8 or 16 times slower than an ADD; cf http://www.intel.com/Assets/PDF/manual/248966.pdf hmm interesting - I don't have the exact numbers any more but that patch(or a previous version of it) definitly showed a noticable improvement when I tested with sysbench on a current generation Intel Nehalem... Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes: Tom Lane wrote: and it turns out that Intel hasn't seen fit to put a lot of effort into the BSR instruction. It's constant time, all right, but on most of their CPUs that constant time is like 8 or 16 times slower than an ADD; cf http://www.intel.com/Assets/PDF/manual/248966.pdf hmm interesting - I don't have the exact numbers any more but that patch(or a previous version of it) definitly showed a noticable improvement when I tested with sysbench on a current generation Intel Nehalem... Hmm. I may be overestimating the importance of the smaller size categories. To try to get some trustworthy numbers, I made a quick-hack patch (attachment #1) to count the actual numbers of calls to AllocSetFreeIndex, and measured the totals for a run of the regression tests on both a 32-bit machine and a 64-bit machine. On 32 I got these totals: 0 5190113 1 5663980 2 3573261 3 4476398 4 4246178 5 1100634 6 386501 7 601654 8 44884 9 52372 10 202801 and on 64 these: 0 2139534 1 5994692 2 5711479 3 3289034 4 4550931 5 2573389 6 487566 7 588470 8 155148 9 52750 10 202597 If you want to do the same in some other workload, feel free. I wouldn't trust the results from a single-purpose benchmark too much, though. I then put together a test harness that exercises AllocSetFreeIndex according to these distributions (attachments #2,#3). (Note: I found out that it's necessary to split the test into two files --- otherwise gcc will inline AllocSetFreeIndex and partially const-fold the work, leading to skewed results.) What I'm seeing with this harness on my x86 machines is that __builtin_clz is indeed a bit faster than a naive loop, but not by very much --- it saves maybe 25% of the runtime. It's better on an old PPC Mac; saves about 50%. Still, these are not impressive numbers for a microbenchmark that is testing *only* AllocSetFreeIndex. I'm still interested in the idea of doing a manual unroll instead of relying on a compiler-specific feature. However, some quick testing didn't find an unrolling that helps much. regards, tom lane *** src/backend/storage/ipc/ipc.c.orig Thu Jun 11 10:55:12 2009 --- src/backend/storage/ipc/ipc.c Mon Jul 20 13:56:06 2009 *** *** 183,188 --- 183,190 on_proc_exit_list[on_proc_exit_index].arg); on_proc_exit_index = 0; + + AllocSetPrintStats(); } /* -- *** src/backend/utils/mmgr/aset.c.orig Thu Jun 11 10:55:25 2009 --- src/backend/utils/mmgr/aset.c Mon Jul 20 13:56:19 2009 *** *** 255,260 --- 255,271 #define AllocAllocInfo(_cxt, _chunk) #endif + static unsigned long allocsizes[ALLOCSET_NUM_FREELISTS]; + + void + AllocSetPrintStats() + { + int i; + + for (i = 0; i ALLOCSET_NUM_FREELISTS; i++) + fprintf(stderr, category %2d count %lu\n, i, allocsizes[i]); + } + /* -- * AllocSetFreeIndex - * *** *** 277,282 --- 288,294 size = 1; } Assert(idx ALLOCSET_NUM_FREELISTS); + allocsizes[idx]++; } return idx; /* * usage: time ./testit N * N is a repeat count, set it large enough to get repeatable timings */ #include stdio.h #include stdlib.h #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 /* number of calls to AllocSetFreeIndex with each result category */ static const int numoccurs[ALLOCSET_NUM_FREELISTS] = { #ifdef USE_64BIT_COUNTS 2139534, 5994692, 5711479, 3289034, 4550931, 2573389, 487566, 588470, 155148, 52750, 202597 #else 5190113, 5663980, 3573261, 4476398, 4246178, 1100634, 386501, 601654, 44884, 52372, 202801 #endif }; extern int AllocSetFreeIndex(size_t size); int main(int argc, char **argv) { int repeat = atoi(argv[1]); while (repeat-- 0) { int k; for (k = 0; k ALLOCSET_NUM_FREELISTS; k++) { int n = numoccurs[k]; size_t siz = (1 (ALLOC_MINBITS + k)); while (n-- 0) { AllocSetFreeIndex(siz); } } } return 0; } #include stdio.h #include stdlib.h #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 #define BITS_PER_BYTE 8 /* -- * AllocSetFreeIndex - * * Depending on the size of an allocation compute which freechunk * list of the alloc set it belongs to. Caller must have verified *
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
I wrote: I'm still interested in the idea of doing a manual unroll instead of relying on a compiler-specific feature. However, some quick testing didn't find an unrolling that helps much. Hmm, actually this seems to work ok: idx++; size = 1; if (size != 0) { idx++; size = 1; if (size != 0) { idx++; size = 1; if (size != 0) { idx++; size = 1; while (size != 0) { idx++; size = 1; } } } } (this is with the initial if (size (1 ALLOC_MINBITS)) so that we know the starting value is nonzero) This seems to be about a wash or a small gain on x86_64, but on my PPC Mac laptop it's very nearly identical in speed to the __builtin_clz code. I also see a speedup on HPPA, for which my gcc is too old to know about __builtin_clz. Anyone want to see if they can beat that? Some testing on other architectures would help too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
On Mon, Jul 20, 2009 at 8:37 PM, Tom Lanet...@sss.pgh.pa.us wrote: Anyone want to see if they can beat that? Some testing on other architectures would help too. Hm, I took the three implementations so far (normal, unrolled, and clz) as well as the two from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious and got some very strange results: normal: 1.494s clz: 2.214s unrolled: 2.966s lookup table: 0.001s float hack: 11.930s I can't see why the unrolled implementation is slower than the non-unrolled so I'm suspecting something's wrong with my #ifdefs but I don't see it. I do think the code I grabbed from the stanford page might be off-by-one for our purposes but I haven't looked closely at that. I also wonder if this microbenchmark is actually ok because it's testing the same value over and over so any branch-prediction will shine unrealistically well. -- greg http://mit.edu/~gsstark/resume.pdf /* * usage: time ./testit N * N is a repeat count, set it large enough to get repeatable timings */ #include stdio.h #include stdlib.h #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 /* number of calls to AllocSetFreeIndex with each result category */ static const int numoccurs[ALLOCSET_NUM_FREELISTS] = { #ifdef USE_64BIT_COUNTS 2139534, 5994692, 5711479, 3289034, 4550931, 2573389, 487566, 588470, 155148, 52750, 202597 #else 5190113, 5663980, 3573261, 4476398, 4246178, 1100634, 386501, 601654, 44884, 52372, 202801 #endif }; extern int AllocSetFreeIndex(size_t size); int main(int argc, char **argv) { int repeat = atoi(argv[1]); while (repeat-- 0) { int k; for (k = 0; k ALLOCSET_NUM_FREELISTS; k++) { int n = numoccurs[k]; size_t siz = (1 (ALLOC_MINBITS + k)); while (n-- 0) { AllocSetFreeIndex(siz); } } } return 0; } #include stdio.h #include stdlib.h #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 #define BITS_PER_BYTE 8 /* -- * AllocSetFreeIndex - * * Depending on the size of an allocation compute which freechunk * list of the alloc set it belongs to. Caller must have verified * that size = ALLOC_CHUNK_LIMIT. * -- */ int AllocSetFreeIndex(size_t size) { int idx = 0; if (size (1 ALLOC_MINBITS)) { size = (size - 1) ALLOC_MINBITS; #if HAVE_BUILTIN_CLZ idx = sizeof(unsigned int) * BITS_PER_BYTE - __builtin_clz((unsigned int)size); #elif TEST_FLOAT_HACK union { unsigned int u[2]; double d; } t; // temp t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x4330; t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = size; t.d -= 4503599627370496.0; idx = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] 20) - 0x3FF; #elif TEST_LOOKUP_TABLE #define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n static const char LogTable256[256] = { -1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7) }; unsigned int t, tt; // temporaries if (tt = size 16) { idx = (t = tt 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt]; } else { idx = (t = size 8) ? 8 + LogTable256[t] : LogTable256[size]; } #elif TEST_UNROLL idx++; size = 1; if (size != 0) { idx++; size = 1; if (size != 0) { idx++; size = 1; if (size != 0) { idx++; size = 1; while (size != 0) { idx++; size = 1; } } } } #else do { idx++; size = 1; } while (size != 0); #endif } return idx; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Doh, I forgot this bit. Will rerun tests now. On Mon, Jul 20, 2009 at 8:25 PM, Tom Lanet...@sss.pgh.pa.us wrote: (Note: I found out that it's necessary to split the test into two files --- otherwise gcc will inline AllocSetFreeIndex and partially const-fold the work, leading to skewed results.) -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Greg Stark gsst...@mit.edu writes: I also wonder if this microbenchmark is actually ok because it's testing the same value over and over so any branch-prediction will shine unrealistically well. Yeah, that is a good point --- and it would benefit the unrolled loop more than the other versions. We should probably revise the test harness so it mixes the size requests a bit. I'm not sure of a suitably simple way to do that, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
On Tue, Jul 21, 2009 at 12:07 AM, Tom Lanet...@sss.pgh.pa.us wrote: Greg Stark gsst...@mit.edu writes: I also wonder if this microbenchmark is actually ok because it's testing the same value over and over so any branch-prediction will shine unrealistically well. Yeah, that is a good point --- and it would benefit the unrolled loop more than the other versions. We should probably revise the test harness so it mixes the size requests a bit. I'm not sure of a suitably simple way to do that, though. Well it was a bit of a pain but I filled an array with (1/1000 scaled down) values and then permuted them. I also went ahead and set the low-order bits to random values since the lookup table based algorithm might be affected by it. The results are a bit disappointing on my machine, only the CLZ and lookup table come out significantly ahead: $ ./testit 10 0: 620 1: 4949 2: 5378 3: 3560 4: 4426 5: 4218 6: 1098 7: 387 8: 599 9: 44 10: 52 clz 1.530s lookup table 1.720s float hack 4.424s unrolled 5.280s normal 5.369s -- greg http://mit.edu/~gsstark/resume.pdf /* * usage: time ./testit N * N is a repeat count, set it large enough to get repeatable timings */ #include stdio.h #include stdlib.h #include sys/time.h #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 /* number of calls to AllocSetFreeIndex with each result category */ static const int numoccurs[ALLOCSET_NUM_FREELISTS] = { #ifdef USE_64BIT_COUNTS 2139534, 5994692, 5711479, 3289034, 4550931, 2573389, 487566, 588470, 155148, 52750, 202597 #else 5190113, 5663980, 3573261, 4476398, 4246178, 1100634, 386501, 601654, 44884, 52372, 202801 #endif }; extern int AllocSetFreeIndex_clz(size_t size); extern int AllocSetFreeIndex_lt(size_t size); extern int AllocSetFreeIndex_fh(size_t size); extern int AllocSetFreeIndex_ur(size_t size); extern int AllocSetFreeIndex(size_t size); int main(int argc, char **argv) { int repeat = atoi(argv[1]); int k,l,m, min=1000, total=0; size_t *arr; struct timeval before, after; double elapsed; // for (k=0; k ALLOCSET_NUM_FREELISTS; k++) // if (numoccurs[k] min) // min = numoccurs[k]; for (k=0; k ALLOCSET_NUM_FREELISTS; k++) total += numoccurs[k]/min; arr = malloc(sizeof(size_t)*total); m = 0; for (k=0; k ALLOCSET_NUM_FREELISTS; k++) for (l=0; l numoccurs[k]/min; l++) { int p = 1 (ALLOC_MINBITS + k); p |= (p*2 - 1) random(); arr[m++] = p; } for (m=0; m total; m++) { int r = random()%total; if (r != m) { int tmp = arr[r]; arr[r] = arr[m]; arr[m] = tmp; } } repeat *= min; static int occurs[ALLOCSET_NUM_FREELISTS]; for (m=0; m total; m++) { int retval_clz = AllocSetFreeIndex_clz(arr[m]); int retval_fh = AllocSetFreeIndex_fh(arr[m]); int retval_lt = AllocSetFreeIndex_lt(arr[m]); int retval_ur = AllocSetFreeIndex_ur(arr[m]); int retval = AllocSetFreeIndex(arr[m]); if (retval != retval_clz || retval != retval_fh || retval != retval_lt || retval != retval_ur) printf(%08x: %4d %4d %4d %4d %4d\n, arr[m], retval, retval_clz, retval_fh, retval_lt, retval_ur); else occurs[retval]++; } for (k=0; kALLOCSET_NUM_FREELISTS; k++) printf(%4d: %8d\n, k, occurs[k]); gettimeofday(before, NULL); for(l=0; lrepeat; l++) for (m=0; m total; m++) AllocSetFreeIndex_clz(arr[m]); gettimeofday(after, NULL); elapsed = after.tv_sec - before.tv_sec; elapsed += (after.tv_usec - before.tv_usec)/100.0; printf(%20s %.3fs\n, clz, elapsed); gettimeofday(before, NULL); for(l=0; lrepeat; l++) for (m=0; m total; m++) AllocSetFreeIndex_lt(arr[m]); gettimeofday(after, NULL); elapsed = after.tv_sec - before.tv_sec; elapsed += (after.tv_usec - before.tv_usec)/100.0; printf(%20s %.3fs\n, lookup table, elapsed); gettimeofday(before, NULL); for(l=0; lrepeat; l++) for (m=0; m total; m++) AllocSetFreeIndex_fh(arr[m]); gettimeofday(after, NULL); elapsed = after.tv_sec - before.tv_sec; elapsed += (after.tv_usec - before.tv_usec)/100.0; printf(%20s %.3fs\n, float hack, elapsed); gettimeofday(before, NULL); for(l=0; lrepeat; l++) for (m=0; m total; m++) AllocSetFreeIndex_ur(arr[m]); gettimeofday(after, NULL); elapsed = after.tv_sec - before.tv_sec; elapsed += (after.tv_usec - before.tv_usec)/100.0; printf(%20s %.3fs\n, unrolled, elapsed); gettimeofday(before, NULL); for(l=0; lrepeat; l++) for (m=0; m total; m++) AllocSetFreeIndex(arr[m]); gettimeofday(after, NULL); elapsed = after.tv_sec - before.tv_sec; elapsed += (after.tv_usec - before.tv_usec)/100.0; printf(%20s %.3fs\n, normal, elapsed); return 0; } #include stdio.h #include stdlib.h #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Greg Stark gsst...@mit.edu writes: Well it was a bit of a pain but I filled an array with (1/1000 scaled down) values and then permuted them. I also went ahead and set the low-order bits to random values since the lookup table based algorithm might be affected by it. The results are a bit disappointing on my machine, only the CLZ and lookup table come out significantly ahead: clz 1.530s lookup table 1.720s float hack 4.424s unrolled 5.280s normal 5.369s It strikes me that we could assume that the values are 64K and hence drop the first case in the lookup table code. I've added that variant and get these results on my machines: x86_64 (Xeon): clz 15.357s lookup table 16.582s small lookup table 16.705s float hack 25.138s unrolled 64.630s normal 79.025s PPC: clz 3.842s lookup table 7.298s small lookup table 8.799s float hack 19.418s unrolled 7.656s normal 8.949s HPPA: clz (n/a) lookup table 11.515s small lookup table 10.803s float hack 16.502s unrolled 17.632s normal 19.754s Not sure why the small lookup table variant actually seems slower than the original on two of these machines; it can hardly be slower in reality since it's strictly less code. Maybe some weird code-alignment issue? It also seems weird that the x86_64 is now showing a much bigger gap between clz and normal than before. I don't see how branch prediction would do much for the normal code. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Hi Greg, Thanks for the benchmark app, thought I'd pitch in with some ppc results: clz 1.530s lookup table 1.720s float hack 4.424s unrolled 5.280s normal 5.369s POWER5+: clz 2.046s lookup table 2.188s float hack 7.786s unrolled 6.353s normal 7.033s POWER6: clz 1.063s lookup table 1.199s float hack 3.575s unrolled 2.290s normal 3.456s Cheers, Jeremy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Rather than testing single bits in a loop, change AllocSetFreeIndex to use the __builtin_clz() function to calculate the chunk index. This requires a new check for __builtin_clz in the configure script. Results in a ~2% performance increase on sysbench on PowerPC. Signed-off-by: Jeremy Kerr j...@ozlabs.org --- v4: don't use separate fls() function, remove hardcoded 32 --- configure.in | 13 + src/backend/utils/mmgr/aset.c | 14 +- 2 files changed, 26 insertions(+), 1 deletion(-) *** a/configure.in --- b/configure.in *** *** 1361,1366 case $host_os in --- 1361,1379 AC_FUNC_FSEEKO;; esac + # GCC builtins + # + # We need AC_TRY_LINK here, as the prototype generated by AC_CHECK_FUNC + # will cause gcc to try to reference a non-builtin symbol. + + AC_MSG_CHECKING([for __builtin_clz]) + AC_TRY_LINK([], + [__builtin_clz(0);], + [AC_DEFINE(HAVE_BUILTIN_CLZ, 1, + [Define to 1 if you have __builtin_clz().]) + AC_MSG_RESULT(yes)], + [AC_MSG_RESULT(no)]) + # # Pthreads *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 268,281 AllocSetFreeIndex(Size size) { int idx = 0; ! if (size 0) { size = (size - 1) ALLOC_MINBITS; while (size != 0) { idx++; size = 1; } Assert(idx ALLOCSET_NUM_FREELISTS); } --- 268,293 { int idx = 0; ! if (size (1 ALLOC_MINBITS)) { size = (size - 1) ALLOC_MINBITS; + + #ifdef HAVE_BUILTIN_CLZ + /* If possible, use __builtin_clz to calculate the chunk index - this +* should be O(1) rather than O(n). The builtin takes an unsigned int, +* so we need to cast from the possibly 64-bit Size type. This cast +* won't overflow, as the limit is at 2^(32 + ALLOC_MINBITS) bytes, +* which is larger than ALLOC_CHUNK_LIMIT. +*/ + idx = sizeof(unsigned int) * BITS_PER_BYTE - + __builtin_clz((unsigned int)size); + #else while (size != 0) { idx++; size = 1; } + #endif Assert(idx ALLOCSET_NUM_FREELISTS); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers