Re: [HACKERS] [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
On Tue, Jun 30, 2009 at 3:08 AM, Jeremy Kerrj...@ozlabs.org wrote: Move the shift-and-test login into a separate fls() function, which can use __builtin_clz() if it's available. This requires a new check for __builtin_clz in the configure script. Results in a ~2% performance increase on PowerPC. There are some comments on this patch that have been posted to commitfest.postgresql.org rather than emailed to -hackers. Note to all: please don't do this. The purpose of commitfest.postgresql.org is to summarize the mailing list discussion for easy reference, not to replace the mailing lists. That having been said, Jeremy, you probably want to take a look at those comments and I have a few responses to them as well. Dan Colish, the round-robin reviewer assigned to this patch, added the following comment: Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement can be seen. Question: are we only doing this because of PowerPC? What is going to happen on x86 and other architectures? x86 is not the center of the universe, but at a very minimum we need to make sure that things don't go backwards on what is a very common platform. Has anyone done any benchmarking of this? A related question: the original email for this patch says that it results in a performance increase of about 2% on PPC. But since it gives no details on exactly what the test was that improved by 2%, it's hard to judge the impact of this. If this means that AllocSetFreeIndex() is 2% faster, I think we should reject this patch and move on. It's not worth introducing architecture-specific code to get a performance improvement that probably won't even move the needle on a macro-benchmark. On the other hand, if the test was something like a pgbench run, then this is really very impressive. But we don't know which it is. Zdenek Kotala added this comment: I have several objections: - inline is forbidden to use in PostgreSQL - you need exception or do it differently - type mismatch - Size vs unsigned int vs 32. you should use only Size and sizeof(Size) And general comment: Do we want to have this kind of code which is optimized for one compiler and platform. See openSSL or MySQL, they have many optimizations which work fine on one platform with specific version of compiler and specific version of OS. But if you select different version of compiler or different compiler you can get very different performance result (e.g. MySQL 30% degradation, OpenSSL RSA 3x faster and so on). I think in this case, it makes sense to have optimization here, but we should do it like spinlocks are implemented and put this code into /port. It should help to implemented special code for SPARC and SUN Studio for example. I don't have any thoughts on this part beyond what I stated above, but hopefully Jeremy does. I am going to mark this Waiting on author pending a response to all of the above, though hopefully Dan Colish will continue with his reviewing efforts in the meantime. Thanks, ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Robert Haas wrote: On Tue, Jun 30, 2009 at 3:08 AM, Jeremy Kerrj...@ozlabs.org wrote: Move the shift-and-test login into a separate fls() function, which can use __builtin_clz() if it's available. This requires a new check for __builtin_clz in the configure script. Results in a ~2% performance increase on PowerPC. There are some comments on this patch that have been posted to commitfest.postgresql.org rather than emailed to -hackers. Note to all: please don't do this. The purpose of commitfest.postgresql.org is to summarize the mailing list discussion for easy reference, not to replace the mailing lists. That having been said, Jeremy, you probably want to take a look at those comments and I have a few responses to them as well. Dan Colish, the round-robin reviewer assigned to this patch, added the following comment: Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement can be seen. Question: are we only doing this because of PowerPC? What is going to happen on x86 and other architectures? x86 is not the center of the universe, but at a very minimum we need to make sure that things don't go backwards on what is a very common platform. Has anyone done any benchmarking of this? A related question: the original email for this patch says that it results in a performance increase of about 2% on PPC. But since it gives no details on exactly what the test was that improved by 2%, it's hard to judge the impact of this. If this means that AllocSetFreeIndex() is 2% faster, I think we should reject this patch and move on. It's not worth introducing architecture-specific code to get a performance improvement that probably won't even move the needle on a macro-benchmark. On the other hand, if the test was something like a pgbench run, then this is really very impressive. But we don't know which it is. There are numbers in the original thread this patch http://archives.postgresql.org/pgsql-hackers/2009-06/msg00159.php resulted in. Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Hi Robert, That having been said, Jeremy, you probably want to take a look at those comments and I have a few responses to them as well. OK, thanks for the heads-up. following comment: Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement can be seen. Question: are we only doing this because of PowerPC? No, this patch will benefit any architecture that has a gcc implementation of __builtin_clz. I know that both x86 and powerpc gcc support this. What is going to happen on x86 and other architectures? x86 is not the center of the universe, but at a very minimum we need to make sure that things don't go backwards on what is a very common platform. Has anyone done any benchmarking of this? Yes, Atsushi Ogawa did some benchmarking with this patch on x86, his numbers are here: http://archives.postgresql.org/message-id/4a2895c4.9050...@hi-ho.ne.jp In fact, Atushi originally submitted a patch using inline asm (using bsr) to do this on x86. Coincidentally, I was working on a powerpc implementation (using cntlz) at the same time, so submitted a patch using the gcc builtin that would work on both (and possibly other) architectures. A related question: the original email for this patch says that it results in a performance increase of about 2% on PPC. But since it gives no details on exactly what the test was that improved by 2%, it's hard to judge the impact of this. If this means that AllocSetFreeIndex() is 2% faster, I think we should reject this patch and move on. It's not worth introducing architecture-specific code to get a performance improvement that probably won't even move the needle on a macro-benchmark. On the other hand, if the test was something like a pgbench run, then this is really very impressive. But we don't know which it is. The 2% improvement I saw is for a sysbench OLTP run. I'm happy to re-do the run and report specific numbers if you need them. Zdenek Kotala added this comment: I have several objections: - inline is forbidden to use in PostgreSQL - you need exception or do it differently - type mismatch - Size vs unsigned int vs 32. you should use only Size and sizeof(Size) OK, happy to make these changes. What's the commitfest process here, should I redo the patch and re-send to -hackers? (inline again: should I just make this a static, the compiler can inline where possible? or do you want a macro?) And general comment: Do we want to have this kind of code which is optimized for one compiler and platform. One compiler, multiple platforms. See openSSL or MySQL, they have many optimizations which work fine on one platform with specific version of compiler and specific version of OS. But if you select different version of compiler or different compiler you can get very different performance result (e.g. MySQL 30% degradation, OpenSSL RSA 3x faster and so on). I don't think we're going to see a lot of variation here (besides the difference where __builtin_clz isn't available). Given that this is implemented with a couple of instructions, I'm confident that we won't see any degradation for platforms that support __builtin_clz. For the other cases, we just use the exiting while-loop algorithm so there should be no change (unless we mess up the inlining...). I think in this case, it makes sense to have optimization here, but we should do it like spinlocks are implemented and put this code into /port. Unless I'm missing something, spinlocks are done the same way - we have one file, src/include/storage/s_lock.h, which is mostly different implementations of the lock primitives for different platforms, separated by preprocessor tests. Essentially, this is just one line of code, protected by HAVE_BUILTIN_CLZ (which is a feature-test, rather than a specific platform or compiler test), and is only used in one place. I don't think that warrants a separate file. Cheers, Jeremy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Jeremy Kerr j...@ozlabs.org writes: - inline is forbidden to use in PostgreSQL - you need exception or do it differently (inline again: should I just make this a static, the compiler can inline where possible? or do you want a macro?) I don't know where Zdenek got the idea that we have something against inline. So far as I can see, recent versions of gcc claim to support __builtin_clz on all supported architectures. On some it might be no faster than our existing loop, but it seems unlikely to be slower. The two comments I have are * do something other than the hardwired 32 for word size; perhaps sizeof(int) * BITS_PER_BYTE. * do not use the separate fls function. On a compiler that fails to inline it, this patch would be a net performance loss, which we're not likely to tolerate for a patch that has no other reason to live than performance. Just #if the builtin right into the one place where it will be used. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers