> On Thu, Dec 20, 2018 at 6:53 AM David Rowley <david.row...@2ndquadrant.com> > wrote: > > Thomas mentions in [1], to get the GCC to use the POPCNT instruction, > we must pass -mpopcnt in the build flags. After doing a bit of > research, I found [2] which mentions that some compilers have some > pattern matching code to allow the popcnt instruction to be used even > without a macro such as __builtin_popcount(). I believe I've > correctly written the run-time test to skip using the new popcnt > function, but if there's any code around that might cause the compiler > to use the popcnt instruction from pattern matching, then that might > cause problems.
I've checked for Clang 6, it turns out that indeed it generates popcnt without any macro, but only in one place for bloom_prop_bits_set. After looking at this function it seems that it would be benefitial to actually use popcnt there too. > I am able to measure performance gains from the patch. In a 3.4GB > table containing a single column with just 10 statistics targets, I > got the following times after running ANALYZE on the table. I've tested it too a bit, and got similar results when the patched version is slightly faster. But then I wonder if popcnt is the best solution here, since after some short research I found a paper [1], where authors claim that: Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors. [1]: https://arxiv.org/pdf/1611.07612.pdf