On Sat, 16 Mar 2024 at 04:06, Nathan Bossart <nathandboss...@gmail.com> wrote: > I ran John Naylor's test_popcount module [0] with the following command on > an i7-1195G7: > > time psql postgres -c 'select drive_popcount(10000000, 1024)' > > Without your patches, this seems to take somewhere around 8.8 seconds. > With your patches, it takes 0.6 seconds. (I re-compiled and re-ran the > tests a couple of times because I had a difficult time believing the amount > of improvement.) > > [0] > https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO%2Bb7AEWHRFANxR1h1kxveEV%3DghLQ%40mail.gmail.com
I think most of that will come from getting rid of the indirect function that currently exists in pg_popcount(). Using the attached quick hack, the performance using John's test module goes from: -- master postgres=# select drive_popcount(10000000, 1024); Time: 9832.845 ms (00:09.833) Time: 9844.460 ms (00:09.844) Time: 9858.608 ms (00:09.859) -- with attached hacky and untested patch postgres=# select drive_popcount(10000000, 1024); Time: 2539.029 ms (00:02.539) Time: 2598.223 ms (00:02.598) Time: 2611.435 ms (00:02.611) --- and with the avx512 patch on an AMD 7945HX CPU: postgres=# select drive_popcount(10000000, 1024); Time: 564.982 ms Time: 556.540 ms Time: 554.032 ms The following comment seems like it could do with some improvements. * Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software * loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast * methods are used this will fall back to __builtin_* or pure software. There's nothing much specific to Intel here. AMD Zen4 has AVX512. Plus "pg_bunutils.c" should be "pg_bitutils.c" and "the the" How about just: * Use AVX-512 Intrinsics on supported CPUs. Fall back the software loop in * pg_popcount_slow() when AVX-512 is unavailable. Maybe it's worth exploring something along the lines of the attached before doing the AVX512 stuff. It seems like a pretty good speed-up and will apply for CPUs without AVX512 support. David
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 640a89561a..85e45cee9b 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -305,7 +305,18 @@ pg_popcount(const char *buf, int bytes) while (bytes >= 8) { - popcnt += pg_popcount64(*words++); +#ifdef _MSC_VER + popcnt += __popcnt64(*words++); +#else + uint64 res; + + __asm__ __volatile__(" popcntq %1,%0\n" + : "=q"(res) + : "rm"(word) + : "cc"); + popcnt += (int) res; + words++; +#endif bytes -= 8; } diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build index f1d18a1b29..ae880db64c 100644 --- a/src/test/modules/meson.build +++ b/src/test/modules/meson.build @@ -26,6 +26,7 @@ subdir('test_misc') subdir('test_oat_hooks') subdir('test_parser') subdir('test_pg_dump') +subdir('test_popcount') subdir('test_predtest') subdir('test_radixtree') subdir('test_rbtree')