On Sat, 16 Mar 2024 at 04:06, Nathan Bossart <nathandboss...@gmail.com> wrote:
> I ran John Naylor's test_popcount module [0] with the following command on
> an i7-1195G7:
>
>         time psql postgres -c 'select drive_popcount(10000000, 1024)'
>
> Without your patches, this seems to take somewhere around 8.8 seconds.
> With your patches, it takes 0.6 seconds.  (I re-compiled and re-ran the
> tests a couple of times because I had a difficult time believing the amount
> of improvement.)
>
> [0] 
> https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO%2Bb7AEWHRFANxR1h1kxveEV%3DghLQ%40mail.gmail.com

I think most of that will come from getting rid of the indirect
function that currently exists in pg_popcount().

Using the attached quick hack, the performance using John's test
module goes from:

-- master
postgres=# select drive_popcount(10000000, 1024);
Time: 9832.845 ms (00:09.833)
Time: 9844.460 ms (00:09.844)
Time: 9858.608 ms (00:09.859)

-- with attached hacky and untested patch
postgres=# select drive_popcount(10000000, 1024);
Time: 2539.029 ms (00:02.539)
Time: 2598.223 ms (00:02.598)
Time: 2611.435 ms (00:02.611)

--- and with the avx512 patch on an AMD 7945HX CPU:
postgres=# select drive_popcount(10000000, 1024);
Time: 564.982 ms
Time: 556.540 ms
Time: 554.032 ms

The following comment seems like it could do with some improvements.

 * Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software
 * loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast
 * methods are used this will fall back to __builtin_* or pure software.

There's nothing much specific to Intel here.  AMD Zen4 has AVX512.
Plus "pg_bunutils.c" should be "pg_bitutils.c" and "the the"

How about just:

 * Use AVX-512 Intrinsics on supported CPUs. Fall back the software loop in
 * pg_popcount_slow() when AVX-512 is unavailable.

Maybe it's worth exploring something along the lines of the attached
before doing the AVX512 stuff.  It seems like a pretty good speed-up
and will apply for CPUs without AVX512 support.

David
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..85e45cee9b 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -305,7 +305,18 @@ pg_popcount(const char *buf, int bytes)
 
                while (bytes >= 8)
                {
-                       popcnt += pg_popcount64(*words++);
+#ifdef _MSC_VER
+                       popcnt += __popcnt64(*words++);
+#else
+                       uint64 res;
+
+                       __asm__ __volatile__(" popcntq %1,%0\n"
+                                                                : "=q"(res)
+                                                                : "rm"(word)
+                                                                : "cc");
+                       popcnt += (int) res;
+                       words++;
+#endif
                        bytes -= 8;
                }
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index f1d18a1b29..ae880db64c 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -26,6 +26,7 @@ subdir('test_misc')
 subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
+subdir('test_popcount')
 subdir('test_predtest')
 subdir('test_radixtree')
 subdir('test_rbtree')

Reply via email to