subject:"Re\: Popcount optimization using AVX512"

Re: Popcount optimization using AVX512

2024-11-07 Thread Nathan Bossart

On Thu, Nov 07, 2024 at 08:38:21PM +, Devulapalli, Raghuveer wrote:
> 
>> Of course, as soon as I committed this, I noticed that it's broken.  It 
>> seems that
>> compilers are rather picky about how multiple target options are specified.
> 
> Just curious, which compiler complained? 

Clang.

-- 
nathan

RE: Popcount optimization using AVX512

2024-11-07 Thread Devulapalli, Raghuveer



> Of course, as soon as I committed this, I noticed that it's broken.  It seems 
> that
> compilers are rather picky about how multiple target options are specified.

Just curious, which compiler complained? 

Raghuveer

Re: Popcount optimization using AVX512

2024-11-07 Thread Nathan Bossart

On Thu, Nov 07, 2024 at 02:03:04PM -0600, Nathan Bossart wrote:
> Committed.

Of course, as soon as I committed this, I noticed that it's broken.  It
seems that compilers are rather picky about how multiple target options are
specified.  AFAICT the commonly supported syntax is to put the entire list
within one pair of quotes and to use only commas as delimiters, i.e., no
spaces.  I plan to commit the attached shortly once I've had a chance to
verify it fixes the problem on cfbot.

-- 
nathan
>From a5ade18a867377fa424347465bbc5f631eff4f96 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 7 Nov 2024 14:28:57 -0600
Subject: [PATCH 1/1] fix __attribute__((target(...))) usage

---
 config/c-compiler.m4  | 2 +-
 configure | 2 +-
 meson.build   | 2 +-
 src/port/pg_popcount_avx512.c | 4 ++--
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index c7eb896f14..a129edb88e 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -733,7 +733,7 @@ AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
 AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
 [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
 #if defined(__has_attribute) && __has_attribute (target)
-__attribute__((target("avx512vpopcntdq","avx512bw")))
+__attribute__((target("avx512vpopcntdq,avx512bw")))
 #endif
 static int popcount_test(void)
 {
diff --git a/configure b/configure
index 3a7332f834..4b01b682b1 100755
--- a/configure
+++ b/configure
@@ -17324,7 +17324,7 @@ else
 /* end confdefs.h.  */
 #include 
 #if defined(__has_attribute) && __has_attribute (target)
-__attribute__((target("avx512vpopcntdq","avx512bw")))
+__attribute__((target("avx512vpopcntdq,avx512bw")))
 #endif
 static int popcount_test(void)
 {
diff --git a/meson.build b/meson.build
index 9eddd72a27..5b0510cef7 100644
--- a/meson.build
+++ b/meson.build
@@ -2184,7 +2184,7 @@ if host_cpu == 'x86_64'
 #include 
 
 #if defined(__has_attribute) && __has_attribute (target)
-__attribute__((target("avx512vpopcntdq","avx512bw")))
+__attribute__((target("avx512vpopcntdq,avx512bw")))
 #endif
 int main(void)
 {
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index b598e86554..1ab2847bf2 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -106,7 +106,7 @@ pg_popcount_avx512_available(void)
  * pg_popcount_avx512
  * Returns the number of 1-bits in buf
  */
-pg_attribute_target("avx512vpopcntdq", "avx512bw")
+pg_attribute_target("avx512vpopcntdq,avx512bw")
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
@@ -162,7 +162,7 @@ pg_popcount_avx512(const char *buf, int bytes)
  * pg_popcount_masked_avx512
  * Returns the number of 1-bits in buf after applying the mask to 
each byte
  */
-pg_attribute_target("avx512vpopcntdq", "avx512bw")
+pg_attribute_target("avx512vpopcntdq,avx512bw")
 uint64
 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
 {
-- 
2.39.5 (Apple Git-154)

Re: Popcount optimization using AVX512

2024-11-07 Thread Nathan Bossart

Committed.

-- 
nathan

Re: Popcount optimization using AVX512

2024-11-07 Thread Nathan Bossart

On Thu, Nov 07, 2024 at 11:12:37AM -0500, Andres Freund wrote:
> One thing that'd I'd like to see this being used is to elide the indirection
> when the current target platform *already* supports the necessary
> intrinsics. Adding a bunch of indirection for short & common operations is
> decidedly not great.   It doesn't have to be part of the same commit, but it
> seems like it's worth doing as part of the same series, as I think it'll lead
> to rather different looking configure checks.

The main hurdle, at least for AVX-512, is that we still need to check (at
runtime) whether the OS supports XGETBV and whether the ZMM registers are
fully enabled.  We might be able to skip those checks in limited cases
(e.g., you are building on the target machine and can perhaps just check it
once at build time), but that probably won't help packagers.

>> +/*
>> + * pg_attribute_target allows specifying different target options that the
>> + * function should be compiled with (e.g., for using special CPU 
>> instructions).
>> + */
>> +#if __has_attribute (target)
>> +#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__)))
>> +#else
>> +#define pg_attribute_target(...)
>> +#endif
> 
> Think it'd be good to mention that there still needs to be configure check to
> verify that specific target attribute is understood by the compiler.

Will do.

-- 
nathan

Re: Popcount optimization using AVX512

2024-11-07 Thread Andres Freund

Hi,

On 2024-11-06 20:26:47 -0600, Nathan Bossart wrote:
> From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart 
> Date: Wed, 16 Oct 2024 15:57:55 -0500
> Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff

One thing that'd I'd like to see this being used is to elide the indirection
when the current target platform *already* supports the necessary
intrinsics. Adding a bunch of indirection for short & common operations is
decidedly not great.   It doesn't have to be part of the same commit, but it
seems like it's worth doing as part of the same series, as I think it'll lead
to rather different looking configure checks.

> diff --git a/src/include/c.h b/src/include/c.h
> index 55dec71a6d..6f5ca25542 100644
> --- a/src/include/c.h
> +++ b/src/include/c.h
> @@ -174,6 +174,16 @@
>  #define pg_attribute_nonnull(...)
>  #endif
>  
> +/*
> + * pg_attribute_target allows specifying different target options that the
> + * function should be compiled with (e.g., for using special CPU 
> instructions).
> + */
> +#if __has_attribute (target)
> +#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__)))
> +#else
> +#define pg_attribute_target(...)
> +#endif

Think it'd be good to mention that there still needs to be configure check to
verify that specific target attribute is understood by the compiler.

Greetings,

Andres Freund

Re: Popcount optimization using AVX512

2024-11-06 Thread Nathan Bossart

rebased

-- 
nathan
>From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff

---
 config/c-compiler.m4 |  64 +-
 configure| 167 +++
 configure.ac |  17 +--
 meson.build  |  21 ++--
 src/Makefile.global.in   |   5 -
 src/include/c.h  |  10 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  12 +-
 src/port/meson.build |   7 +-
 src/port/pg_popcount_avx512.c|  86 +-
 src/port/pg_popcount_avx512_choose.c | 102 
 11 files changed, 183 insertions(+), 312 deletions(-)
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..c7eb896f14 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -700,20 +700,22 @@ undefine([Ac_cachevar])dnl
 # Check if the compiler supports the XSAVE instructions using the _xgetbv
 # intrinsic function.
 #
-# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
 AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("xsave")))
+#endif
+static int xsave_test(void)
+{
+  return _xgetbv(0) & 0xe0;
+}],
+  [return xsave_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_XSAVE="$1"
   pgac_xsave_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -725,29 +727,29 @@ undefine([Ac_cachevar])dnl
 # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
 # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
 #
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw).  If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
 AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [const char buf@<:@sizeof(__m512i)@:>@;
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, 
(const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("avx512vpopcntdq","avx512bw")))
+#endif
+static int popcount_test(void)
+{
+  const char buf@<:@sizeof(__m512i)@:>@;
+  PG_INT64_TYPE popcnt = 0;
+  __m512i accum = _mm512_setzero_si512();
+  const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 
0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+  const __m512i cnt = _mm512_popcnt_epi64(val);
+  accum = _mm512_add_epi64(accum, cnt);
+  popcnt = _mm512_reduce_add_epi64(accum);
+  return (int) popcnt;
+}],
+  [return popcount_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_POPCNT="$1"
   pgac_avx512_popcnt_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
diff --git a/configure b/configure
index 6e256b417b..3a7332f834 100755
--- a/configure
+++ b/configure
@@ -647,9 +647,6 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
-PG_POPCNT_OBJS
-CFLAGS_POPCNT
-CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17272,185 +17269,103 @@ fi
 
 # Check for XSAVE intrinsics
 #
-CFLAGS_XSAVE=""
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
-$as_echo_n "ch

Re: Popcount optimization using AVX512

2024-11-04 Thread Nathan Bossart

On Thu, Oct 31, 2024 at 07:58:06PM +, Devulapalli, Raghuveer wrote:
> LGTM. 

Thanks.  Barring additional feedback, I plan to commit this soon.

-- 
nathan

Re: Popcount optimization using AVX512

2024-10-31 Thread Nathan Bossart

On Wed, Oct 30, 2024 at 04:10:10PM -0500, Nathan Bossart wrote:
> On Wed, Oct 30, 2024 at 08:53:10PM +, Raghuveer Devulapalli wrote:
>> BTW, I just realized function attributes for xsave and avx512 don't work
>> on MSVC (see
>> https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630).
>> Not sure if you care about it. Its an easy fix (see
>> https://gcc.godbolt.org/z/Pebdj3vMx).
> 
> Oh, good catch.  IIUC we only need to check for #ifndef _MSC_VER in the
> configure programs for meson.  pg_attribute_target will be empty on MSVC,
> and I believe we only support meson builds there.

Here is an updated patch with this change.

-- 
nathan
>From 8cf7c08220a9c0a1dec809794af2dfb719981923 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v2 1/1] use __attribute__((target(...))) for AVX-512 stuff

---
 config/c-compiler.m4 |  60 +-
 configure| 163 ++-
 configure.ac |  17 +--
 meson.build  |  21 ++--
 src/Makefile.global.in   |   5 -
 src/include/c.h  |  10 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  12 +-
 src/port/meson.build |   7 +-
 src/port/pg_popcount_avx512.c|  86 +-
 src/port/pg_popcount_avx512_choose.c | 102 -
 11 files changed, 175 insertions(+), 312 deletions(-)
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..aa90f8ef33 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -700,20 +700,20 @@ undefine([Ac_cachevar])dnl
 # Check if the compiler supports the XSAVE instructions using the _xgetbv
 # intrinsic function.
 #
-# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
 AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+__attribute__((target("xsave")))
+static int xsave_test(void)
+{
+  return _xgetbv(0) & 0xe0;
+}],
+  [return xsave_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_XSAVE="$1"
   pgac_xsave_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -725,29 +725,27 @@ undefine([Ac_cachevar])dnl
 # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
 # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
 #
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw).  If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
 AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [const char buf@<:@sizeof(__m512i)@:>@;
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, 
(const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+__attribute__((target("avx512vpopcntdq","avx512bw")))
+static int popcount_test(void)
+{
+  const char buf@<:@sizeof(__m512i)@:>@;
+  PG_INT64_TYPE popcnt = 0;
+  __m512i accum = _mm512_setzero_si512();
+  const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 
0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+  const __m512i cnt = _mm512_popcnt_epi64(val);
+  accum = _mm512_add_epi64(accum, cnt);
+  popcnt = _mm512_reduce_add_epi64(accum);
+  return (int) popcnt;
+}],
+  [return popcount_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cach

RE: Popcount optimization using AVX512

2024-10-31 Thread Devulapalli, Raghuveer

> Here is an updated patch with this change.

LGTM. 

Raghuveer

RE: Popcount optimization using AVX512

2024-10-30 Thread Devulapalli, Raghuveer



> Oh, good catch.  IIUC we only need to check for #ifndef _MSC_VER in the
> configure programs for meson.  pg_attribute_target will be empty on MSVC, and 
> I
> believe we only support meson builds there.

Right. __has_attribute (target) produces a compiler warning on MSVC: 
https://gcc.godbolt.org/z/EfWGxbvj3. Might need to guard that with #if 
defined(__has_attribute) to get rid of it. 

> 
> --
> nathan

Re: Popcount optimization using AVX512

2024-10-30 Thread Nathan Bossart

On Wed, Oct 30, 2024 at 08:53:10PM +, Raghuveer Devulapalli wrote:
> BTW, I just realized function attributes for xsave and avx512 don't work
> on MSVC (see
> https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630).
> Not sure if you care about it. Its an easy fix (see
> https://gcc.godbolt.org/z/Pebdj3vMx).

Oh, good catch.  IIUC we only need to check for #ifndef _MSC_VER in the
configure programs for meson.  pg_attribute_target will be empty on MSVC,
and I believe we only support meson builds there.

-- 
nathan

Re: Popcount optimization using AVX512

2024-10-30 Thread Raghuveer Devulapalli

BTW, I just realized function attributes for xsave and avx512 don't work on 
MSVC (see 
https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630).
 Not sure if you care about it. Its an easy fix (see 
https://gcc.godbolt.org/z/Pebdj3vMx).

Re: Popcount optimization using AVX512

2024-10-29 Thread Raghuveer Devulapalli

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, failed
Spec compliant:   tested, failed
Documentation:tested, failed

Changes LGTM. Makes the Makefile look clean. Built and ran tests with `make 
check` with gcc-13 on a ICX and gcc-11 on SKX. I built on top of this patch and 
converted SSE4.2 and AVX-512 CRC32C to use function attributes too.

The new status of this patch is: Ready for Committer

Re: Popcount optimization using AVX512

2024-10-16 Thread Nathan Bossart

On Tue, Oct 08, 2024 at 09:36:03PM -0500, Nathan Bossart wrote:
> On Wed, Jul 31, 2024 at 04:43:02PM -0500, Nathan Bossart wrote:
>> On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote:
>>> I think we'd be better off enabling architectural features on a per-function
>>> basis, roughly like this:
>>>
>>> [...]
>>> 
>>> /* FIXME: Should be gated by configure check of -mavx512vpopcntdq 
>>> -mavx512bw support */
>>> pg_enable_target("avx512vpopcntdq,avx512bw")
>>> uint64_t
>>> pg_popcount_avx512(const char *buf, int bytes)
>>> ...
>> 
>> I remember wondering why the CRC-32C code wasn't already doing something
>> like this (old compiler versions? non-gcc-like compilers?), and I'm not
>> sure I ever discovered the reason, so out of an abundance of caution I used
>> the same approach for AVX-512.  If we can convince ourselves that
>> __attribute__((target("..."))) is standard enough at this point, +1 for
>> moving to that.
> 
> [...]
> 
> So, at least for the CRC code, __attribute__((target("..."))) was probably
> not widely available enough yet when it was first added.  Unfortunately,
> the ARMv8 CRC target support (without -march) is still pretty new, but it
> might be possible to switch the others to a per-function approach in v18.

Here is a first attempt at using __attribute__((target("..."))) for the
AVX-512 stuff.  Besides allowing us to consolidate the code into a single
file, this simplifies the build file changes quite a bit.

-- 
nathan
>From c97e25e56347c90f169a5ce069a9ea06c873915b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v1 1/1] use __attribute__((target(...))) for AVX-512 stuff

---
 config/c-compiler.m4 |  60 +-
 configure| 163 ++-
 configure.ac |  17 +--
 meson.build  |  17 +--
 src/Makefile.global.in   |   5 -
 src/include/c.h  |  10 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  12 +-
 src/port/meson.build |   7 +-
 src/port/pg_popcount_avx512.c|  86 +-
 src/port/pg_popcount_avx512_choose.c | 102 -
 11 files changed, 171 insertions(+), 312 deletions(-)
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..aa90f8ef33 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -700,20 +700,20 @@ undefine([Ac_cachevar])dnl
 # Check if the compiler supports the XSAVE instructions using the _xgetbv
 # intrinsic function.
 #
-# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
 AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+__attribute__((target("xsave")))
+static int xsave_test(void)
+{
+  return _xgetbv(0) & 0xe0;
+}],
+  [return xsave_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_XSAVE="$1"
   pgac_xsave_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -725,29 +725,27 @@ undefine([Ac_cachevar])dnl
 # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
 # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
 #
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw).  If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
 AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [const char buf@<:@sizeof(__m512i)@:>@;
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, 
(const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics]

Re: Popcount optimization using AVX512

2024-10-08 Thread Nathan Bossart

On Wed, Jul 31, 2024 at 04:43:02PM -0500, Nathan Bossart wrote:
> On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote:
>> I think we'd be better off enabling architectural features on a per-function
>> basis, roughly like this:
>>
>> [...]
>> 
>> /* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw 
>> support */
>> pg_enable_target("avx512vpopcntdq,avx512bw")
>> uint64_t
>> pg_popcount_avx512(const char *buf, int bytes)
>> ...
> 
> I remember wondering why the CRC-32C code wasn't already doing something
> like this (old compiler versions? non-gcc-like compilers?), and I'm not
> sure I ever discovered the reason, so out of an abundance of caution I used
> the same approach for AVX-512.  If we can convince ourselves that
> __attribute__((target("..."))) is standard enough at this point, +1 for
> moving to that.

I looked into this some more, and found the following:

* We added SSE 4.2 CRC support in April 2015 (commit 3dc2d62).  gcc support
  for __attribute__((target("sse4.2"))) was added in 4.9.0 (April 2014).
  clang support was added in 3.8 (March 2016).

* We added ARMv8 CRC support in April 2018 (commit f044d71).  gcc support
  for __attribute__((target("+crc"))) was added in 6.3 (December 2016).  I
  didn't find precisely when clang support was added, but until 16.0.0
  (March 2023), including arm_acle.h requires the -march flag [0], and you
  had to use "crc" (plus sign omitted) as the target [1].

* We added AVX-512 support in April 2024 (commit 792752a).  gcc support for
  __attribute__((target("avx512vpopcntdq,avx512bw"))) was added in 7.1 (May
  2017).  clang support was added in 5.0.0 (September 2017).  However, the
  "xsave" target was not supported until 9.1 for gcc (May 2019) and 9.0.0
  for clang (September 2019), and we need that for our AVX-512 code, too.

So, at least for the CRC code, __attribute__((target("..."))) was probably
not widely available enough yet when it was first added.  Unfortunately,
the ARMv8 CRC target support (without -march) is still pretty new, but it
might be possible to switch the others to a per-function approach in v18.

[0] https://github.com/llvm/llvm-project/commit/30b67c6
[1] 
https://releases.llvm.org/16.0.0/tools/clang/docs/ReleaseNotes.html#arm-and-aarch64-support

-- 
nathan

Re: Popcount optimization using AVX512

2024-07-31 Thread Nathan Bossart

On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote:
> On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote:
>> As I started on this, I remembered why I needed it.  The file
>> pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order
>> to avoid inadvertently issuing any AVX-512 instructions before determining
>> we have support.  If that's not a concern, we could still probably remove
>> the XSAVE check.
> 
> I think it's a valid concern - but isn't that theoretically also an issue with
> xsave itself? I guess practically the compiler won't do that, because there's
> no practical reason to emit any instructions enabled by -mxsave (in contrast
> to e.g. -mavx, which does trigger gcc to emit different instructions even for
> basic math).

Yeah, this crossed my mind.  It's certainly not the sturdiest of
assumptions...

> I think enabling options like these on a per-translation-unit basis isn't
> really a scalable approach. To actually be safe there could only be a single
> function in each TU and that function could only be called after a cpuid check
> performed in a separate TU. That a) ends up pretty unreadable b) requires
> functions to be implemented in .c files, which we really don't want for some
> of this.

Agreed.

> I think we'd be better off enabling architectural features on a per-function
> basis, roughly like this:
>
> [...]
> 
> /* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw 
> support */
> pg_enable_target("avx512vpopcntdq,avx512bw")
> uint64_t
> pg_popcount_avx512(const char *buf, int bytes)
> ...

I remember wondering why the CRC-32C code wasn't already doing something
like this (old compiler versions? non-gcc-like compilers?), and I'm not
sure I ever discovered the reason, so out of an abundance of caution I used
the same approach for AVX-512.  If we can convince ourselves that
__attribute__((target("..."))) is standard enough at this point, +1 for
moving to that.

-- 
nathan

Re: Popcount optimization using AVX512

2024-07-31 Thread Andres Freund

Hi,

On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote:
> > On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote:
> >> My point is that _xgetbv() is made available by -mavx512vpopcntdq 
> >> -mavx512bw
> >> alone, without needing -mxsave:
> >
> > Oh, I see.  I'll work on a patch to remove that compiler check, then...
>
> As I started on this, I remembered why I needed it.  The file
> pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order
> to avoid inadvertently issuing any AVX-512 instructions before determining
> we have support.  If that's not a concern, we could still probably remove
> the XSAVE check.

I think it's a valid concern - but isn't that theoretically also an issue with
xsave itself? I guess practically the compiler won't do that, because there's
no practical reason to emit any instructions enabled by -mxsave (in contrast
to e.g. -mavx, which does trigger gcc to emit different instructions even for
basic math).

I think this is one of the few instances where msvc has the right approach -
if I use intrinsics to emit a specific instruction, the intrinsic should do
so, regardless of whether the compiler is allowed to do so on its own.

I think enabling options like these on a per-translation-unit basis isn't
really a scalable approach. To actually be safe there could only be a single
function in each TU and that function could only be called after a cpuid check
performed in a separate TU. That a) ends up pretty unreadable b) requires
functions to be implemented in .c files, which we really don't want for some
of this.

I think we'd be better off enabling architectural features on a per-function
basis, roughly like this:
https://godbolt.org/z/a4q9Gc6Ez

For posterity, in the unlikely case anybody reads this after godbolt shuts
down:

I'm thinking we'd have an attribute like this:

/*
 * GCC like compilers don't support intrinsics without those intrinsics 
explicitly
 * having been enabled. We can't just add these options more widely, as that 
allows the
 * compiler to emit such instructions more widely, even if we gate reaching the 
code using
 * intrinsics. So we just enable the relevant support for individual functions.
 *
 * In contrast to this, msvc allows use of intrinsics independent of what the 
compiler
 * otherwise is allowed to emit.
 */
#ifdef __GNUC__
#define pg_enable_target(foo) __attribute__ ((__target__ (foo)))
#else
#define pg_enable_target(foo)
#endif

and then use that selectively for some functions:

/* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw 
support */
pg_enable_target("avx512vpopcntdq,avx512bw")
uint64_t
pg_popcount_avx512(const char *buf, int bytes)
...

Greetings,

Andres Freund

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart

On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote:
>> My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw
>> alone, without needing -mxsave:
> 
> Oh, I see.  I'll work on a patch to remove that compiler check, then...

As I started on this, I remembered why I needed it.  The file
pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order
to avoid inadvertently issuing any AVX-512 instructions before determining
we have support.  If that's not a concern, we could still probably remove
the XSAVE check.

-- 
nathan

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart

On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote:
> On 2024-07-30 21:01:31 -0500, Nathan Bossart wrote:
>> The main purpose of the XSAVE compiler check is to determine whether we
>> need to add -mxsave in order to use _xgetbv() [0].  If that wasn't a
>> factor, we could probably skip it.  Earlier versions of the patch used
>> inline assembly in the non-MSVC path to call XGETBV, which I was trying to
>> avoid.
> 
> My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw
> alone, without needing -mxsave:

Oh, I see.  I'll work on a patch to remove that compiler check, then...

-- 
nathan

Re: Popcount optimization using AVX512

2024-07-30 Thread Andres Freund

Hi,

On 2024-07-30 21:01:31 -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 06:46:51PM -0700, Andres Freund wrote:
> > On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote:
> >> On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote:
> >> > Why are we actually checking for xsave? We're not using xsave itself and 
> >> > I
> >> > couldn't find a comment in 792752af4eb5 explaining what we're using it 
> >> > as a
> >> > proxy for?  Is that just to know if _xgetbv() exists?  Is it actually 
> >> > possible
> >> > that xsave isn't available when avx512 is?
> >> 
> >> Yes, it's to verify we have XGETBV, which IIUC requires support from both
> >> the processor and the OS (see 598e011 and upthread discussion).  AFAIK the
> >> way we are detecting AVX-512 support is quite literally by-the-book unless
> >> I've gotten something wrong.
> > 
> > I'm basically wondering whether we need to check for compiler (not OS 
> > support)
> > support for xsave if we also check for -mavx512vpopcntdq -mavx512bw
> > support. Afaict the latter implies support for xsave.
> 
> The main purpose of the XSAVE compiler check is to determine whether we
> need to add -mxsave in order to use _xgetbv() [0].  If that wasn't a
> factor, we could probably skip it.  Earlier versions of the patch used
> inline assembly in the non-MSVC path to call XGETBV, which I was trying to
> avoid.

My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw
alone, without needing -mxsave:

echo -e '#include \nint main() { return _xgetbv(0) & 0xe0; }'|time 
gcc -march=x86-64 -c -xc  - -o /dev/null
-> fails

echo -e '#include \nint main() { return _xgetbv(0) & 0xe0;}'|time 
gcc -march=x86-64 -mavx512vpopcntdq -mavx512bw -c -xc - -o /dev/null
-> succeeds

Greetings,

Andres Freund

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart

On Tue, Jul 30, 2024 at 06:46:51PM -0700, Andres Freund wrote:
> On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote:
>> On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote:
>> > Why are we actually checking for xsave? We're not using xsave itself and I
>> > couldn't find a comment in 792752af4eb5 explaining what we're using it as a
>> > proxy for?  Is that just to know if _xgetbv() exists?  Is it actually 
>> > possible
>> > that xsave isn't available when avx512 is?
>> 
>> Yes, it's to verify we have XGETBV, which IIUC requires support from both
>> the processor and the OS (see 598e011 and upthread discussion).  AFAIK the
>> way we are detecting AVX-512 support is quite literally by-the-book unless
>> I've gotten something wrong.
> 
> I'm basically wondering whether we need to check for compiler (not OS support)
> support for xsave if we also check for -mavx512vpopcntdq -mavx512bw
> support. Afaict the latter implies support for xsave.

The main purpose of the XSAVE compiler check is to determine whether we
need to add -mxsave in order to use _xgetbv() [0].  If that wasn't a
factor, we could probably skip it.  Earlier versions of the patch used
inline assembly in the non-MSVC path to call XGETBV, which I was trying to
avoid.

[0] https://postgr.es/m/20240330032209.GA2018686%40nathanxps13

-- 
nathan

Re: Popcount optimization using AVX512

2024-07-30 Thread Andres Freund

Hi,

On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote:
> > Ah, I somehow thought we'd avoid the runtime check in case we determine at
> > compile time we don't need any extra flags to enable the AVX512 stuff 
> > (similar
> > to how we deal with crc32). But it looks like that's not the case - which
> > seems pretty odd to me:
> > 
> > This turns something that can be a single instruction into an indirect
> > function call, even if we could know that it's guaranteed to be available 
> > for
> > the compilation target, due to -march=
> > 
> > It's one thing for the avx512 path to have that overhead, but it's
> > particularly absurd for pg_popcount32/pg_popcount64, where
> > 
> > a) The function call overhead is a larger proportion of the cost.
> > b) the instruction is almost universally available, including in the
> >architecture baseline x86-64-v2, which several distros are using as the
> >x86-64 baseline.
> 
> Yeah, pg_popcount32/64 have been doing this since v12 (02a6a54).  Until v17
> (cc4826d), pg_popcount() repeatedly calls these function pointers, too.  I
> think it'd be awesome if we could start requiring some of these "almost
> universally available" instructions, but AFAICT that brings its own
> complexity [0].

I'll respond there...


> > Why are we actually checking for xsave? We're not using xsave itself and I
> > couldn't find a comment in 792752af4eb5 explaining what we're using it as a
> > proxy for?  Is that just to know if _xgetbv() exists?  Is it actually 
> > possible
> > that xsave isn't available when avx512 is?
> 
> Yes, it's to verify we have XGETBV, which IIUC requires support from both
> the processor and the OS (see 598e011 and upthread discussion).  AFAIK the
> way we are detecting AVX-512 support is quite literally by-the-book unless
> I've gotten something wrong.

I'm basically wondering whether we need to check for compiler (not OS support)
support for xsave if we also check for -mavx512vpopcntdq -mavx512bw
support. Afaict the latter implies support for xsave.

andres@alap6:~$ echo|gcc -c - -march=x86-64 -xc -dM -E  - -o -|grep '__XSAVE__'
andres@alap6:~$ echo|gcc -c - -march=x86-64 -mavx512vpopcntdq -mavx512bw -xc 
-dM -E  - -o -|grep '__XSAVE__'
#define __XSAVE__ 1
#define __XSAVE__ 1

Greetings,

Andres Freund

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart

On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote:
> Ah, I somehow thought we'd avoid the runtime check in case we determine at
> compile time we don't need any extra flags to enable the AVX512 stuff (similar
> to how we deal with crc32). But it looks like that's not the case - which
> seems pretty odd to me:
> 
> This turns something that can be a single instruction into an indirect
> function call, even if we could know that it's guaranteed to be available for
> the compilation target, due to -march=
> 
> It's one thing for the avx512 path to have that overhead, but it's
> particularly absurd for pg_popcount32/pg_popcount64, where
> 
> a) The function call overhead is a larger proportion of the cost.
> b) the instruction is almost universally available, including in the
>architecture baseline x86-64-v2, which several distros are using as the
>x86-64 baseline.

Yeah, pg_popcount32/64 have been doing this since v12 (02a6a54).  Until v17
(cc4826d), pg_popcount() repeatedly calls these function pointers, too.  I
think it'd be awesome if we could start requiring some of these "almost
universally available" instructions, but AFAICT that brings its own
complexity [0].

> Why are we actually checking for xsave? We're not using xsave itself and I
> couldn't find a comment in 792752af4eb5 explaining what we're using it as a
> proxy for?  Is that just to know if _xgetbv() exists?  Is it actually possible
> that xsave isn't available when avx512 is?

Yes, it's to verify we have XGETBV, which IIUC requires support from both
the processor and the OS (see 598e011 and upthread discussion).  AFAIK the
way we are detecting AVX-512 support is quite literally by-the-book unless
I've gotten something wrong.

[0] https://postgr.es/m/ZmpG2ZzT30Q75BZO%40nathan

-- 
nathan

Re: Popcount optimization using AVX512

2024-07-30 Thread Thomas Munro

On Wed, Jul 31, 2024 at 12:50 PM Andres Freund  wrote:
> It's one thing for the avx512 path to have that overhead, but it's
> particularly absurd for pg_popcount32/pg_popcount64, where
>
> a) The function call overhead is a larger proportion of the cost.
> b) the instruction is almost universally available, including in the
>architecture baseline x86-64-v2, which several distros are using as the
>x86-64 baseline.

FWIW, another recent thread about that:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKS64zJezV9y9mPcB-J0i%2BfLGiv3FAdwSH_3SCaVdrjyQ%40mail.gmail.com

Re: Popcount optimization using AVX512

2024-07-30 Thread Andres Freund

Hi,

On 2024-07-30 16:32:07 -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote:
> > Now, a reasonable counter-argument would be that only some of these macros 
> > are
> > defined for msvc ([1]).  However, as it turns out, the test is broken
> > today, as msvc doesn't error out when using an intrinsic that's not
> > "available" by the target architecture, it seems to assume that the caller 
> > did
> > a cpuid check ahead of time.
> > 
> > 
> > Check out [2], it shows the various predefined macros for gcc, clang and 
> > msvc.
> > 
> > 
> > ISTM that the msvc checks for xsave/avx512 being broken should be an open
> > item?
> 
> I'm not following this one.  At the moment, we always do a runtime check
> for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and
> set the function pointers appropriately, right?  We could, of course, still
> fix it, though.

Ah, I somehow thought we'd avoid the runtime check in case we determine at
compile time we don't need any extra flags to enable the AVX512 stuff (similar
to how we deal with crc32). But it looks like that's not the case - which
seems pretty odd to me:

This turns something that can be a single instruction into an indirect
function call, even if we could know that it's guaranteed to be available for
the compilation target, due to -march=

It's one thing for the avx512 path to have that overhead, but it's
particularly absurd for pg_popcount32/pg_popcount64, where

a) The function call overhead is a larger proportion of the cost.
b) the instruction is almost universally available, including in the
   architecture baseline x86-64-v2, which several distros are using as the
   x86-64 baseline.

Why are we actually checking for xsave? We're not using xsave itself and I
couldn't find a comment in 792752af4eb5 explaining what we're using it as a
proxy for?  Is that just to know if _xgetbv() exists?  Is it actually possible
that xsave isn't available when avx512 is?

Greetings,

Andres Freund

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart

On Tue, Jul 30, 2024 at 04:32:07PM -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote:
>> Afaict we could just check for predefined preprocessor macros:
>> 
>> echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E  - -o 
>> -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
>> #define __AVX512BW__ 1
>> #define __AVX512VPOPCNTDQ__ 1
>> #define __XSAVE__ 1
>> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 
>> 13292maxresident)k
>> 
>> echo|time gcc -c -march=nehalem -xc -dM -E  - -o -|grep -E 
>> '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
>> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 
>> 10972maxresident)k
> 
> Seems promising.  I can't think of a reason that wouldn't work.
> 
>> Now, a reasonable counter-argument would be that only some of these macros 
>> are
>> defined for msvc ([1]).  However, as it turns out, the test is broken
>> today, as msvc doesn't error out when using an intrinsic that's not
>> "available" by the target architecture, it seems to assume that the caller 
>> did
>> a cpuid check ahead of time.

Hm.  Upon further inspection, I see that MSVC appears to be missing
__XSAVE__ and __AVX512VPOPCNTDQ__, which is unfortunate.  Still, I think
the worst case scenario is that the CPUID check fails and we don't use
AVX-512 instructions.  AFAICT we aren't adding new function pointers in any
builds that don't already have them, just compiling some extra unused code.

-- 
nathan

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart

On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote:
> I've noticed that the configure probes for this are quite slow - pretty much
> the slowest step in a meson setup (and autoconf is similar).  While looking
> into this, I also noticed that afaict the tests don't do the right thing for
> msvc.
> 
> ...
> [6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES
> [6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES
> [6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES
> [7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO
> [8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES
> [8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : 
> links: NO
> [9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : 
> links: YES
> [9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : 
> links: NO
> [9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: 
> YES
> [9.367] Checking if "x86_64: popcntq instruction" compiles: YES
> [9.382] Has header "atomic.h" : NO
> ...
> 
> (the times here are a bit exaggerated, enabling them in meson also turns on
> python profiling, which makes everything a bit slower)
> 
> 
> Looks like this is largely the fault of including immintrin.h:
> 
> echo -e '#include \nint main(){return _xgetbv(0) & 0xe0;}'|time 
> gcc -mxsave -xc  - -o /dev/null
> 0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 
> 94184maxresident)k
> 
> echo -e '#include \n'|time gcc -c -mxsave -xc  - -o /dev/null
> 0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 
> 86004maxresident)k

Interesting.  Thanks for bringing this to my attention.

> Do we really need to link the generated programs? If we instead were able to
> just rely on the preprocessor, it'd be vastly faster.
> 
> The __sync* and __atomic* checks actually need to link, as the compiler ends
> up generating calls to unimplemented functions if the compilation target
> doesn't support some operation natively - but I don't think that's true for
> the xsave/avx512 stuff
> 
> Afaict we could just check for predefined preprocessor macros:
> 
> echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E  - -o -|grep 
> -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
> #define __AVX512BW__ 1
> #define __AVX512VPOPCNTDQ__ 1
> #define __XSAVE__ 1
> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 
> 13292maxresident)k
> 
> echo|time gcc -c -march=nehalem -xc -dM -E  - -o -|grep -E 
> '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 
> 10972maxresident)k

Seems promising.  I can't think of a reason that wouldn't work.

> Now, a reasonable counter-argument would be that only some of these macros are
> defined for msvc ([1]).  However, as it turns out, the test is broken
> today, as msvc doesn't error out when using an intrinsic that's not
> "available" by the target architecture, it seems to assume that the caller did
> a cpuid check ahead of time.
> 
> 
> Check out [2], it shows the various predefined macros for gcc, clang and msvc.
> 
> 
> ISTM that the msvc checks for xsave/avx512 being broken should be an open
> item?

I'm not following this one.  At the moment, we always do a runtime check
for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and
set the function pointers appropriately, right?  We could, of course, still
fix it, though.

-- 
nathan

Re: Popcount optimization using AVX512

2024-07-30 Thread Andres Freund

Hi,

On 2024-04-23 11:02:07 -0500, Nathan Bossart wrote:
> On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote:
> > Makes sense, thanks.  I'm planning to commit this fix sometime early next
> > week.
>
> Committed.

I've noticed that the configure probes for this are quite slow - pretty much
the slowest step in a meson setup (and autoconf is similar).  While looking
into this, I also noticed that afaict the tests don't do the right thing for
msvc.

...
[6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES
[6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES
[6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES
[7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO
[8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES
[8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : 
links: NO
[9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : 
links: YES
[9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : links: 
NO
[9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: YES
[9.367] Checking if "x86_64: popcntq instruction" compiles: YES
[9.382] Has header "atomic.h" : NO
...

(the times here are a bit exaggerated, enabling them in meson also turns on
python profiling, which makes everything a bit slower)


Looks like this is largely the fault of including immintrin.h:

echo -e '#include \nint main(){return _xgetbv(0) & 0xe0;}'|time 
gcc -mxsave -xc  - -o /dev/null
0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 94184maxresident)k

echo -e '#include \n'|time gcc -c -mxsave -xc  - -o /dev/null
0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 86004maxresident)k


Do we really need to link the generated programs? If we instead were able to
just rely on the preprocessor, it'd be vastly faster.

The __sync* and __atomic* checks actually need to link, as the compiler ends
up generating calls to unimplemented functions if the compilation target
doesn't support some operation natively - but I don't think that's true for
the xsave/avx512 stuff

Afaict we could just check for predefined preprocessor macros:

echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E  - -o -|grep 
-E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
#define __AVX512BW__ 1
#define __AVX512VPOPCNTDQ__ 1
#define __XSAVE__ 1
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)k

echo|time gcc -c -march=nehalem -xc -dM -E  - -o -|grep -E 
'__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)k



Now, a reasonable counter-argument would be that only some of these macros are
defined for msvc ([1]).  However, as it turns out, the test is broken
today, as msvc doesn't error out when using an intrinsic that's not
"available" by the target architecture, it seems to assume that the caller did
a cpuid check ahead of time.


Check out [2], it shows the various predefined macros for gcc, clang and msvc.


ISTM that the msvc checks for xsave/avx512 being broken should be an open
item?

Greetings,

Andres


[1] 
https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170
[2] https://godbolt.org/z/c8Kj8r3PK

Re: Popcount optimization using AVX512

2024-04-23 Thread Nathan Bossart

On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote:
> Makes sense, thanks.  I'm planning to commit this fix sometime early next
> week.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart

On Thu, Apr 18, 2024 at 10:11:08PM +, Devulapalli, Raghuveer wrote:
>> On that note, is it necessary to also check for avx512f?  At the moment,
>> we are assuming that's supported if the other AVX-512 instructions are
>> available.
> 
> No, it's not needed. There are no CPU's with avx512bw/avx512popcnt
> without avx512f.  Unfortunately though, avx512popcnt does not mean
> avx512bw (I think the deprecated Xeon Phi processors falls in this
> category) which is why we need both.

Makes sense, thanks.  I'm planning to commit this fix sometime early next
week.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer

> On that note, is it necessary to also check for avx512f?  At the moment, we 
> are assuming that's supported if the other AVX-512 instructions are available.

No, it's not needed. There are no CPU's with avx512bw/avx512popcnt without 
avx512f.  Unfortunately though, avx512popcnt does not mean avx512bw (I think 
the deprecated Xeon Phi processors falls in this category) which is why we need 
both.

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart

On Thu, Apr 18, 2024 at 09:29:55PM +, Devulapalli, Raghuveer wrote:
> (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise
> zmm_regs_available() will return false..

Yes, that's a mistake.  I fixed that in v3.

> (2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the
> same cpuid leaf. You could combine them into one to avoid running cpuid
> twice. My apologies, I should have mentioned this before..

Good call.  The byte-and-word instructions were a late addition to the
patch, so I missed this originally.

On that note, is it necessary to also check for avx512f?  At the moment, we
are assuming that's supported if the other AVX-512 instructions are
available.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e04c348eb389c6aa1597ac35d57b5e7ae7075381 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 18 Apr 2024 15:57:56 -0500
Subject: [PATCH v3 1/1] osxsave

---
 src/port/pg_popcount_avx512_choose.c | 80 
 1 file changed, 47 insertions(+), 33 deletions(-)

diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..b37107803a 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -34,39 +34,13 @@
 #ifdef TRY_POPCNT_FAST
 
 /*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
+ * Does CPUID say there's support for XSAVE instructions?
  */
-bool
-pg_popcount_avx512_available(void)
+static inline bool
+xsave_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
 
-	/* Does CPUID say there's support for AVX-512 popcount instructions? */
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	if ((exx[2] & (1 << 14)) == 0)	/* avx512-vpopcntdq */
-		return false;
-
-	/* Does CPUID say there's support for AVX-512 byte and word instructions? */
-	memset(exx, 0, sizeof(exx));
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	if ((exx[1] & (1 << 30)) == 0)	/* avx512-bw */
-		return false;
-
-	/* Does CPUID say there's support for XSAVE instructions? */
-	memset(exx, 0, sizeof(exx));
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
@@ -74,15 +48,55 @@ pg_popcount_avx512_available(void)
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[2] & (1 << 26)) == 0)	/* xsave */
-		return false;
+	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
+}
 
-	/* Does XGETBV say the ZMM registers are enabled? */
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
 #ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe0) != 0;
+	return (_xgetbv(0) & 0xe6) == 0xe6;
 #else
 	return false;
 #endif
 }
 
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
+		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+	return xsave_available() &&
+		zmm_regs_available() &&
+		avx512_popcnt_available();
+}
+
 #endif			/* TRY_POPCNT_FAST */
-- 
2.25.1

RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer

> Thanks for the feedback.  I've attached an updated patch.

(1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise 
zmm_regs_available() will return false. 
(2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the same 
cpuid leaf. You could combine them into one to avoid running cpuid twice. My 
apologies, I should have mentioned this before.

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart

On Thu, Apr 18, 2024 at 08:24:03PM +, Devulapalli, Raghuveer wrote:
>> This seems to contradict the note about doing step 3 at any point, and
>> given step 1 is the OSXSAVE check, I'm not following what this means,
>> anyway.
> 
> It is recommended that you run the xgetbv code before you check for cpu
> features avx512-popcnt and avx512-bw. The way it is written now is the
> opposite order. I would also recommend splitting the cpuid feature check
> for avx512popcnt/avx512bw and xgetbv section into separate functions to
> make them modular. Something like:
> 
> static inline
> int check_os_avx512_support(void)
> {
> // (1) run cpuid leaf 1 to check for xgetbv instruction support:
> unsigned int exx[4] = {0, 0, 0, 0};
> __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
> if ((exx[2] & (1 << 27)) == 0)  /* xsave */
> return false;
> 
> /* Does XGETBV say the ZMM/YMM/XMM registers are enabled? */
> return (_xgetbv(0) & 0xe0) == 0xe0;
> }
> 
>> I'm also wondering if we need to check that (_xgetbv(0) & 0xe6) == 0xe6
>> instead of just (_xgetbv(0) & 0xe0) != 0, as the status of the lower
>> half of some of the ZMM registers is stored in the SSE and AVX state
>> [0].  I don't know how likely it is that 0xe0 would succeed but 0xe6
>> wouldn't, but we might as well make it correct.
> 
> This is correct. It needs to check all the 3 bits (XMM/YMM and ZMM). The
> way it is written is now is in-correct. 

Thanks for the feedback.  I've attached an updated patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d20b19804a17d9f6eab1d40de7e9fb10488ac6b0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 18 Apr 2024 15:57:56 -0500
Subject: [PATCH v2 1/1] osxsave

---
 src/port/pg_popcount_avx512_choose.c | 89 +++-
 1 file changed, 60 insertions(+), 29 deletions(-)

diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..009f94909a 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -34,27 +34,47 @@
 #ifdef TRY_POPCNT_FAST
 
 /*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
+ * Does CPUID say there's support for XSAVE instructions?
  */
-bool
-pg_popcount_avx512_available(void)
+static inline bool
+xsave_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
 
-	/* Does CPUID say there's support for AVX-512 popcount instructions? */
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[2] & (1 << 14)) == 0)	/* avx512-vpopcntdq */
-		return false;
+	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+	return (_xgetbv(0) & 0xe6) != 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
 
-	/* Does CPUID say there's support for AVX-512 byte and word instructions? */
-	memset(exx, 0, sizeof(exx));
 #if defined(HAVE__GET_CPUID_COUNT)
 	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUIDEX)
@@ -62,27 +82,38 @@ pg_popcount_avx512_available(void)
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[1] & (1 << 30)) == 0)	/* avx512-bw */
-		return false;
+	return (exx[2] & (1 << 14)) != 0;	/* avx512-vpopcntdq */
+}
 
-	/* Does CPUID say there's support for XSAVE instructions? */
-	memset(exx, 0, sizeof(exx));
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
+/*
+ * Does CPUID say there's support for AVX-512 byte and word instructions?
+ */
+static inline bool
+avx512_bw_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[2] & (1 << 26)) == 0)	/* xsave */
-		return false;
+	return (exx[1] & (1 << 30)) != 0;	/* avx512-bw */
+}
 
-	/* Does XGETBV say the ZMM registers are enabled? */
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe0) != 0;
-#else
-	return false;
-#endif
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx5

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart

On Thu, Apr 18, 2024 at 06:12:22PM +, Shankaran, Akash wrote:
> Good find. I confirmed after speaking with an intel expert, and from the 
> intel AVX-512 manual [0] section 14.3, which recommends to check bit27. From 
> the manual:
> 
> "Prior to using Intel AVX, the application must identify that the operating 
> system supports the XGETBV instruction,
> the YMM register state, in addition to processor's support for YMM state 
> management using XSAVE/XRSTOR and
> AVX instructions. The following simplified sequence accomplishes both and is 
> strongly recommended.
> 1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application 
> use1).
> 2) Issue XGETBV and verify that XCR0[2:1] = '11b' (XMM state and YMM state 
> are enabled by OS).
> 3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported).
> (Step 3 can be done in any order relative to 1 and 2.)"

Thanks for confirming.  IIUC my patch should be sufficient, then.

> It also seems that step 1 and step 2 need to be done prior to the CPUID 
> OSXSAVE check in the popcount code.

This seems to contradict the note about doing step 3 at any point, and
given step 1 is the OSXSAVE check, I'm not following what this means,
anyway.

I'm also wondering if we need to check that (_xgetbv(0) & 0xe6) == 0xe6
instead of just (_xgetbv(0) & 0xe0) != 0, as the status of the lower half
of some of the ZMM registers is stored in the SSE and AVX state [0].  I
don't know how likely it is that 0xe0 would succeed but 0xe6 wouldn't, but
we might as well make it correct.

[0] https://en.wikipedia.org/wiki/Control_register#cite_ref-23

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-04-18 Thread Shankaran, Akash

> It was brought to my attention [0] that we probably should be checking for 
> the OSXSAVE bit instead of the XSAVE bit when determining whether there's 
> support for the XGETBV instruction.  IIUC that should indicate that both the 
> OS and the processor have XGETBV support (not just the processor).
> I've attached a one-line patch to fix this.

> [0] https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463

Good find. I confirmed after speaking with an intel expert, and from the intel 
AVX-512 manual [0] section 14.3, which recommends to check bit27. From the 
manual:

"Prior to using Intel AVX, the application must identify that the operating 
system supports the XGETBV instruction,
the YMM register state, in addition to processor's support for YMM state 
management using XSAVE/XRSTOR and
AVX instructions. The following simplified sequence accomplishes both and is 
strongly recommended.
1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use1).
2) Issue XGETBV and verify that XCR0[2:1] = '11b' (XMM state and YMM state are 
enabled by OS).
3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported).
(Step 3 can be done in any order relative to 1 and 2.)"

It also seems that step 1 and step 2 need to be done prior to the CPUID OSXSAVE 
check in the popcount code.

[0]: https://cdrdv2.intel.com/v1/dl/getContent/671200

- Akash Shankaran

Re: Popcount optimization using AVX512

2024-04-17 Thread Nathan Bossart

It was brought to my attention [0] that we probably should be checking for
the OSXSAVE bit instead of the XSAVE bit when determining whether there's
support for the XGETBV instruction.  IIUC that should indicate that both
the OS and the processor have XGETBV support (not just the processor).
I've attached a one-line patch to fix this.

[0] https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..cc3e89e096 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -74,7 +74,7 @@ pg_popcount_avx512_available(void)
 #else
 #error cpuid instruction not available
 #endif
-	if ((exx[2] & (1 << 26)) == 0)	/* xsave */
+	if ((exx[2] & (1 << 27)) == 0)	/* osxsave */
 		return false;
 
 	/* Does XGETBV say the ZMM registers are enabled? */

Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane

Nathan Bossart  writes:
> On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote:
>> The Intel documentation for _mm256_undefined_si256() [0]
>> indicates that it is intended to return "undefined elements," so it seems
>> like the use of an uninitialized variable might be intentional.

> See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122.

Ah, interesting.  That hasn't propagated to stable distros yet,
evidently (and even when it does, I wonder how soon Coverity
will understand it).  Anyway, that does establish that it's
gcc's problem not ours.  Thanks for digging!

regards, tom lane

Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart

On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote:
> The Intel documentation for _mm256_undefined_si256() [0]
> indicates that it is intended to return "undefined elements," so it seems
> like the use of an uninitialized variable might be intentional.

See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart

On Sun, Apr 07, 2024 at 08:42:12PM -0400, Tom Lane wrote:
> Today's Coverity run produced this warning, which seemingly was
> triggered by one of these commits, but I can't make much sense
> of it:
> 
> *** CID 1596255:  Uninitialized variables  (UNINIT)
> /usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in 
> _mm256_undefined_si256()
> 1214 extern __inline __m256i __attribute__((__gnu_inline__, 
> __always_inline__, __artificial__))
> 1215 _mm256_undefined_si256 (void)
> 1216 {
> 1217   __m256i __Y = __Y;
 CID 1596255:  Uninitialized variables  (UNINIT)
 Using uninitialized value "__Y".
> 1218   return __Y;
> 1219 }
> 
> I see the same code in my local copy of avxintrin.h,
> and I quite agree that it looks like either an undefined
> value or something that properly ought to be an error.
> If we are calling this, why (and from where)?

Nothing in these commits uses this, or even uses the 256-bit registers.
avxintrin.h is included by immintrin.h, which is probably why this is
showing up.  I believe you're supposed to use immintrin.h for the
intrinsics used in these commits, so I don't immediately see a great way to
avoid this.  The Intel documentation for _mm256_undefined_si256() [0]
indicates that it is intended to return "undefined elements," so it seems
like the use of an uninitialized variable might be intentional.

> Anyway, we can certainly just dismiss this warning if it
> doesn't correspond to any real problem in our code.
> But I thought I'd raise the question.

That's probably the right thing to do, unless there's some action we can
take to suppress this warning.

[0] 
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_undefined_si256&ig_expand=6943

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane

Nathan Bossart  writes:
> Here is what I have staged for commit, which I intend to do shortly.

Today's Coverity run produced this warning, which seemingly was
triggered by one of these commits, but I can't make much sense
of it:

*** CID 1596255:  Uninitialized variables  (UNINIT)
/usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in 
_mm256_undefined_si256()
1214 extern __inline __m256i __attribute__((__gnu_inline__, 
__always_inline__, __artificial__))
1215 _mm256_undefined_si256 (void)
1216 {
1217   __m256i __Y = __Y;
>>> CID 1596255:  Uninitialized variables  (UNINIT)
>>> Using uninitialized value "__Y".
1218   return __Y;
1219 }

I see the same code in my local copy of avxintrin.h,
and I quite agree that it looks like either an undefined
value or something that properly ought to be an error.
If we are calling this, why (and from where)?

Anyway, we can certainly just dismiss this warning if it
doesn't correspond to any real problem in our code.
But I thought I'd raise the question.

regards, tom lane

Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart

On Sat, Apr 06, 2024 at 02:41:01PM -0500, Nathan Bossart wrote:
> Here is what I have staged for commit, which I intend to do shortly.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart

On Sat, Apr 06, 2024 at 02:51:39PM +1300, David Rowley wrote:
> On Sat, 6 Apr 2024 at 14:17, Nathan Bossart  wrote:
>> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
>> > Won't Valgrind complain about this?
>> >
>> > +pg_popcount_avx512(const char *buf, int bytes)
>> >
>> > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
>> >
>> > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
>>
>> I haven't been able to generate any complaints, at least with some simple
>> tests.  But I see your point.  If this did cause such complaints, ISTM we'd
>> just want to add it to the suppression file.  Otherwise, I think we'd have
>> to go back to the non-maskz approach (which I really wanted to avoid
>> because of the weird function overhead juggling) or find another way to do
>> a partial load into an __m512i.
> 
> [1] seems to think it's ok.  If this is true then the following
> shouldn't segfault:
> 
> The following seems to run without any issue and if I change the mask
> to 1 it crashes, as you'd expect.

Cool.

Here is what I have staged for commit, which I intend to do shortly.  At
some point, I'd like to revisit converting TRY_POPCNT_FAST to a
configure-time check and maybe even moving the "fast" and "slow"
implementations to their own files, but since that's mostly for code
neatness and we are rapidly approaching the v17 deadline, I'm content to
leave that for v18.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9eea49555cbd14c7871085e159c9b0b78e92 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v28 1/2] Optimize pg_popcount() with AVX-512 instructions.

Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible.  Newer hardware that supports AVX-512 instructions
can perform these tasks in 512-bit chunks, which can provide a nice
speedup, especially for larger buffers.

This commit introduces the infrastructure required to detect both
compiler and CPU support for the required AVX-512 intrinsic
functions, and it makes use of that infrastructure in a new
pg_popcount() implementation.  If CPU support for this optimized
implementation is detected at runtime, a function pointer is
updated so that it is used for subsequent calls to pg_popcount().

Most of the existing in-tree calls to pg_popcount() should benefit
nicely from these instructions, and calls for smaller buffers
should not regress when compared to v16.  The new infrastructure
introduced by this commit can also be used to optimized
visibilitymap_count(), but that work is left for a follow-up
commit.

Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  11 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   5 +
 src/port/pg_popcount_avx512.c|  82 +
 src/port/pg_popcount_avx512_choose.c |  87 +
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 696 insertions(+), 3 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..cfff48c1bc 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# --

Re: Popcount optimization using AVX512

2024-04-05 Thread David Rowley

On Sat, 6 Apr 2024 at 14:17, Nathan Bossart  wrote:
>
> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
> > Won't Valgrind complain about this?
> >
> > +pg_popcount_avx512(const char *buf, int bytes)
> >
> > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
> >
> > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
>
> I haven't been able to generate any complaints, at least with some simple
> tests.  But I see your point.  If this did cause such complaints, ISTM we'd
> just want to add it to the suppression file.  Otherwise, I think we'd have
> to go back to the non-maskz approach (which I really wanted to avoid
> because of the weird function overhead juggling) or find another way to do
> a partial load into an __m512i.

[1] seems to think it's ok.  If this is true then the following
shouldn't segfault:

The following seems to run without any issue and if I change the mask
to 1 it crashes, as you'd expect.

#include 
#include 
int main(void)
{
__m512i val;
val = _mm512_maskz_loadu_epi8((__mmask64) 0, NULL);
printf("%llu\n", _mm512_reduce_add_epi64(val));
return 0;
}

gcc avx512.c -o avx512 -O0 -mavx512f -march=native

David

[1] 
https://stackoverflow.com/questions/54497141/when-using-a-mask-register-with-avx-512-load-and-stores-is-a-fault-raised-for-i

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart

On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
> Won't Valgrind complain about this?
> 
> +pg_popcount_avx512(const char *buf, int bytes)
> 
> + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
> 
> + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);

I haven't been able to generate any complaints, at least with some simple
tests.  But I see your point.  If this did cause such complaints, ISTM we'd
just want to add it to the suppression file.  Otherwise, I think we'd have
to go back to the non-maskz approach (which I really wanted to avoid
because of the weird function overhead juggling) or find another way to do
a partial load into an __m512i.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-05 Thread David Rowley

On Sat, 6 Apr 2024 at 04:38, Nathan Bossart  wrote:
> This seems to provide a small performance boost, so I've incorporated it
> into v27.

Won't Valgrind complain about this?

+pg_popcount_avx512(const char *buf, int bytes)

+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);

+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);

David

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart

On Fri, Apr 05, 2024 at 07:58:44AM -0500, Nathan Bossart wrote:
> On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote:
>> The main issue I saw was that clang was able to peel off the first
>> iteration of the loop and then eliminate the mask assignment and
>> replace masked load with a memory operand for vpopcnt. I was not able
>> to convince gcc to do that regardless of optimization options.
>> Generated code for the inner loop:
>> 
>> clang:
>> :
>>   50:  add rdx, 64
>>   54:  cmp rdx, rdi
>>   57:  jae 
>>   59:  vpopcntq zmm1, zmmword ptr [rdx]
>>   5f:  vpaddq zmm0, zmm1, zmm0
>>   65:  jmp 
>> 
>> gcc:
>> :
>>   38:  kmovq k1, rdx
>>   3d:  vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
>>   43:  add rax, 64
>>   47:  mov rdx, -1
>>   4e:  vpopcntq zmm0, zmm0
>>   54:  vpaddq zmm0, zmm0, zmm1
>>   5a:  vmovdqa64 zmm1, zmm0
>>   60:  cmp rax, rsi
>>   63:  jb 
>> 
>> I'm not sure how much that matters in practice. Attached is a patch to
>> do this manually giving essentially the same result in gcc. As most
>> distro packages are built using gcc I think it would make sense to
>> have the extra code if it gives a noticeable benefit for large cases.
> 
> Yeah, I did see this, but I also wasn't sure if it was worth further
> complicating the code.  I can test with and without your fix and see if it
> makes any difference in the benchmarks.

This seems to provide a small performance boost, so I've incorporated it
into v27.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9fc4b7556b72d51fce676db84b446099767efff3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v27 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  11 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   5 +
 src/port/pg_popcount_avx512.c|  82 +
 src/port/pg_popcount_avx512_choose.c |  81 +
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 690 insertions(+), 3 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..892b3c9580 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
+# -mavx512bw).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed va

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart

On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote:
> The main issue I saw was that clang was able to peel off the first
> iteration of the loop and then eliminate the mask assignment and
> replace masked load with a memory operand for vpopcnt. I was not able
> to convince gcc to do that regardless of optimization options.
> Generated code for the inner loop:
> 
> clang:
> :
>   50:  add rdx, 64
>   54:  cmp rdx, rdi
>   57:  jae 
>   59:  vpopcntq zmm1, zmmword ptr [rdx]
>   5f:  vpaddq zmm0, zmm1, zmm0
>   65:  jmp 
> 
> gcc:
> :
>   38:  kmovq k1, rdx
>   3d:  vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
>   43:  add rax, 64
>   47:  mov rdx, -1
>   4e:  vpopcntq zmm0, zmm0
>   54:  vpaddq zmm0, zmm0, zmm1
>   5a:  vmovdqa64 zmm1, zmm0
>   60:  cmp rax, rsi
>   63:  jb 
> 
> I'm not sure how much that matters in practice. Attached is a patch to
> do this manually giving essentially the same result in gcc. As most
> distro packages are built using gcc I think it would make sense to
> have the extra code if it gives a noticeable benefit for large cases.

Yeah, I did see this, but I also wasn't sure if it was worth further
complicating the code.  I can test with and without your fix and see if it
makes any difference in the benchmarks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-05 Thread Ants Aasma

On Fri, 5 Apr 2024 at 07:15, Nathan Bossart  wrote:
> Here is an updated patch set.  IMHO this is in decent shape and is
> approaching committable.

I checked the code generation on various gcc and clang versions. It
looks mostly fine starting from versions where avx512 is supported,
gcc-7.1 and clang-5.

The main issue I saw was that clang was able to peel off the first
iteration of the loop and then eliminate the mask assignment and
replace masked load with a memory operand for vpopcnt. I was not able
to convince gcc to do that regardless of optimization options.
Generated code for the inner loop:

clang:
:
  50:  add rdx, 64
  54:  cmp rdx, rdi
  57:  jae 
  59:  vpopcntq zmm1, zmmword ptr [rdx]
  5f:  vpaddq zmm0, zmm1, zmm0
  65:  jmp 

gcc:
:
  38:  kmovq k1, rdx
  3d:  vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
  43:  add rax, 64
  47:  mov rdx, -1
  4e:  vpopcntq zmm0, zmm0
  54:  vpaddq zmm0, zmm0, zmm1
  5a:  vmovdqa64 zmm1, zmm0
  60:  cmp rax, rsi
  63:  jb 

I'm not sure how much that matters in practice. Attached is a patch to
do this manually giving essentially the same result in gcc. As most
distro packages are built using gcc I think it would make sense to
have the extra code if it gives a noticeable benefit for large cases.

The visibility map patch has the same issue, otherwise looks good.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index dacc7553d29..f6e718b86e9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -52,13 +52,21 @@ pg_popcount_avx512(const char *buf, int bytes)
 	 * Iterate through all but the final iteration.  Starting from second
 	 * iteration, the start index mask is ignored.
 	 */
-	for (; buf < final; buf += sizeof(__m512i))
+	if (buf < final)
 	{
 		val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
 		cnt = _mm512_popcnt_epi64(val);
 		accum = _mm512_add_epi64(accum, cnt);

+		buf += sizeof(__m512i);
 		mask = ~UINT64CONST(0);
+
+		for (; buf < final; buf += sizeof(__m512i))
+		{
+			val = _mm512_load_si512((const __m512i *) buf);
+			cnt = _mm512_popcnt_epi64(val);
+			accum = _mm512_add_epi64(accum, cnt);
+		}
 	}

 	/* Final iteration needs to ignore bytes that are not within the length */

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart

Here is an updated patch set.  IMHO this is in decent shape and is
approaching committable.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From df59d3e78604e4530f5096bafc08ac94e13d82d2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v26 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  11 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   5 +
 src/port/pg_popcount_avx512.c|  74 
 src/port/pg_popcount_avx512_choose.c |  81 +
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 682 insertions(+), 3 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..892b3c9580 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
+# -mavx512bw).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..72d20d3945 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart

On Thu, Apr 04, 2024 at 04:02:53PM +0300, Ants Aasma wrote:
> Speaking of which, what does bumping up the inlined version threshold
> to 16 do with and without AVX-512 available? Linearly extrapolating
> the 2 and 4 byte numbers it might just come ahead in both cases,
> making the choice easy.

IIRC the inlined version starts losing pretty quickly after 8 bytes.  As I
noted in my previous message, I think we have enough data to switch to your
approach already, so I think it's a moot point.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart

On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote:
> On Thu, 4 Apr 2024 at 11:50, Nathan Bossart  wrote:
>> If we can verify this approach won't cause segfaults and can stomach the
>> regression between 8 and 16 bytes, I'd happily pivot to this approach so
>> that we can avoid the function call dance that I have in v25.
> 
> If we're worried about regressions with some narrow range of byte
> values, wouldn't it make more sense to compare that to cc4826dd5~1 at
> the latest rather than to some version that's already probably faster
> than PG16?

Good point.  When compared with REL_16_STABLE, Ants's idea still wins:

  bytes  v25   v25+ants  REL_16_STABLE
  2  1108.205  1033.132  2039.342
  4  1311.227  1289.373  3207.217
  8  1927.954  2360.113  3200.238
 16  2281.091  2365.408  4457.769
 32  3856.992  2390.688  6206.689
 64  3648.72   3242.498  9619.403
128  4108.549  3607.148  17912.081
256  4910.076  4496.852  33591.385

As before, with 2 and 4 bytes, HEAD is using the inlined approach, but
REL_16_STABLE is doing a function call.  For 8 bytes, REL_16_STABLE is
doing a function call as well as a call to a function pointer.  At 16
bytes, it's doing a function call and two calls to a function pointer.
With Ant's approach, both 8 and 16 bytes require a single call to a
function pointer, and of course we are using the AVX-512 implementation for
both.

I think this is sufficient to justify switching approaches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-04 Thread Ants Aasma

On Thu, 4 Apr 2024 at 01:50, Nathan Bossart  wrote:
> If we can verify this approach won't cause segfaults and can stomach the
> regression between 8 and 16 bytes, I'd happily pivot to this approach so
> that we can avoid the function call dance that I have in v25.

The approach I posted does not rely on masking performing page fault
suppression. All loads are 64 byte aligned and always contain at least
one byte of the buffer and therefore are guaranteed to be within a
valid page.

I personally don't mind it being slower for the very small cases,
because when performance on those sizes really matters it makes much
more sense to shoot for an inlined version instead.

Speaking of which, what does bumping up the inlined version threshold
to 16 do with and without AVX-512 available? Linearly extrapolating
the 2 and 4 byte numbers it might just come ahead in both cases,
making the choice easy.

Regards,
Ants Aasma

Re: Popcount optimization using AVX512

2024-04-03 Thread David Rowley

On Thu, 4 Apr 2024 at 11:50, Nathan Bossart  wrote:
> If we can verify this approach won't cause segfaults and can stomach the
> regression between 8 and 16 bytes, I'd happily pivot to this approach so
> that we can avoid the function call dance that I have in v25.
>
> Thoughts?

If we're worried about regressions with some narrow range of byte
values, wouldn't it make more sense to compare that to cc4826dd5~1 at
the latest rather than to some version that's already probably faster
than PG16?

David

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote:
> On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
>> > What about using the masking capabilities of AVX-512 to handle the
>> > tail in the same code path? Masked out portions of a load instruction
>> > will not generate an exception. To allow byte level granularity
>> > masking, -mavx512bw is needed. Based on wikipedia this will only
>> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
>> > VPOPCNTQ implies availability of BW.
>>
>> Sounds promising.  IMHO we should really be sure that these kinds of loads
>> won't generate segfaults and the like due to the masked-out portions.  I
>> searched around a little bit but haven't found anything that seemed
>> definitive.
> 
> After sleeping on the problem, I think we can avoid this question
> altogether while making the code faster by using aligned accesses.
> Loads that straddle cache line boundaries run internally as 2 load
> operations. Gut feel says that there are enough out-of-order resources
> available to make it not matter in most cases. But even so, not doing
> the extra work is surely better. Attached is another approach that
> does aligned accesses, and thereby avoids going outside bounds.
> 
> Would be interesting to see how well that fares in the small use case.
> Anything that fits into one aligned cache line should be constant
> speed, and there is only one branch, but the mask setup and folding
> the separate popcounts together should add up to about 20-ish cycles
> of overhead.

I tested your patch in comparison to v25 and saw the following:

  bytes  v25   v25+ants
21108.205  1033.132
41311.227  1289.373
81927.954  2360.113
   162281.091  2365.408
   323856.992  2390.688
   643648.72   3242.498
  1284108.549  3607.148
  2564910.076  4496.852

For 2 bytes and 4 bytes, the inlining should take effect, so any difference
there is likely just noise.  At 8 bytes, we are calling the function
pointer, and there is a small regression with the masking approach.
However, by 16 bytes, the masking approach is on par with v25, and it wins
for all larger buffers, although the gains seem to taper off a bit.

If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart

On Wed, Apr 03, 2024 at 12:41:27PM -0500, Nathan Bossart wrote:
> I committed v23-0001.  Here is a rebased version of the remaining patches.
> I intend to test the masking idea from Ants next.

0002 was missing a cast that is needed for the 32-bit builds.  I've fixed
that in v25.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From fe001e38b3f209c2fe615a2c4c64109d5e4d3da1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v25 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 673 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: re

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart

I committed v23-0001.  Here is a rebased version of the remaining patches.
I intend to test the masking idea from Ants next.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 295b03530de5f42fe876b4489191da2f8dc83194 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v24 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 673 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 05:20:20PM -0500, Nathan Bossart wrote:
> Sorry for the noise.  I noticed a couple of silly mistakes immediately
> after sending v21.

Sigh...  I missed a line while rebasing these patches, which seems to have
grossly offended cfbot.  Apologies again for the noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bfe2b3158378fd822c17fb251178df7557065cfd Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v23 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 12 ++--
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..6271acea60 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
+		pg_popcount_optimized = pg_popcount_slow;
 	}
 }
 
@@ -183,7 +183,7 @@ static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
 	choose_popcount_functions();
-	return pg_popcount(buf, bytes);
+	return pg_popcount_optimized(buf, bytes);
 }
 
 /*
@@ -387,11 +387,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount
+ * pg_popcount_optimized
  *		Returns the number of 1-bits in buf
  */
 uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
 {
 	return pg_popcount_slow(buf, bytes);
 }
-- 
2.25.1

>From da744d0614021cf002e4d9e292e5c874bd81a84e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v23 2/3] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 05:01:32PM -0500, Nathan Bossart wrote:
> In v21, 0001 is just the above inlining idea, which seems worth doing
> independent of $SUBJECT.  0002 and 0003 are the AVX-512 patches, which I've
> modified similarly to 0001, i.e., I've inlined the "fast" version in the
> function pointer to avoid the function call overhead when there are fewer
> than 64 bytes.  All of this overhead juggling should result in choosing the
> optimal popcount implementation depending on how many bytes there are to
> process, roughly speaking.

Sorry for the noise.  I noticed a couple of silly mistakes immediately
after sending v21.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cfc5e9fe77f96225ec67a044377b10113c98ce0d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v22 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 12 ++--
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..6271acea60 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
+		pg_popcount_optimized = pg_popcount_slow;
 	}
 }
 
@@ -183,7 +183,7 @@ static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
 	choose_popcount_functions();
-	return pg_popcount(buf, bytes);
+	return pg_popcount_optimized(buf, bytes);
 }
 
 /*
@@ -387,11 +387,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount
+ * pg_popcount_optimized
  *		Returns the number of 1-bits in buf
  */
 uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
 {
 	return pg_popcount_slow(buf, bytes);
 }
-- 
2.25.1

>From a8024ebcc54b4ac0d3d145ade5d7cd85eb192afc Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v22 2/3] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 +

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 01:40:21PM -0500, Nathan Bossart wrote:
> On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
>> I don't like the double evaluation of the macro argument.  Seems like
>> you could get the same results more safely with
>> 
>>  static inline uint64
>>  pg_popcount(const char *buf, int bytes)
>>  {
>>  if (bytes < 64)
>>  {
>>  uint64  popcnt = 0;
>> 
>>  while (bytes--)
>>  popcnt += pg_number_of_ones[(unsigned char) 
>> *buf++];
>> 
>>  return popcnt;
>>  }
>>  return pg_popcount_optimized(buf, bytes);
>>  }
> 
> Yeah, I like that better.  I'll do some testing to see what the threshold
> really should be before posting an actual patch.

My testing shows that inlining wins with fewer than 8 bytes for the current
"fast" implementation.  The "fast" implementation wins with fewer than 64
bytes compared to the AVX-512 implementation.  These results are pretty
intuitive because those are the points at which the optimizations kick in.

In v21, 0001 is just the above inlining idea, which seems worth doing
independent of $SUBJECT.  0002 and 0003 are the AVX-512 patches, which I've
modified similarly to 0001, i.e., I've inlined the "fast" version in the
function pointer to avoid the function call overhead when there are fewer
than 64 bytes.  All of this overhead juggling should result in choosing the
optimal popcount implementation depending on how many bytes there are to
process, roughly speaking.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From ce1180d557cbdf8cff33842ea2f1a22ba6676725 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v21 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 10 +-
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..4720f8e419 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
+		pg_popcount_optimized = pg_popcount_slow;
 	}
 }
 
@@ -183,7 +183,7 @@ static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
 	choose_pop

Re: Popcount optimization using AVX512

2024-04-02 Thread Ants Aasma

On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.

After sleeping on the problem, I think we can avoid this question
altogether while making the code faster by using aligned accesses.
Loads that straddle cache line boundaries run internally as 2 load
operations. Gut feel says that there are enough out-of-order resources
available to make it not matter in most cases. But even so, not doing
the extra work is surely better. Attached is another approach that
does aligned accesses, and thereby avoids going outside bounds.

Would be interesting to see how well that fares in the small use case.
Anything that fits into one aligned cache line should be constant
speed, and there is only one branch, but the mask setup and folding
the separate popcounts together should add up to about 20-ish cycles
of overhead.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..e1fbd98fa14 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,44 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
 	__m512i		accum = _mm512_setzero_si512();
+	const char *final;
+	int 		tail_idx;
+	__mmask64	mask = -1;
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
-	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+	/*
+	 * Align buffer down to avoid double load overhead from unaligned access.
+	 * Calculate a mask to ignore preceding bytes. Find start offset of final
+	 * iteration and number of valid bytes making sure that final iteration
+	 * is not empty.
+	 */
+	mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+	tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+	final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+	buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
 
+	/*
+	 * Iterate through all but the final iteration. Starting from second
+	 * iteration, the start index mask is ignored.
+	 */
+	for (; buf < final; buf += sizeof(__m512i))
+	{
+		val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 		accum = _mm512_add_epi64(accum, cnt);
-		buf += sizeof(__m512i);
+
+		mask = -1;
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	/* Final iteration needs to ignore bytes that are not within the length */
+	mask &= ((~0ULL) >> (64 - tail_idx));
+
+	val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif			/* TRY_POPCNT_FAST */

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2024-Apr-02, Nathan Bossart wrote:
>>> Another idea I had is to turn pg_popcount() into a macro that just uses the
>>> pg_number_of_ones array when called for few bytes:
>>> 
>>> static inline uint64
>>> pg_popcount_inline(const char *buf, int bytes)
>>> {
>>> uint64  popcnt = 0;
>>> 
>>> while (bytes--)
>>> popcnt += pg_number_of_ones[(unsigned char) *buf++];
>>> 
>>> return popcnt;
>>> }
>>> 
>>> #define pg_popcount(buf, bytes) \
>>> ((bytes < 64) ? \
>>>  pg_popcount_inline(buf, bytes) : \
>>>  pg_popcount_optimized(buf, bytes))
>>> 
>>> But again, I'm not sure this is really worth it for the current use-cases.
> 
>> Eh, that seems simple enough, and then you can forget about that case.
> 
> I don't like the double evaluation of the macro argument.  Seems like
> you could get the same results more safely with
> 
>   static inline uint64
>   pg_popcount(const char *buf, int bytes)
>   {
>   if (bytes < 64)
>   {
>   uint64  popcnt = 0;
> 
>   while (bytes--)
>   popcnt += pg_number_of_ones[(unsigned char) 
> *buf++];
> 
>   return popcnt;
>   }
>   return pg_popcount_optimized(buf, bytes);
>   }

Yeah, I like that better.  I'll do some testing to see what the threshold
really should be before posting an actual patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-02 Thread Tom Lane

Alvaro Herrera  writes:
> On 2024-Apr-02, Nathan Bossart wrote:
>> Another idea I had is to turn pg_popcount() into a macro that just uses the
>> pg_number_of_ones array when called for few bytes:
>> 
>>  static inline uint64
>>  pg_popcount_inline(const char *buf, int bytes)
>>  {
>>  uint64  popcnt = 0;
>> 
>>  while (bytes--)
>>  popcnt += pg_number_of_ones[(unsigned char) *buf++];
>> 
>>  return popcnt;
>>  }
>> 
>>  #define pg_popcount(buf, bytes) \
>>  ((bytes < 64) ? \
>>   pg_popcount_inline(buf, bytes) : \
>>   pg_popcount_optimized(buf, bytes))
>> 
>> But again, I'm not sure this is really worth it for the current use-cases.

> Eh, that seems simple enough, and then you can forget about that case.

I don't like the double evaluation of the macro argument.  Seems like
you could get the same results more safely with

static inline uint64
pg_popcount(const char *buf, int bytes)
{
if (bytes < 64)
{
uint64  popcnt = 0;

while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) 
*buf++];

return popcnt;
}
return pg_popcount_optimized(buf, bytes);
}

regards, tom lane

Re: Popcount optimization using AVX512

2024-04-02 Thread Alvaro Herrera

On 2024-Apr-02, Nathan Bossart wrote:

> Another idea I had is to turn pg_popcount() into a macro that just uses the
> pg_number_of_ones array when called for few bytes:
> 
>   static inline uint64
>   pg_popcount_inline(const char *buf, int bytes)
>   {
>   uint64  popcnt = 0;
> 
>   while (bytes--)
>   popcnt += pg_number_of_ones[(unsigned char) *buf++];
> 
>   return popcnt;
>   }
> 
>   #define pg_popcount(buf, bytes) \
>   ((bytes < 64) ? \
>pg_popcount_inline(buf, bytes) : \
>pg_popcount_optimized(buf, bytes))
> 
> But again, I'm not sure this is really worth it for the current use-cases.

Eh, that seems simple enough, and then you can forget about that case.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart

On Mon, Apr 01, 2024 at 05:11:17PM -0500, Nathan Bossart wrote:
> Here is a v19 of the patch set.  I moved out the refactoring of the
> function pointer selection code to 0001.  I think this is a good change
> independent of $SUBJECT, and I plan to commit this soon.  In 0002, I
> changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones
> instead.  This is standard practice elsewhere where the popcount functions
> are unlikely to win.  I'll probably commit this one soon, too, as it's even
> more trivial than 0001.
>
> 0003 is the AVX512 POPCNT patch.  Besides refactoring out 0001, there are
> no changes from v18.  0004 is an early proof-of-concept for using AVX512
> for the visibility map code.  The code is missing comments, and I haven't
> performed any benchmarking yet, but I figured I'd post it because it
> demonstrates how it's possible to build upon 0003 in other areas.

I've committed the first two patches, and I've attached a rebased version
of the latter two.

> AFAICT the main open question is the function call overhead in 0003 that
> Alvaro brought up earlier.  After 0002 is committed, I believe the only
> in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm
> not sure it's worth expending too much energy to make sure there are
> absolutely no regressions there.  However, I'm happy to do so if folks feel
> that it is necessary, and I'd be grateful for thoughts on how to proceed on
> this one.

Another idea I had is to turn pg_popcount() into a macro that just uses the
pg_number_of_ones array when called for few bytes:

static inline uint64
pg_popcount_inline(const char *buf, int bytes)
{
uint64  popcnt = 0;

while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];

return popcnt;
}

#define pg_popcount(buf, bytes) \
((bytes < 64) ? \
 pg_popcount_inline(buf, bytes) : \
 pg_popcount_optimized(buf, bytes))

But again, I'm not sure this is really worth it for the current use-cases.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3c5c3fdaffd623b513bcc476ee7c15f6379af1e7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v20 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   7 +-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 651 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACH

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 01:09:57AM +0300, Ants Aasma wrote:
> On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
>> > What about using the masking capabilities of AVX-512 to handle the
>> > tail in the same code path? Masked out portions of a load instruction
>> > will not generate an exception. To allow byte level granularity
>> > masking, -mavx512bw is needed. Based on wikipedia this will only
>> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
>> > VPOPCNTQ implies availability of BW.
>>
>> Sounds promising.  IMHO we should really be sure that these kinds of loads
>> won't generate segfaults and the like due to the masked-out portions.  I
>> searched around a little bit but haven't found anything that seemed
>> definitive.
> 
> Interestingly the Intel software developer manual is not exactly
> crystal clear on how memory faults with masks work, but volume 2A
> chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb
> that supports memory fault suppression on page fault.

Perhaps Paul or Akash could chime in here...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart

Here is a v19 of the patch set.  I moved out the refactoring of the
function pointer selection code to 0001.  I think this is a good change
independent of $SUBJECT, and I plan to commit this soon.  In 0002, I
changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones
instead.  This is standard practice elsewhere where the popcount functions
are unlikely to win.  I'll probably commit this one soon, too, as it's even
more trivial than 0001.

0003 is the AVX512 POPCNT patch.  Besides refactoring out 0001, there are
no changes from v18.  0004 is an early proof-of-concept for using AVX512
for the visibility map code.  The code is missing comments, and I haven't
performed any benchmarking yet, but I figured I'd post it because it
demonstrates how it's possible to build upon 0003 in other areas.

AFAICT the main open question is the function call overhead in 0003 that
Alvaro brought up earlier.  After 0002 is committed, I believe the only
in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm
not sure it's worth expending too much energy to make sure there are
absolutely no regressions there.  However, I'm happy to do so if folks feel
that it is necessary, and I'd be grateful for thoughts on how to proceed on
this one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cedad23b7b35e77fde164b1d577c37fb07a578c6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 1 Apr 2024 16:37:53 -0500
Subject: [PATCH v19 1/4] refactor popcount function choosing

---
 src/port/pg_bitutils.c | 37 +
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..28312f3dd9 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -148,8 +148,8 @@ pg_popcount_available(void)
  * the function pointers so that subsequent calls are routed directly to
  * the chosen implementation.
  */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
 {
 	if (pg_popcount_available())
 	{
@@ -163,45 +163,26 @@ pg_popcount32_choose(uint32 word)
 		pg_popcount64 = pg_popcount64_slow;
 		pg_popcount = pg_popcount_slow;
 	}
+}
 
+static int
+pg_popcount32_choose(uint32 word)
+{
+	choose_popcount_functions();
 	return pg_popcount32(word);
 }
 
 static int
 pg_popcount64_choose(uint64 word)
 {
-	if (pg_popcount_available())
-	{
-		pg_popcount32 = pg_popcount32_fast;
-		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
-	}
-	else
-	{
-		pg_popcount32 = pg_popcount32_slow;
-		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
-	}
-
+	choose_popcount_functions();
 	return pg_popcount64(word);
 }
 
 static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
-	if (pg_popcount_available())
-	{
-		pg_popcount32 = pg_popcount32_fast;
-		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
-	}
-	else
-	{
-		pg_popcount32 = pg_popcount32_slow;
-		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
-	}
-
+	choose_popcount_functions();
 	return pg_popcount(buf, bytes);
 }
 
-- 
2.25.1

>From 038b74045b006c5d8a5470364f2041370ec0b083 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 1 Apr 2024 16:47:22 -0500
Subject: [PATCH v19 2/4] use pg_number_of_ones instead of pg_popcount for
 single byte

---
 src/backend/postmaster/syslogger.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 08efe74cc9..437947dbb9 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -898,7 +898,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 		if (p.nuls[0] == '\0' && p.nuls[1] == '\0' &&
 			p.len > 0 && p.len <= PIPE_MAX_PAYLOAD &&
 			p.pid != 0 &&
-			pg_popcount((char *) &dest_flags, 1) == 1)
+			pg_number_of_ones[dest_flags] == 1)
 		{
 			List	   *buffer_list;
 			ListCell   *cell;
-- 
2.25.1

>From 73ee8d6018b047856e63ad075641a0dcfe889417 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v19 3/4] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   7 +-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma

On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.

Interestingly the Intel software developer manual is not exactly
crystal clear on how memory faults with masks work, but volume 2A
chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb
that supports memory fault suppression on page fault.

Regards,
Ants Aasma

[1] https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> What about using the masking capabilities of AVX-512 to handle the
> tail in the same code path? Masked out portions of a load instruction
> will not generate an exception. To allow byte level granularity
> masking, -mavx512bw is needed. Based on wikipedia this will only
> disable this fast path on Knights Mill (Xeon Phi), in all other cases
> VPOPCNTQ implies availability of BW.

Sounds promising.  IMHO we should really be sure that these kinds of loads
won't generate segfaults and the like due to the masked-out portions.  I
searched around a little bit but haven't found anything that seemed
definitive.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma

On Mon, 1 Apr 2024 at 18:53, Nathan Bossart  wrote:
>
> On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> > On 2024-Mar-31, Nathan Bossart wrote:
> >> +popcnt = _mm512_reduce_add_epi64(accum);
> >> +return popcnt + pg_popcount_fast(buf, bytes);
> >
> > Hmm, doesn't this arrangement cause an extra function call to
> > pg_popcount_fast to be used here?  Given the level of micro-optimization
> > being used by this code, I would have thought that you'd have tried to
> > avoid that.  (At least, maybe avoid the call if bytes is 0, no?)
>
> Yes, it does.  I did another benchmark on very small arrays and can see the
> overhead.  This is the time in milliseconds to run pg_popcount() on an
> array 1 billion times:
>
> size (bytes)  HEAD  AVX512-POPCNT
> 1 1707.685  3480.424
> 2 1926.694  4606.182
> 4 3210.412  5284.506
> 8 1920.703  3640.968
> 162936.91   4045.586
> 323627.956  5538.418
> 645347.213  3748.212
>
> I suspect that anything below 64 bytes will see this regression, as that is
> the earliest point where there are enough bytes for ZMM registers.

What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.

Attached is an example of what I mean. I did not have a machine to
test it with, but the code generated looks sane. I added the clang
pragma because it insisted on unrolling otherwise and based on how the
instruction dependencies look that is probably not too helpful even
for large cases (needs to be tested). The configure check and compile
flags of course need to be amended for BW.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..7fb2ada16c9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,27 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
+	__mmask64	remaining_mask;
 	__m512i		accum = _mm512_setzero_si512();
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+	#pragma clang loop unroll(disable)
+	for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i))
 	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+		val = _mm512_loadu_si512((const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 
 		accum = _mm512_add_epi64(accum, cnt);
 		buf += sizeof(__m512i);
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes);
+	val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif			/* TRY_POPCNT_FAST */

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart

On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> On 2024-Mar-31, Nathan Bossart wrote:
>> +popcnt = _mm512_reduce_add_epi64(accum);
>> +return popcnt + pg_popcount_fast(buf, bytes);
> 
> Hmm, doesn't this arrangement cause an extra function call to
> pg_popcount_fast to be used here?  Given the level of micro-optimization
> being used by this code, I would have thought that you'd have tried to
> avoid that.  (At least, maybe avoid the call if bytes is 0, no?)

Yes, it does.  I did another benchmark on very small arrays and can see the
overhead.  This is the time in milliseconds to run pg_popcount() on an
array 1 billion times:

size (bytes)  HEAD  AVX512-POPCNT
1 1707.685  3480.424
2 1926.694  4606.182
4 3210.412  5284.506
8 1920.703  3640.968
162936.91   4045.586
323627.956  5538.418
645347.213  3748.212

I suspect that anything below 64 bytes will see this regression, as that is
the earliest point where there are enough bytes for ZMM registers.

We could avoid the call if there are no remaining bytes, but the numbers
for the smallest arrays probably wouldn't improve much, and that might
actually add some overhead due to branching.  The other option to avoid
this overhead is to put most of pg_bitutils.c into its header file so that
we can inline the call.

Reviewing the current callers of pg_popcount(), IIUC the only ones that are
passing very small arrays are the bit_count() implementations and a call in
the syslogger for a single byte.  I don't know how much to worry about the
overhead for bit_count() since there's presumably a bunch of other
overhead, and the syslogger one could probably be fixed via an inline
function that pulled the value from pg_number_of_ones (which would probably
be an improvement over the status quo, anyway).  But this is all to save a
couple of nanoseconds...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-01 Thread Alvaro Herrera

On 2024-Mar-31, Nathan Bossart wrote:

> +uint64
> +pg_popcount_avx512(const char *buf, int bytes)
> +{
> + uint64  popcnt;
> + __m512i accum = _mm512_setzero_si512();
> +
> + for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
> + {
> + const   __m512i val = _mm512_loadu_si512((const __m512i 
> *) buf);
> + const   __m512i cnt = _mm512_popcnt_epi64(val);
> +
> + accum = _mm512_add_epi64(accum, cnt);
> + buf += sizeof(__m512i);
> + }
> +
> + popcnt = _mm512_reduce_add_epi64(accum);
> + return popcnt + pg_popcount_fast(buf, bytes);
> +}

Hmm, doesn't this arrangement cause an extra function call to
pg_popcount_fast to be used here?  Given the level of micro-optimization
being used by this code, I would have thought that you'd have tried to
avoid that.  (At least, maybe avoid the call if bytes is 0, no?)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

Re: Popcount optimization using AVX512

2024-03-31 Thread Nathan Bossart

On Sat, Mar 30, 2024 at 03:03:29PM -0500, Nathan Bossart wrote:
> My current plan is to add some new tests for
> pg_popcount() with many bytes, and then I'll give it a few more days for
> any additional feedback before committing.

Here is a v18 with a couple of new tests.  Otherwise, it is the same as
v17.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 86a571721ed3ed4ca7e04134b9541fc3ac43b9f1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v18 1/1] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  56 +++---
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 666 insertions(+), 39 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo

Re: Popcount optimization using AVX512

2024-03-30 Thread Nathan Bossart

I used John Naylor's test_popcount module [0] to put together the attached
graphs (note that the "small arrays" one is semi-logarithmic).  For both
graphs, the X-axis is the number of 64-bit words in the array, and Y-axis
is the amount of time in milliseconds to run pg_popcount() on it 100,000
times (along with a bit of overhead).  This test didn't show any
regressions with a relatively small number of bytes, and it showed the
expected improvements with many bytes.

There isn't a ton of use of pg_popcount() in Postgres, but I do see a few
places that call it with enough bytes for the AVX512 optimization to take
effect.  There may be more callers in the future, though, and it seems
generally useful to have some of the foundational work for using AVX512
instructions in place.  My current plan is to add some new tests for
pg_popcount() with many bytes, and then I'll give it a few more days for
any additional feedback before committing.

[0] 
https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=g...@mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

Here's a v17 of the patch.  This one has configure checks for everything
(i.e., CPUID, XGETBV, and the AVX512 intrinsics) as well as the relevant
runtime checks (i.e., we call CPUID to check for XGETBV and AVX512 POPCNT
availability, and we call XGETBV to ensure the ZMM registers are enabled).
I restricted the AVX512 configure checks to x86_64 since we know we won't
have TRY_POPCNT_FAST on 32-bit, and we rely on pg_popcount_fast() as our
fallback implementation in the AVX512 version.  Finally, I removed the
inline assembly in favor of using the _xgetbv() intrinsic on all systems.
It looks like that's available on gcc, clang, and msvc, although it
sometimes requires -mxsave, so that's applied to
pg_popcount_avx512_choose.o as needed.  I doubt this will lead to SIGILLs,
but it's admittedly a little shaky.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a26b209927cc6b266b33f74fd734772eff87bff9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v17 1/1] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  56 +++---
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 13 files changed, 638 insertions(+), 39 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 03:08:28PM -0500, Nathan Bossart wrote:
>> +#if defined(HAVE__GET_CPUID)
>> +__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
>> +#elif defined(HAVE__CPUID)
>> +__cpuidex(exx, 7, 0);
> 
> Is there any reason we can't use __get_cpuid() and __cpuid() here, given
> the sub-leaf is 0?

The answer to this seems to be "no."  After additional research,
__get_cpuid_count/__cpuidex seem new enough that we probably want configure
checks for them, so I'll add those back in the next version of the patch.

Apologies for the stream of consciousness today...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 02:13:12PM -0500, Nathan Bossart wrote:
> * If the compiler understands AVX512 intrinsics, we assume that it also
>   knows about the required CPUID and XGETBV intrinsics, and we assume that
>   the conditions for TRY_POPCNT_FAST are true.

Bleh, cfbot's 32-bit build is unhappy with this [0].  It looks like it's
trying to build the AVX512 stuff, but TRY_POPCNT_FAST isn't set.

[19:39:11.306] ../src/port/pg_popcount_avx512.c:39:18: warning: implicit 
declaration of function ‘pg_popcount_fast’; did you mean ‘pg_popcount’? 
[-Wimplicit-function-declaration]
[19:39:11.306]39 |  return popcnt + pg_popcount_fast(buf, bytes);
[19:39:11.306]   |  ^~~~
[19:39:11.306]   |  pg_popcount

There's also a complaint about the inline assembly:

[19:39:11.443] ../src/port/pg_popcount_avx512_choose.c:55:1: error: 
inconsistent operand constraints in an ‘asm’
[19:39:11.443]55 | __asm__ __volatile__(" xgetbv\n":"=a"(low), 
"=d"(high):"c"(xcr));
[19:39:11.443]   | ^~~

I'm looking into this...

> +#if defined(HAVE__GET_CPUID)
> + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
> +#elif defined(HAVE__CPUID)
> + __cpuidex(exx, 7, 0);

Is there any reason we can't use __get_cpuid() and __cpuid() here, given
the sub-leaf is 0?

[0] https://cirrus-ci.com/task/5475113447981056

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

Okay, here is a slightly different approach that I've dubbed the "maximum
assumption" approach.  In short, I wanted to see how much we could simplify
the patch by making all possibly-reasonable assumptions about the compiler
and CPU.  These include:

* If the compiler understands AVX512 intrinsics, we assume that it also
  knows about the required CPUID and XGETBV intrinsics, and we assume that
  the conditions for TRY_POPCNT_FAST are true.
* If this is x86_64, CPUID will be supported by the CPU.
* If CPUID indicates AVX512 POPCNT support, the CPU also supports XGETBV.

Do any of these assumptions seem unreasonable or unlikely to be true for
all practical purposes?  I don't mind adding back some or all of the
configure/runtime checks if they seem necessary.  I guess the real test
will be the buildfarm...

Another big change in this version is that I've moved
pg_popcount_avx512_available() to its own file so that we only compile
pg_popcount_avx512() with the special compiler flags.  This is just an
oversight in previous versions.

Finally, I've modified the build scripts so that the AVX512 popcount stuff
is conditionally built based on the configure checks for both
autoconf/meson.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d7864391c455ea77b8e555e40a358c59de1bd702 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v16 1/1] AVX512 popcount support

---
 config/c-compiler.m4 |  34 +
 configure| 100 +++
 configure.ac |  14 
 meson.build  |  35 ++
 src/Makefile.global.in   |   4 ++
 src/include/pg_config.h.in   |   3 +
 src/include/port/pg_bitutils.h   |  17 +
 src/makefiles/meson.build|   3 +-
 src/port/Makefile|   6 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  56 ++-
 src/port/pg_popcount_avx512.c|  40 +++
 src/port/pg_popcount_avx512_choose.c |  61 
 13 files changed, 340 insertions(+), 39 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..7d13368b23 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..86c471f4ec 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,8 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_POPCNT
+PG_POPCNT_OBJS
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17438,6 +17440,104 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
 
 fi
 
+# Check for AVX512 popcount intrinsics
+#
+PG_POPCNT_OBJS=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(va

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D

> A counterexample is the CRC32C code.  AFAICT we assume the presence of
> CPUID in that code (and #error otherwise).  I imagine its probably safe to
> assume the compiler understands CPUID if it understands AVX512 intrinsics,
> but that is still mostly a guess.

If AVX-512 intrinsics are available, then yes you will have CPUID. CPUID is 
much older in the hardware/software timeline than AVX-512.

Thanks,
Paul

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D

> On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote:
> > We don't do MSVC via autoconf/Make.  We used to have a special build
> > framework for MSVC which parsed Makefiles to produce "solution" files,
> > but it was removed as soon as Meson was mature enough to build.  See
> > commit 1301c80b2167.  If it builds with Meson, you're good.
> 
> The latest cfbot build for this seems to indicate that at least newer MSVC
> knows AVX512 intrinsics without any special compiler flags [0], so maybe
> what I had in v14 is good enough.  A previous version of the patch set [1] had
> the following lines:
> 
> +  if host_system == 'windows'
> +test_flags = ['/arch:AVX512']
> +  endif
> 
> I'm not sure if this is needed for older MSVC or something else.  IIRC I 
> couldn't
> find any other examples of this sort of thing in the meson scripts, either.  
> Paul,
> do you recall why you added this?

I asked internal folks here in-the-know and they suggested I add it. I 
personally am not a Windows guy. If it works without it and you are comfortable 
not including the lines, I am fine with it.

Thanks,
Paul

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 12:30:14PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>>> I see google web references to the xgetbv instruction as far back as 2009
>>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
>>> _xgetbv() MSVC built-in. How far back do you need to go?
> 
>> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
>> and not XGETBV then.  I guess the other question is whether CPUID
>> indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
>> If that's not safe, we might need to add another CPUID test.
> 
> Some quick googling says that (1) XGETBV predates AVX and (2) if you
> are worried about old CPUs, you should check CPUID to verify whether
> XGETBV exists before trying to use it.  I did not look for the
> bit-level details on how to do that.

That extra CPUID check should translate to exactly one additional line of
code, so I think I'm inclined to just add it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-29 Thread Shankaran, Akash

> From: Nathan Bossart  
> Sent: Friday, March 29, 2024 9:17 AM
> To: Amonson, Paul D 

> On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote:
>> Yeah, I understand that much, but I want to know how portable the 
>> XGETBV instruction is.  Unless I can assume that all x86_64 systems 
>> and compilers support that instruction, we might need an additional 
>> configure check and/or CPUID check.  It looks like MSVC has had 
>> support for the _xgetbv intrinsic for quite a while, but I'm still 
>> researching the other cases.
> 
> I see google web references to the xgetbv instruction as far back as 
> 2009 for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could 
> test for
> _xgetbv() MSVC built-in. How far back do you need to go?

> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics and 
> not XGETBV then.  I guess the other question is whether CPUID indicating 
> AVX512 is enabled implies the availability of XGETBV on the CPU.
> If that's not safe, we might need to add another CPUID test.

> It would probably be easy enough to add a couple of tests for this, but if we 
> don't have reason to believe there's any practical case to do so, I don't 
> know why we would.  I'm curious what others think about this.

This seems unlikely. Machines supporting XGETBV would support AVX512 
intrinsics. Xgetbv instruction seems to be part of xsave feature set as per 
intel developer manual [2]. XGETBV/XSAVE came first, and seems to be available 
in all x86 systems available since 2011, since Intel SandyBridge architecture 
and AMD the Opteron Gen4 [0].
AVX512 first came into a product in 2016 [1]
[0]: https://kb.vmware.com/s/article/1005764
[1]: https://en.wikipedia.org/wiki/AVX-512
[2]: https://cdrdv2-public.intel.com/774475/252046-sdm-change-document.pdf

- Akash Shankaran

Re: Popcount optimization using AVX512

2024-03-29 Thread Tom Lane

Nathan Bossart  writes:
>> I see google web references to the xgetbv instruction as far back as 2009
>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
>> _xgetbv() MSVC built-in. How far back do you need to go?

> Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
> and not XGETBV then.  I guess the other question is whether CPUID
> indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
> If that's not safe, we might need to add another CPUID test.

Some quick googling says that (1) XGETBV predates AVX and (2) if you
are worried about old CPUs, you should check CPUID to verify whether
XGETBV exists before trying to use it.  I did not look for the
bit-level details on how to do that.

regards, tom lane

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 10:59:40AM -0500, Nathan Bossart wrote:
> It might be nice if we conditionally built pg_popcount_avx512.o in autoconf
> builds, too, but AFAICT we still need to wrap most of that code with
> macros, so I'm not sure it's worth the trouble.  I'll take another look at
> this...

If we assumed that TRY_POPCNT_FAST would be set and either
HAVE__GET_CPUID_COUNT or HAVE__CPUIDEX would be set whenever
USE_AVX512_POPCNT_WITH_RUNTIME_CHECK is set, we could probably remove the
surrounding macros and just compile pg_popcount_avx512.c conditionally
based on USE_AVX512_POPCNT_WITH_RUNTIME_CHECK.  However, the surrounding
code seems to be pretty cautious about these assumptions (e.g., the CPUID
macros are checked before setting TRY_POPCNT_FAST), so this would stray
from the nearby precedent a bit.

A counterexample is the CRC32C code.  AFAICT we assume the presence of
CPUID in that code (and #error otherwise).  I imagine its probably safe to
assume the compiler understands CPUID if it understands AVX512 intrinsics,
but that is still mostly a guess.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote:
>> Yeah, I understand that much, but I want to know how portable the XGETBV
>> instruction is.  Unless I can assume that all x86_64 systems and compilers
>> support that instruction, we might need an additional configure check and/or
>> CPUID check.  It looks like MSVC has had support for the _xgetbv intrinsic 
>> for
>> quite a while, but I'm still researching the other cases.
> 
> I see google web references to the xgetbv instruction as far back as 2009
> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
> _xgetbv() MSVC built-in. How far back do you need to go?

Hm.  It seems unlikely that a compiler would understand AVX512 intrinsics
and not XGETBV then.  I guess the other question is whether CPUID
indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
If that's not safe, we might need to add another CPUID test.

It would probably be easy enough to add a couple of tests for this, but if
we don't have reason to believe there's any practical case to do so, I
don't know why we would.  I'm curious what others think about this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D

> -Original Message-
>
> Cool.  I think we should run the benchmarks again to be safe, though.

Ok, sure go ahead. :)

> >> I forgot to mention that I also want to understand whether we can
> >> actually assume availability of XGETBV when CPUID says we support
> >> AVX512:
> >
> > You cannot assume as there are edge cases where AVX-512 was found on
> > system one during compile but it's not actually available in a kernel
> > on a second system at runtime despite the CPU actually having the
> > hardware feature.
> 
> Yeah, I understand that much, but I want to know how portable the XGETBV
> instruction is.  Unless I can assume that all x86_64 systems and compilers
> support that instruction, we might need an additional configure check and/or
> CPUID check.  It looks like MSVC has had support for the _xgetbv intrinsic for
> quite a while, but I'm still researching the other cases.

I see google web references to the xgetbv instruction as far back as 2009 for 
Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for _xgetbv() 
MSVC built-in. How far back do you need to go?

Thanks,
Paul

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Thu, Mar 28, 2024 at 10:29:47PM +, Amonson, Paul D wrote:
> I see in the meson.build you added the new file twice?
> 
> @@ -7,6 +7,7 @@ pgport_sources = [
>'noblock.c',
>'path.c',
>'pg_bitutils.c',
> +  'pg_popcount_avx512.c',
>'pg_strong_random.c',
>'pgcheckdir.c',
>'pgmkdirp.c',
> @@ -84,6 +85,7 @@ replace_funcs_pos = [
>['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
>['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
>['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
> +  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 
> 'avx512_popcnt'],
> 
> I was putting the file with special flags ONLY in the second section and all 
> seemed to work. :)

Ah, yes, I think that's a mistake, and without looking closely, might
explain the MSVC warnings [0]:

[22:05:47.444] pg_popcount_avx512.c.obj : warning LNK4006: 
pg_popcount_avx512_available already defined in pg_popcount_a...

It might be nice if we conditionally built pg_popcount_avx512.o in autoconf
builds, too, but AFAICT we still need to wrap most of that code with
macros, so I'm not sure it's worth the trouble.  I'll take another look at
this...

[0] http://commitfest.cputube.org/highlights/all.html#4883

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c924b57f8479e51aa30c8e3cfe194a2ab85497ff Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v15 1/1] AVX512 popcount support

---
 config/c-compiler.m4   |  34 +++
 configure  | 165 +
 configure.ac   |  34 +++
 meson.build|  59 
 src/Makefile.global.in |   1 +
 src/include/pg_config.h.in |   9 ++
 src/include/port/pg_bitutils.h |  20 
 src/makefiles/meson.build  |   1 +
 src/port/Makefile  |   6 ++
 src/port/meson.build   |   5 +-
 src/port/pg_bitutils.c |  56 ---
 src/port/pg_popcount_avx512.c  |  98 
 12 files changed, 450 insertions(+), 38 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX512_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote:
> We don't do MSVC via autoconf/Make.  We used to have a special build
> framework for MSVC which parsed Makefiles to produce "solution" files,
> but it was removed as soon as Meson was mature enough to build.  See
> commit 1301c80b2167.  If it builds with Meson, you're good.

The latest cfbot build for this seems to indicate that at least newer MSVC
knows AVX512 intrinsics without any special compiler flags [0], so maybe
what I had in v14 is good enough.  A previous version of the patch set [1]
had the following lines:

+  if host_system == 'windows'
+test_flags = ['/arch:AVX512']
+  endif

I'm not sure if this is needed for older MSVC or something else.  IIRC I
couldn't find any other examples of this sort of thing in the meson
scripts, either.  Paul, do you recall why you added this?

[0] https://cirrus-ci.com/task/5787206636273664?logs=configure#L159
[1] 
https://postgr.es/m/attachment/158206/v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart

On Thu, Mar 28, 2024 at 10:03:04PM +, Amonson, Paul D wrote:
>> * I think we need to verify there isn't a huge performance regression for
>>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>>   two as well as a function call, which might add some noticeable overhead.
> 
> Not considering your changes, I had already tested small buffers. At less
> than 512 bytes there was no measurable regression (there was one extra
> condition check) and for 512+ bytes it moved from no regression to some
> gains between 512 and 4096 bytes. Assuming you introduced no extra
> function calls, it should be the same.

Cool.  I think we should run the benchmarks again to be safe, though.

>> I forgot to mention that I also want to understand whether we can
>> actually assume availability of XGETBV when CPUID says we support
>> AVX512:
> 
> You cannot assume as there are edge cases where AVX-512 was found on
> system one during compile but it's not actually available in a kernel on
> a second system at runtime despite the CPU actually having the hardware
> feature.

Yeah, I understand that much, but I want to know how portable the XGETBV
instruction is.  Unless I can assume that all x86_64 systems and compilers
support that instruction, we might need an additional configure check
and/or CPUID check.  It looks like MSVC has had support for the _xgetbv
intrinsic for quite a while, but I'm still researching the other cases.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D

> -Original Message-
> From: Amonson, Paul D 
> Sent: Thursday, March 28, 2024 3:03 PM
> To: Nathan Bossart 
> ...
> I will review the new patch to see if there are anything that jumps out at me.

I see in the meson.build you added the new file twice?

@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 
'avx512_popcnt'],

I was putting the file with special flags ONLY in the second section and all 
seemed to work. :)

Everything else seems good to me.

Thanks,
Paul

Re: Popcount optimization using AVX512

2024-03-28 Thread Alvaro Herrera

On 2024-Mar-28, Amonson, Paul D wrote:

> > -Original Message-
> > From: Nathan Bossart 
> > Sent: Thursday, March 28, 2024 2:39 PM
> > To: Amonson, Paul D 
> > 
> > * The latest patch set from Paul Amonson appeared to support MSVC in the
> >   meson build, but not the autoconf one.  I don't have much expertise here,
> >   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
> >   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
> >   can always compile the x86_64 popcount code, but I don't know whether
> >   that's safe for AVX512.
> 
> I also do not know how to integrate MSVC+Autoconf, the CI uses
> MSVC+Meson+Ninja so I stuck with that.

We don't do MSVC via autoconf/Make.  We used to have a special build
framework for MSVC which parsed Makefiles to produce "solution" files,
but it was removed as soon as Meson was mature enough to build.  See
commit 1301c80b2167.  If it builds with Meson, you're good.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."(Lamar Owen)

RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D

> -Original Message-
> From: Nathan Bossart 
> Sent: Thursday, March 28, 2024 2:39 PM
> To: Amonson, Paul D 
> 
> * The latest patch set from Paul Amonson appeared to support MSVC in the
>   meson build, but not the autoconf one.  I don't have much expertise here,
>   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
>   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
>   can always compile the x86_64 popcount code, but I don't know whether
>   that's safe for AVX512.

I also do not know how to integrate MSVC+Autoconf, the CI uses MSVC+Meson+Ninja 
so I stuck with that.
 
> * I think we need to verify there isn't a huge performance regression for
>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>   two as well as a function call, which might add some noticeable overhead.

Not considering your changes, I had already tested small buffers. At less than 
512 bytes there was no measurable regression (there was one extra condition 
check) and for 512+ bytes it moved from no regression to some gains between 512 
and 4096 bytes. Assuming you introduced no extra function calls, it should be 
the same.

> I forgot to mention that I also want to understand whether we can actually 
> assume availability of XGETBV when CPUID says we support AVX512:

You cannot assume as there are edge cases where AVX-512 was found on system one 
during compile but it's not actually available in a kernel on a second system 
at runtime despite the CPU actually having the hardware feature.

I will review the new patch to see if there are anything that jumps out at me.

Thanks,
Paul

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart

On Thu, Mar 28, 2024 at 04:38:54PM -0500, Nathan Bossart wrote:
> Here is a v14 of the patch that I think is beginning to approach something
> committable.  Besides general review and testing, there are two things that
> I'd like to bring up:
> 
> * The latest patch set from Paul Amonson appeared to support MSVC in the
>   meson build, but not the autoconf one.  I don't have much expertise here,
>   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
>   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
>   can always compile the x86_64 popcount code, but I don't know whether
>   that's safe for AVX512.
> 
> * I think we need to verify there isn't a huge performance regression for
>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>   two as well as a function call, which might add some noticeable overhead.

I forgot to mention that I also want to understand whether we can actually
assume availability of XGETBV when CPUID says we support AVX512:

> + /*
> +  * We also need to check that the OS has enabled support for 
> the ZMM
> +  * registers.
> +  */
> +#ifdef _MSC_VER
> + return (_xgetbv(0) & 0xe0) != 0;
> +#else
> + uint64  xcr = 0;
> + uint32  high;
> + uint32  low;
> +
> +__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
> + return (low & 0xe0) != 0;
> +#endif

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart

Here is a v14 of the patch that I think is beginning to approach something
committable.  Besides general review and testing, there are two things that
I'd like to bring up:

* The latest patch set from Paul Amonson appeared to support MSVC in the
  meson build, but not the autoconf one.  I don't have much expertise here,
  so the v14 patch doesn't have any autoconf/meson support for MSVC, which
  I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
  can always compile the x86_64 popcount code, but I don't know whether
  that's safe for AVX512.

* I think we need to verify there isn't a huge performance regression for
  smaller arrays.  IIUC those will still require an AVX512 instruction or
  two as well as a function call, which might add some noticeable overhead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9b5725e36aa8cff7caeb8683e11cd09bd5bda745 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v14 1/1] AVX512 popcount support

---
 config/c-compiler.m4   |  34 +++
 configure  | 165 +
 configure.ac   |  34 +++
 meson.build|  59 
 src/Makefile.global.in |   1 +
 src/include/pg_config.h.in |   9 ++
 src/include/port/pg_bitutils.h |  20 
 src/makefiles/meson.build  |   1 +
 src/port/Makefile  |   6 ++
 src/port/meson.build   |   6 +-
 src/port/pg_bitutils.c |  56 ---
 src/port/pg_popcount_avx512.c  |  98 
 12 files changed, 451 insertions(+), 38 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX512_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
 $as_echo_n "checking for __cpuid... " >&6; }
 if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,135 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking

RE: Popcount optimization using AVX512

2024-03-27 Thread Amonson, Paul D

> -Original Message-
> From: Nathan Bossart 
> Sent: Wednesday, March 27, 2024 3:00 PM
> To: Amonson, Paul D 
> 
> ...  (I realize that I'm essentially
> recanting much of my previous feedback, which I apologize for.)

It happens. LOL As long as the algorithm for AVX-512 is not altered I am 
confident that your new refactor will be fine. :)

Thanks,
Paul

Re: Popcount optimization using AVX512

2024-03-27 Thread Nathan Bossart

On Mon, Mar 25, 2024 at 03:05:51PM -0500, Nathan Bossart wrote:
> On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote:
>> Ok, CI turned green after my re-post of the patches.  Can this please get
>> merged?
> 
> Thanks for the new patches.  I intend to take another look soon.

Thanks for your patience.  I spent most of my afternoon looking into the
latest patch set, but I needed to do a CHECKPOINT and take a break.  I am
in the middle of doing some rather heavy editorialization, but the core of
your changes will remain the same (and so I still intend to give you
authorship credit).  I've attached what I have so far, which is still
missing the configuration checks and the changes to make sure the extra
compiler flags make it to the right places.

Unless something pops up while I work on the remainder of this patch, I
think we'll end up going with a simpler approach.  I originally set out to
make this look like the CRC32C stuff (e.g., a file per implementation), but
that seemed primarily useful if we can choose which files need to be
compiled at configure-time.  However, the TRY_POPCNT_FAST macro is defined
at compile-time (AFAICT for good reason [0]), so we end up having to
compile all the files in many cases anyway, and we continue to need to
surround lots of code with "#ifdef TRY_POPCNT_FAST" or similar.  So, my
current thinking is that we should only move the AVX512 stuff to its own
file for the purposes of compiling it with special flags when possible.  (I
realize that I'm essentially recanting much of my previous feedback, which
I apologize for.)

[0] 
https://postgr.es/m/CAApHDvrONNcYxGV6C0O3ZmaL0BvXBWY%2BrBOCBuYcQVUOURwhkA%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 031eb4a365665edd304f0281ad7e412341504749 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v13 1/1] AVX512 popcount support

---
 src/include/port/pg_bitutils.h | 16 +++
 src/port/Makefile  |  1 +
 src/port/meson.build   |  1 +
 src/port/pg_bitutils.c | 53 
 src/port/pg_popcount_avx512.c  | 88 ++
 5 files changed, 125 insertions(+), 34 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..4b1e4d92b4 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,22 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ */
+#if defined(TRY_POPCNT_FAST) && \
+	defined(HAVE__IMMINTRIN) && \
+	defined(HAVE__AVX512_POPCNT)
+#if defined(HAVE__GET_CPUID_COUNT) || defined(HAVE__CPUIDEX)
+#define TRY_POPCNT_AVX512 1
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..eb1e56fe41 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..c77bbd3168 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..2f9a6690e0 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -142,20 +142,18 @@ pg_popcount_available(void)
 	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
 }
 
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
 {
 	if (pg_popcount_available())
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
 		pg_popcount = pg_popcount_fast;
+#ifdef TRY_POPCNT_AVX512
+		if (pg_popcount_avx512_available())
+			pg_popcount = pg_popcount_avx512;
+#endif
 	}
 	else
 	{
@@ -163,45 +161,32 @@ pg_popcount32_choose(uint32 word)
 		pg_popcount64 = pg_popcount64_slow;
 		pg_popcount = pg_popcount_slow;
 	}
+}
 
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect w

Re: Popcount optimization using AVX512

2024-03-25 Thread Nathan Bossart

On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote:
> Ok, CI turned green after my re-post of the patches.  Can this please get
> merged?

Thanks for the new patches.  I intend to take another look soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D

> -Original Message-
> From: Amonson, Paul D 
> Sent: Monday, March 25, 2024 8:20 AM
> To: Tom Lane 
> Cc: David Rowley ; Nathan Bossart
> ; Andres Freund ; Alvaro
> Herrera ; Shankaran, Akash
> ; Noah Misch ; Matthias
> van de Meent ; pgsql-
> hack...@lists.postgresql.org
> Subject: RE: Popcount optimization using AVX512
>

Ok, CI turned green after my re-post of the patches.  Can this please get 
merged?

Thanks,
Paul

Re: Popcount optimization using AVX512

2024-03-25 Thread Joe Conway


On 3/25/24 11:12, Tom Lane wrote:

"Amonson, Paul D"  writes:

I am re-posting the patches as CI for Mac failed (CI error not code/test 
error). The patches are the same as last time.


Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump.  The current cycle period seems to
be about two days.



Just an FYI -- there seems to be an issue with all three of the macos 
cfbot runners (mine included). I spent time over the weekend working 
with Thomas Munro (added to CC list) trying different fixes to no avail. 
Help from macos CI wizards would be gratefully accepted...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

1 2 >

1 - 100 of 171 matches

Mail list logo