Re: Popcount optimization using AVX512
On Thu, Nov 07, 2024 at 08:38:21PM +, Devulapalli, Raghuveer wrote: > >> Of course, as soon as I committed this, I noticed that it's broken. It >> seems that >> compilers are rather picky about how multiple target options are specified. > > Just curious, which compiler complained? Clang. -- nathan
RE: Popcount optimization using AVX512
> Of course, as soon as I committed this, I noticed that it's broken. It seems > that > compilers are rather picky about how multiple target options are specified. Just curious, which compiler complained? Raghuveer
Re: Popcount optimization using AVX512
On Thu, Nov 07, 2024 at 02:03:04PM -0600, Nathan Bossart wrote: > Committed. Of course, as soon as I committed this, I noticed that it's broken. It seems that compilers are rather picky about how multiple target options are specified. AFAICT the commonly supported syntax is to put the entire list within one pair of quotes and to use only commas as delimiters, i.e., no spaces. I plan to commit the attached shortly once I've had a chance to verify it fixes the problem on cfbot. -- nathan >From a5ade18a867377fa424347465bbc5f631eff4f96 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 7 Nov 2024 14:28:57 -0600 Subject: [PATCH 1/1] fix __attribute__((target(...))) usage --- config/c-compiler.m4 | 2 +- configure | 2 +- meson.build | 2 +- src/port/pg_popcount_avx512.c | 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index c7eb896f14..a129edb88e 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -733,7 +733,7 @@ AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar], [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include #if defined(__has_attribute) && __has_attribute (target) -__attribute__((target("avx512vpopcntdq","avx512bw"))) +__attribute__((target("avx512vpopcntdq,avx512bw"))) #endif static int popcount_test(void) { diff --git a/configure b/configure index 3a7332f834..4b01b682b1 100755 --- a/configure +++ b/configure @@ -17324,7 +17324,7 @@ else /* end confdefs.h. */ #include #if defined(__has_attribute) && __has_attribute (target) -__attribute__((target("avx512vpopcntdq","avx512bw"))) +__attribute__((target("avx512vpopcntdq,avx512bw"))) #endif static int popcount_test(void) { diff --git a/meson.build b/meson.build index 9eddd72a27..5b0510cef7 100644 --- a/meson.build +++ b/meson.build @@ -2184,7 +2184,7 @@ if host_cpu == 'x86_64' #include #if defined(__has_attribute) && __has_attribute (target) -__attribute__((target("avx512vpopcntdq","avx512bw"))) +__attribute__((target("avx512vpopcntdq,avx512bw"))) #endif int main(void) { diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index b598e86554..1ab2847bf2 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -106,7 +106,7 @@ pg_popcount_avx512_available(void) * pg_popcount_avx512 * Returns the number of 1-bits in buf */ -pg_attribute_target("avx512vpopcntdq", "avx512bw") +pg_attribute_target("avx512vpopcntdq,avx512bw") uint64 pg_popcount_avx512(const char *buf, int bytes) { @@ -162,7 +162,7 @@ pg_popcount_avx512(const char *buf, int bytes) * pg_popcount_masked_avx512 * Returns the number of 1-bits in buf after applying the mask to each byte */ -pg_attribute_target("avx512vpopcntdq", "avx512bw") +pg_attribute_target("avx512vpopcntdq,avx512bw") uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask) { -- 2.39.5 (Apple Git-154)
Re: Popcount optimization using AVX512
Committed. -- nathan
Re: Popcount optimization using AVX512
On Thu, Nov 07, 2024 at 11:12:37AM -0500, Andres Freund wrote: > One thing that'd I'd like to see this being used is to elide the indirection > when the current target platform *already* supports the necessary > intrinsics. Adding a bunch of indirection for short & common operations is > decidedly not great. It doesn't have to be part of the same commit, but it > seems like it's worth doing as part of the same series, as I think it'll lead > to rather different looking configure checks. The main hurdle, at least for AVX-512, is that we still need to check (at runtime) whether the OS supports XGETBV and whether the ZMM registers are fully enabled. We might be able to skip those checks in limited cases (e.g., you are building on the target machine and can perhaps just check it once at build time), but that probably won't help packagers. >> +/* >> + * pg_attribute_target allows specifying different target options that the >> + * function should be compiled with (e.g., for using special CPU >> instructions). >> + */ >> +#if __has_attribute (target) >> +#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__))) >> +#else >> +#define pg_attribute_target(...) >> +#endif > > Think it'd be good to mention that there still needs to be configure check to > verify that specific target attribute is understood by the compiler. Will do. -- nathan
Re: Popcount optimization using AVX512
Hi, On 2024-11-06 20:26:47 -0600, Nathan Bossart wrote: > From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001 > From: Nathan Bossart > Date: Wed, 16 Oct 2024 15:57:55 -0500 > Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff One thing that'd I'd like to see this being used is to elide the indirection when the current target platform *already* supports the necessary intrinsics. Adding a bunch of indirection for short & common operations is decidedly not great. It doesn't have to be part of the same commit, but it seems like it's worth doing as part of the same series, as I think it'll lead to rather different looking configure checks. > diff --git a/src/include/c.h b/src/include/c.h > index 55dec71a6d..6f5ca25542 100644 > --- a/src/include/c.h > +++ b/src/include/c.h > @@ -174,6 +174,16 @@ > #define pg_attribute_nonnull(...) > #endif > > +/* > + * pg_attribute_target allows specifying different target options that the > + * function should be compiled with (e.g., for using special CPU > instructions). > + */ > +#if __has_attribute (target) > +#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__))) > +#else > +#define pg_attribute_target(...) > +#endif Think it'd be good to mention that there still needs to be configure check to verify that specific target attribute is understood by the compiler. Greetings, Andres Freund
Re: Popcount optimization using AVX512
rebased -- nathan >From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 16 Oct 2024 15:57:55 -0500 Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff --- config/c-compiler.m4 | 64 +- configure| 167 +++ configure.ac | 17 +-- meson.build | 21 ++-- src/Makefile.global.in | 5 - src/include/c.h | 10 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 12 +- src/port/meson.build | 7 +- src/port/pg_popcount_avx512.c| 86 +- src/port/pg_popcount_avx512_choose.c | 102 11 files changed, 183 insertions(+), 312 deletions(-) delete mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 10f8c7bd0a..c7eb896f14 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -700,20 +700,22 @@ undefine([Ac_cachevar])dnl # Check if the compiler supports the XSAVE instructions using the _xgetbv # intrinsic function. # -# An optional compiler flag can be passed as argument (e.g., -mxsave). If the -# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +# If the intrinsics are supported, sets pgac_xsave_intrinsics. AC_DEFUN([PGAC_XSAVE_INTRINSICS], -[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl -AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], -[pgac_save_CFLAGS=$CFLAGS -CFLAGS="$pgac_save_CFLAGS $1" -AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], - [return _xgetbv(0) & 0xe0;])], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl +AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include +#if defined(__has_attribute) && __has_attribute (target) +__attribute__((target("xsave"))) +#endif +static int xsave_test(void) +{ + return _xgetbv(0) & 0xe0; +}], + [return xsave_test();])], [Ac_cachevar=yes], - [Ac_cachevar=no]) -CFLAGS="$pgac_save_CFLAGS"]) + [Ac_cachevar=no])]) if test x"$Ac_cachevar" = x"yes"; then - CFLAGS_XSAVE="$1" pgac_xsave_intrinsics=yes fi undefine([Ac_cachevar])dnl @@ -725,29 +727,29 @@ undefine([Ac_cachevar])dnl # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64, # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. # -# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq -# -mavx512bw). If the intrinsics are supported, sets -# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics. AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], -[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl -AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], -[pgac_save_CFLAGS=$CFLAGS -CFLAGS="$pgac_save_CFLAGS $1" -AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], - [const char buf@<:@sizeof(__m512i)@:>@; - PG_INT64_TYPE popcnt = 0; - __m512i accum = _mm512_setzero_si512(); - const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); - accum = _mm512_add_epi64(accum, cnt); - popcnt = _mm512_reduce_add_epi64(accum); - /* return computed value, to prevent the above being optimized away */ - return popcnt == 0;])], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include +#if defined(__has_attribute) && __has_attribute (target) +__attribute__((target("avx512vpopcntdq","avx512bw"))) +#endif +static int popcount_test(void) +{ + const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + return (int) popcnt; +}], + [return popcount_test();])], [Ac_cachevar=yes], - [Ac_cachevar=no]) -CFLAGS="$pgac_save_CFLAGS"]) + [Ac_cachevar=no])]) if test x"$Ac_cachevar" = x"yes"; then - CFLAGS_POPCNT="$1" pgac_avx512_popcnt_intrinsics=yes fi undefine([Ac_cachevar])dnl diff --git a/configure b/configure index 6e256b417b..3a7332f834 100755 --- a/configure +++ b/configure @@ -647,9 +647,6 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC -PG_POPCNT_OBJS -CFLAGS_POPCNT -CFLAGS_XSAVE LIBOBJS OPENSSL ZSTD @@ -17272,185 +17269,103 @@ fi # Check for XSAVE intrinsics # -CFLAGS_XSAVE="" -{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5 -$as_echo_n "ch
Re: Popcount optimization using AVX512
On Thu, Oct 31, 2024 at 07:58:06PM +, Devulapalli, Raghuveer wrote: > LGTM. Thanks. Barring additional feedback, I plan to commit this soon. -- nathan
Re: Popcount optimization using AVX512
On Wed, Oct 30, 2024 at 04:10:10PM -0500, Nathan Bossart wrote: > On Wed, Oct 30, 2024 at 08:53:10PM +, Raghuveer Devulapalli wrote: >> BTW, I just realized function attributes for xsave and avx512 don't work >> on MSVC (see >> https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630). >> Not sure if you care about it. Its an easy fix (see >> https://gcc.godbolt.org/z/Pebdj3vMx). > > Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the > configure programs for meson. pg_attribute_target will be empty on MSVC, > and I believe we only support meson builds there. Here is an updated patch with this change. -- nathan >From 8cf7c08220a9c0a1dec809794af2dfb719981923 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 16 Oct 2024 15:57:55 -0500 Subject: [PATCH v2 1/1] use __attribute__((target(...))) for AVX-512 stuff --- config/c-compiler.m4 | 60 +- configure| 163 ++- configure.ac | 17 +-- meson.build | 21 ++-- src/Makefile.global.in | 5 - src/include/c.h | 10 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 12 +- src/port/meson.build | 7 +- src/port/pg_popcount_avx512.c| 86 +- src/port/pg_popcount_avx512_choose.c | 102 - 11 files changed, 175 insertions(+), 312 deletions(-) delete mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 10f8c7bd0a..aa90f8ef33 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -700,20 +700,20 @@ undefine([Ac_cachevar])dnl # Check if the compiler supports the XSAVE instructions using the _xgetbv # intrinsic function. # -# An optional compiler flag can be passed as argument (e.g., -mxsave). If the -# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +# If the intrinsics are supported, sets pgac_xsave_intrinsics. AC_DEFUN([PGAC_XSAVE_INTRINSICS], -[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl -AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], -[pgac_save_CFLAGS=$CFLAGS -CFLAGS="$pgac_save_CFLAGS $1" -AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], - [return _xgetbv(0) & 0xe0;])], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl +AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include +__attribute__((target("xsave"))) +static int xsave_test(void) +{ + return _xgetbv(0) & 0xe0; +}], + [return xsave_test();])], [Ac_cachevar=yes], - [Ac_cachevar=no]) -CFLAGS="$pgac_save_CFLAGS"]) + [Ac_cachevar=no])]) if test x"$Ac_cachevar" = x"yes"; then - CFLAGS_XSAVE="$1" pgac_xsave_intrinsics=yes fi undefine([Ac_cachevar])dnl @@ -725,29 +725,27 @@ undefine([Ac_cachevar])dnl # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64, # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. # -# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq -# -mavx512bw). If the intrinsics are supported, sets -# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics. AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], -[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl -AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], -[pgac_save_CFLAGS=$CFLAGS -CFLAGS="$pgac_save_CFLAGS $1" -AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], - [const char buf@<:@sizeof(__m512i)@:>@; - PG_INT64_TYPE popcnt = 0; - __m512i accum = _mm512_setzero_si512(); - const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); - accum = _mm512_add_epi64(accum, cnt); - popcnt = _mm512_reduce_add_epi64(accum); - /* return computed value, to prevent the above being optimized away */ - return popcnt == 0;])], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include +__attribute__((target("avx512vpopcntdq","avx512bw"))) +static int popcount_test(void) +{ + const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + return (int) popcnt; +}], + [return popcount_test();])], [Ac_cachevar=yes], - [Ac_cachevar=no]) -CFLAGS="$pgac_save_CFLAGS"]) + [Ac_cachevar=no])]) if test x"$Ac_cach
RE: Popcount optimization using AVX512
> Here is an updated patch with this change. LGTM. Raghuveer
RE: Popcount optimization using AVX512
> Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the > configure programs for meson. pg_attribute_target will be empty on MSVC, and > I > believe we only support meson builds there. Right. __has_attribute (target) produces a compiler warning on MSVC: https://gcc.godbolt.org/z/EfWGxbvj3. Might need to guard that with #if defined(__has_attribute) to get rid of it. > > -- > nathan
Re: Popcount optimization using AVX512
On Wed, Oct 30, 2024 at 08:53:10PM +, Raghuveer Devulapalli wrote: > BTW, I just realized function attributes for xsave and avx512 don't work > on MSVC (see > https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630). > Not sure if you care about it. Its an easy fix (see > https://gcc.godbolt.org/z/Pebdj3vMx). Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the configure programs for meson. pg_attribute_target will be empty on MSVC, and I believe we only support meson builds there. -- nathan
Re: Popcount optimization using AVX512
BTW, I just realized function attributes for xsave and avx512 don't work on MSVC (see https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630). Not sure if you care about it. Its an easy fix (see https://gcc.godbolt.org/z/Pebdj3vMx).
Re: Popcount optimization using AVX512
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, failed Spec compliant: tested, failed Documentation:tested, failed Changes LGTM. Makes the Makefile look clean. Built and ran tests with `make check` with gcc-13 on a ICX and gcc-11 on SKX. I built on top of this patch and converted SSE4.2 and AVX-512 CRC32C to use function attributes too. The new status of this patch is: Ready for Committer
Re: Popcount optimization using AVX512
On Tue, Oct 08, 2024 at 09:36:03PM -0500, Nathan Bossart wrote: > On Wed, Jul 31, 2024 at 04:43:02PM -0500, Nathan Bossart wrote: >> On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote: >>> I think we'd be better off enabling architectural features on a per-function >>> basis, roughly like this: >>> >>> [...] >>> >>> /* FIXME: Should be gated by configure check of -mavx512vpopcntdq >>> -mavx512bw support */ >>> pg_enable_target("avx512vpopcntdq,avx512bw") >>> uint64_t >>> pg_popcount_avx512(const char *buf, int bytes) >>> ... >> >> I remember wondering why the CRC-32C code wasn't already doing something >> like this (old compiler versions? non-gcc-like compilers?), and I'm not >> sure I ever discovered the reason, so out of an abundance of caution I used >> the same approach for AVX-512. If we can convince ourselves that >> __attribute__((target("..."))) is standard enough at this point, +1 for >> moving to that. > > [...] > > So, at least for the CRC code, __attribute__((target("..."))) was probably > not widely available enough yet when it was first added. Unfortunately, > the ARMv8 CRC target support (without -march) is still pretty new, but it > might be possible to switch the others to a per-function approach in v18. Here is a first attempt at using __attribute__((target("..."))) for the AVX-512 stuff. Besides allowing us to consolidate the code into a single file, this simplifies the build file changes quite a bit. -- nathan >From c97e25e56347c90f169a5ce069a9ea06c873915b Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 16 Oct 2024 15:57:55 -0500 Subject: [PATCH v1 1/1] use __attribute__((target(...))) for AVX-512 stuff --- config/c-compiler.m4 | 60 +- configure| 163 ++- configure.ac | 17 +-- meson.build | 17 +-- src/Makefile.global.in | 5 - src/include/c.h | 10 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 12 +- src/port/meson.build | 7 +- src/port/pg_popcount_avx512.c| 86 +- src/port/pg_popcount_avx512_choose.c | 102 - 11 files changed, 171 insertions(+), 312 deletions(-) delete mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 10f8c7bd0a..aa90f8ef33 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -700,20 +700,20 @@ undefine([Ac_cachevar])dnl # Check if the compiler supports the XSAVE instructions using the _xgetbv # intrinsic function. # -# An optional compiler flag can be passed as argument (e.g., -mxsave). If the -# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +# If the intrinsics are supported, sets pgac_xsave_intrinsics. AC_DEFUN([PGAC_XSAVE_INTRINSICS], -[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl -AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], -[pgac_save_CFLAGS=$CFLAGS -CFLAGS="$pgac_save_CFLAGS $1" -AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], - [return _xgetbv(0) & 0xe0;])], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl +AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include +__attribute__((target("xsave"))) +static int xsave_test(void) +{ + return _xgetbv(0) & 0xe0; +}], + [return xsave_test();])], [Ac_cachevar=yes], - [Ac_cachevar=no]) -CFLAGS="$pgac_save_CFLAGS"]) + [Ac_cachevar=no])]) if test x"$Ac_cachevar" = x"yes"; then - CFLAGS_XSAVE="$1" pgac_xsave_intrinsics=yes fi undefine([Ac_cachevar])dnl @@ -725,29 +725,27 @@ undefine([Ac_cachevar])dnl # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64, # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. # -# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq -# -mavx512bw). If the intrinsics are supported, sets -# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics. AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], -[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl -AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], -[pgac_save_CFLAGS=$CFLAGS -CFLAGS="$pgac_save_CFLAGS $1" -AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], - [const char buf@<:@sizeof(__m512i)@:>@; - PG_INT64_TYPE popcnt = 0; - __m512i accum = _mm512_setzero_si512(); - const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); - accum = _mm512_add_epi64(accum, cnt); - popcnt = _mm512_reduce_add_epi64(accum); - /* return computed value, to prevent the above being optimized away */ - return popcnt == 0;])], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics]
Re: Popcount optimization using AVX512
On Wed, Jul 31, 2024 at 04:43:02PM -0500, Nathan Bossart wrote: > On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote: >> I think we'd be better off enabling architectural features on a per-function >> basis, roughly like this: >> >> [...] >> >> /* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw >> support */ >> pg_enable_target("avx512vpopcntdq,avx512bw") >> uint64_t >> pg_popcount_avx512(const char *buf, int bytes) >> ... > > I remember wondering why the CRC-32C code wasn't already doing something > like this (old compiler versions? non-gcc-like compilers?), and I'm not > sure I ever discovered the reason, so out of an abundance of caution I used > the same approach for AVX-512. If we can convince ourselves that > __attribute__((target("..."))) is standard enough at this point, +1 for > moving to that. I looked into this some more, and found the following: * We added SSE 4.2 CRC support in April 2015 (commit 3dc2d62). gcc support for __attribute__((target("sse4.2"))) was added in 4.9.0 (April 2014). clang support was added in 3.8 (March 2016). * We added ARMv8 CRC support in April 2018 (commit f044d71). gcc support for __attribute__((target("+crc"))) was added in 6.3 (December 2016). I didn't find precisely when clang support was added, but until 16.0.0 (March 2023), including arm_acle.h requires the -march flag [0], and you had to use "crc" (plus sign omitted) as the target [1]. * We added AVX-512 support in April 2024 (commit 792752a). gcc support for __attribute__((target("avx512vpopcntdq,avx512bw"))) was added in 7.1 (May 2017). clang support was added in 5.0.0 (September 2017). However, the "xsave" target was not supported until 9.1 for gcc (May 2019) and 9.0.0 for clang (September 2019), and we need that for our AVX-512 code, too. So, at least for the CRC code, __attribute__((target("..."))) was probably not widely available enough yet when it was first added. Unfortunately, the ARMv8 CRC target support (without -march) is still pretty new, but it might be possible to switch the others to a per-function approach in v18. [0] https://github.com/llvm/llvm-project/commit/30b67c6 [1] https://releases.llvm.org/16.0.0/tools/clang/docs/ReleaseNotes.html#arm-and-aarch64-support -- nathan
Re: Popcount optimization using AVX512
On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote: > On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote: >> As I started on this, I remembered why I needed it. The file >> pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order >> to avoid inadvertently issuing any AVX-512 instructions before determining >> we have support. If that's not a concern, we could still probably remove >> the XSAVE check. > > I think it's a valid concern - but isn't that theoretically also an issue with > xsave itself? I guess practically the compiler won't do that, because there's > no practical reason to emit any instructions enabled by -mxsave (in contrast > to e.g. -mavx, which does trigger gcc to emit different instructions even for > basic math). Yeah, this crossed my mind. It's certainly not the sturdiest of assumptions... > I think enabling options like these on a per-translation-unit basis isn't > really a scalable approach. To actually be safe there could only be a single > function in each TU and that function could only be called after a cpuid check > performed in a separate TU. That a) ends up pretty unreadable b) requires > functions to be implemented in .c files, which we really don't want for some > of this. Agreed. > I think we'd be better off enabling architectural features on a per-function > basis, roughly like this: > > [...] > > /* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw > support */ > pg_enable_target("avx512vpopcntdq,avx512bw") > uint64_t > pg_popcount_avx512(const char *buf, int bytes) > ... I remember wondering why the CRC-32C code wasn't already doing something like this (old compiler versions? non-gcc-like compilers?), and I'm not sure I ever discovered the reason, so out of an abundance of caution I used the same approach for AVX-512. If we can convince ourselves that __attribute__((target("..."))) is standard enough at this point, +1 for moving to that. -- nathan
Re: Popcount optimization using AVX512
Hi, On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote: > > On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote: > >> My point is that _xgetbv() is made available by -mavx512vpopcntdq > >> -mavx512bw > >> alone, without needing -mxsave: > > > > Oh, I see. I'll work on a patch to remove that compiler check, then... > > As I started on this, I remembered why I needed it. The file > pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order > to avoid inadvertently issuing any AVX-512 instructions before determining > we have support. If that's not a concern, we could still probably remove > the XSAVE check. I think it's a valid concern - but isn't that theoretically also an issue with xsave itself? I guess practically the compiler won't do that, because there's no practical reason to emit any instructions enabled by -mxsave (in contrast to e.g. -mavx, which does trigger gcc to emit different instructions even for basic math). I think this is one of the few instances where msvc has the right approach - if I use intrinsics to emit a specific instruction, the intrinsic should do so, regardless of whether the compiler is allowed to do so on its own. I think enabling options like these on a per-translation-unit basis isn't really a scalable approach. To actually be safe there could only be a single function in each TU and that function could only be called after a cpuid check performed in a separate TU. That a) ends up pretty unreadable b) requires functions to be implemented in .c files, which we really don't want for some of this. I think we'd be better off enabling architectural features on a per-function basis, roughly like this: https://godbolt.org/z/a4q9Gc6Ez For posterity, in the unlikely case anybody reads this after godbolt shuts down: I'm thinking we'd have an attribute like this: /* * GCC like compilers don't support intrinsics without those intrinsics explicitly * having been enabled. We can't just add these options more widely, as that allows the * compiler to emit such instructions more widely, even if we gate reaching the code using * intrinsics. So we just enable the relevant support for individual functions. * * In contrast to this, msvc allows use of intrinsics independent of what the compiler * otherwise is allowed to emit. */ #ifdef __GNUC__ #define pg_enable_target(foo) __attribute__ ((__target__ (foo))) #else #define pg_enable_target(foo) #endif and then use that selectively for some functions: /* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw support */ pg_enable_target("avx512vpopcntdq,avx512bw") uint64_t pg_popcount_avx512(const char *buf, int bytes) ... Greetings, Andres Freund
Re: Popcount optimization using AVX512
On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote: >> My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw >> alone, without needing -mxsave: > > Oh, I see. I'll work on a patch to remove that compiler check, then... As I started on this, I remembered why I needed it. The file pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order to avoid inadvertently issuing any AVX-512 instructions before determining we have support. If that's not a concern, we could still probably remove the XSAVE check. -- nathan
Re: Popcount optimization using AVX512
On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote: > On 2024-07-30 21:01:31 -0500, Nathan Bossart wrote: >> The main purpose of the XSAVE compiler check is to determine whether we >> need to add -mxsave in order to use _xgetbv() [0]. If that wasn't a >> factor, we could probably skip it. Earlier versions of the patch used >> inline assembly in the non-MSVC path to call XGETBV, which I was trying to >> avoid. > > My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw > alone, without needing -mxsave: Oh, I see. I'll work on a patch to remove that compiler check, then... -- nathan
Re: Popcount optimization using AVX512
Hi, On 2024-07-30 21:01:31 -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 06:46:51PM -0700, Andres Freund wrote: > > On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote: > >> On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote: > >> > Why are we actually checking for xsave? We're not using xsave itself and > >> > I > >> > couldn't find a comment in 792752af4eb5 explaining what we're using it > >> > as a > >> > proxy for? Is that just to know if _xgetbv() exists? Is it actually > >> > possible > >> > that xsave isn't available when avx512 is? > >> > >> Yes, it's to verify we have XGETBV, which IIUC requires support from both > >> the processor and the OS (see 598e011 and upthread discussion). AFAIK the > >> way we are detecting AVX-512 support is quite literally by-the-book unless > >> I've gotten something wrong. > > > > I'm basically wondering whether we need to check for compiler (not OS > > support) > > support for xsave if we also check for -mavx512vpopcntdq -mavx512bw > > support. Afaict the latter implies support for xsave. > > The main purpose of the XSAVE compiler check is to determine whether we > need to add -mxsave in order to use _xgetbv() [0]. If that wasn't a > factor, we could probably skip it. Earlier versions of the patch used > inline assembly in the non-MSVC path to call XGETBV, which I was trying to > avoid. My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw alone, without needing -mxsave: echo -e '#include \nint main() { return _xgetbv(0) & 0xe0; }'|time gcc -march=x86-64 -c -xc - -o /dev/null -> fails echo -e '#include \nint main() { return _xgetbv(0) & 0xe0;}'|time gcc -march=x86-64 -mavx512vpopcntdq -mavx512bw -c -xc - -o /dev/null -> succeeds Greetings, Andres Freund
Re: Popcount optimization using AVX512
On Tue, Jul 30, 2024 at 06:46:51PM -0700, Andres Freund wrote: > On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote: >> On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote: >> > Why are we actually checking for xsave? We're not using xsave itself and I >> > couldn't find a comment in 792752af4eb5 explaining what we're using it as a >> > proxy for? Is that just to know if _xgetbv() exists? Is it actually >> > possible >> > that xsave isn't available when avx512 is? >> >> Yes, it's to verify we have XGETBV, which IIUC requires support from both >> the processor and the OS (see 598e011 and upthread discussion). AFAIK the >> way we are detecting AVX-512 support is quite literally by-the-book unless >> I've gotten something wrong. > > I'm basically wondering whether we need to check for compiler (not OS support) > support for xsave if we also check for -mavx512vpopcntdq -mavx512bw > support. Afaict the latter implies support for xsave. The main purpose of the XSAVE compiler check is to determine whether we need to add -mxsave in order to use _xgetbv() [0]. If that wasn't a factor, we could probably skip it. Earlier versions of the patch used inline assembly in the non-MSVC path to call XGETBV, which I was trying to avoid. [0] https://postgr.es/m/20240330032209.GA2018686%40nathanxps13 -- nathan
Re: Popcount optimization using AVX512
Hi, On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote: > > Ah, I somehow thought we'd avoid the runtime check in case we determine at > > compile time we don't need any extra flags to enable the AVX512 stuff > > (similar > > to how we deal with crc32). But it looks like that's not the case - which > > seems pretty odd to me: > > > > This turns something that can be a single instruction into an indirect > > function call, even if we could know that it's guaranteed to be available > > for > > the compilation target, due to -march= > > > > It's one thing for the avx512 path to have that overhead, but it's > > particularly absurd for pg_popcount32/pg_popcount64, where > > > > a) The function call overhead is a larger proportion of the cost. > > b) the instruction is almost universally available, including in the > >architecture baseline x86-64-v2, which several distros are using as the > >x86-64 baseline. > > Yeah, pg_popcount32/64 have been doing this since v12 (02a6a54). Until v17 > (cc4826d), pg_popcount() repeatedly calls these function pointers, too. I > think it'd be awesome if we could start requiring some of these "almost > universally available" instructions, but AFAICT that brings its own > complexity [0]. I'll respond there... > > Why are we actually checking for xsave? We're not using xsave itself and I > > couldn't find a comment in 792752af4eb5 explaining what we're using it as a > > proxy for? Is that just to know if _xgetbv() exists? Is it actually > > possible > > that xsave isn't available when avx512 is? > > Yes, it's to verify we have XGETBV, which IIUC requires support from both > the processor and the OS (see 598e011 and upthread discussion). AFAIK the > way we are detecting AVX-512 support is quite literally by-the-book unless > I've gotten something wrong. I'm basically wondering whether we need to check for compiler (not OS support) support for xsave if we also check for -mavx512vpopcntdq -mavx512bw support. Afaict the latter implies support for xsave. andres@alap6:~$ echo|gcc -c - -march=x86-64 -xc -dM -E - -o -|grep '__XSAVE__' andres@alap6:~$ echo|gcc -c - -march=x86-64 -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep '__XSAVE__' #define __XSAVE__ 1 #define __XSAVE__ 1 Greetings, Andres Freund
Re: Popcount optimization using AVX512
On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote: > Ah, I somehow thought we'd avoid the runtime check in case we determine at > compile time we don't need any extra flags to enable the AVX512 stuff (similar > to how we deal with crc32). But it looks like that's not the case - which > seems pretty odd to me: > > This turns something that can be a single instruction into an indirect > function call, even if we could know that it's guaranteed to be available for > the compilation target, due to -march= > > It's one thing for the avx512 path to have that overhead, but it's > particularly absurd for pg_popcount32/pg_popcount64, where > > a) The function call overhead is a larger proportion of the cost. > b) the instruction is almost universally available, including in the >architecture baseline x86-64-v2, which several distros are using as the >x86-64 baseline. Yeah, pg_popcount32/64 have been doing this since v12 (02a6a54). Until v17 (cc4826d), pg_popcount() repeatedly calls these function pointers, too. I think it'd be awesome if we could start requiring some of these "almost universally available" instructions, but AFAICT that brings its own complexity [0]. > Why are we actually checking for xsave? We're not using xsave itself and I > couldn't find a comment in 792752af4eb5 explaining what we're using it as a > proxy for? Is that just to know if _xgetbv() exists? Is it actually possible > that xsave isn't available when avx512 is? Yes, it's to verify we have XGETBV, which IIUC requires support from both the processor and the OS (see 598e011 and upthread discussion). AFAIK the way we are detecting AVX-512 support is quite literally by-the-book unless I've gotten something wrong. [0] https://postgr.es/m/ZmpG2ZzT30Q75BZO%40nathan -- nathan
Re: Popcount optimization using AVX512
On Wed, Jul 31, 2024 at 12:50 PM Andres Freund wrote: > It's one thing for the avx512 path to have that overhead, but it's > particularly absurd for pg_popcount32/pg_popcount64, where > > a) The function call overhead is a larger proportion of the cost. > b) the instruction is almost universally available, including in the >architecture baseline x86-64-v2, which several distros are using as the >x86-64 baseline. FWIW, another recent thread about that: https://www.postgresql.org/message-id/flat/CA%2BhUKGKS64zJezV9y9mPcB-J0i%2BfLGiv3FAdwSH_3SCaVdrjyQ%40mail.gmail.com
Re: Popcount optimization using AVX512
Hi, On 2024-07-30 16:32:07 -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote: > > Now, a reasonable counter-argument would be that only some of these macros > > are > > defined for msvc ([1]). However, as it turns out, the test is broken > > today, as msvc doesn't error out when using an intrinsic that's not > > "available" by the target architecture, it seems to assume that the caller > > did > > a cpuid check ahead of time. > > > > > > Check out [2], it shows the various predefined macros for gcc, clang and > > msvc. > > > > > > ISTM that the msvc checks for xsave/avx512 being broken should be an open > > item? > > I'm not following this one. At the moment, we always do a runtime check > for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and > set the function pointers appropriately, right? We could, of course, still > fix it, though. Ah, I somehow thought we'd avoid the runtime check in case we determine at compile time we don't need any extra flags to enable the AVX512 stuff (similar to how we deal with crc32). But it looks like that's not the case - which seems pretty odd to me: This turns something that can be a single instruction into an indirect function call, even if we could know that it's guaranteed to be available for the compilation target, due to -march= It's one thing for the avx512 path to have that overhead, but it's particularly absurd for pg_popcount32/pg_popcount64, where a) The function call overhead is a larger proportion of the cost. b) the instruction is almost universally available, including in the architecture baseline x86-64-v2, which several distros are using as the x86-64 baseline. Why are we actually checking for xsave? We're not using xsave itself and I couldn't find a comment in 792752af4eb5 explaining what we're using it as a proxy for? Is that just to know if _xgetbv() exists? Is it actually possible that xsave isn't available when avx512 is? Greetings, Andres Freund
Re: Popcount optimization using AVX512
On Tue, Jul 30, 2024 at 04:32:07PM -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote: >> Afaict we could just check for predefined preprocessor macros: >> >> echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o >> -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__' >> #define __AVX512BW__ 1 >> #define __AVX512VPOPCNTDQ__ 1 >> #define __XSAVE__ 1 >> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata >> 13292maxresident)k >> >> echo|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E >> '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__' >> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata >> 10972maxresident)k > > Seems promising. I can't think of a reason that wouldn't work. > >> Now, a reasonable counter-argument would be that only some of these macros >> are >> defined for msvc ([1]). However, as it turns out, the test is broken >> today, as msvc doesn't error out when using an intrinsic that's not >> "available" by the target architecture, it seems to assume that the caller >> did >> a cpuid check ahead of time. Hm. Upon further inspection, I see that MSVC appears to be missing __XSAVE__ and __AVX512VPOPCNTDQ__, which is unfortunate. Still, I think the worst case scenario is that the CPUID check fails and we don't use AVX-512 instructions. AFAICT we aren't adding new function pointers in any builds that don't already have them, just compiling some extra unused code. -- nathan
Re: Popcount optimization using AVX512
On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote: > I've noticed that the configure probes for this are quite slow - pretty much > the slowest step in a meson setup (and autoconf is similar). While looking > into this, I also noticed that afaict the tests don't do the right thing for > msvc. > > ... > [6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES > [6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES > [6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES > [7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO > [8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES > [8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : > links: NO > [9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : > links: YES > [9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : > links: NO > [9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: > YES > [9.367] Checking if "x86_64: popcntq instruction" compiles: YES > [9.382] Has header "atomic.h" : NO > ... > > (the times here are a bit exaggerated, enabling them in meson also turns on > python profiling, which makes everything a bit slower) > > > Looks like this is largely the fault of including immintrin.h: > > echo -e '#include \nint main(){return _xgetbv(0) & 0xe0;}'|time > gcc -mxsave -xc - -o /dev/null > 0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata > 94184maxresident)k > > echo -e '#include \n'|time gcc -c -mxsave -xc - -o /dev/null > 0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata > 86004maxresident)k Interesting. Thanks for bringing this to my attention. > Do we really need to link the generated programs? If we instead were able to > just rely on the preprocessor, it'd be vastly faster. > > The __sync* and __atomic* checks actually need to link, as the compiler ends > up generating calls to unimplemented functions if the compilation target > doesn't support some operation natively - but I don't think that's true for > the xsave/avx512 stuff > > Afaict we could just check for predefined preprocessor macros: > > echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep > -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__' > #define __AVX512BW__ 1 > #define __AVX512VPOPCNTDQ__ 1 > #define __XSAVE__ 1 > 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata > 13292maxresident)k > > echo|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E > '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__' > 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata > 10972maxresident)k Seems promising. I can't think of a reason that wouldn't work. > Now, a reasonable counter-argument would be that only some of these macros are > defined for msvc ([1]). However, as it turns out, the test is broken > today, as msvc doesn't error out when using an intrinsic that's not > "available" by the target architecture, it seems to assume that the caller did > a cpuid check ahead of time. > > > Check out [2], it shows the various predefined macros for gcc, clang and msvc. > > > ISTM that the msvc checks for xsave/avx512 being broken should be an open > item? I'm not following this one. At the moment, we always do a runtime check for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and set the function pointers appropriately, right? We could, of course, still fix it, though. -- nathan
Re: Popcount optimization using AVX512
Hi, On 2024-04-23 11:02:07 -0500, Nathan Bossart wrote: > On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote: > > Makes sense, thanks. I'm planning to commit this fix sometime early next > > week. > > Committed. I've noticed that the configure probes for this are quite slow - pretty much the slowest step in a meson setup (and autoconf is similar). While looking into this, I also noticed that afaict the tests don't do the right thing for msvc. ... [6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES [6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES [6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES [7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO [8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES [8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO [9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES [9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : links: NO [9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: YES [9.367] Checking if "x86_64: popcntq instruction" compiles: YES [9.382] Has header "atomic.h" : NO ... (the times here are a bit exaggerated, enabling them in meson also turns on python profiling, which makes everything a bit slower) Looks like this is largely the fault of including immintrin.h: echo -e '#include \nint main(){return _xgetbv(0) & 0xe0;}'|time gcc -mxsave -xc - -o /dev/null 0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 94184maxresident)k echo -e '#include \n'|time gcc -c -mxsave -xc - -o /dev/null 0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 86004maxresident)k Do we really need to link the generated programs? If we instead were able to just rely on the preprocessor, it'd be vastly faster. The __sync* and __atomic* checks actually need to link, as the compiler ends up generating calls to unimplemented functions if the compilation target doesn't support some operation natively - but I don't think that's true for the xsave/avx512 stuff Afaict we could just check for predefined preprocessor macros: echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__' #define __AVX512BW__ 1 #define __AVX512VPOPCNTDQ__ 1 #define __XSAVE__ 1 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)k echo|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__' 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)k Now, a reasonable counter-argument would be that only some of these macros are defined for msvc ([1]). However, as it turns out, the test is broken today, as msvc doesn't error out when using an intrinsic that's not "available" by the target architecture, it seems to assume that the caller did a cpuid check ahead of time. Check out [2], it shows the various predefined macros for gcc, clang and msvc. ISTM that the msvc checks for xsave/avx512 being broken should be an open item? Greetings, Andres [1] https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170 [2] https://godbolt.org/z/c8Kj8r3PK
Re: Popcount optimization using AVX512
On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote: > Makes sense, thanks. I'm planning to commit this fix sometime early next > week. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Thu, Apr 18, 2024 at 10:11:08PM +, Devulapalli, Raghuveer wrote: >> On that note, is it necessary to also check for avx512f? At the moment, >> we are assuming that's supported if the other AVX-512 instructions are >> available. > > No, it's not needed. There are no CPU's with avx512bw/avx512popcnt > without avx512f. Unfortunately though, avx512popcnt does not mean > avx512bw (I think the deprecated Xeon Phi processors falls in this > category) which is why we need both. Makes sense, thanks. I'm planning to commit this fix sometime early next week. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
RE: Popcount optimization using AVX512
> On that note, is it necessary to also check for avx512f? At the moment, we > are assuming that's supported if the other AVX-512 instructions are available. No, it's not needed. There are no CPU's with avx512bw/avx512popcnt without avx512f. Unfortunately though, avx512popcnt does not mean avx512bw (I think the deprecated Xeon Phi processors falls in this category) which is why we need both.
Re: Popcount optimization using AVX512
On Thu, Apr 18, 2024 at 09:29:55PM +, Devulapalli, Raghuveer wrote: > (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise > zmm_regs_available() will return false.. Yes, that's a mistake. I fixed that in v3. > (2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the > same cpuid leaf. You could combine them into one to avoid running cpuid > twice. My apologies, I should have mentioned this before.. Good call. The byte-and-word instructions were a late addition to the patch, so I missed this originally. On that note, is it necessary to also check for avx512f? At the moment, we are assuming that's supported if the other AVX-512 instructions are available. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From e04c348eb389c6aa1597ac35d57b5e7ae7075381 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 18 Apr 2024 15:57:56 -0500 Subject: [PATCH v3 1/1] osxsave --- src/port/pg_popcount_avx512_choose.c | 80 1 file changed, 47 insertions(+), 33 deletions(-) diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c index ae3fa3d306..b37107803a 100644 --- a/src/port/pg_popcount_avx512_choose.c +++ b/src/port/pg_popcount_avx512_choose.c @@ -34,39 +34,13 @@ #ifdef TRY_POPCNT_FAST /* - * Returns true if the CPU supports the instructions required for the AVX-512 - * pg_popcount() implementation. + * Does CPUID say there's support for XSAVE instructions? */ -bool -pg_popcount_avx512_available(void) +static inline bool +xsave_available(void) { unsigned int exx[4] = {0, 0, 0, 0}; - /* Does CPUID say there's support for AVX-512 popcount instructions? */ -#if defined(HAVE__GET_CPUID_COUNT) - __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); -#elif defined(HAVE__CPUIDEX) - __cpuidex(exx, 7, 0); -#else -#error cpuid instruction not available -#endif - if ((exx[2] & (1 << 14)) == 0) /* avx512-vpopcntdq */ - return false; - - /* Does CPUID say there's support for AVX-512 byte and word instructions? */ - memset(exx, 0, sizeof(exx)); -#if defined(HAVE__GET_CPUID_COUNT) - __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); -#elif defined(HAVE__CPUIDEX) - __cpuidex(exx, 7, 0); -#else -#error cpuid instruction not available -#endif - if ((exx[1] & (1 << 30)) == 0) /* avx512-bw */ - return false; - - /* Does CPUID say there's support for XSAVE instructions? */ - memset(exx, 0, sizeof(exx)); #if defined(HAVE__GET_CPUID) __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID) @@ -74,15 +48,55 @@ pg_popcount_avx512_available(void) #else #error cpuid instruction not available #endif - if ((exx[2] & (1 << 26)) == 0) /* xsave */ - return false; + return (exx[2] & (1 << 27)) != 0; /* osxsave */ +} - /* Does XGETBV say the ZMM registers are enabled? */ +/* + * Does XGETBV say the ZMM registers are enabled? + * + * NB: Caller is responsible for verifying that xsave_available() returns true + * before calling this. + */ +static inline bool +zmm_regs_available(void) +{ #ifdef HAVE_XSAVE_INTRINSICS - return (_xgetbv(0) & 0xe0) != 0; + return (_xgetbv(0) & 0xe6) == 0xe6; #else return false; #endif } +/* + * Does CPUID say there's support for AVX-512 popcount and byte-and-word + * instructions? + */ +static inline bool +avx512_popcnt_available(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; + +#if defined(HAVE__GET_CPUID_COUNT) + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUIDEX) + __cpuidex(exx, 7, 0); +#else +#error cpuid instruction not available +#endif + return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */ + (exx[1] & (1 << 30)) != 0; /* avx512-bw */ +} + +/* + * Returns true if the CPU supports the instructions required for the AVX-512 + * pg_popcount() implementation. + */ +bool +pg_popcount_avx512_available(void) +{ + return xsave_available() && + zmm_regs_available() && + avx512_popcnt_available(); +} + #endif /* TRY_POPCNT_FAST */ -- 2.25.1
RE: Popcount optimization using AVX512
> Thanks for the feedback. I've attached an updated patch. (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise zmm_regs_available() will return false. (2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the same cpuid leaf. You could combine them into one to avoid running cpuid twice. My apologies, I should have mentioned this before.
Re: Popcount optimization using AVX512
On Thu, Apr 18, 2024 at 08:24:03PM +, Devulapalli, Raghuveer wrote: >> This seems to contradict the note about doing step 3 at any point, and >> given step 1 is the OSXSAVE check, I'm not following what this means, >> anyway. > > It is recommended that you run the xgetbv code before you check for cpu > features avx512-popcnt and avx512-bw. The way it is written now is the > opposite order. I would also recommend splitting the cpuid feature check > for avx512popcnt/avx512bw and xgetbv section into separate functions to > make them modular. Something like: > > static inline > int check_os_avx512_support(void) > { > // (1) run cpuid leaf 1 to check for xgetbv instruction support: > unsigned int exx[4] = {0, 0, 0, 0}; > __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); > if ((exx[2] & (1 << 27)) == 0) /* xsave */ > return false; > > /* Does XGETBV say the ZMM/YMM/XMM registers are enabled? */ > return (_xgetbv(0) & 0xe0) == 0xe0; > } > >> I'm also wondering if we need to check that (_xgetbv(0) & 0xe6) == 0xe6 >> instead of just (_xgetbv(0) & 0xe0) != 0, as the status of the lower >> half of some of the ZMM registers is stored in the SSE and AVX state >> [0]. I don't know how likely it is that 0xe0 would succeed but 0xe6 >> wouldn't, but we might as well make it correct. > > This is correct. It needs to check all the 3 bits (XMM/YMM and ZMM). The > way it is written is now is in-correct. Thanks for the feedback. I've attached an updated patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From d20b19804a17d9f6eab1d40de7e9fb10488ac6b0 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 18 Apr 2024 15:57:56 -0500 Subject: [PATCH v2 1/1] osxsave --- src/port/pg_popcount_avx512_choose.c | 89 +++- 1 file changed, 60 insertions(+), 29 deletions(-) diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c index ae3fa3d306..009f94909a 100644 --- a/src/port/pg_popcount_avx512_choose.c +++ b/src/port/pg_popcount_avx512_choose.c @@ -34,27 +34,47 @@ #ifdef TRY_POPCNT_FAST /* - * Returns true if the CPU supports the instructions required for the AVX-512 - * pg_popcount() implementation. + * Does CPUID say there's support for XSAVE instructions? */ -bool -pg_popcount_avx512_available(void) +static inline bool +xsave_available(void) { unsigned int exx[4] = {0, 0, 0, 0}; - /* Does CPUID say there's support for AVX-512 popcount instructions? */ -#if defined(HAVE__GET_CPUID_COUNT) - __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); -#elif defined(HAVE__CPUIDEX) - __cpuidex(exx, 7, 0); +#if defined(HAVE__GET_CPUID) + __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUID) + __cpuid(exx, 1); #else #error cpuid instruction not available #endif - if ((exx[2] & (1 << 14)) == 0) /* avx512-vpopcntdq */ - return false; + return (exx[2] & (1 << 27)) != 0; /* osxsave */ +} + +/* + * Does XGETBV say the ZMM registers are enabled? + * + * NB: Caller is responsible for verifying that xsave_available() returns true + * before calling this. + */ +static inline bool +zmm_regs_available(void) +{ +#ifdef HAVE_XSAVE_INTRINSICS + return (_xgetbv(0) & 0xe6) != 0xe6; +#else + return false; +#endif +} + +/* + * Does CPUID say there's support for AVX-512 popcount instructions? + */ +static inline bool +avx512_popcnt_available(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; - /* Does CPUID say there's support for AVX-512 byte and word instructions? */ - memset(exx, 0, sizeof(exx)); #if defined(HAVE__GET_CPUID_COUNT) __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUIDEX) @@ -62,27 +82,38 @@ pg_popcount_avx512_available(void) #else #error cpuid instruction not available #endif - if ((exx[1] & (1 << 30)) == 0) /* avx512-bw */ - return false; + return (exx[2] & (1 << 14)) != 0; /* avx512-vpopcntdq */ +} - /* Does CPUID say there's support for XSAVE instructions? */ - memset(exx, 0, sizeof(exx)); -#if defined(HAVE__GET_CPUID) - __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); -#elif defined(HAVE__CPUID) - __cpuid(exx, 1); +/* + * Does CPUID say there's support for AVX-512 byte and word instructions? + */ +static inline bool +avx512_bw_available(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; + +#if defined(HAVE__GET_CPUID_COUNT) + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUIDEX) + __cpuidex(exx, 7, 0); #else #error cpuid instruction not available #endif - if ((exx[2] & (1 << 26)) == 0) /* xsave */ - return false; + return (exx[1] & (1 << 30)) != 0; /* avx512-bw */ +} - /* Does XGETBV say the ZMM registers are enabled? */ -#ifdef HAVE_XSAVE_INTRINSICS - return (_xgetbv(0) & 0xe0) != 0; -#else - return false; -#endif +/* + * Returns true if the CPU supports the instructions required for the AVX-512 + * pg_popcount() implementation. + */ +bool +pg_popcount_avx5
Re: Popcount optimization using AVX512
On Thu, Apr 18, 2024 at 06:12:22PM +, Shankaran, Akash wrote: > Good find. I confirmed after speaking with an intel expert, and from the > intel AVX-512 manual [0] section 14.3, which recommends to check bit27. From > the manual: > > "Prior to using Intel AVX, the application must identify that the operating > system supports the XGETBV instruction, > the YMM register state, in addition to processor's support for YMM state > management using XSAVE/XRSTOR and > AVX instructions. The following simplified sequence accomplishes both and is > strongly recommended. > 1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application > use1). > 2) Issue XGETBV and verify that XCR0[2:1] = '11b' (XMM state and YMM state > are enabled by OS). > 3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported). > (Step 3 can be done in any order relative to 1 and 2.)" Thanks for confirming. IIUC my patch should be sufficient, then. > It also seems that step 1 and step 2 need to be done prior to the CPUID > OSXSAVE check in the popcount code. This seems to contradict the note about doing step 3 at any point, and given step 1 is the OSXSAVE check, I'm not following what this means, anyway. I'm also wondering if we need to check that (_xgetbv(0) & 0xe6) == 0xe6 instead of just (_xgetbv(0) & 0xe0) != 0, as the status of the lower half of some of the ZMM registers is stored in the SSE and AVX state [0]. I don't know how likely it is that 0xe0 would succeed but 0xe6 wouldn't, but we might as well make it correct. [0] https://en.wikipedia.org/wiki/Control_register#cite_ref-23 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
RE: Popcount optimization using AVX512
> It was brought to my attention [0] that we probably should be checking for > the OSXSAVE bit instead of the XSAVE bit when determining whether there's > support for the XGETBV instruction. IIUC that should indicate that both the > OS and the processor have XGETBV support (not just the processor). > I've attached a one-line patch to fix this. > [0] https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463 Good find. I confirmed after speaking with an intel expert, and from the intel AVX-512 manual [0] section 14.3, which recommends to check bit27. From the manual: "Prior to using Intel AVX, the application must identify that the operating system supports the XGETBV instruction, the YMM register state, in addition to processor's support for YMM state management using XSAVE/XRSTOR and AVX instructions. The following simplified sequence accomplishes both and is strongly recommended. 1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use1). 2) Issue XGETBV and verify that XCR0[2:1] = '11b' (XMM state and YMM state are enabled by OS). 3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported). (Step 3 can be done in any order relative to 1 and 2.)" It also seems that step 1 and step 2 need to be done prior to the CPUID OSXSAVE check in the popcount code. [0]: https://cdrdv2.intel.com/v1/dl/getContent/671200 - Akash Shankaran
Re: Popcount optimization using AVX512
It was brought to my attention [0] that we probably should be checking for the OSXSAVE bit instead of the XSAVE bit when determining whether there's support for the XGETBV instruction. IIUC that should indicate that both the OS and the processor have XGETBV support (not just the processor). I've attached a one-line patch to fix this. [0] https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c index ae3fa3d306..cc3e89e096 100644 --- a/src/port/pg_popcount_avx512_choose.c +++ b/src/port/pg_popcount_avx512_choose.c @@ -74,7 +74,7 @@ pg_popcount_avx512_available(void) #else #error cpuid instruction not available #endif - if ((exx[2] & (1 << 26)) == 0) /* xsave */ + if ((exx[2] & (1 << 27)) == 0) /* osxsave */ return false; /* Does XGETBV say the ZMM registers are enabled? */
Re: Popcount optimization using AVX512
Nathan Bossart writes: > On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote: >> The Intel documentation for _mm256_undefined_si256() [0] >> indicates that it is intended to return "undefined elements," so it seems >> like the use of an uninitialized variable might be intentional. > See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122. Ah, interesting. That hasn't propagated to stable distros yet, evidently (and even when it does, I wonder how soon Coverity will understand it). Anyway, that does establish that it's gcc's problem not ours. Thanks for digging! regards, tom lane
Re: Popcount optimization using AVX512
On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote: > The Intel documentation for _mm256_undefined_si256() [0] > indicates that it is intended to return "undefined elements," so it seems > like the use of an uninitialized variable might be intentional. See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Sun, Apr 07, 2024 at 08:42:12PM -0400, Tom Lane wrote: > Today's Coverity run produced this warning, which seemingly was > triggered by one of these commits, but I can't make much sense > of it: > > *** CID 1596255: Uninitialized variables (UNINIT) > /usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in > _mm256_undefined_si256() > 1214 extern __inline __m256i __attribute__((__gnu_inline__, > __always_inline__, __artificial__)) > 1215 _mm256_undefined_si256 (void) > 1216 { > 1217 __m256i __Y = __Y; CID 1596255: Uninitialized variables (UNINIT) Using uninitialized value "__Y". > 1218 return __Y; > 1219 } > > I see the same code in my local copy of avxintrin.h, > and I quite agree that it looks like either an undefined > value or something that properly ought to be an error. > If we are calling this, why (and from where)? Nothing in these commits uses this, or even uses the 256-bit registers. avxintrin.h is included by immintrin.h, which is probably why this is showing up. I believe you're supposed to use immintrin.h for the intrinsics used in these commits, so I don't immediately see a great way to avoid this. The Intel documentation for _mm256_undefined_si256() [0] indicates that it is intended to return "undefined elements," so it seems like the use of an uninitialized variable might be intentional. > Anyway, we can certainly just dismiss this warning if it > doesn't correspond to any real problem in our code. > But I thought I'd raise the question. That's probably the right thing to do, unless there's some action we can take to suppress this warning. [0] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_undefined_si256&ig_expand=6943 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
Nathan Bossart writes: > Here is what I have staged for commit, which I intend to do shortly. Today's Coverity run produced this warning, which seemingly was triggered by one of these commits, but I can't make much sense of it: *** CID 1596255: Uninitialized variables (UNINIT) /usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in _mm256_undefined_si256() 1214 extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) 1215 _mm256_undefined_si256 (void) 1216 { 1217 __m256i __Y = __Y; >>> CID 1596255: Uninitialized variables (UNINIT) >>> Using uninitialized value "__Y". 1218 return __Y; 1219 } I see the same code in my local copy of avxintrin.h, and I quite agree that it looks like either an undefined value or something that properly ought to be an error. If we are calling this, why (and from where)? Anyway, we can certainly just dismiss this warning if it doesn't correspond to any real problem in our code. But I thought I'd raise the question. regards, tom lane
Re: Popcount optimization using AVX512
On Sat, Apr 06, 2024 at 02:41:01PM -0500, Nathan Bossart wrote: > Here is what I have staged for commit, which I intend to do shortly. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Sat, Apr 06, 2024 at 02:51:39PM +1300, David Rowley wrote: > On Sat, 6 Apr 2024 at 14:17, Nathan Bossart wrote: >> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: >> > Won't Valgrind complain about this? >> > >> > +pg_popcount_avx512(const char *buf, int bytes) >> > >> > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); >> > >> > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); >> >> I haven't been able to generate any complaints, at least with some simple >> tests. But I see your point. If this did cause such complaints, ISTM we'd >> just want to add it to the suppression file. Otherwise, I think we'd have >> to go back to the non-maskz approach (which I really wanted to avoid >> because of the weird function overhead juggling) or find another way to do >> a partial load into an __m512i. > > [1] seems to think it's ok. If this is true then the following > shouldn't segfault: > > The following seems to run without any issue and if I change the mask > to 1 it crashes, as you'd expect. Cool. Here is what I have staged for commit, which I intend to do shortly. At some point, I'd like to revisit converting TRY_POPCNT_FAST to a configure-time check and maybe even moving the "fast" and "slow" implementations to their own files, but since that's mostly for code neatness and we are rapidly approaching the v17 deadline, I'm content to leave that for v18. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 9eea49555cbd14c7871085e159c9b0b78e92 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v28 1/2] Optimize pg_popcount() with AVX-512 instructions. Presently, pg_popcount() processes data in 32-bit or 64-bit chunks when possible. Newer hardware that supports AVX-512 instructions can perform these tasks in 512-bit chunks, which can provide a nice speedup, especially for larger buffers. This commit introduces the infrastructure required to detect both compiler and CPU support for the required AVX-512 intrinsic functions, and it makes use of that infrastructure in a new pg_popcount() implementation. If CPU support for this optimized implementation is detected at runtime, a function pointer is updated so that it is used for subsequent calls to pg_popcount(). Most of the existing in-tree calls to pg_popcount() should benefit nicely from these instructions, and calls for smaller buffers should not regress when compared to v16. The new infrastructure introduced by this commit can also be used to optimized visibilitymap_count(), but that work is left for a follow-up commit. Co-authored-by: Paul Amonson, Ants Aasma Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 11 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 5 + src/port/pg_popcount_avx512.c| 82 + src/port/pg_popcount_avx512_choose.c | 87 + src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 696 insertions(+), 3 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..cfff48c1bc 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# --
Re: Popcount optimization using AVX512
On Sat, 6 Apr 2024 at 14:17, Nathan Bossart wrote: > > On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: > > Won't Valgrind complain about this? > > > > +pg_popcount_avx512(const char *buf, int bytes) > > > > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); > > > > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); > > I haven't been able to generate any complaints, at least with some simple > tests. But I see your point. If this did cause such complaints, ISTM we'd > just want to add it to the suppression file. Otherwise, I think we'd have > to go back to the non-maskz approach (which I really wanted to avoid > because of the weird function overhead juggling) or find another way to do > a partial load into an __m512i. [1] seems to think it's ok. If this is true then the following shouldn't segfault: The following seems to run without any issue and if I change the mask to 1 it crashes, as you'd expect. #include #include int main(void) { __m512i val; val = _mm512_maskz_loadu_epi8((__mmask64) 0, NULL); printf("%llu\n", _mm512_reduce_add_epi64(val)); return 0; } gcc avx512.c -o avx512 -O0 -mavx512f -march=native David [1] https://stackoverflow.com/questions/54497141/when-using-a-mask-register-with-avx-512-load-and-stores-is-a-fault-raised-for-i
Re: Popcount optimization using AVX512
On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: > Won't Valgrind complain about this? > > +pg_popcount_avx512(const char *buf, int bytes) > > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); > > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); I haven't been able to generate any complaints, at least with some simple tests. But I see your point. If this did cause such complaints, ISTM we'd just want to add it to the suppression file. Otherwise, I think we'd have to go back to the non-maskz approach (which I really wanted to avoid because of the weird function overhead juggling) or find another way to do a partial load into an __m512i. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Sat, 6 Apr 2024 at 04:38, Nathan Bossart wrote: > This seems to provide a small performance boost, so I've incorporated it > into v27. Won't Valgrind complain about this? +pg_popcount_avx512(const char *buf, int bytes) + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); David
Re: Popcount optimization using AVX512
On Fri, Apr 05, 2024 at 07:58:44AM -0500, Nathan Bossart wrote: > On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote: >> The main issue I saw was that clang was able to peel off the first >> iteration of the loop and then eliminate the mask assignment and >> replace masked load with a memory operand for vpopcnt. I was not able >> to convince gcc to do that regardless of optimization options. >> Generated code for the inner loop: >> >> clang: >> : >> 50: add rdx, 64 >> 54: cmp rdx, rdi >> 57: jae >> 59: vpopcntq zmm1, zmmword ptr [rdx] >> 5f: vpaddq zmm0, zmm1, zmm0 >> 65: jmp >> >> gcc: >> : >> 38: kmovq k1, rdx >> 3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax] >> 43: add rax, 64 >> 47: mov rdx, -1 >> 4e: vpopcntq zmm0, zmm0 >> 54: vpaddq zmm0, zmm0, zmm1 >> 5a: vmovdqa64 zmm1, zmm0 >> 60: cmp rax, rsi >> 63: jb >> >> I'm not sure how much that matters in practice. Attached is a patch to >> do this manually giving essentially the same result in gcc. As most >> distro packages are built using gcc I think it would make sense to >> have the extra code if it gives a noticeable benefit for large cases. > > Yeah, I did see this, but I also wasn't sure if it was worth further > complicating the code. I can test with and without your fix and see if it > makes any difference in the benchmarks. This seems to provide a small performance boost, so I've incorporated it into v27. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 9fc4b7556b72d51fce676db84b446099767efff3 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v27 1/2] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 11 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 5 + src/port/pg_popcount_avx512.c| 82 + src/port/pg_popcount_avx512_choose.c | 81 + src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 690 insertions(+), 3 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..892b3c9580 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq +# -mavx512bw). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed va
Re: Popcount optimization using AVX512
On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote: > The main issue I saw was that clang was able to peel off the first > iteration of the loop and then eliminate the mask assignment and > replace masked load with a memory operand for vpopcnt. I was not able > to convince gcc to do that regardless of optimization options. > Generated code for the inner loop: > > clang: > : > 50: add rdx, 64 > 54: cmp rdx, rdi > 57: jae > 59: vpopcntq zmm1, zmmword ptr [rdx] > 5f: vpaddq zmm0, zmm1, zmm0 > 65: jmp > > gcc: > : > 38: kmovq k1, rdx > 3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax] > 43: add rax, 64 > 47: mov rdx, -1 > 4e: vpopcntq zmm0, zmm0 > 54: vpaddq zmm0, zmm0, zmm1 > 5a: vmovdqa64 zmm1, zmm0 > 60: cmp rax, rsi > 63: jb > > I'm not sure how much that matters in practice. Attached is a patch to > do this manually giving essentially the same result in gcc. As most > distro packages are built using gcc I think it would make sense to > have the extra code if it gives a noticeable benefit for large cases. Yeah, I did see this, but I also wasn't sure if it was worth further complicating the code. I can test with and without your fix and see if it makes any difference in the benchmarks. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Fri, 5 Apr 2024 at 07:15, Nathan Bossart wrote: > Here is an updated patch set. IMHO this is in decent shape and is > approaching committable. I checked the code generation on various gcc and clang versions. It looks mostly fine starting from versions where avx512 is supported, gcc-7.1 and clang-5. The main issue I saw was that clang was able to peel off the first iteration of the loop and then eliminate the mask assignment and replace masked load with a memory operand for vpopcnt. I was not able to convince gcc to do that regardless of optimization options. Generated code for the inner loop: clang: : 50: add rdx, 64 54: cmp rdx, rdi 57: jae 59: vpopcntq zmm1, zmmword ptr [rdx] 5f: vpaddq zmm0, zmm1, zmm0 65: jmp gcc: : 38: kmovq k1, rdx 3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax] 43: add rax, 64 47: mov rdx, -1 4e: vpopcntq zmm0, zmm0 54: vpaddq zmm0, zmm0, zmm1 5a: vmovdqa64 zmm1, zmm0 60: cmp rax, rsi 63: jb I'm not sure how much that matters in practice. Attached is a patch to do this manually giving essentially the same result in gcc. As most distro packages are built using gcc I think it would make sense to have the extra code if it gives a noticeable benefit for large cases. The visibility map patch has the same issue, otherwise looks good. Regards, Ants Aasma diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index dacc7553d29..f6e718b86e9 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -52,13 +52,21 @@ pg_popcount_avx512(const char *buf, int bytes) * Iterate through all but the final iteration. Starting from second * iteration, the start index mask is ignored. */ - for (; buf < final; buf += sizeof(__m512i)) + if (buf < final) { val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); + buf += sizeof(__m512i); mask = ~UINT64CONST(0); + + for (; buf < final; buf += sizeof(__m512i)) + { + val = _mm512_load_si512((const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + } } /* Final iteration needs to ignore bytes that are not within the length */
Re: Popcount optimization using AVX512
Here is an updated patch set. IMHO this is in decent shape and is approaching committable. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From df59d3e78604e4530f5096bafc08ac94e13d82d2 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v26 1/2] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 11 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 5 + src/port/pg_popcount_avx512.c| 74 src/port/pg_popcount_avx512_choose.c | 81 + src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 682 insertions(+), 3 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..892b3c9580 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq +# -mavx512bw). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..72d20d3945 100755 --- a/configure +++ b/configure @@ -647,6 +647,9 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +PG_POPCNT_OBJS +CFLAGS_POPCNT +CFLAGS_XSAVE LIBOBJS OPENSSL ZSTD @@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5 +$as_echo_n "checking for __get_cpuid_count... " >&6; } +if ${pgac_cv__get_cpuid_count+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__get_cpuid_count="yes" +else + pgac_cv__get_cpuid_count="no" +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5 +$as_echo "$pgac_cv__get_cpuid_count" >&6; } +if test x"$pgac_cv__get_cpuid_count" = x"yes"; then +
Re: Popcount optimization using AVX512
On Thu, Apr 04, 2024 at 04:02:53PM +0300, Ants Aasma wrote: > Speaking of which, what does bumping up the inlined version threshold > to 16 do with and without AVX-512 available? Linearly extrapolating > the 2 and 4 byte numbers it might just come ahead in both cases, > making the choice easy. IIRC the inlined version starts losing pretty quickly after 8 bytes. As I noted in my previous message, I think we have enough data to switch to your approach already, so I think it's a moot point. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote: > On Thu, 4 Apr 2024 at 11:50, Nathan Bossart wrote: >> If we can verify this approach won't cause segfaults and can stomach the >> regression between 8 and 16 bytes, I'd happily pivot to this approach so >> that we can avoid the function call dance that I have in v25. > > If we're worried about regressions with some narrow range of byte > values, wouldn't it make more sense to compare that to cc4826dd5~1 at > the latest rather than to some version that's already probably faster > than PG16? Good point. When compared with REL_16_STABLE, Ants's idea still wins: bytes v25 v25+ants REL_16_STABLE 2 1108.205 1033.132 2039.342 4 1311.227 1289.373 3207.217 8 1927.954 2360.113 3200.238 16 2281.091 2365.408 4457.769 32 3856.992 2390.688 6206.689 64 3648.72 3242.498 9619.403 128 4108.549 3607.148 17912.081 256 4910.076 4496.852 33591.385 As before, with 2 and 4 bytes, HEAD is using the inlined approach, but REL_16_STABLE is doing a function call. For 8 bytes, REL_16_STABLE is doing a function call as well as a call to a function pointer. At 16 bytes, it's doing a function call and two calls to a function pointer. With Ant's approach, both 8 and 16 bytes require a single call to a function pointer, and of course we are using the AVX-512 implementation for both. I think this is sufficient to justify switching approaches. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Thu, 4 Apr 2024 at 01:50, Nathan Bossart wrote: > If we can verify this approach won't cause segfaults and can stomach the > regression between 8 and 16 bytes, I'd happily pivot to this approach so > that we can avoid the function call dance that I have in v25. The approach I posted does not rely on masking performing page fault suppression. All loads are 64 byte aligned and always contain at least one byte of the buffer and therefore are guaranteed to be within a valid page. I personally don't mind it being slower for the very small cases, because when performance on those sizes really matters it makes much more sense to shoot for an inlined version instead. Speaking of which, what does bumping up the inlined version threshold to 16 do with and without AVX-512 available? Linearly extrapolating the 2 and 4 byte numbers it might just come ahead in both cases, making the choice easy. Regards, Ants Aasma
Re: Popcount optimization using AVX512
On Thu, 4 Apr 2024 at 11:50, Nathan Bossart wrote: > If we can verify this approach won't cause segfaults and can stomach the > regression between 8 and 16 bytes, I'd happily pivot to this approach so > that we can avoid the function call dance that I have in v25. > > Thoughts? If we're worried about regressions with some narrow range of byte values, wouldn't it make more sense to compare that to cc4826dd5~1 at the latest rather than to some version that's already probably faster than PG16? David
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote: > On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: >> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: >> > What about using the masking capabilities of AVX-512 to handle the >> > tail in the same code path? Masked out portions of a load instruction >> > will not generate an exception. To allow byte level granularity >> > masking, -mavx512bw is needed. Based on wikipedia this will only >> > disable this fast path on Knights Mill (Xeon Phi), in all other cases >> > VPOPCNTQ implies availability of BW. >> >> Sounds promising. IMHO we should really be sure that these kinds of loads >> won't generate segfaults and the like due to the masked-out portions. I >> searched around a little bit but haven't found anything that seemed >> definitive. > > After sleeping on the problem, I think we can avoid this question > altogether while making the code faster by using aligned accesses. > Loads that straddle cache line boundaries run internally as 2 load > operations. Gut feel says that there are enough out-of-order resources > available to make it not matter in most cases. But even so, not doing > the extra work is surely better. Attached is another approach that > does aligned accesses, and thereby avoids going outside bounds. > > Would be interesting to see how well that fares in the small use case. > Anything that fits into one aligned cache line should be constant > speed, and there is only one branch, but the mask setup and folding > the separate popcounts together should add up to about 20-ish cycles > of overhead. I tested your patch in comparison to v25 and saw the following: bytes v25 v25+ants 21108.205 1033.132 41311.227 1289.373 81927.954 2360.113 162281.091 2365.408 323856.992 2390.688 643648.72 3242.498 1284108.549 3607.148 2564910.076 4496.852 For 2 bytes and 4 bytes, the inlining should take effect, so any difference there is likely just noise. At 8 bytes, we are calling the function pointer, and there is a small regression with the masking approach. However, by 16 bytes, the masking approach is on par with v25, and it wins for all larger buffers, although the gains seem to taper off a bit. If we can verify this approach won't cause segfaults and can stomach the regression between 8 and 16 bytes, I'd happily pivot to this approach so that we can avoid the function call dance that I have in v25. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Wed, Apr 03, 2024 at 12:41:27PM -0500, Nathan Bossart wrote: > I committed v23-0001. Here is a rebased version of the remaining patches. > I intend to test the masking idea from Ants next. 0002 was missing a cast that is needed for the 32-bit builds. I've fixed that in v25. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From fe001e38b3f209c2fe615a2c4c64109d5e4d3da1 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v25 1/2] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 29 ++- src/port/pg_popcount_avx512.c| 49 ++ src/port/pg_popcount_avx512_choose.c | 71 src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 673 insertions(+), 5 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..5fb60775ca 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# An optional compiler flag can be passed as argument +# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..b48ed7f271 100755 --- a/configure +++ b/configure @@ -647,6 +647,9 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +PG_POPCNT_OBJS +CFLAGS_POPCNT +CFLAGS_XSAVE LIBOBJS OPENSSL ZSTD @@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5 +$as_echo_n "checking for __get_cpuid_count... " >&6; } +if ${pgac_cv__get_cpuid_count+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__get_cpuid_count="yes" +else + pgac_cv__get_cpuid_count="no" +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: re
Re: Popcount optimization using AVX512
I committed v23-0001. Here is a rebased version of the remaining patches. I intend to test the masking idea from Ants next. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 295b03530de5f42fe876b4489191da2f8dc83194 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v24 1/2] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 29 ++- src/port/pg_popcount_avx512.c| 49 ++ src/port/pg_popcount_avx512_choose.c | 71 src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 673 insertions(+), 5 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..5fb60775ca 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# An optional compiler flag can be passed as argument +# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..b48ed7f271 100755 --- a/configure +++ b/configure @@ -647,6 +647,9 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +PG_POPCNT_OBJS +CFLAGS_POPCNT +CFLAGS_XSAVE LIBOBJS OPENSSL ZSTD @@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5 +$as_echo_n "checking for __get_cpuid_count... " >&6; } +if ${pgac_cv__get_cpuid_count+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__get_cpuid_count="yes" +else + pgac_cv__get_cpuid_count="no" +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5 +$as_echo "$pgac_cv__get_cpuid_count" >&6; } +if test x"$pgac_cv__get_cpuid_count" = x"yes"; then + +$as_echo "#define
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 05:20:20PM -0500, Nathan Bossart wrote: > Sorry for the noise. I noticed a couple of silly mistakes immediately > after sending v21. Sigh... I missed a line while rebasing these patches, which seems to have grossly offended cfbot. Apologies again for the noise. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From bfe2b3158378fd822c17fb251178df7557065cfd Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 2 Apr 2024 15:54:49 -0500 Subject: [PATCH v23 1/3] inline pg_popcount for small numbers of bytes --- src/include/port/pg_bitutils.h | 34 -- src/port/pg_bitutils.c | 12 ++-- 2 files changed, 38 insertions(+), 8 deletions(-) diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 53e5239717..1f487a4bc3 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num) /* Attempt to use the POPCNT instruction, but perform a runtime check first */ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word); extern PGDLLIMPORT int (*pg_popcount64) (uint64 word); -extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes); +extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes); #else /* Use a portable implementation -- no need for a function pointer. */ extern int pg_popcount32(uint32 word); extern int pg_popcount64(uint64 word); -extern uint64 pg_popcount(const char *buf, int bytes); +extern uint64 pg_popcount_optimized(const char *buf, int bytes); #endif /* TRY_POPCNT_FAST */ +/* + * Returns the number of 1-bits in buf. + * + * If there aren't many bytes to process, the function call overhead of the + * optimized versions isn't worth taking, so we inline a loop that consults + * pg_number_of_ones in that case. If there are many bytes to process, we + * accept the function call overhead because the optimized versions are likely + * to be faster. + */ +static inline uint64 +pg_popcount(const char *buf, int bytes) +{ + /* + * We use 8 bytes as the threshold because that's where we'll first use + * special instructions on 64-bit systems. A threshold of 4 bytes might + * make more sense on 32-bit systems, but it seems unlikely to make a + * tremendous difference. + */ + if (bytes < 8) + { + uint64 popcnt = 0; + + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++]; + return popcnt; + } + + return pg_popcount_optimized(buf, bytes); +} + /* * Rotate the bits of "word" to the right/left by n bits. */ diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 28312f3dd9..6271acea60 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes); int (*pg_popcount32) (uint32 word) = pg_popcount32_choose; int (*pg_popcount64) (uint64 word) = pg_popcount64_choose; -uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose; +uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose; #endif /* TRY_POPCNT_FAST */ #ifdef TRY_POPCNT_FAST @@ -155,13 +155,13 @@ choose_popcount_functions(void) { pg_popcount32 = pg_popcount32_fast; pg_popcount64 = pg_popcount64_fast; - pg_popcount = pg_popcount_fast; + pg_popcount_optimized = pg_popcount_fast; } else { pg_popcount32 = pg_popcount32_slow; pg_popcount64 = pg_popcount64_slow; - pg_popcount = pg_popcount_slow; + pg_popcount_optimized = pg_popcount_slow; } } @@ -183,7 +183,7 @@ static uint64 pg_popcount_choose(const char *buf, int bytes) { choose_popcount_functions(); - return pg_popcount(buf, bytes); + return pg_popcount_optimized(buf, bytes); } /* @@ -387,11 +387,11 @@ pg_popcount64(uint64 word) } /* - * pg_popcount + * pg_popcount_optimized * Returns the number of 1-bits in buf */ uint64 -pg_popcount(const char *buf, int bytes) +pg_popcount_optimized(const char *buf, int bytes) { return pg_popcount_slow(buf, bytes); } -- 2.25.1 >From da744d0614021cf002e4d9e292e5c874bd81a84e Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v23 2/3] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 29 ++- src/port/pg_popcount_avx512.c| 49 ++ src/port/pg_popcount_avx512_choose.c | 71 src/test/regress/expected/bit.out| 24 +++ src/
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 05:01:32PM -0500, Nathan Bossart wrote: > In v21, 0001 is just the above inlining idea, which seems worth doing > independent of $SUBJECT. 0002 and 0003 are the AVX-512 patches, which I've > modified similarly to 0001, i.e., I've inlined the "fast" version in the > function pointer to avoid the function call overhead when there are fewer > than 64 bytes. All of this overhead juggling should result in choosing the > optimal popcount implementation depending on how many bytes there are to > process, roughly speaking. Sorry for the noise. I noticed a couple of silly mistakes immediately after sending v21. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From cfc5e9fe77f96225ec67a044377b10113c98ce0d Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 2 Apr 2024 15:54:49 -0500 Subject: [PATCH v22 1/3] inline pg_popcount for small numbers of bytes --- src/include/port/pg_bitutils.h | 34 -- src/port/pg_bitutils.c | 12 ++-- 2 files changed, 38 insertions(+), 8 deletions(-) diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 53e5239717..1f487a4bc3 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num) /* Attempt to use the POPCNT instruction, but perform a runtime check first */ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word); extern PGDLLIMPORT int (*pg_popcount64) (uint64 word); -extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes); +extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes); #else /* Use a portable implementation -- no need for a function pointer. */ extern int pg_popcount32(uint32 word); extern int pg_popcount64(uint64 word); -extern uint64 pg_popcount(const char *buf, int bytes); +extern uint64 pg_popcount_optimized(const char *buf, int bytes); #endif /* TRY_POPCNT_FAST */ +/* + * Returns the number of 1-bits in buf. + * + * If there aren't many bytes to process, the function call overhead of the + * optimized versions isn't worth taking, so we inline a loop that consults + * pg_number_of_ones in that case. If there are many bytes to process, we + * accept the function call overhead because the optimized versions are likely + * to be faster. + */ +static inline uint64 +pg_popcount(const char *buf, int bytes) +{ + /* + * We use 8 bytes as the threshold because that's where we'll first use + * special instructions on 64-bit systems. A threshold of 4 bytes might + * make more sense on 32-bit systems, but it seems unlikely to make a + * tremendous difference. + */ + if (bytes < 8) + { + uint64 popcnt = 0; + + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++]; + return popcnt; + } + + return pg_popcount_optimized(buf, bytes); +} + /* * Rotate the bits of "word" to the right/left by n bits. */ diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 28312f3dd9..6271acea60 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes); int (*pg_popcount32) (uint32 word) = pg_popcount32_choose; int (*pg_popcount64) (uint64 word) = pg_popcount64_choose; -uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose; +uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose; #endif /* TRY_POPCNT_FAST */ #ifdef TRY_POPCNT_FAST @@ -155,13 +155,13 @@ choose_popcount_functions(void) { pg_popcount32 = pg_popcount32_fast; pg_popcount64 = pg_popcount64_fast; - pg_popcount = pg_popcount_fast; + pg_popcount_optimized = pg_popcount_fast; } else { pg_popcount32 = pg_popcount32_slow; pg_popcount64 = pg_popcount64_slow; - pg_popcount = pg_popcount_slow; + pg_popcount_optimized = pg_popcount_slow; } } @@ -183,7 +183,7 @@ static uint64 pg_popcount_choose(const char *buf, int bytes) { choose_popcount_functions(); - return pg_popcount(buf, bytes); + return pg_popcount_optimized(buf, bytes); } /* @@ -387,11 +387,11 @@ pg_popcount64(uint64 word) } /* - * pg_popcount + * pg_popcount_optimized * Returns the number of 1-bits in buf */ uint64 -pg_popcount(const char *buf, int bytes) +pg_popcount_optimized(const char *buf, int bytes) { return pg_popcount_slow(buf, bytes); } -- 2.25.1 >From a8024ebcc54b4ac0d3d145ade5d7cd85eb192afc Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v22 2/3] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 +
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 01:40:21PM -0500, Nathan Bossart wrote: > On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote: >> I don't like the double evaluation of the macro argument. Seems like >> you could get the same results more safely with >> >> static inline uint64 >> pg_popcount(const char *buf, int bytes) >> { >> if (bytes < 64) >> { >> uint64 popcnt = 0; >> >> while (bytes--) >> popcnt += pg_number_of_ones[(unsigned char) >> *buf++]; >> >> return popcnt; >> } >> return pg_popcount_optimized(buf, bytes); >> } > > Yeah, I like that better. I'll do some testing to see what the threshold > really should be before posting an actual patch. My testing shows that inlining wins with fewer than 8 bytes for the current "fast" implementation. The "fast" implementation wins with fewer than 64 bytes compared to the AVX-512 implementation. These results are pretty intuitive because those are the points at which the optimizations kick in. In v21, 0001 is just the above inlining idea, which seems worth doing independent of $SUBJECT. 0002 and 0003 are the AVX-512 patches, which I've modified similarly to 0001, i.e., I've inlined the "fast" version in the function pointer to avoid the function call overhead when there are fewer than 64 bytes. All of this overhead juggling should result in choosing the optimal popcount implementation depending on how many bytes there are to process, roughly speaking. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From ce1180d557cbdf8cff33842ea2f1a22ba6676725 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 2 Apr 2024 15:54:49 -0500 Subject: [PATCH v21 1/3] inline pg_popcount for small numbers of bytes --- src/include/port/pg_bitutils.h | 34 -- src/port/pg_bitutils.c | 10 +- 2 files changed, 37 insertions(+), 7 deletions(-) diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 53e5239717..1f487a4bc3 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num) /* Attempt to use the POPCNT instruction, but perform a runtime check first */ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word); extern PGDLLIMPORT int (*pg_popcount64) (uint64 word); -extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes); +extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes); #else /* Use a portable implementation -- no need for a function pointer. */ extern int pg_popcount32(uint32 word); extern int pg_popcount64(uint64 word); -extern uint64 pg_popcount(const char *buf, int bytes); +extern uint64 pg_popcount_optimized(const char *buf, int bytes); #endif /* TRY_POPCNT_FAST */ +/* + * Returns the number of 1-bits in buf. + * + * If there aren't many bytes to process, the function call overhead of the + * optimized versions isn't worth taking, so we inline a loop that consults + * pg_number_of_ones in that case. If there are many bytes to process, we + * accept the function call overhead because the optimized versions are likely + * to be faster. + */ +static inline uint64 +pg_popcount(const char *buf, int bytes) +{ + /* + * We use 8 bytes as the threshold because that's where we'll first use + * special instructions on 64-bit systems. A threshold of 4 bytes might + * make more sense on 32-bit systems, but it seems unlikely to make a + * tremendous difference. + */ + if (bytes < 8) + { + uint64 popcnt = 0; + + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++]; + return popcnt; + } + + return pg_popcount_optimized(buf, bytes); +} + /* * Rotate the bits of "word" to the right/left by n bits. */ diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 28312f3dd9..4720f8e419 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes); int (*pg_popcount32) (uint32 word) = pg_popcount32_choose; int (*pg_popcount64) (uint64 word) = pg_popcount64_choose; -uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose; +uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose; #endif /* TRY_POPCNT_FAST */ #ifdef TRY_POPCNT_FAST @@ -155,13 +155,13 @@ choose_popcount_functions(void) { pg_popcount32 = pg_popcount32_fast; pg_popcount64 = pg_popcount64_fast; - pg_popcount = pg_popcount_fast; + pg_popcount_optimized = pg_popcount_fast; } else { pg_popcount32 = pg_popcount32_slow; pg_popcount64 = pg_popcount64_slow; - pg_popcount = pg_popcount_slow; + pg_popcount_optimized = pg_popcount_slow; } } @@ -183,7 +183,7 @@ static uint64 pg_popcount_choose(const char *buf, int bytes) { choose_pop
Re: Popcount optimization using AVX512
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception. To allow byte level granularity > > masking, -mavx512bw is needed. Based on wikipedia this will only > > disable this fast path on Knights Mill (Xeon Phi), in all other cases > > VPOPCNTQ implies availability of BW. > > Sounds promising. IMHO we should really be sure that these kinds of loads > won't generate segfaults and the like due to the masked-out portions. I > searched around a little bit but haven't found anything that seemed > definitive. After sleeping on the problem, I think we can avoid this question altogether while making the code faster by using aligned accesses. Loads that straddle cache line boundaries run internally as 2 load operations. Gut feel says that there are enough out-of-order resources available to make it not matter in most cases. But even so, not doing the extra work is surely better. Attached is another approach that does aligned accesses, and thereby avoids going outside bounds. Would be interesting to see how well that fares in the small use case. Anything that fits into one aligned cache line should be constant speed, and there is only one branch, but the mask setup and folding the separate popcounts together should add up to about 20-ish cycles of overhead. Regards, Ants Aasma diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index f86558d1ee5..e1fbd98fa14 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -30,20 +30,44 @@ uint64 pg_popcount_avx512(const char *buf, int bytes) { - uint64 popcnt; + __m512i val, cnt; __m512i accum = _mm512_setzero_si512(); + const char *final; + int tail_idx; + __mmask64 mask = -1; - for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) - { - const __m512i val = _mm512_loadu_si512((const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); + /* + * Align buffer down to avoid double load overhead from unaligned access. + * Calculate a mask to ignore preceding bytes. Find start offset of final + * iteration and number of valid bytes making sure that final iteration + * is not empty. + */ + mask <<= ((uintptr_t) buf) % sizeof(__m512i); + tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1; + final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1); + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); + /* + * Iterate through all but the final iteration. Starting from second + * iteration, the start index mask is ignored. + */ + for (; buf < final; buf += sizeof(__m512i)) + { + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); - buf += sizeof(__m512i); + + mask = -1; } - popcnt = _mm512_reduce_add_epi64(accum); - return popcnt + pg_popcount_fast(buf, bytes); + /* Final iteration needs to ignore bytes that are not within the length */ + mask &= ((~0ULL) >> (64 - tail_idx)); + + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + + return _mm512_reduce_add_epi64(accum); } #endif /* TRY_POPCNT_FAST */
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote: > Alvaro Herrera writes: >> On 2024-Apr-02, Nathan Bossart wrote: >>> Another idea I had is to turn pg_popcount() into a macro that just uses the >>> pg_number_of_ones array when called for few bytes: >>> >>> static inline uint64 >>> pg_popcount_inline(const char *buf, int bytes) >>> { >>> uint64 popcnt = 0; >>> >>> while (bytes--) >>> popcnt += pg_number_of_ones[(unsigned char) *buf++]; >>> >>> return popcnt; >>> } >>> >>> #define pg_popcount(buf, bytes) \ >>> ((bytes < 64) ? \ >>> pg_popcount_inline(buf, bytes) : \ >>> pg_popcount_optimized(buf, bytes)) >>> >>> But again, I'm not sure this is really worth it for the current use-cases. > >> Eh, that seems simple enough, and then you can forget about that case. > > I don't like the double evaluation of the macro argument. Seems like > you could get the same results more safely with > > static inline uint64 > pg_popcount(const char *buf, int bytes) > { > if (bytes < 64) > { > uint64 popcnt = 0; > > while (bytes--) > popcnt += pg_number_of_ones[(unsigned char) > *buf++]; > > return popcnt; > } > return pg_popcount_optimized(buf, bytes); > } Yeah, I like that better. I'll do some testing to see what the threshold really should be before posting an actual patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
Alvaro Herrera writes: > On 2024-Apr-02, Nathan Bossart wrote: >> Another idea I had is to turn pg_popcount() into a macro that just uses the >> pg_number_of_ones array when called for few bytes: >> >> static inline uint64 >> pg_popcount_inline(const char *buf, int bytes) >> { >> uint64 popcnt = 0; >> >> while (bytes--) >> popcnt += pg_number_of_ones[(unsigned char) *buf++]; >> >> return popcnt; >> } >> >> #define pg_popcount(buf, bytes) \ >> ((bytes < 64) ? \ >> pg_popcount_inline(buf, bytes) : \ >> pg_popcount_optimized(buf, bytes)) >> >> But again, I'm not sure this is really worth it for the current use-cases. > Eh, that seems simple enough, and then you can forget about that case. I don't like the double evaluation of the macro argument. Seems like you could get the same results more safely with static inline uint64 pg_popcount(const char *buf, int bytes) { if (bytes < 64) { uint64 popcnt = 0; while (bytes--) popcnt += pg_number_of_ones[(unsigned char) *buf++]; return popcnt; } return pg_popcount_optimized(buf, bytes); } regards, tom lane
Re: Popcount optimization using AVX512
On 2024-Apr-02, Nathan Bossart wrote: > Another idea I had is to turn pg_popcount() into a macro that just uses the > pg_number_of_ones array when called for few bytes: > > static inline uint64 > pg_popcount_inline(const char *buf, int bytes) > { > uint64 popcnt = 0; > > while (bytes--) > popcnt += pg_number_of_ones[(unsigned char) *buf++]; > > return popcnt; > } > > #define pg_popcount(buf, bytes) \ > ((bytes < 64) ? \ >pg_popcount_inline(buf, bytes) : \ >pg_popcount_optimized(buf, bytes)) > > But again, I'm not sure this is really worth it for the current use-cases. Eh, that seems simple enough, and then you can forget about that case. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "No hay hombre que no aspire a la plenitud, es decir, la suma de experiencias de que un hombre es capaz"
Re: Popcount optimization using AVX512
On Mon, Apr 01, 2024 at 05:11:17PM -0500, Nathan Bossart wrote: > Here is a v19 of the patch set. I moved out the refactoring of the > function pointer selection code to 0001. I think this is a good change > independent of $SUBJECT, and I plan to commit this soon. In 0002, I > changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones > instead. This is standard practice elsewhere where the popcount functions > are unlikely to win. I'll probably commit this one soon, too, as it's even > more trivial than 0001. > > 0003 is the AVX512 POPCNT patch. Besides refactoring out 0001, there are > no changes from v18. 0004 is an early proof-of-concept for using AVX512 > for the visibility map code. The code is missing comments, and I haven't > performed any benchmarking yet, but I figured I'd post it because it > demonstrates how it's possible to build upon 0003 in other areas. I've committed the first two patches, and I've attached a rebased version of the latter two. > AFAICT the main open question is the function call overhead in 0003 that > Alvaro brought up earlier. After 0002 is committed, I believe the only > in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm > not sure it's worth expending too much energy to make sure there are > absolutely no regressions there. However, I'm happy to do so if folks feel > that it is necessary, and I'd be grateful for thoughts on how to proceed on > this one. Another idea I had is to turn pg_popcount() into a macro that just uses the pg_number_of_ones array when called for few bytes: static inline uint64 pg_popcount_inline(const char *buf, int bytes) { uint64 popcnt = 0; while (bytes--) popcnt += pg_number_of_ones[(unsigned char) *buf++]; return popcnt; } #define pg_popcount(buf, bytes) \ ((bytes < 64) ? \ pg_popcount_inline(buf, bytes) : \ pg_popcount_optimized(buf, bytes)) But again, I'm not sure this is really worth it for the current use-cases. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 3c5c3fdaffd623b513bcc476ee7c15f6379af1e7 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v20 1/2] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 7 +- src/port/pg_popcount_avx512.c| 49 ++ src/port/pg_popcount_avx512_choose.c | 71 src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 651 insertions(+), 5 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..5fb60775ca 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# An optional compiler flag can be passed as argument +# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACH
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 01:09:57AM +0300, Ants Aasma wrote: > On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: >> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: >> > What about using the masking capabilities of AVX-512 to handle the >> > tail in the same code path? Masked out portions of a load instruction >> > will not generate an exception. To allow byte level granularity >> > masking, -mavx512bw is needed. Based on wikipedia this will only >> > disable this fast path on Knights Mill (Xeon Phi), in all other cases >> > VPOPCNTQ implies availability of BW. >> >> Sounds promising. IMHO we should really be sure that these kinds of loads >> won't generate segfaults and the like due to the masked-out portions. I >> searched around a little bit but haven't found anything that seemed >> definitive. > > Interestingly the Intel software developer manual is not exactly > crystal clear on how memory faults with masks work, but volume 2A > chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb > that supports memory fault suppression on page fault. Perhaps Paul or Akash could chime in here... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
Here is a v19 of the patch set. I moved out the refactoring of the function pointer selection code to 0001. I think this is a good change independent of $SUBJECT, and I plan to commit this soon. In 0002, I changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones instead. This is standard practice elsewhere where the popcount functions are unlikely to win. I'll probably commit this one soon, too, as it's even more trivial than 0001. 0003 is the AVX512 POPCNT patch. Besides refactoring out 0001, there are no changes from v18. 0004 is an early proof-of-concept for using AVX512 for the visibility map code. The code is missing comments, and I haven't performed any benchmarking yet, but I figured I'd post it because it demonstrates how it's possible to build upon 0003 in other areas. AFAICT the main open question is the function call overhead in 0003 that Alvaro brought up earlier. After 0002 is committed, I believe the only in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm not sure it's worth expending too much energy to make sure there are absolutely no regressions there. However, I'm happy to do so if folks feel that it is necessary, and I'd be grateful for thoughts on how to proceed on this one. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From cedad23b7b35e77fde164b1d577c37fb07a578c6 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 1 Apr 2024 16:37:53 -0500 Subject: [PATCH v19 1/4] refactor popcount function choosing --- src/port/pg_bitutils.c | 37 + 1 file changed, 9 insertions(+), 28 deletions(-) diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 1197696e97..28312f3dd9 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -148,8 +148,8 @@ pg_popcount_available(void) * the function pointers so that subsequent calls are routed directly to * the chosen implementation. */ -static int -pg_popcount32_choose(uint32 word) +static inline void +choose_popcount_functions(void) { if (pg_popcount_available()) { @@ -163,45 +163,26 @@ pg_popcount32_choose(uint32 word) pg_popcount64 = pg_popcount64_slow; pg_popcount = pg_popcount_slow; } +} +static int +pg_popcount32_choose(uint32 word) +{ + choose_popcount_functions(); return pg_popcount32(word); } static int pg_popcount64_choose(uint64 word) { - if (pg_popcount_available()) - { - pg_popcount32 = pg_popcount32_fast; - pg_popcount64 = pg_popcount64_fast; - pg_popcount = pg_popcount_fast; - } - else - { - pg_popcount32 = pg_popcount32_slow; - pg_popcount64 = pg_popcount64_slow; - pg_popcount = pg_popcount_slow; - } - + choose_popcount_functions(); return pg_popcount64(word); } static uint64 pg_popcount_choose(const char *buf, int bytes) { - if (pg_popcount_available()) - { - pg_popcount32 = pg_popcount32_fast; - pg_popcount64 = pg_popcount64_fast; - pg_popcount = pg_popcount_fast; - } - else - { - pg_popcount32 = pg_popcount32_slow; - pg_popcount64 = pg_popcount64_slow; - pg_popcount = pg_popcount_slow; - } - + choose_popcount_functions(); return pg_popcount(buf, bytes); } -- 2.25.1 >From 038b74045b006c5d8a5470364f2041370ec0b083 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 1 Apr 2024 16:47:22 -0500 Subject: [PATCH v19 2/4] use pg_number_of_ones instead of pg_popcount for single byte --- src/backend/postmaster/syslogger.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index 08efe74cc9..437947dbb9 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -898,7 +898,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) if (p.nuls[0] == '\0' && p.nuls[1] == '\0' && p.len > 0 && p.len <= PIPE_MAX_PAYLOAD && p.pid != 0 && - pg_popcount((char *) &dest_flags, 1) == 1) + pg_number_of_ones[dest_flags] == 1) { List *buffer_list; ListCell *cell; -- 2.25.1 >From 73ee8d6018b047856e63ad075641a0dcfe889417 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v19 3/4] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 7 +- src/port/pg_popcount_avx512.c| 49 ++ src/port/pg_popcount_avx512_choose.c | 71 src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed
Re: Popcount optimization using AVX512
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception. To allow byte level granularity > > masking, -mavx512bw is needed. Based on wikipedia this will only > > disable this fast path on Knights Mill (Xeon Phi), in all other cases > > VPOPCNTQ implies availability of BW. > > Sounds promising. IMHO we should really be sure that these kinds of loads > won't generate segfaults and the like due to the masked-out portions. I > searched around a little bit but haven't found anything that seemed > definitive. Interestingly the Intel software developer manual is not exactly crystal clear on how memory faults with masks work, but volume 2A chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb that supports memory fault suppression on page fault. Regards, Ants Aasma [1] https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > What about using the masking capabilities of AVX-512 to handle the > tail in the same code path? Masked out portions of a load instruction > will not generate an exception. To allow byte level granularity > masking, -mavx512bw is needed. Based on wikipedia this will only > disable this fast path on Knights Mill (Xeon Phi), in all other cases > VPOPCNTQ implies availability of BW. Sounds promising. IMHO we should really be sure that these kinds of loads won't generate segfaults and the like due to the masked-out portions. I searched around a little bit but haven't found anything that seemed definitive. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart wrote: > > On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > > On 2024-Mar-31, Nathan Bossart wrote: > >> +popcnt = _mm512_reduce_add_epi64(accum); > >> +return popcnt + pg_popcount_fast(buf, bytes); > > > > Hmm, doesn't this arrangement cause an extra function call to > > pg_popcount_fast to be used here? Given the level of micro-optimization > > being used by this code, I would have thought that you'd have tried to > > avoid that. (At least, maybe avoid the call if bytes is 0, no?) > > Yes, it does. I did another benchmark on very small arrays and can see the > overhead. This is the time in milliseconds to run pg_popcount() on an > array 1 billion times: > > size (bytes) HEAD AVX512-POPCNT > 1 1707.685 3480.424 > 2 1926.694 4606.182 > 4 3210.412 5284.506 > 8 1920.703 3640.968 > 162936.91 4045.586 > 323627.956 5538.418 > 645347.213 3748.212 > > I suspect that anything below 64 bytes will see this regression, as that is > the earliest point where there are enough bytes for ZMM registers. What about using the masking capabilities of AVX-512 to handle the tail in the same code path? Masked out portions of a load instruction will not generate an exception. To allow byte level granularity masking, -mavx512bw is needed. Based on wikipedia this will only disable this fast path on Knights Mill (Xeon Phi), in all other cases VPOPCNTQ implies availability of BW. Attached is an example of what I mean. I did not have a machine to test it with, but the code generated looks sane. I added the clang pragma because it insisted on unrolling otherwise and based on how the instruction dependencies look that is probably not too helpful even for large cases (needs to be tested). The configure check and compile flags of course need to be amended for BW. Regards, Ants Aasma diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index f86558d1ee5..7fb2ada16c9 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -30,20 +30,27 @@ uint64 pg_popcount_avx512(const char *buf, int bytes) { - uint64 popcnt; + __m512i val, cnt; + __mmask64 remaining_mask; __m512i accum = _mm512_setzero_si512(); - for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) + #pragma clang loop unroll(disable) + for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i)) { - const __m512i val = _mm512_loadu_si512((const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); + val = _mm512_loadu_si512((const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); buf += sizeof(__m512i); } - popcnt = _mm512_reduce_add_epi64(accum); - return popcnt + pg_popcount_fast(buf, bytes); + remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes); + val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + + accum = _mm512_add_epi64(accum, cnt); + + return _mm512_reduce_add_epi64(accum); } #endif /* TRY_POPCNT_FAST */
Re: Popcount optimization using AVX512
On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > On 2024-Mar-31, Nathan Bossart wrote: >> +popcnt = _mm512_reduce_add_epi64(accum); >> +return popcnt + pg_popcount_fast(buf, bytes); > > Hmm, doesn't this arrangement cause an extra function call to > pg_popcount_fast to be used here? Given the level of micro-optimization > being used by this code, I would have thought that you'd have tried to > avoid that. (At least, maybe avoid the call if bytes is 0, no?) Yes, it does. I did another benchmark on very small arrays and can see the overhead. This is the time in milliseconds to run pg_popcount() on an array 1 billion times: size (bytes) HEAD AVX512-POPCNT 1 1707.685 3480.424 2 1926.694 4606.182 4 3210.412 5284.506 8 1920.703 3640.968 162936.91 4045.586 323627.956 5538.418 645347.213 3748.212 I suspect that anything below 64 bytes will see this regression, as that is the earliest point where there are enough bytes for ZMM registers. We could avoid the call if there are no remaining bytes, but the numbers for the smallest arrays probably wouldn't improve much, and that might actually add some overhead due to branching. The other option to avoid this overhead is to put most of pg_bitutils.c into its header file so that we can inline the call. Reviewing the current callers of pg_popcount(), IIUC the only ones that are passing very small arrays are the bit_count() implementations and a call in the syslogger for a single byte. I don't know how much to worry about the overhead for bit_count() since there's presumably a bunch of other overhead, and the syslogger one could probably be fixed via an inline function that pulled the value from pg_number_of_ones (which would probably be an improvement over the status quo, anyway). But this is all to save a couple of nanoseconds... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On 2024-Mar-31, Nathan Bossart wrote: > +uint64 > +pg_popcount_avx512(const char *buf, int bytes) > +{ > + uint64 popcnt; > + __m512i accum = _mm512_setzero_si512(); > + > + for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) > + { > + const __m512i val = _mm512_loadu_si512((const __m512i > *) buf); > + const __m512i cnt = _mm512_popcnt_epi64(val); > + > + accum = _mm512_add_epi64(accum, cnt); > + buf += sizeof(__m512i); > + } > + > + popcnt = _mm512_reduce_add_epi64(accum); > + return popcnt + pg_popcount_fast(buf, bytes); > +} Hmm, doesn't this arrangement cause an extra function call to pg_popcount_fast to be used here? Given the level of micro-optimization being used by this code, I would have thought that you'd have tried to avoid that. (At least, maybe avoid the call if bytes is 0, no?) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte" (Ijon Tichy en Viajes, Stanislaw Lem)
Re: Popcount optimization using AVX512
On Sat, Mar 30, 2024 at 03:03:29PM -0500, Nathan Bossart wrote: > My current plan is to add some new tests for > pg_popcount() with many bytes, and then I'll give it a few more days for > any additional feedback before committing. Here is a v18 with a couple of new tests. Otherwise, it is the same as v17. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 86a571721ed3ed4ca7e04134b9541fc3ac43b9f1 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v18 1/1] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 56 +++--- src/port/pg_popcount_avx512.c| 49 ++ src/port/pg_popcount_avx512_choose.c | 71 src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 666 insertions(+), 39 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..5fb60775ca 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# An optional compiler flag can be passed as argument +# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..b48ed7f271 100755 --- a/configure +++ b/configure @@ -647,6 +647,9 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +PG_POPCNT_OBJS +CFLAGS_POPCNT +CFLAGS_XSAVE LIBOBJS OPENSSL ZSTD @@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5 +$as_echo_n "checking for __get_cpuid_count... " >&6; } +if ${pgac_cv__get_cpuid_count+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__get_cpuid_count="yes" +else + pgac_cv__get_cpuid_count="no" +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo
Re: Popcount optimization using AVX512
I used John Naylor's test_popcount module [0] to put together the attached graphs (note that the "small arrays" one is semi-logarithmic). For both graphs, the X-axis is the number of 64-bit words in the array, and Y-axis is the amount of time in milliseconds to run pg_popcount() on it 100,000 times (along with a bit of overhead). This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes. There isn't a ton of use of pg_popcount() in Postgres, but I do see a few places that call it with enough bytes for the AVX512 optimization to take effect. There may be more callers in the future, though, and it seems generally useful to have some of the foundational work for using AVX512 instructions in place. My current plan is to add some new tests for pg_popcount() with many bytes, and then I'll give it a few more days for any additional feedback before committing. [0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=g...@mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
Here's a v17 of the patch. This one has configure checks for everything (i.e., CPUID, XGETBV, and the AVX512 intrinsics) as well as the relevant runtime checks (i.e., we call CPUID to check for XGETBV and AVX512 POPCNT availability, and we call XGETBV to ensure the ZMM registers are enabled). I restricted the AVX512 configure checks to x86_64 since we know we won't have TRY_POPCNT_FAST on 32-bit, and we rely on pg_popcount_fast() as our fallback implementation in the AVX512 version. Finally, I removed the inline assembly in favor of using the _xgetbv() intrinsic on all systems. It looks like that's available on gcc, clang, and msvc, although it sometimes requires -mxsave, so that's applied to pg_popcount_avx512_choose.o as needed. I doubt this will lead to SIGILLs, but it's admittedly a little shaky. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From a26b209927cc6b266b33f74fd734772eff87bff9 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v17 1/1] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 56 +++--- src/port/pg_popcount_avx512.c| 49 ++ src/port/pg_popcount_avx512_choose.c | 71 13 files changed, 638 insertions(+), 39 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..5fb60775ca 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# An optional compiler flag can be passed as argument +# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..b48ed7f271 100755 --- a/configure +++ b/configure @@ -647,6 +647,9 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +PG_POPCNT_OBJS +CFLAGS_POPCNT +CFLAGS_XSAVE LIBOBJS OPENSSL ZSTD @@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5 +$as_echo_n "checking for __get_cpuid_count... " >&6; } +if ${pgac_cv__get_cpuid_count+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.
Re: Popcount optimization using AVX512
On Fri, Mar 29, 2024 at 03:08:28PM -0500, Nathan Bossart wrote: >> +#if defined(HAVE__GET_CPUID) >> +__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); >> +#elif defined(HAVE__CPUID) >> +__cpuidex(exx, 7, 0); > > Is there any reason we can't use __get_cpuid() and __cpuid() here, given > the sub-leaf is 0? The answer to this seems to be "no." After additional research, __get_cpuid_count/__cpuidex seem new enough that we probably want configure checks for them, so I'll add those back in the next version of the patch. Apologies for the stream of consciousness today... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Fri, Mar 29, 2024 at 02:13:12PM -0500, Nathan Bossart wrote: > * If the compiler understands AVX512 intrinsics, we assume that it also > knows about the required CPUID and XGETBV intrinsics, and we assume that > the conditions for TRY_POPCNT_FAST are true. Bleh, cfbot's 32-bit build is unhappy with this [0]. It looks like it's trying to build the AVX512 stuff, but TRY_POPCNT_FAST isn't set. [19:39:11.306] ../src/port/pg_popcount_avx512.c:39:18: warning: implicit declaration of function ‘pg_popcount_fast’; did you mean ‘pg_popcount’? [-Wimplicit-function-declaration] [19:39:11.306]39 | return popcnt + pg_popcount_fast(buf, bytes); [19:39:11.306] | ^~~~ [19:39:11.306] | pg_popcount There's also a complaint about the inline assembly: [19:39:11.443] ../src/port/pg_popcount_avx512_choose.c:55:1: error: inconsistent operand constraints in an ‘asm’ [19:39:11.443]55 | __asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr)); [19:39:11.443] | ^~~ I'm looking into this... > +#if defined(HAVE__GET_CPUID) > + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); > +#elif defined(HAVE__CPUID) > + __cpuidex(exx, 7, 0); Is there any reason we can't use __get_cpuid() and __cpuid() here, given the sub-leaf is 0? [0] https://cirrus-ci.com/task/5475113447981056 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
Okay, here is a slightly different approach that I've dubbed the "maximum assumption" approach. In short, I wanted to see how much we could simplify the patch by making all possibly-reasonable assumptions about the compiler and CPU. These include: * If the compiler understands AVX512 intrinsics, we assume that it also knows about the required CPUID and XGETBV intrinsics, and we assume that the conditions for TRY_POPCNT_FAST are true. * If this is x86_64, CPUID will be supported by the CPU. * If CPUID indicates AVX512 POPCNT support, the CPU also supports XGETBV. Do any of these assumptions seem unreasonable or unlikely to be true for all practical purposes? I don't mind adding back some or all of the configure/runtime checks if they seem necessary. I guess the real test will be the buildfarm... Another big change in this version is that I've moved pg_popcount_avx512_available() to its own file so that we only compile pg_popcount_avx512() with the special compiler flags. This is just an oversight in previous versions. Finally, I've modified the build scripts so that the AVX512 popcount stuff is conditionally built based on the configure checks for both autoconf/meson. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From d7864391c455ea77b8e555e40a358c59de1bd702 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v16 1/1] AVX512 popcount support --- config/c-compiler.m4 | 34 + configure| 100 +++ configure.ac | 14 meson.build | 35 ++ src/Makefile.global.in | 4 ++ src/include/pg_config.h.in | 3 + src/include/port/pg_bitutils.h | 17 + src/makefiles/meson.build| 3 +- src/port/Makefile| 6 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 56 ++- src/port/pg_popcount_avx512.c| 40 +++ src/port/pg_popcount_avx512_choose.c | 61 13 files changed, 340 insertions(+), 39 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..7d13368b23 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# An optional compiler flag can be passed as argument +# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..86c471f4ec 100755 --- a/configure +++ b/configure @@ -647,6 +647,8 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +CFLAGS_POPCNT +PG_POPCNT_OBJS LIBOBJS OPENSSL ZSTD @@ -17438,6 +17440,104 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h fi +# Check for AVX512 popcount intrinsics +# +PG_POPCNT_OBJS="" +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5 +$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; } +if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS " +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +const char buf[sizeof(__m512i)]; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(va
RE: Popcount optimization using AVX512
> A counterexample is the CRC32C code. AFAICT we assume the presence of > CPUID in that code (and #error otherwise). I imagine its probably safe to > assume the compiler understands CPUID if it understands AVX512 intrinsics, > but that is still mostly a guess. If AVX-512 intrinsics are available, then yes you will have CPUID. CPUID is much older in the hardware/software timeline than AVX-512. Thanks, Paul
RE: Popcount optimization using AVX512
> On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote: > > We don't do MSVC via autoconf/Make. We used to have a special build > > framework for MSVC which parsed Makefiles to produce "solution" files, > > but it was removed as soon as Meson was mature enough to build. See > > commit 1301c80b2167. If it builds with Meson, you're good. > > The latest cfbot build for this seems to indicate that at least newer MSVC > knows AVX512 intrinsics without any special compiler flags [0], so maybe > what I had in v14 is good enough. A previous version of the patch set [1] had > the following lines: > > + if host_system == 'windows' > +test_flags = ['/arch:AVX512'] > + endif > > I'm not sure if this is needed for older MSVC or something else. IIRC I > couldn't > find any other examples of this sort of thing in the meson scripts, either. > Paul, > do you recall why you added this? I asked internal folks here in-the-know and they suggested I add it. I personally am not a Windows guy. If it works without it and you are comfortable not including the lines, I am fine with it. Thanks, Paul
Re: Popcount optimization using AVX512
On Fri, Mar 29, 2024 at 12:30:14PM -0400, Tom Lane wrote: > Nathan Bossart writes: >>> I see google web references to the xgetbv instruction as far back as 2009 >>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for >>> _xgetbv() MSVC built-in. How far back do you need to go? > >> Hm. It seems unlikely that a compiler would understand AVX512 intrinsics >> and not XGETBV then. I guess the other question is whether CPUID >> indicating AVX512 is enabled implies the availability of XGETBV on the CPU. >> If that's not safe, we might need to add another CPUID test. > > Some quick googling says that (1) XGETBV predates AVX and (2) if you > are worried about old CPUs, you should check CPUID to verify whether > XGETBV exists before trying to use it. I did not look for the > bit-level details on how to do that. That extra CPUID check should translate to exactly one additional line of code, so I think I'm inclined to just add it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
RE: Popcount optimization using AVX512
> From: Nathan Bossart > Sent: Friday, March 29, 2024 9:17 AM > To: Amonson, Paul D > On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote: >> Yeah, I understand that much, but I want to know how portable the >> XGETBV instruction is. Unless I can assume that all x86_64 systems >> and compilers support that instruction, we might need an additional >> configure check and/or CPUID check. It looks like MSVC has had >> support for the _xgetbv intrinsic for quite a while, but I'm still >> researching the other cases. > > I see google web references to the xgetbv instruction as far back as > 2009 for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could > test for > _xgetbv() MSVC built-in. How far back do you need to go? > Hm. It seems unlikely that a compiler would understand AVX512 intrinsics and > not XGETBV then. I guess the other question is whether CPUID indicating > AVX512 is enabled implies the availability of XGETBV on the CPU. > If that's not safe, we might need to add another CPUID test. > It would probably be easy enough to add a couple of tests for this, but if we > don't have reason to believe there's any practical case to do so, I don't > know why we would. I'm curious what others think about this. This seems unlikely. Machines supporting XGETBV would support AVX512 intrinsics. Xgetbv instruction seems to be part of xsave feature set as per intel developer manual [2]. XGETBV/XSAVE came first, and seems to be available in all x86 systems available since 2011, since Intel SandyBridge architecture and AMD the Opteron Gen4 [0]. AVX512 first came into a product in 2016 [1] [0]: https://kb.vmware.com/s/article/1005764 [1]: https://en.wikipedia.org/wiki/AVX-512 [2]: https://cdrdv2-public.intel.com/774475/252046-sdm-change-document.pdf - Akash Shankaran
Re: Popcount optimization using AVX512
Nathan Bossart writes: >> I see google web references to the xgetbv instruction as far back as 2009 >> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for >> _xgetbv() MSVC built-in. How far back do you need to go? > Hm. It seems unlikely that a compiler would understand AVX512 intrinsics > and not XGETBV then. I guess the other question is whether CPUID > indicating AVX512 is enabled implies the availability of XGETBV on the CPU. > If that's not safe, we might need to add another CPUID test. Some quick googling says that (1) XGETBV predates AVX and (2) if you are worried about old CPUs, you should check CPUID to verify whether XGETBV exists before trying to use it. I did not look for the bit-level details on how to do that. regards, tom lane
Re: Popcount optimization using AVX512
On Fri, Mar 29, 2024 at 10:59:40AM -0500, Nathan Bossart wrote: > It might be nice if we conditionally built pg_popcount_avx512.o in autoconf > builds, too, but AFAICT we still need to wrap most of that code with > macros, so I'm not sure it's worth the trouble. I'll take another look at > this... If we assumed that TRY_POPCNT_FAST would be set and either HAVE__GET_CPUID_COUNT or HAVE__CPUIDEX would be set whenever USE_AVX512_POPCNT_WITH_RUNTIME_CHECK is set, we could probably remove the surrounding macros and just compile pg_popcount_avx512.c conditionally based on USE_AVX512_POPCNT_WITH_RUNTIME_CHECK. However, the surrounding code seems to be pretty cautious about these assumptions (e.g., the CPUID macros are checked before setting TRY_POPCNT_FAST), so this would stray from the nearby precedent a bit. A counterexample is the CRC32C code. AFAICT we assume the presence of CPUID in that code (and #error otherwise). I imagine its probably safe to assume the compiler understands CPUID if it understands AVX512 intrinsics, but that is still mostly a guess. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote: >> Yeah, I understand that much, but I want to know how portable the XGETBV >> instruction is. Unless I can assume that all x86_64 systems and compilers >> support that instruction, we might need an additional configure check and/or >> CPUID check. It looks like MSVC has had support for the _xgetbv intrinsic >> for >> quite a while, but I'm still researching the other cases. > > I see google web references to the xgetbv instruction as far back as 2009 > for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for > _xgetbv() MSVC built-in. How far back do you need to go? Hm. It seems unlikely that a compiler would understand AVX512 intrinsics and not XGETBV then. I guess the other question is whether CPUID indicating AVX512 is enabled implies the availability of XGETBV on the CPU. If that's not safe, we might need to add another CPUID test. It would probably be easy enough to add a couple of tests for this, but if we don't have reason to believe there's any practical case to do so, I don't know why we would. I'm curious what others think about this. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
RE: Popcount optimization using AVX512
> -Original Message- > > Cool. I think we should run the benchmarks again to be safe, though. Ok, sure go ahead. :) > >> I forgot to mention that I also want to understand whether we can > >> actually assume availability of XGETBV when CPUID says we support > >> AVX512: > > > > You cannot assume as there are edge cases where AVX-512 was found on > > system one during compile but it's not actually available in a kernel > > on a second system at runtime despite the CPU actually having the > > hardware feature. > > Yeah, I understand that much, but I want to know how portable the XGETBV > instruction is. Unless I can assume that all x86_64 systems and compilers > support that instruction, we might need an additional configure check and/or > CPUID check. It looks like MSVC has had support for the _xgetbv intrinsic for > quite a while, but I'm still researching the other cases. I see google web references to the xgetbv instruction as far back as 2009 for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for _xgetbv() MSVC built-in. How far back do you need to go? Thanks, Paul
Re: Popcount optimization using AVX512
On Thu, Mar 28, 2024 at 10:29:47PM +, Amonson, Paul D wrote: > I see in the meson.build you added the new file twice? > > @@ -7,6 +7,7 @@ pgport_sources = [ >'noblock.c', >'path.c', >'pg_bitutils.c', > + 'pg_popcount_avx512.c', >'pg_strong_random.c', >'pgcheckdir.c', >'pgmkdirp.c', > @@ -84,6 +85,7 @@ replace_funcs_pos = [ >['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'], >['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'], >['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'], > + ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', > 'avx512_popcnt'], > > I was putting the file with special flags ONLY in the second section and all > seemed to work. :) Ah, yes, I think that's a mistake, and without looking closely, might explain the MSVC warnings [0]: [22:05:47.444] pg_popcount_avx512.c.obj : warning LNK4006: pg_popcount_avx512_available already defined in pg_popcount_a... It might be nice if we conditionally built pg_popcount_avx512.o in autoconf builds, too, but AFAICT we still need to wrap most of that code with macros, so I'm not sure it's worth the trouble. I'll take another look at this... [0] http://commitfest.cputube.org/highlights/all.html#4883 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From c924b57f8479e51aa30c8e3cfe194a2ab85497ff Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v15 1/1] AVX512 popcount support --- config/c-compiler.m4 | 34 +++ configure | 165 + configure.ac | 34 +++ meson.build| 59 src/Makefile.global.in | 1 + src/include/pg_config.h.in | 9 ++ src/include/port/pg_bitutils.h | 20 src/makefiles/meson.build | 1 + src/port/Makefile | 6 ++ src/port/meson.build | 5 +- src/port/pg_bitutils.c | 56 --- src/port/pg_popcount_avx512.c | 98 12 files changed, 450 insertions(+), 38 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..f881e7ec28 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# An optional compiler flag can be passed as argument +# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_AVX512_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..189264b86e 100755 --- a/configure +++ b/configure @@ -647,6 +647,7 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +CFLAGS_AVX512_POPCNT LIBOBJS OPENSSL ZSTD @@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h fi +# Check for x86 cpuid_count instruction +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5 +$as_echo_n "checking for __get_cpuid_count... " >&6; } +if ${pgac_cv__get_cpuid_count+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__get_cpuid_count="yes" +else + pgac_cv__get_cpuid_count="no" +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5 +$as
Re: Popcount optimization using AVX512
On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote: > We don't do MSVC via autoconf/Make. We used to have a special build > framework for MSVC which parsed Makefiles to produce "solution" files, > but it was removed as soon as Meson was mature enough to build. See > commit 1301c80b2167. If it builds with Meson, you're good. The latest cfbot build for this seems to indicate that at least newer MSVC knows AVX512 intrinsics without any special compiler flags [0], so maybe what I had in v14 is good enough. A previous version of the patch set [1] had the following lines: + if host_system == 'windows' +test_flags = ['/arch:AVX512'] + endif I'm not sure if this is needed for older MSVC or something else. IIRC I couldn't find any other examples of this sort of thing in the meson scripts, either. Paul, do you recall why you added this? [0] https://cirrus-ci.com/task/5787206636273664?logs=configure#L159 [1] https://postgr.es/m/attachment/158206/v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
On Thu, Mar 28, 2024 at 10:03:04PM +, Amonson, Paul D wrote: >> * I think we need to verify there isn't a huge performance regression for >> smaller arrays. IIUC those will still require an AVX512 instruction or >> two as well as a function call, which might add some noticeable overhead. > > Not considering your changes, I had already tested small buffers. At less > than 512 bytes there was no measurable regression (there was one extra > condition check) and for 512+ bytes it moved from no regression to some > gains between 512 and 4096 bytes. Assuming you introduced no extra > function calls, it should be the same. Cool. I think we should run the benchmarks again to be safe, though. >> I forgot to mention that I also want to understand whether we can >> actually assume availability of XGETBV when CPUID says we support >> AVX512: > > You cannot assume as there are edge cases where AVX-512 was found on > system one during compile but it's not actually available in a kernel on > a second system at runtime despite the CPU actually having the hardware > feature. Yeah, I understand that much, but I want to know how portable the XGETBV instruction is. Unless I can assume that all x86_64 systems and compilers support that instruction, we might need an additional configure check and/or CPUID check. It looks like MSVC has had support for the _xgetbv intrinsic for quite a while, but I'm still researching the other cases. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
RE: Popcount optimization using AVX512
> -Original Message- > From: Amonson, Paul D > Sent: Thursday, March 28, 2024 3:03 PM > To: Nathan Bossart > ... > I will review the new patch to see if there are anything that jumps out at me. I see in the meson.build you added the new file twice? @@ -7,6 +7,7 @@ pgport_sources = [ 'noblock.c', 'path.c', 'pg_bitutils.c', + 'pg_popcount_avx512.c', 'pg_strong_random.c', 'pgcheckdir.c', 'pgmkdirp.c', @@ -84,6 +85,7 @@ replace_funcs_pos = [ ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'], ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'], ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'], + ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512_popcnt'], I was putting the file with special flags ONLY in the second section and all seemed to work. :) Everything else seems good to me. Thanks, Paul
Re: Popcount optimization using AVX512
On 2024-Mar-28, Amonson, Paul D wrote: > > -Original Message- > > From: Nathan Bossart > > Sent: Thursday, March 28, 2024 2:39 PM > > To: Amonson, Paul D > > > > * The latest patch set from Paul Amonson appeared to support MSVC in the > > meson build, but not the autoconf one. I don't have much expertise here, > > so the v14 patch doesn't have any autoconf/meson support for MSVC, which > > I thought might be okay for now. IIUC we assume that 64-bit/MSVC builds > > can always compile the x86_64 popcount code, but I don't know whether > > that's safe for AVX512. > > I also do not know how to integrate MSVC+Autoconf, the CI uses > MSVC+Meson+Ninja so I stuck with that. We don't do MSVC via autoconf/Make. We used to have a special build framework for MSVC which parsed Makefiles to produce "solution" files, but it was removed as soon as Meson was mature enough to build. See commit 1301c80b2167. If it builds with Meson, you're good. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "[PostgreSQL] is a great group; in my opinion it is THE best open source development communities in existence anywhere."(Lamar Owen)
RE: Popcount optimization using AVX512
> -Original Message- > From: Nathan Bossart > Sent: Thursday, March 28, 2024 2:39 PM > To: Amonson, Paul D > > * The latest patch set from Paul Amonson appeared to support MSVC in the > meson build, but not the autoconf one. I don't have much expertise here, > so the v14 patch doesn't have any autoconf/meson support for MSVC, which > I thought might be okay for now. IIUC we assume that 64-bit/MSVC builds > can always compile the x86_64 popcount code, but I don't know whether > that's safe for AVX512. I also do not know how to integrate MSVC+Autoconf, the CI uses MSVC+Meson+Ninja so I stuck with that. > * I think we need to verify there isn't a huge performance regression for > smaller arrays. IIUC those will still require an AVX512 instruction or > two as well as a function call, which might add some noticeable overhead. Not considering your changes, I had already tested small buffers. At less than 512 bytes there was no measurable regression (there was one extra condition check) and for 512+ bytes it moved from no regression to some gains between 512 and 4096 bytes. Assuming you introduced no extra function calls, it should be the same. > I forgot to mention that I also want to understand whether we can actually > assume availability of XGETBV when CPUID says we support AVX512: You cannot assume as there are edge cases where AVX-512 was found on system one during compile but it's not actually available in a kernel on a second system at runtime despite the CPU actually having the hardware feature. I will review the new patch to see if there are anything that jumps out at me. Thanks, Paul
Re: Popcount optimization using AVX512
On Thu, Mar 28, 2024 at 04:38:54PM -0500, Nathan Bossart wrote: > Here is a v14 of the patch that I think is beginning to approach something > committable. Besides general review and testing, there are two things that > I'd like to bring up: > > * The latest patch set from Paul Amonson appeared to support MSVC in the > meson build, but not the autoconf one. I don't have much expertise here, > so the v14 patch doesn't have any autoconf/meson support for MSVC, which > I thought might be okay for now. IIUC we assume that 64-bit/MSVC builds > can always compile the x86_64 popcount code, but I don't know whether > that's safe for AVX512. > > * I think we need to verify there isn't a huge performance regression for > smaller arrays. IIUC those will still require an AVX512 instruction or > two as well as a function call, which might add some noticeable overhead. I forgot to mention that I also want to understand whether we can actually assume availability of XGETBV when CPUID says we support AVX512: > + /* > + * We also need to check that the OS has enabled support for > the ZMM > + * registers. > + */ > +#ifdef _MSC_VER > + return (_xgetbv(0) & 0xe0) != 0; > +#else > + uint64 xcr = 0; > + uint32 high; > + uint32 low; > + > +__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr)); > + return (low & 0xe0) != 0; > +#endif -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
Here is a v14 of the patch that I think is beginning to approach something committable. Besides general review and testing, there are two things that I'd like to bring up: * The latest patch set from Paul Amonson appeared to support MSVC in the meson build, but not the autoconf one. I don't have much expertise here, so the v14 patch doesn't have any autoconf/meson support for MSVC, which I thought might be okay for now. IIUC we assume that 64-bit/MSVC builds can always compile the x86_64 popcount code, but I don't know whether that's safe for AVX512. * I think we need to verify there isn't a huge performance regression for smaller arrays. IIUC those will still require an AVX512 instruction or two as well as a function call, which might add some noticeable overhead. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 9b5725e36aa8cff7caeb8683e11cd09bd5bda745 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v14 1/1] AVX512 popcount support --- config/c-compiler.m4 | 34 +++ configure | 165 + configure.ac | 34 +++ meson.build| 59 src/Makefile.global.in | 1 + src/include/pg_config.h.in | 9 ++ src/include/port/pg_bitutils.h | 20 src/makefiles/meson.build | 1 + src/port/Makefile | 6 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 56 --- src/port/pg_popcount_avx512.c | 98 12 files changed, 451 insertions(+), 38 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..f881e7ec28 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# - +# Check if the compiler supports the AVX512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# An optional compiler flag can be passed as argument +# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_AVX512_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..189264b86e 100755 --- a/configure +++ b/configure @@ -647,6 +647,7 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +CFLAGS_AVX512_POPCNT LIBOBJS OPENSSL ZSTD @@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h fi +# Check for x86 cpuid_count instruction +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5 +$as_echo_n "checking for __get_cpuid_count... " >&6; } +if ${pgac_cv__get_cpuid_count+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__get_cpuid_count="yes" +else + pgac_cv__get_cpuid_count="no" +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5 +$as_echo "$pgac_cv__get_cpuid_count" >&6; } +if test x"$pgac_cv__get_cpuid_count" = x"yes"; then + +$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h + +fi + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5 $as_echo_n "checking for __cpuid... " >&6; } if ${pgac_cv__cpuid+:} false; then : @@ -17438,6 +17474,135 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5 +$as_echo_n "checking
RE: Popcount optimization using AVX512
> -Original Message- > From: Nathan Bossart > Sent: Wednesday, March 27, 2024 3:00 PM > To: Amonson, Paul D > > ... (I realize that I'm essentially > recanting much of my previous feedback, which I apologize for.) It happens. LOL As long as the algorithm for AVX-512 is not altered I am confident that your new refactor will be fine. :) Thanks, Paul
Re: Popcount optimization using AVX512
On Mon, Mar 25, 2024 at 03:05:51PM -0500, Nathan Bossart wrote: > On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote: >> Ok, CI turned green after my re-post of the patches. Can this please get >> merged? > > Thanks for the new patches. I intend to take another look soon. Thanks for your patience. I spent most of my afternoon looking into the latest patch set, but I needed to do a CHECKPOINT and take a break. I am in the middle of doing some rather heavy editorialization, but the core of your changes will remain the same (and so I still intend to give you authorship credit). I've attached what I have so far, which is still missing the configuration checks and the changes to make sure the extra compiler flags make it to the right places. Unless something pops up while I work on the remainder of this patch, I think we'll end up going with a simpler approach. I originally set out to make this look like the CRC32C stuff (e.g., a file per implementation), but that seemed primarily useful if we can choose which files need to be compiled at configure-time. However, the TRY_POPCNT_FAST macro is defined at compile-time (AFAICT for good reason [0]), so we end up having to compile all the files in many cases anyway, and we continue to need to surround lots of code with "#ifdef TRY_POPCNT_FAST" or similar. So, my current thinking is that we should only move the AVX512 stuff to its own file for the purposes of compiling it with special flags when possible. (I realize that I'm essentially recanting much of my previous feedback, which I apologize for.) [0] https://postgr.es/m/CAApHDvrONNcYxGV6C0O3ZmaL0BvXBWY%2BrBOCBuYcQVUOURwhkA%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 031eb4a365665edd304f0281ad7e412341504749 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v13 1/1] AVX512 popcount support --- src/include/port/pg_bitutils.h | 16 +++ src/port/Makefile | 1 + src/port/meson.build | 1 + src/port/pg_bitutils.c | 53 src/port/pg_popcount_avx512.c | 88 ++ 5 files changed, 125 insertions(+), 34 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 53e5239717..4b1e4d92b4 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -298,6 +298,22 @@ pg_ceil_log2_64(uint64 num) #endif #endif +/* + * We can also try to use the AVX512 popcount instruction on some systems. + * The implementation of that is located in its own file because it may + * require special compiler flags that we don't want to apply to any other + * files. + */ +#if defined(TRY_POPCNT_FAST) && \ + defined(HAVE__IMMINTRIN) && \ + defined(HAVE__AVX512_POPCNT) +#if defined(HAVE__GET_CPUID_COUNT) || defined(HAVE__CPUIDEX) +#define TRY_POPCNT_AVX512 1 +extern bool pg_popcount_avx512_available(void); +extern uint64 pg_popcount_avx512(const char *buf, int bytes); +#endif +#endif + #ifdef TRY_POPCNT_FAST /* Attempt to use the POPCNT instruction, but perform a runtime check first */ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word); diff --git a/src/port/Makefile b/src/port/Makefile index dcc8737e68..eb1e56fe41 100644 --- a/src/port/Makefile +++ b/src/port/Makefile @@ -44,6 +44,7 @@ OBJS = \ noblock.o \ path.o \ pg_bitutils.o \ + pg_popcount_avx512.o \ pg_strong_random.o \ pgcheckdir.o \ pgmkdirp.o \ diff --git a/src/port/meson.build b/src/port/meson.build index 92b593e6ef..c77bbd3168 100644 --- a/src/port/meson.build +++ b/src/port/meson.build @@ -7,6 +7,7 @@ pgport_sources = [ 'noblock.c', 'path.c', 'pg_bitutils.c', + 'pg_popcount_avx512.c', 'pg_strong_random.c', 'pgcheckdir.c', 'pgmkdirp.c', diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 1197696e97..2f9a6690e0 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -142,20 +142,18 @@ pg_popcount_available(void) return (exx[2] & (1 << 23)) != 0; /* POPCNT */ } -/* - * These functions get called on the first call to pg_popcount32 etc. - * They detect whether we can use the asm implementations, and replace - * the function pointers so that subsequent calls are routed directly to - * the chosen implementation. - */ -static int -pg_popcount32_choose(uint32 word) +static inline void +choose_popcount_functions(void) { if (pg_popcount_available()) { pg_popcount32 = pg_popcount32_fast; pg_popcount64 = pg_popcount64_fast; pg_popcount = pg_popcount_fast; +#ifdef TRY_POPCNT_AVX512 + if (pg_popcount_avx512_available()) + pg_popcount = pg_popcount_avx512; +#endif } else { @@ -163,45 +161,32 @@ pg_popcount32_choose(uint32 word) pg_popcount64 = pg_popcount64_slow; pg_popcount = pg_popcount_slow; } +} +/* + * These functions get called on the first call to pg_popcount32 etc. + * They detect w
Re: Popcount optimization using AVX512
On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote: > Ok, CI turned green after my re-post of the patches. Can this please get > merged? Thanks for the new patches. I intend to take another look soon. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
RE: Popcount optimization using AVX512
> -Original Message- > From: Amonson, Paul D > Sent: Monday, March 25, 2024 8:20 AM > To: Tom Lane > Cc: David Rowley ; Nathan Bossart > ; Andres Freund ; Alvaro > Herrera ; Shankaran, Akash > ; Noah Misch ; Matthias > van de Meent ; pgsql- > hack...@lists.postgresql.org > Subject: RE: Popcount optimization using AVX512 > Ok, CI turned green after my re-post of the patches. Can this please get merged? Thanks, Paul
Re: Popcount optimization using AVX512
On 3/25/24 11:12, Tom Lane wrote: "Amonson, Paul D" writes: I am re-posting the patches as CI for Mac failed (CI error not code/test error). The patches are the same as last time. Just for a note --- the cfbot will re-test existing patches every so often without needing a bump. The current cycle period seems to be about two days. Just an FYI -- there seems to be an issue with all three of the macos cfbot runners (mine included). I spent time over the weekend working with Thomas Munro (added to CC list) trying different fixes to no avail. Help from macos CI wizards would be gratefully accepted... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com