On Sat, Apr 06, 2024 at 02:51:39PM +1300, David Rowley wrote: > On Sat, 6 Apr 2024 at 14:17, Nathan Bossart <nathandboss...@gmail.com> wrote: >> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: >> > Won't Valgrind complain about this? >> > >> > +pg_popcount_avx512(const char *buf, int bytes) >> > >> > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); >> > >> > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); >> >> I haven't been able to generate any complaints, at least with some simple >> tests. But I see your point. If this did cause such complaints, ISTM we'd >> just want to add it to the suppression file. Otherwise, I think we'd have >> to go back to the non-maskz approach (which I really wanted to avoid >> because of the weird function overhead juggling) or find another way to do >> a partial load into an __m512i. > > [1] seems to think it's ok. If this is true then the following > shouldn't segfault: > > The following seems to run without any issue and if I change the mask > to 1 it crashes, as you'd expect.
Cool. Here is what I have staged for commit, which I intend to do shortly. At some point, I'd like to revisit converting TRY_POPCNT_FAST to a configure-time check and maybe even moving the "fast" and "slow" implementations to their own files, but since that's mostly for code neatness and we are rapidly approaching the v17 deadline, I'm content to leave that for v18. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
>From 9eea492222555cbd14c7871085e159c9b0b78e92 Mon Sep 17 00:00:00 2001 From: Nathan Bossart <nat...@postgresql.org> Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v28 1/2] Optimize pg_popcount() with AVX-512 instructions. Presently, pg_popcount() processes data in 32-bit or 64-bit chunks when possible. Newer hardware that supports AVX-512 instructions can perform these tasks in 512-bit chunks, which can provide a nice speedup, especially for larger buffers. This commit introduces the infrastructure required to detect both compiler and CPU support for the required AVX-512 intrinsic functions, and it makes use of that infrastructure in a new pg_popcount() implementation. If CPU support for this optimized implementation is detected at runtime, a function pointer is updated so that it is used for subsequent calls to pg_popcount(). Most of the existing in-tree calls to pg_popcount() should benefit nicely from these instructions, and calls for smaller buffers should not regress when compared to v16. The new infrastructure introduced by this commit can also be used to optimized visibilitymap_count(), but that work is left for a follow-up commit. Co-authored-by: Paul Amonson, Ants Aasma Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com --- config/c-compiler.m4 | 58 ++++++ configure | 252 +++++++++++++++++++++++++++ configure.ac | 51 ++++++ meson.build | 87 +++++++++ src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 11 ++ src/makefiles/meson.build | 4 +- src/port/Makefile | 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 5 + src/port/pg_popcount_avx512.c | 82 +++++++++ src/port/pg_popcount_avx512_choose.c | 87 +++++++++ src/test/regress/expected/bit.out | 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 696 insertions(+), 3 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..cfff48c1bc 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# --------------------- +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# ----------------------------- +# Check if the compiler supports the AVX-512 POPCNT instructions using the +# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64, +# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. +# +# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq +# -mavx512bw). If the intrinsics are supported, sets +# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>], + [const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_POPCNT="$1" + pgac_avx512_popcnt_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX512_POPCNT_INTRINSICS diff --git a/configure b/configure index 36feeafbb2..cfbd2a096f 100755 --- a/configure +++ b/configure @@ -647,6 +647,9 @@ MSGFMT_FLAGS MSGFMT PG_CRC32C_OBJS CFLAGS_CRC +PG_POPCNT_OBJS +CFLAGS_POPCNT +CFLAGS_XSAVE LIBOBJS OPENSSL ZSTD @@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5 +$as_echo_n "checking for __get_cpuid_count... " >&6; } +if ${pgac_cv__get_cpuid_count+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <cpuid.h> +int +main () +{ +unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__get_cpuid_count="yes" +else + pgac_cv__get_cpuid_count="no" +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5 +$as_echo "$pgac_cv__get_cpuid_count" >&6; } +if test x"$pgac_cv__get_cpuid_count" = x"yes"; then + +$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h + +fi + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5 $as_echo_n "checking for __cpuid... " >&6; } if ${pgac_cv__cpuid+:} false; then : @@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5 +$as_echo_n "checking for __cpuidex... " >&6; } +if ${pgac_cv__cpuidex+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <intrin.h> +int +main () +{ +unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuidex(exx[0], 7, 0); + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__cpuidex="yes" +else + pgac_cv__cpuidex="no" +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5 +$as_echo "$pgac_cv__cpuidex" >&6; } +if test x"$pgac_cv__cpuidex" = x"yes"; then + +$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h + +fi + +# Check for XSAVE intrinsics +# +CFLAGS_XSAVE="" +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5 +$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; } +if ${pgac_cv_xsave_intrinsics_+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS " +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <immintrin.h> +int +main () +{ +return _xgetbv(0) & 0xe0; + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_xsave_intrinsics_=yes +else + pgac_cv_xsave_intrinsics_=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +CFLAGS="$pgac_save_CFLAGS" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5 +$as_echo "$pgac_cv_xsave_intrinsics_" >&6; } +if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then + CFLAGS_XSAVE="" + pgac_xsave_intrinsics=yes +fi + +if test x"$pgac_xsave_intrinsics" != x"yes"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5 +$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; } +if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS -mxsave" +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <immintrin.h> +int +main () +{ +return _xgetbv(0) & 0xe0; + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_xsave_intrinsics__mxsave=yes +else + pgac_cv_xsave_intrinsics__mxsave=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +CFLAGS="$pgac_save_CFLAGS" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5 +$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; } +if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then + CFLAGS_XSAVE="-mxsave" + pgac_xsave_intrinsics=yes +fi + +fi +if test x"$pgac_xsave_intrinsics" = x"yes"; then + +$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h + +fi + + +# Check for AVX-512 popcount intrinsics +# +CFLAGS_POPCNT="" +PG_POPCNT_OBJS="" +if test x"$host_cpu" = x"x86_64"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5 +$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; } +if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS " +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <immintrin.h> +int +main () +{ +const char buf[sizeof(__m512i)]; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0; + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_avx512_popcnt_intrinsics_=yes +else + pgac_cv_avx512_popcnt_intrinsics_=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +CFLAGS="$pgac_save_CFLAGS" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5 +$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; } +if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then + CFLAGS_POPCNT="" + pgac_avx512_popcnt_intrinsics=yes +fi + + if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5 +$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; } +if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then : + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw" +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <immintrin.h> +int +main () +{ +const char buf[sizeof(__m512i)]; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0; + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes +else + pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +CFLAGS="$pgac_save_CFLAGS" +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5 +$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; } +if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then + CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw" + pgac_avx512_popcnt_intrinsics=yes +fi + + fi + if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then + PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o" + +$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h + + fi +fi + + + # Check for Intel SSE 4.2 intrinsics to do CRC calculations. # # First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used diff --git a/configure.ac b/configure.ac index 57f734879e..67e738d92b 100644 --- a/configure.ac +++ b/configure.ac @@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.]) fi +AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>], + [[unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + ]])], + [pgac_cv__get_cpuid_count="yes"], + [pgac_cv__get_cpuid_count="no"])]) +if test x"$pgac_cv__get_cpuid_count" = x"yes"; then + AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.]) +fi + AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid], [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>], [[unsigned int exx[4] = {0, 0, 0, 0}; @@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.]) fi +AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>], + [[unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuidex(exx[0], 7, 0); + ]])], + [pgac_cv__cpuidex="yes"], + [pgac_cv__cpuidex="no"])]) +if test x"$pgac_cv__cpuidex" = x"yes"; then + AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.]) +fi + +# Check for XSAVE intrinsics +# +CFLAGS_XSAVE="" +PGAC_XSAVE_INTRINSICS([]) +if test x"$pgac_xsave_intrinsics" != x"yes"; then + PGAC_XSAVE_INTRINSICS([-mxsave]) +fi +if test x"$pgac_xsave_intrinsics" = x"yes"; then + AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.]) +fi +AC_SUBST(CFLAGS_XSAVE) + +# Check for AVX-512 popcount intrinsics +# +CFLAGS_POPCNT="" +PG_POPCNT_OBJS="" +if test x"$host_cpu" = x"x86_64"; then + PGAC_AVX512_POPCNT_INTRINSICS([]) + if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw]) + fi + if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then + PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o" + AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.]) + fi +fi +AC_SUBST(CFLAGS_POPCNT) +AC_SUBST(PG_POPCNT_OBJS) + # Check for Intel SSE 4.2 intrinsics to do CRC calculations. # # First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used diff --git a/meson.build b/meson.build index 87437960bc..5acf083ce3 100644 --- a/meson.build +++ b/meson.build @@ -1783,6 +1783,30 @@ elif cc.links(''' endif +# Check for __get_cpuid_count() and __cpuidex() in a similar fashion. +if cc.links(''' + #include <cpuid.h> + int main(int arg, char **argv) + { + unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + } + ''', name: '__get_cpuid_count', + args: test_c_args) + cdata.set('HAVE__GET_CPUID_COUNT', 1) +elif cc.links(''' + #include <intrin.h> + int main(int arg, char **argv) + { + unsigned int exx[4] = {0, 0, 0, 0}; + __cpuidex(exx, 7, 0); + } + ''', name: '__cpuidex', + args: test_c_args) + cdata.set('HAVE__CPUIDEX', 1) +endif + + # Defend against clang being used on x86-32 without SSE2 enabled. As current # versions of clang do not understand -fexcess-precision=standard, the use of # x87 floating point operations leads to problems like isinf possibly returning @@ -1996,6 +2020,69 @@ int main(void) endif +############################################################### +# Check for the availability of XSAVE intrinsics. +############################################################### + +cflags_xsave = [] +if host_cpu == 'x86' or host_cpu == 'x86_64' + + prog = ''' +#include <immintrin.h> + +int main(void) +{ + return _xgetbv(0) & 0xe0; +} +''' + + if cc.links(prog, name: 'XSAVE intrinsics without -mxsave', + args: test_c_args) + cdata.set('HAVE_XSAVE_INTRINSICS', 1) + elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave', + args: test_c_args + ['-mxsave']) + cdata.set('HAVE_XSAVE_INTRINSICS', 1) + cflags_xsave += '-mxsave' + endif + +endif + + +############################################################### +# Check for the availability of AVX-512 popcount intrinsics. +############################################################### + +cflags_popcnt = [] +if host_cpu == 'x86_64' + + prog = ''' +#include <immintrin.h> + +int main(void) +{ + const char buf[sizeof(__m512i)]; + INT64 popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0; +} +''' + + if cc.links(prog, name: 'AVX-512 popcount without -mavx512vpopcntdq -mavx512bw', + args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))]) + cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1) + elif cc.links(prog, name: 'AVX-512 popcount with -mavx512vpopcntdq -mavx512bw', + args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw']) + cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1) + cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw'] + endif + +endif + ############################################################### # Select CRC-32C implementation. diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 8b3f8c24e0..36d880d225 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@ CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@ CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@ +CFLAGS_POPCNT = @CFLAGS_POPCNT@ CFLAGS_CRC = @CFLAGS_CRC@ +CFLAGS_XSAVE = @CFLAGS_XSAVE@ PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@ CXXFLAGS = @CXXFLAGS@ @@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@ # files needed for the chosen CRC-32C implementation PG_CRC32C_OBJS = @PG_CRC32C_OBJS@ +# files needed for the chosen popcount implementation +PG_POPCNT_OBJS = @PG_POPCNT_OBJS@ + LIBS := -lpgcommon -lpgport $(LIBS) # to make ws2_32.lib the last library diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 591e1ca3df..f8d3e3b6b8 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -513,6 +513,9 @@ /* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */ #undef HAVE_X86_64_POPCNTQ +/* Define to 1 if you have XSAVE intrinsics. */ +#undef HAVE_XSAVE_INTRINSICS + /* Define to 1 if the system has the type `_Bool'. */ #undef HAVE__BOOL @@ -555,9 +558,15 @@ /* Define to 1 if you have __cpuid. */ #undef HAVE__CPUID +/* Define to 1 if you have __cpuidex. */ +#undef HAVE__CPUIDEX + /* Define to 1 if you have __get_cpuid. */ #undef HAVE__GET_CPUID +/* Define to 1 if you have __get_cpuid_count. */ +#undef HAVE__GET_CPUID_COUNT + /* Define to 1 if your compiler understands _Static_assert. */ #undef HAVE__STATIC_ASSERT @@ -680,6 +689,9 @@ /* Define to 1 to build with assertion checks. (--enable-cassert) */ #undef USE_ASSERT_CHECKING +/* Define to 1 to use AVX-512 popcount instructions with a runtime check. */ +#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK + /* Define to 1 to build with Bonjour support. (--with-bonjour) */ #undef USE_BONJOUR diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index de480da71e..b453f84d8f 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -304,6 +304,17 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word); extern PGDLLIMPORT int (*pg_popcount64) (uint64 word); extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes); +/* + * We can also try to use the AVX-512 popcount instruction on some systems. + * The implementation of that is located in its own file because it may + * require special compiler flags that we don't want to apply to any other + * files. + */ +#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK +extern bool pg_popcount_avx512_available(void); +extern uint64 pg_popcount_avx512(const char *buf, int bytes); +#endif + #else /* Use a portable implementation -- no need for a function pointer. */ extern int pg_popcount32(uint32 word); diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build index b0f4178b3d..5618050b30 100644 --- a/src/makefiles/meson.build +++ b/src/makefiles/meson.build @@ -100,8 +100,10 @@ pgxs_kv = { ' '.join(cflags_no_decl_after_statement), 'CFLAGS_CRC': ' '.join(cflags_crc), + 'CFLAGS_POPCNT': ' '.join(cflags_popcnt), 'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags), 'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags), + 'CFLAGS_XSAVE': ' '.join(cflags_xsave), 'LDFLAGS': var_ldflags, 'LDFLAGS_EX': var_ldflags_ex, @@ -177,7 +179,7 @@ pgxs_empty = [ 'WANTED_LANGUAGES', # Not needed because we don't build the server / PLs with the generated makefile - 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS', + 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS', 'DTRACEFLAGS', # only server has dtrace probes 'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp', diff --git a/src/port/Makefile b/src/port/Makefile index dcc8737e68..db7c02117b 100644 --- a/src/port/Makefile +++ b/src/port/Makefile @@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS) OBJS = \ $(LIBOBJS) \ $(PG_CRC32C_OBJS) \ + $(PG_POPCNT_OBJS) \ bsearch_arg.o \ chklocale.o \ inet_net_ntop.o \ @@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC) pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC) pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC) +# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE +pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE) +pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE) +pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE) + +# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT +pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT) +pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT) +pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT) + # # Shared library versions of object files # diff --git a/src/port/meson.build b/src/port/meson.build index 92b593e6ef..fd9ee199d1 100644 --- a/src/port/meson.build +++ b/src/port/meson.build @@ -84,6 +84,8 @@ replace_funcs_pos = [ ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'], ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'], ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'], + ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'], + ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'], # arm / aarch64 ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'], @@ -98,8 +100,8 @@ replace_funcs_pos = [ ['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'], ] -pgport_cflags = {'crc': cflags_crc} -pgport_sources_cflags = {'crc': []} +pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave} +pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []} foreach f : replace_funcs_neg func = f.get(0) diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 6271acea60..411be90f73 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -163,6 +163,11 @@ choose_popcount_functions(void) pg_popcount64 = pg_popcount64_slow; pg_popcount_optimized = pg_popcount_slow; } + +#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK + if (pg_popcount_avx512_available()) + pg_popcount_optimized = pg_popcount_avx512; +#endif } static int diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c new file mode 100644 index 0000000000..0040361cf6 --- /dev/null +++ b/src/port/pg_popcount_avx512.c @@ -0,0 +1,82 @@ +/*------------------------------------------------------------------------- + * + * pg_popcount_avx512.c + * Holds the pg_popcount() implementation that uses AVX-512 instructions. + * + * Copyright (c) 2024, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/port/pg_popcount_avx512.c + * + *------------------------------------------------------------------------- + */ +#include "c.h" + +#include <immintrin.h> + +#include "port/pg_bitutils.h" + +/* + * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to + * use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on + * the function pointers that are only used when TRY_POPCNT_FAST is set. + */ +#ifdef TRY_POPCNT_FAST + +/* + * pg_popcount_avx512 + * Returns the number of 1-bits in buf + */ +uint64 +pg_popcount_avx512(const char *buf, int bytes) +{ + __m512i val, + cnt; + __m512i accum = _mm512_setzero_si512(); + const char *final; + int tail_idx; + __mmask64 mask = ~UINT64CONST(0); + + /* + * Align buffer down to avoid double load overhead from unaligned access. + * Calculate a mask to ignore preceding bytes. Find start offset of final + * iteration and number of valid bytes making sure that final iteration is + * not empty. + */ + mask <<= ((uintptr_t) buf) % sizeof(__m512i); + tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1; + final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1); + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); + + /* + * Iterate through all but the final iteration. Starting from second + * iteration, the start index mask is ignored. + */ + if (buf < final) + { + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + + buf += sizeof(__m512i); + mask = ~UINT64CONST(0); + + for (; buf < final; buf += sizeof(__m512i)) + { + val = _mm512_load_si512((const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + } + } + + /* Final iteration needs to ignore bytes that are not within the length */ + mask &= (~UINT64CONST(0) >> (sizeof(__m512i) - tail_idx)); + + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + + return _mm512_reduce_add_epi64(accum); +} + +#endif /* TRY_POPCNT_FAST */ diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c new file mode 100644 index 0000000000..d54147b88c --- /dev/null +++ b/src/port/pg_popcount_avx512_choose.c @@ -0,0 +1,87 @@ +/*------------------------------------------------------------------------- + * + * pg_popcount_avx512_choose.c + * Test whether we can use AVX-512 POPCNT instructions. + * + * Copyright (c) 2024, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/port/pg_popcount_avx512_choose.c + * + *------------------------------------------------------------------------- + */ +#include "c.h" + +#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) +#include <cpuid.h> +#endif + +#ifdef HAVE_XSAVE_INTRINSICS +#include <immintrin.h> +#endif + +#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX) +#include <intrin.h> +#endif + +#include "port/pg_bitutils.h" + +/* + * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to + * use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on + * the function pointers that are only used when TRY_POPCNT_FAST is set. + */ +#ifdef TRY_POPCNT_FAST + +/* + * Returns true if the CPU supports AVX-512 POPCNT. + */ +bool +pg_popcount_avx512_available(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; + + /* does CPUID say there's support for AVX-512 POPCNT? */ +#if defined(HAVE__GET_CPUID_COUNT) + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUIDEX) + __cpuidex(exx, 7, 0); +#else +#error cpuid instruction not available +#endif + if ((exx[2] & (1 << 14)) == 0) /* avx512-vpopcntdq */ + return false; + + /* does CPUID say there's support for AVX-512 BW? */ + memset(exx, 0, sizeof(exx)); +#if defined(HAVE__GET_CPUID_COUNT) + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUIDEX) + __cpuidex(exx, 7, 0); +#else +#error cpuid instruction not available +#endif + if ((exx[1] & (1 << 30)) == 0) /* avx512-bw */ + return false; + + /* does CPUID say there's support for XGETBV? */ + memset(exx, 0, sizeof(exx)); +#if defined(HAVE__GET_CPUID) + __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUID) + __cpuid(exx, 1); +#else +#error cpuid instruction not available +#endif + if ((exx[2] & (1 << 26)) == 0) /* xsave */ + return false; + + /* does XGETBV say the ZMM registers are enabled? */ +#ifdef HAVE_XSAVE_INTRINSICS + return (_xgetbv(0) & 0xe0) != 0; +#else + return false; +#endif +} + +#endif /* TRY_POPCNT_FAST */ diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out index e17cbf42ca..6a436288bb 100644 --- a/src/test/regress/expected/bit.out +++ b/src/test/regress/expected/bit.out @@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10)); 10 (1 row) +SELECT bit_count(repeat('0', 100)::bit(100)); + bit_count +----------- + 0 +(1 row) + +SELECT bit_count(repeat('1', 100)::bit(100)); + bit_count +----------- + 100 +(1 row) + +SELECT bit_count(repeat('01', 500)::bit(1000)); + bit_count +----------- + 500 +(1 row) + +SELECT bit_count(repeat('10101', 200)::bit(1000)); + bit_count +----------- + 600 +(1 row) + -- This table is intentionally left around to exercise pg_dump/pg_upgrade CREATE TABLE bit_defaults( b1 bit(4) DEFAULT '1001', diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql index 34230b99fb..8ba6facd03 100644 --- a/src/test/regress/sql/bit.sql +++ b/src/test/regress/sql/bit.sql @@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20); -- bit_count SELECT bit_count(B'0101011100'::bit(10)); SELECT bit_count(B'1111111111'::bit(10)); +SELECT bit_count(repeat('0', 100)::bit(100)); +SELECT bit_count(repeat('1', 100)::bit(100)); +SELECT bit_count(repeat('01', 500)::bit(1000)); +SELECT bit_count(repeat('10101', 200)::bit(1000)); -- This table is intentionally left around to exercise pg_dump/pg_upgrade CREATE TABLE bit_defaults( -- 2.25.1
>From 01e8c3fc481fda518b7d92cb6af044c6cda410e3 Mon Sep 17 00:00:00 2001 From: Nathan Bossart <nat...@postgresql.org> Date: Sat, 6 Apr 2024 11:55:48 -0500 Subject: [PATCH v28 2/2] Optimize visibilitymap_count() with AVX-512 instructions. Thanks to the infrastructure added by commit XXXXXXXXXX, we can pretty easily optimize this function with AVX-512 intrinsic functions. A new pg_popcount_masked() function is introduced that applies a bitmask to every byte in the buffer prior to calculating the population count, which is used to filter out the all-visible or all-frozen bits as needed. Platforms without AVX-512 support should also see a nice speedup due to the reduced number of calls to a function pointer. Co-authored-by: Ants Aasma Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com --- src/backend/access/heap/visibilitymap.c | 25 +---- src/include/port/pg_bitutils.h | 34 +++++++ src/port/pg_bitutils.c | 126 ++++++++++++++++++++++++ src/port/pg_popcount_avx512.c | 61 ++++++++++++ 4 files changed, 226 insertions(+), 20 deletions(-) diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index 1ab6c865e3..8b24e7bc33 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -119,10 +119,8 @@ #define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK) /* Masks for counting subsets of bits in the visibility map. */ -#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each - * bit pair */ -#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each - * bit pair */ +#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */ +#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */ /* prototypes for internal routines */ static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend); @@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro { Buffer mapBuffer; uint64 *map; - int i; /* * Read till we fall off the end of the map. We assume that any extra @@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro */ map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer)); - StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0, - "unsupported MAPSIZE"); - if (all_frozen == NULL) - { - for (i = 0; i < MAPSIZE / sizeof(uint64); i++) - nvisible += pg_popcount64(map[i] & VISIBLE_MASK64); - } - else - { - for (i = 0; i < MAPSIZE / sizeof(uint64); i++) - { - nvisible += pg_popcount64(map[i] & VISIBLE_MASK64); - nfrozen += pg_popcount64(map[i] & FROZEN_MASK64); - } - } + nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8); + if (all_frozen) + nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8); ReleaseBuffer(mapBuffer); } diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index b453f84d8f..4d88478c9c 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -303,6 +303,7 @@ pg_ceil_log2_64(uint64 num) extern PGDLLIMPORT int (*pg_popcount32) (uint32 word); extern PGDLLIMPORT int (*pg_popcount64) (uint64 word); extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes); +extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask); /* * We can also try to use the AVX-512 popcount instruction on some systems. @@ -313,6 +314,7 @@ extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes); #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK extern bool pg_popcount_avx512_available(void); extern uint64 pg_popcount_avx512(const char *buf, int bytes); +extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask); #endif #else @@ -320,6 +322,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes); extern int pg_popcount32(uint32 word); extern int pg_popcount64(uint64 word); extern uint64 pg_popcount_optimized(const char *buf, int bytes); +extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask); #endif /* TRY_POPCNT_FAST */ @@ -357,6 +360,37 @@ pg_popcount(const char *buf, int bytes) return pg_popcount_optimized(buf, bytes); } +/* + * Returns the number of 1-bits in buf after applying the mask to each byte. + * + * Similar to pg_popcount(), we only take on the function pointer overhead when + * it's likely to be faster. + */ +static inline uint64 +pg_popcount_masked(const char *buf, int bytes, bits8 mask) +{ + /* + * We set the threshold to the point at which we'll first use special + * instructions in the optimized version. + */ +#if SIZEOF_VOID_P >= 8 + int threshold = 8; +#else + int threshold = 4; +#endif + + if (bytes < threshold) + { + uint64 popcnt = 0; + + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask]; + return popcnt; + } + + return pg_popcount_masked_optimized(buf, bytes, mask); +} + /* * Rotate the bits of "word" to the right/left by n bits. */ diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 411be90f73..88bc5cdbb1 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -106,19 +106,23 @@ const uint8 pg_number_of_ones[256] = { static inline int pg_popcount32_slow(uint32 word); static inline int pg_popcount64_slow(uint64 word); static uint64 pg_popcount_slow(const char *buf, int bytes); +static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask); #ifdef TRY_POPCNT_FAST static bool pg_popcount_available(void); static int pg_popcount32_choose(uint32 word); static int pg_popcount64_choose(uint64 word); static uint64 pg_popcount_choose(const char *buf, int bytes); +static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask); static inline int pg_popcount32_fast(uint32 word); static inline int pg_popcount64_fast(uint64 word); static uint64 pg_popcount_fast(const char *buf, int bytes); +static uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask); int (*pg_popcount32) (uint32 word) = pg_popcount32_choose; int (*pg_popcount64) (uint64 word) = pg_popcount64_choose; uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose; +uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose; #endif /* TRY_POPCNT_FAST */ #ifdef TRY_POPCNT_FAST @@ -156,17 +160,22 @@ choose_popcount_functions(void) pg_popcount32 = pg_popcount32_fast; pg_popcount64 = pg_popcount64_fast; pg_popcount_optimized = pg_popcount_fast; + pg_popcount_masked_optimized = pg_popcount_masked_fast; } else { pg_popcount32 = pg_popcount32_slow; pg_popcount64 = pg_popcount64_slow; pg_popcount_optimized = pg_popcount_slow; + pg_popcount_masked_optimized = pg_popcount_masked_slow; } #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK if (pg_popcount_avx512_available()) + { pg_popcount_optimized = pg_popcount_avx512; + pg_popcount_masked_optimized = pg_popcount_masked_avx512; + } #endif } @@ -191,6 +200,13 @@ pg_popcount_choose(const char *buf, int bytes) return pg_popcount_optimized(buf, bytes); } +static uint64 +pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask) +{ + choose_popcount_functions(); + return pg_popcount_masked(buf, bytes, mask); +} + /* * pg_popcount32_fast * Return the number of 1 bits set in word @@ -271,6 +287,56 @@ pg_popcount_fast(const char *buf, int bytes) return popcnt; } +/* + * pg_popcount_masked_fast + * Returns the number of 1-bits in buf after apply the mask to each byte + */ +static uint64 +pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask) +{ + uint64 popcnt = 0; + +#if SIZEOF_VOID_P >= 8 + /* Process in 64-bit chunks if the buffer is aligned */ + uint64 maskv = ~UINT64CONST(0) / 0xFF * mask; + + if (buf == (const char *) TYPEALIGN(8, buf)) + { + const uint64 *words = (const uint64 *) buf; + + while (bytes >= 8) + { + popcnt += pg_popcount64_fast(*words++ & maskv); + bytes -= 8; + } + + buf = (const char *) words; + } +#else + /* Process in 32-bit chunks if the buffer is aligned. */ + uint32 maskv = ~((uint32) 0) / 0xFF * mask; + + if (buf == (const char *) TYPEALIGN(4, buf)) + { + const uint32 *words = (const uint32 *) buf; + + while (bytes >= 4) + { + popcnt += pg_popcount32_fast(*words++ & maskv); + bytes -= 4; + } + + buf = (const char *) words; + } +#endif + + /* Process any remaining bytes */ + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask]; + + return popcnt; +} + #endif /* TRY_POPCNT_FAST */ @@ -370,6 +436,56 @@ pg_popcount_slow(const char *buf, int bytes) return popcnt; } +/* + * pg_popcount_masked_slow + * Returns the number of 1-bits in buf after apply the mask to each byte + */ +static uint64 +pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask) +{ + uint64 popcnt = 0; + +#if SIZEOF_VOID_P >= 8 + /* Process in 64-bit chunks if the buffer is aligned */ + uint64 maskv = ~UINT64CONST(0) / 0xFF * mask; + + if (buf == (const char *) TYPEALIGN(8, buf)) + { + const uint64 *words = (const uint64 *) buf; + + while (bytes >= 8) + { + popcnt += pg_popcount64_slow(*words++ & maskv); + bytes -= 8; + } + + buf = (const char *) words; + } +#else + /* Process in 32-bit chunks if the buffer is aligned. */ + uint32 maskv = ~((uint32) 0) / 0xFF * mask; + + if (buf == (const char *) TYPEALIGN(4, buf)) + { + const uint32 *words = (const uint32 *) buf; + + while (bytes >= 4) + { + popcnt += pg_popcount32_slow(*words++ & maskv); + bytes -= 4; + } + + buf = (const char *) words; + } +#endif + + /* Process any remaining bytes */ + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask]; + + return popcnt; +} + #ifndef TRY_POPCNT_FAST /* @@ -401,4 +517,14 @@ pg_popcount_optimized(const char *buf, int bytes) return pg_popcount_slow(buf, bytes); } +/* + * pg_popcount_masked_optimized + * Returns the number of 1-bits in buf after apply the mask to each byte + */ +uint64 +pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask) +{ + return pg_popcount_masked_slow(buf, bytes, mask); +} + #endif /* !TRY_POPCNT_FAST */ diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index 0040361cf6..a52615eb8b 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -79,4 +79,65 @@ pg_popcount_avx512(const char *buf, int bytes) return _mm512_reduce_add_epi64(accum); } +/* + * pg_popcount_masked_avx512 + * Returns the number of 1-bits in buf after applying the mask to each byte + */ +uint64 +pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask) +{ + __m512i val, + vmasked, + cnt; + __m512i accum = _mm512_setzero_si512(); + const char *final; + int tail_idx; + __mmask64 bmask = ~UINT64CONST(0); + const __m512i maskv = _mm512_set1_epi8(mask); + + /* + * Align buffer down to avoid double load overhead from unaligned access. + * Calculate a mask to ignore preceding bytes. Find start offset of final + * iteration and number of valid bytes making sure that final iteration is + * not empty. + */ + bmask <<= ((uintptr_t) buf) % sizeof(__m512i); + tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1; + final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1); + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); + + /* + * Iterate through all but the final iteration. Starting from second + * iteration, the start index mask is ignored. + */ + if (buf < final) + { + val = _mm512_maskz_loadu_epi8(bmask, (const __m512i *) buf); + vmasked = _mm512_and_si512(val, maskv); + cnt = _mm512_popcnt_epi64(vmasked); + accum = _mm512_add_epi64(accum, cnt); + + buf += sizeof(__m512i); + bmask = ~UINT64CONST(0); + + for (; buf < final; buf += sizeof(__m512i)) + { + val = _mm512_load_si512((const __m512i *) buf); + vmasked = _mm512_and_si512(val, maskv); + cnt = _mm512_popcnt_epi64(vmasked); + accum = _mm512_add_epi64(accum, cnt); + } + } + + /* Final iteration needs to ignore bytes that are not within the length */ + bmask &= (~UINT64CONST(0) >> (sizeof(__m512i) - tail_idx)); + + val = _mm512_maskz_loadu_epi8(bmask, (const __m512i *) buf); + vmasked = _mm512_and_si512(val, maskv); + cnt = _mm512_popcnt_epi64(vmasked); + accum = _mm512_add_epi64(accum, cnt); + + return _mm512_reduce_add_epi64(accum); +} + #endif /* TRY_POPCNT_FAST */ -- 2.25.1