[Bug target/111874] Missed mask_fold_left_plus with AVX512

2023-11-12 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

Andrew Pinski  changed:

   What|Removed |Added

   Last reconfirmed||2023-11-12
   Severity|normal  |enhancement
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #4 from Andrew Pinski  ---
.

[Bug target/111874] Missed mask_fold_left_plus with AVX512

2023-10-23 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #3 from Hongtao.liu  ---
> For the case of conditional (or loop masked) fold-left reductions the scalar
> fallback isn't implemented.  But AVX512 has vpcompress that could be used
> to implement a more efficient sequence for a masked fold-left, possibly
> using a loop and population count of the mask.
There's extra kmov + vpcompress + popcnt, I'm afraid the performance could be 
 worse than the scalar version.

[Bug target/111874] Missed mask_fold_left_plus with AVX512

2023-10-19 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #2 from Richard Biener  ---
(In reply to Hongtao.liu from comment #1)
> For integer, We have _mm512_mask_reduce_add_epi32 defined as
> 
> extern __inline int
> __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
> {
>   __A = _mm512_maskz_mov_epi32 (__U, __A);
>   __MM512_REDUCE_OP (+);
> }
> 
> #undef __MM512_REDUCE_OP
> #define __MM512_REDUCE_OP(op) \
>   __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1);  \
>   __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0);  \
>   __m256i __T3 = (__m256i) (__T1 op __T2);\
>   __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1);  \
>   __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0);  \
>   __v4si __T6 = __T4 op __T5; \
>   __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 });\
>   __v4si __T8 = __T6 op __T7; \
>   return __T8[0] op __T8[1]
> 
> There's correponding floating point version, but it's not in-order adds.

It also doesn't handle signed zeros correctly which would require
not using _mm512_maskz_mov_epi32 but merge masking with { -0.0, -0.0, ... }
for FP.  Of course as it's not doing in-order processing not handling
signed zeros correctly might be a minor thing.

So yes, we're looking for -O3 without -ffast-math vectorization of
a conditional reduction that's currently not supported (correctly).

double a[1024];
double foo()
{
  double res = 0.0;
  for (int i = 0; i < 1024; ++i)
{
  if (a[i] < 0.)
 res += a[i];
}
  return res;
}

should be vectorizable also with -frounding-math (where the trick using
-0.0 for masked elements doesn't work).  Currently we are using 0.0 for
them (but there's a pending patch).

Maybe we don't care about -frounding-math and so -0.0 adds are OK.  We
get something like the following with znver4, it could be that trying
to optimize the case of a sparse mask with vcompress isn't worth it

.L2:
vmovapd (%rax), %zmm1
addq$64, %rax
vminpd  %zmm5, %zmm1, %zmm1
valignq $3, %ymm1, %ymm1, %ymm2
vunpckhpd   %xmm1, %xmm1, %xmm3
vaddsd  %xmm1, %xmm0, %xmm0
vaddsd  %xmm3, %xmm0, %xmm0
vextractf64x2   $1, %ymm1, %xmm3
vextractf64x4   $0x1, %zmm1, %ymm1
vaddsd  %xmm3, %xmm0, %xmm0
vaddsd  %xmm2, %xmm0, %xmm0
vunpckhpd   %xmm1, %xmm1, %xmm2
vaddsd  %xmm1, %xmm0, %xmm0
vaddsd  %xmm2, %xmm0, %xmm0
vextractf64x2   $1, %ymm1, %xmm2
valignq $3, %ymm1, %ymm1, %ymm1
vaddsd  %xmm2, %xmm0, %xmm0
vaddsd  %xmm1, %xmm0, %xmm0
cmpq$a+8192, %rax
jne .L2

[Bug target/111874] Missed mask_fold_left_plus with AVX512

2023-10-19 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #1 from Hongtao.liu  ---
For integer, We have _mm512_mask_reduce_add_epi32 defined as

extern __inline int
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
{
  __A = _mm512_maskz_mov_epi32 (__U, __A);
  __MM512_REDUCE_OP (+);
}

#undef __MM512_REDUCE_OP
#define __MM512_REDUCE_OP(op) \
  __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1);\
  __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0);\
  __m256i __T3 = (__m256i) (__T1 op __T2);  \
  __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1);\
  __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0);\
  __v4si __T6 = __T4 op __T5;   \
  __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 });  \
  __v4si __T8 = __T6 op __T7;   \
  return __T8[0] op __T8[1]

There's correponding floating point version, but it's not in-order adds.