https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874
--- Comment #2 from Richard Biener ---
(In reply to Hongtao.liu from comment #1)
> For integer, We have _mm512_mask_reduce_add_epi32 defined as
>
> extern __inline int
> __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
> {
> __A = _mm512_maskz_mov_epi32 (__U, __A);
> __MM512_REDUCE_OP (+);
> }
>
> #undef __MM512_REDUCE_OP
> #define __MM512_REDUCE_OP(op) \
> __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1); \
> __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0); \
> __m256i __T3 = (__m256i) (__T1 op __T2);\
> __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1); \
> __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0); \
> __v4si __T6 = __T4 op __T5; \
> __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 });\
> __v4si __T8 = __T6 op __T7; \
> return __T8[0] op __T8[1]
>
> There's correponding floating point version, but it's not in-order adds.
It also doesn't handle signed zeros correctly which would require
not using _mm512_maskz_mov_epi32 but merge masking with { -0.0, -0.0, ... }
for FP. Of course as it's not doing in-order processing not handling
signed zeros correctly might be a minor thing.
So yes, we're looking for -O3 without -ffast-math vectorization of
a conditional reduction that's currently not supported (correctly).
double a[1024];
double foo()
{
double res = 0.0;
for (int i = 0; i < 1024; ++i)
{
if (a[i] < 0.)
res += a[i];
}
return res;
}
should be vectorizable also with -frounding-math (where the trick using
-0.0 for masked elements doesn't work). Currently we are using 0.0 for
them (but there's a pending patch).
Maybe we don't care about -frounding-math and so -0.0 adds are OK. We
get something like the following with znver4, it could be that trying
to optimize the case of a sparse mask with vcompress isn't worth it
.L2:
vmovapd (%rax), %zmm1
addq$64, %rax
vminpd %zmm5, %zmm1, %zmm1
valignq $3, %ymm1, %ymm1, %ymm2
vunpckhpd %xmm1, %xmm1, %xmm3
vaddsd %xmm1, %xmm0, %xmm0
vaddsd %xmm3, %xmm0, %xmm0
vextractf64x2 $1, %ymm1, %xmm3
vextractf64x4 $0x1, %zmm1, %ymm1
vaddsd %xmm3, %xmm0, %xmm0
vaddsd %xmm2, %xmm0, %xmm0
vunpckhpd %xmm1, %xmm1, %xmm2
vaddsd %xmm1, %xmm0, %xmm0
vaddsd %xmm2, %xmm0, %xmm0
vextractf64x2 $1, %ymm1, %xmm2
valignq $3, %ymm1, %ymm1, %ymm1
vaddsd %xmm2, %xmm0, %xmm0
vaddsd %xmm1, %xmm0, %xmm0
cmpq$a+8192, %rax
jne .L2