https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122746

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to vekumar from comment #1)
> GCC 16 (1/1/26)
> --Snip--
> .L4:
>         vmovupd (%rsi,%rax), %zmm3
>         vmovupd 64(%rsi,%rax), %zmm2
>         vaddpd  %xmm3, %xmm1, %xmm1
>         vextracti32x4   $1, %zmm3, %xmm4
>         vaddpd  %xmm4, %xmm1, %xmm1
>         vextracti32x4   $2, %zmm3, %xmm4
>         vextracti32x4   $3, %zmm3, %xmm3
>         vaddpd  %xmm4, %xmm1, %xmm1
>         vaddpd  %xmm3, %xmm1, %xmm1
> --Snip--        
>       
> On Zen4/5, the GCC trunk code is bad and generating high latency
> vextracti32x4 (5 cycles) to do in-order reduction. On these targets wider to
> narrow operations should be costed more and better avoided. 
> 
> GCC 15 uses "insertf64x2". Inserts are cheaper and vectorizing at YMM level
> seems better here. 
> 
> vmovsd  (%rdx), %xmm0
> vmovhpd 8(%rdx), %xmm0, %xmm2         <== This can be optimized to single 
> load.
> vmovupd (%rax), %xmm0
> vinsertf64x2    $0x1, %xmm2, %ymm0, %ymm0
> vaddpd  %ymm1, %ymm0, %ymm0

This isn't code for the testcase in this bug?

Reply via email to