[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 25 Nov 2022 12:48:17 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832


--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Michael_S from comment #16)
> On unrelated note, why loop overhead uses so many instructions?
> Assuming that I am as misguided as gcc about load-op combining, I would
> write it as:
>   sub %rax, %rdx
> .L3:
>   vmovupd   (%rdx,%rax), %ymm1
>   vmovupd 32(%rdx,%rax), %ymm0
>   vfmadd213pd    32(%rax), %ymm3, %ymm1
>   vfnmadd213pd     (%rax), %ymm2, %ymm0
>   vfnmadd231pd   32(%rdx,%rax), %ymm3, %ymm0
>   vfnmadd231pd     (%rdx,%rax), %ymm2, %ymm1
>   vmovupd %ymm0,   (%rax)
>   vmovupd %ymm1, 32(%rax)
>   addq    $64, %rax
>   decl    %esi
>   jb      .L3
>   
> The loop overhead in my variant is 3 x86 instructions==2 macro-ops,
> vs 5 x86 instructions==4 macro-ops in gcc variant.
> Also, in gcc variant all memory accesses have displacement that makes them
> 1 byte longer. In my variant only half of accesses have displacement.
> 
> I think, in the past I had seen cases where gcc generates optimal or
> near-optimal
> code sequences for loop overhead. I wonder why it can not do it here.

I don't think we currently consider IVs based on the difference of two
addresses.  The cost benefit of no displacement is only size, otherwise
I have no idea why we have biased the %rax accesses by -32.  Why we
fail to consider decrement-to-zero for the counter IV is probably because
IVCANON would add such IV but the vectorizer replaces that and IVOPTs
doesn't consider re-adding that.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

Reply via email to