https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86029

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ktkachov at gcc dot gnu.org

--- Comment #2 from ktkachov at gcc dot gnu.org ---
(In reply to Tavian Barnes from comment #1)
> Maybe a dupe of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70291?  In the
> -O3 version, __mulsc3() dominates the profile.
> 
>        │                 for(int i=0; i<=decimate_taps_length; i++) decim +=
> samplebuf[i] * decimate_taps[i];
>   0.20 │430:┌─→vmovss 0x4(%r13,%rbx,1),%xmm1
>   3.63 │    │  vmovss 0x0(%r13,%rbx,1),%xmm0
>  12.35 │    │  vmovss 0x4(%r12,%rbx,1),%xmm3
>   0.31 │    │  vmovss (%r12,%rbx,1),%xmm2
>   0.02 │    │  add    $0x8,%rbx
>  36.48 │    │→ callq  __mulsc3
>   0.01 │    │  vmovss -0x78(%rbp),%xmm6
>   0.00 │    │  vmovss -0x80(%rbp),%xmm4
>  23.70 │    │  vmovq  %xmm0,-0x68(%rbp)
>  14.25 │    │  vaddss -0x68(%rbp),%xmm6,%xmm5
>   1.54 │    │  vaddss -0x64(%rbp),%xmm4,%xmm7
>   0.48 │    │  vmovss %xmm5,-0x78(%rbp)
>   5.92 │    │  vmovss %xmm7,-0x80(%rbp)
>        │    ├──cmp    $0x2590,%rbx
>   0.01 │    └──jne    430
> 
> At -Ofast,
> 
>        │                 for(int i=0; i<=decimate_taps_length; i++) decim +=
> samplebuf[i] * decimate_taps[i];
>   9.36 │5e0:   vpermilps $0xf5,(%r12,%rax,1),%ymm0
>  15.56 │       vpermilps $0xa0,(%r12,%rax,1),%ymm1
>  11.24 │       vmulps (%rbx,%rax,1),%ymm0,%ymm0
>  17.55 │       vpermilps $0xb1,(%rbx,%rax,1),%ymm4
>   3.31 │       add    $0x20,%rax
>   2.11 │       vmovaps %ymm1,%ymm3
>   6.62 │       vfmadd132ps %ymm4,%ymm0,%ymm3
>   3.79 │       vfmsub231ps %ymm4,%ymm1,%ymm0
>   2.91 │       vblendps $0xaa,%ymm0,%ymm3,%ymm0
>  10.75 │       vaddps %ymm0,%ymm6,%ymm6
>        │       cmp    $0x2580,%rax
>   5.59 │     ↑ jne    5e0
>   0.01 │       vmovss 0x258c(%rbx),%xmm0
>   0.01 │       vmovss -0x70(%rbp),%xmm7
>   0.01 │       vmovss %xmm5,-0xd0(%rbp)
>   0.05 │       vextractf128 $0x1,%ymm6,%xmm3
>   0.01 │       vmovss 0x2588(%rbx),%xmm8
>   0.03 │       vshufps $0xff,%xmm3,%xmm3,%xmm13

Looks like so. Could you try this out with current trunk?

Reply via email to