https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86029
ktkachov at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ktkachov at gcc dot gnu.org --- Comment #2 from ktkachov at gcc dot gnu.org --- (In reply to Tavian Barnes from comment #1) > Maybe a dupe of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70291? In the > -O3 version, __mulsc3() dominates the profile. > > │ for(int i=0; i<=decimate_taps_length; i++) decim += > samplebuf[i] * decimate_taps[i]; > 0.20 │430:┌─→vmovss 0x4(%r13,%rbx,1),%xmm1 > 3.63 │ │ vmovss 0x0(%r13,%rbx,1),%xmm0 > 12.35 │ │ vmovss 0x4(%r12,%rbx,1),%xmm3 > 0.31 │ │ vmovss (%r12,%rbx,1),%xmm2 > 0.02 │ │ add $0x8,%rbx > 36.48 │ │→ callq __mulsc3 > 0.01 │ │ vmovss -0x78(%rbp),%xmm6 > 0.00 │ │ vmovss -0x80(%rbp),%xmm4 > 23.70 │ │ vmovq %xmm0,-0x68(%rbp) > 14.25 │ │ vaddss -0x68(%rbp),%xmm6,%xmm5 > 1.54 │ │ vaddss -0x64(%rbp),%xmm4,%xmm7 > 0.48 │ │ vmovss %xmm5,-0x78(%rbp) > 5.92 │ │ vmovss %xmm7,-0x80(%rbp) > │ ├──cmp $0x2590,%rbx > 0.01 │ └──jne 430 > > At -Ofast, > > │ for(int i=0; i<=decimate_taps_length; i++) decim += > samplebuf[i] * decimate_taps[i]; > 9.36 │5e0: vpermilps $0xf5,(%r12,%rax,1),%ymm0 > 15.56 │ vpermilps $0xa0,(%r12,%rax,1),%ymm1 > 11.24 │ vmulps (%rbx,%rax,1),%ymm0,%ymm0 > 17.55 │ vpermilps $0xb1,(%rbx,%rax,1),%ymm4 > 3.31 │ add $0x20,%rax > 2.11 │ vmovaps %ymm1,%ymm3 > 6.62 │ vfmadd132ps %ymm4,%ymm0,%ymm3 > 3.79 │ vfmsub231ps %ymm4,%ymm1,%ymm0 > 2.91 │ vblendps $0xaa,%ymm0,%ymm3,%ymm0 > 10.75 │ vaddps %ymm0,%ymm6,%ymm6 > │ cmp $0x2580,%rax > 5.59 │ ↑ jne 5e0 > 0.01 │ vmovss 0x258c(%rbx),%xmm0 > 0.01 │ vmovss -0x70(%rbp),%xmm7 > 0.01 │ vmovss %xmm5,-0xd0(%rbp) > 0.05 │ vextractf128 $0x1,%ymm6,%xmm3 > 0.01 │ vmovss 0x2588(%rbx),%xmm8 > 0.03 │ vshufps $0xff,%xmm3,%xmm3,%xmm13 Looks like so. Could you try this out with current trunk?