[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

rguenth at gcc dot gnu.org Mon, 11 Mar 2019 07:01:38 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561


--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Re-checking today we reject AVX vectorization via the costmodel but do
SSE vectorization.  With versioning for alias we could also SLP vectorize this,
keeping the loop body smaller and avoiding an epilogue.  Esp. since we're
ending up without any vector load or store anyway.

Of course SLP analysis requires a grouped store which we do not have since
we do not identify XPQKL(MPQ,MKL) and XPQKL(MRS,MKL) as such (they ain't
with MPQ == MRS but the runtime alias check ensures that's not the case).
That is, we miss "strided group" detection or in general SLP forming via
different mechanism.

That said, I have a hard time thinking of a heuristic aligning with reality
(it's of course possible to come up with a hack).

Generally we'd need to work towards doing the versioning / cost model checks
on outer loops but the better versioning condition thing would be a
prerequesite for this.

I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting
to bogus state).

Scalar inner loop assembly:

.L8:
        vmulsd  (%rax,%rdi,8), %xmm3, %xmm0
        incl    %ecx
        vfmadd231sd     (%rax), %xmm4, %xmm0
        vfmadd213sd     (%rdx), %xmm6, %xmm0
        vmovsd  %xmm0, (%rdx)
        vmulsd  (%rax,%r8,8), %xmm1, %xmm0
        vfmadd231sd     (%rax,%r10,8), %xmm2, %xmm0
        addq    %r15, %rax
        vfmadd213sd     (%rdx,%rsi,8), %xmm5, %xmm0
        vmovsd  %xmm0, (%rdx,%rsi,8)
        addq    %rbp, %rdx
        cmpl    %r9d, %ecx
        jne     .L8

vectorized inner loop assembly:

.L9:
        vmovsd  (%r10,%rcx), %xmm13
        vmovsd  (%rdx), %xmm0
        incl    %r14d
        vmovhpd (%r10,%rsi), %xmm13, %xmm13
        vmovhpd (%rdx,%r13), %xmm0, %xmm14
        vmovsd  (%rdi,%rcx), %xmm0
        vmulpd  %xmm9, %xmm13, %xmm13
        vmovhpd (%rdi,%rsi), %xmm0, %xmm0
        vfmadd132pd     %xmm10, %xmm13, %xmm0
        vfmadd132pd     %xmm12, %xmm14, %xmm0
        vmovlpd %xmm0, (%rdx)
        vmovhpd %xmm0, (%rdx,%r13)
        vmovsd  (%r8,%rcx), %xmm13
        vmovsd  (%rax), %xmm0
        addq    %r11, %rdx
        vmovhpd (%r8,%rsi), %xmm13, %xmm13
        vmovhpd (%rax,%r13), %xmm0, %xmm14
        vmovsd  (%r9,%rcx), %xmm0
        addq    %rbx, %rcx
        vmulpd  %xmm7, %xmm13, %xmm13
        vmovhpd (%r9,%rsi), %xmm0, %xmm0
        addq    %rbx, %rsi
        vfmadd132pd     %xmm8, %xmm13, %xmm0
        vfmadd132pd     %xmm11, %xmm14, %xmm0
        vmovlpd %xmm0, (%rax)
        vmovhpd %xmm0, (%rax,%r13)
        addq    %r11, %rax
        cmpl    %r14d, %r15d
        jne     .L9

only outer loop context and knowledge of low trip count makes this bad.

The cost modeling doesn't know the scalar loop can execute like if
vectorized given the CPUs plenty of resources (speculating
non-dependence), whereas the vector variant introduces more constraints
to the pipelining due to data dependences from using vectors.  But
even IACA doesn't tell us the differences are big.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

Reply via email to