[Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert

peter at cordes dot ca via Gcc-bugs Tue, 15 Dec 2020 07:51:31 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291


            Bug ID: 98291
           Summary: multiple scalar FP accumulators auto-vectorize worse
                    than scalar, including vector load + merge instead of
                    scalar + high-half insert
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, ssemmx
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

An FP reduction loop with 2 scalar accumulators auto-vectorizes into a mess,
instead of effectively mapping each scalar to an element of one vector
accumulator.  (Unless we use -ffast-math, then that happens.  clang gets it
right even without -ffast-math).

double dotprod(const double *a, const double *b, unsigned long long n)
{
  double d1 = 0.0;
  double d2 = 0.0;

  for (unsigned long long i = 0; i < n; i += 2) {
    d1 += a[i] * b[i];
    d2 += a[i + 1] * b[i + 1];
  }

  return (d1 + d2);
}

https://godbolt.org/z/Kq48j9

With -ffast-math the nice sane loop we expect

.L3:
        movupd  (%rsi,%rax), %xmm0
        movupd  (%rdi,%rax), %xmm3
        addq    $1, %rdx
        addq    $16, %rax
        mulpd   %xmm3, %xmm0
        addpd   %xmm0, %xmm1
        cmpq    %rcx, %rdx
        jb      .L3


without: 

...
main loop
.L4:
        movupd  (%rcx,%rax), %xmm1        # 16-byte load
        movupd  (%rsi,%rax), %xmm3 
        movhpd  16(%rcx,%rax), %xmm1      # overwrite the high half of it!!
        movhpd  16(%rsi,%rax), %xmm3
        mulpd   %xmm3, %xmm1
        movupd  16(%rsi,%rax), %xmm3
        movlpd  8(%rsi,%rax), %xmm3
        addsd   %xmm1, %xmm2
        unpckhpd        %xmm1, %xmm1
        addsd   %xmm1, %xmm2
        movupd  16(%rcx,%rax), %xmm1
        movlpd  8(%rcx,%rax), %xmm1
        addq    $32, %rax
        mulpd   %xmm3, %xmm1
        addsd   %xmm1, %xmm0
        unpckhpd        %xmm1, %xmm1
        addsd   %xmm1, %xmm0
        cmpq    %rdx, %rax
        jne     .L4

The overall strategy is insane, but even some of the details are insane.  e.g.
a 16-byte load into XMM1, and then overwriting the high half of that with a
different double before reading it.  That's bad enough, but you'd expect movsd
/ movhpd to manually gather 2 doubles, without introducing the possibility of a
cache-line split load for zero benefit.

Similarly, movupd / movlpd should have just loaded in the other order.  (Or
since they're contiguous, movupd  8(%rsi,%rax), %xmm3 / shufpd.)

So beyond the bad overall strategy (which is likely worse than unrolled
scalar), it might be worth checking for some of this kind of smaller-scale
insanity somewhere later to make it less bad if some other inputs can trigger
similar behaviour.

(This small-scale detecting of movupd / movhpd and using movsd / movhpd could
be a separate bug, but if it's just a symptom of something that should never
happen in the first place then it's not really its own bug at all.)

[Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert

Reply via email to