https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115709

            Bug ID: 115709
           Summary: missed optimisation: vperms not reordered to eliminate
           Product: gcc
           Version: 14.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mjr19 at cam dot ac.uk
  Target Milestone: ---

#include <complex.h>
void foo(double complex *a, double *b, int n){
  int i;

  for(i=0;i<n;i++)
    b[i]=creal(a[i])*creal(a[i])+cimag(a[i])*cimag(a[i]);
}

with "gcc-14 -mavx2 -mfma -Ofast" produces a loop which ends

        vpermpd $216, %ymm0, %ymm0
        vpermpd $216, %ymm1, %ymm1
        vmulpd  %ymm0, %ymm0, %ymm0
        vfmadd132pd     %ymm1, %ymm0, %ymm1
        vmovupd %ymm1, (%rsi,%rax)

However, if the two identical vperms were delayed until after the vmul and
vfmadd, then just one on ymm1 would be needed. I believe that

        vmulpd  %ymm0, %ymm0, %ymm0
        vfmadd132pd     %ymm1, %ymm0, %ymm1
        vpermpd $216, %ymm1, %ymm1
        vmovupd %ymm1, (%rsi,%rax)

is equivalent, given that the contents of ymm0 are not used again.

subroutine foo(a,b,n)
  complex(kind(1d0))::a(*)
  real(kind(1d0))::b(*)
  integer::i,n

  do i=1,n
     b(i)=real(a(i))*real(a(i))+aimag(a(i))*aimag(a(i))
  end do
end subroutine foo

has the same issue. The speed increase from eliminating one vperm is quite
measurable.

Reply via email to