https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116022

            Bug ID: 116022
           Summary: complete (early) unrolling foils vectorizer for vector
                    initialization
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: amylaar at gcc dot gnu.org
  Target Milestone: ---

#define LENGTH 4
typedef unsigned uint32v_t __attribute ((vector_size (LENGTH * 4)));

uint32v_t vdup_u32(uint32v_t a, unsigned b)
{
  uint32v_t r;
  int i;
  for (i = 0; i < LENGTH; i++)
    r[i] = b;
  return r;
}

For x86_64-pc-linux-gnu, with -O1 -ftree-vectorize, we get:

vdup_u32:
.LFB0:
        .cfi_startproc
        movd    %edi, %xmm1
        pshufd  $0, %xmm1, %xmm0
        ret

which is fine.

However, with -O3, the complete unroller is run before the vectorizer, and
instead we get:
vdup_u32:
.LFB0:
        .cfi_startproc
        movd    %edi, %xmm0
        movd    %edi, %xmm1
        pshufd  $225, %xmm0, %xmm0
        movss   %xmm1, %xmm0
        pshufd  $225, %xmm0, %xmm0
        pshufd  $198, %xmm0, %xmm0
        movss   %xmm1, %xmm0
        pshufd  $198, %xmm0, %xmm0
        pshufd  $39, %xmm0, %xmm0
        movss   %xmm1, %xmm0
        pshufd  $39, %xmm0, %xmm0
        ret

making the code both larger and slower.

According to https://gcc.gnu.org/projects/tree-ssa/vectorization.htm , this was
supposed
to be handled by SLP, but apparently that is not happening.

See dump files produced by -fdump-tree-rebuild_frequencies -fdump-tree-cunrolli
-fdump-tree-vect
for details.

Reply via email to