https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116022
Bug ID: 116022 Summary: complete (early) unrolling foils vectorizer for vector initialization Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amylaar at gcc dot gnu.org Target Milestone: --- #define LENGTH 4 typedef unsigned uint32v_t __attribute ((vector_size (LENGTH * 4))); uint32v_t vdup_u32(uint32v_t a, unsigned b) { uint32v_t r; int i; for (i = 0; i < LENGTH; i++) r[i] = b; return r; } For x86_64-pc-linux-gnu, with -O1 -ftree-vectorize, we get: vdup_u32: .LFB0: .cfi_startproc movd %edi, %xmm1 pshufd $0, %xmm1, %xmm0 ret which is fine. However, with -O3, the complete unroller is run before the vectorizer, and instead we get: vdup_u32: .LFB0: .cfi_startproc movd %edi, %xmm0 movd %edi, %xmm1 pshufd $225, %xmm0, %xmm0 movss %xmm1, %xmm0 pshufd $225, %xmm0, %xmm0 pshufd $198, %xmm0, %xmm0 movss %xmm1, %xmm0 pshufd $198, %xmm0, %xmm0 pshufd $39, %xmm0, %xmm0 movss %xmm1, %xmm0 pshufd $39, %xmm0, %xmm0 ret making the code both larger and slower. According to https://gcc.gnu.org/projects/tree-ssa/vectorization.htm , this was supposed to be handled by SLP, but apparently that is not happening. See dump files produced by -fdump-tree-rebuild_frequencies -fdump-tree-cunrolli -fdump-tree-vect for details.