https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed| |2021-04-09 Status|UNCONFIRMED |ASSIGNED --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. While we manage to analyze for the "perfect" solution" we fail because dependence testing doesn't handle a piece, this throws away half of the vectorization. We do actually see that we'll retain the scalar loads and computations but still doing three vector loads and a vector add seems cheaper than doing four scalar stores: 0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body 0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body 0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body 0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body 0x1fddcb0 _15 1 times scalar_store costs 12 in body 0x1fddcb0 _18 1 times scalar_store costs 12 in body 0x1fddcb0 _21 1 times scalar_store costs 12 in body 0x1fddcb0 _24 1 times scalar_store costs 12 in body t.C:28:1: note: Cost model analysis: Vector inside of basic block cost: 40 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 48 t.C:28:1: note: Basic block will be vectorized using SLP now, fortunately GCC 11 will improve on this [a bit] and we'll produce _Z4testR1ARKS_S2_: .LFB2: .cfi_startproc movdqu (%rsi), %xmm0 movdqu (%rdi), %xmm1 paddd %xmm1, %xmm0 movups %xmm0, (%rdi) movd %xmm0, %eax subl (%rdx), %eax movl %eax, (%rdi) pextrd $1, %xmm0, %eax subl 4(%rdx), %eax movl %eax, 4(%rdi) pextrd $2, %xmm0, %eax subl 8(%rdx), %eax movl %eax, 8(%rdi) pextrd $3, %xmm0, %eax subl 12(%rdx), %eax movl %eax, 12(%rdi) ret which is not re-doing the scalar loads/adds but instead uses the vector result. Still the same dependence issue is present: t.C:16:11: missed: can't determine dependence between y1_3(D)->b and x_2(D)->a t.C:16:11: note: removing SLP instance operations starting from: x_2(D)->a = _6; the scalar code before vectorization looks like <bb 2> [local count: 1073741824]: _13 = x_2(D)->a; _14 = y1_3(D)->a; _15 = _13 + _14; x_2(D)->a = _15; _16 = x_2(D)->b; _17 = y1_3(D)->b; <--- _18 = _16 + _17; x_2(D)->b = _18; _19 = x_2(D)->c; _20 = y1_3(D)->c; _21 = _19 + _20; x_2(D)->c = _21; _22 = x_2(D)->d; _23 = y1_3(D)->d; _24 = _22 + _23; x_2(D)->d = _24; _5 = y2_4(D)->a; _6 = _15 - _5; x_2(D)->a = _6; <--- _7 = y2_4(D)->b; _8 = _18 - _7; x_2(D)->b = _8; _9 = y2_4(D)->c; _10 = _21 - _9; x_2(D)->c = _10; _11 = y2_4(D)->d; _12 = _24 - _11; x_2(D)->d = _12; return; Using void test(A& __restrict x, A const& y1, A const& y2) { x += y1; x -= y2; } produces optimal assembly even with GCC 10: _Z4testR1ARKS_S2_: .LFB2: .cfi_startproc movdqu (%rsi), %xmm0 movdqu (%rdx), %xmm1 movdqu (%rdi), %xmm2 psubd %xmm1, %xmm0 paddd %xmm2, %xmm0 movups %xmm0, (%rdi) ret note that I think we should be able to handle the dependences even without the __restrict annotation.