[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 --- Comment #11 from rguenther at suse dot de --- On Fri, 23 Apr 2021, andysem at mail dot ru wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 > > --- Comment #10 from andysem at mail dot ru --- > Thanks. Will this be backported to 10 and 11 branches? I don't plan to since it isn't a regression as far as I know, it doesn't apply to GCC 10 so definitely not there. I'll consider for GCC 11.
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 --- Comment #10 from andysem at mail dot ru --- Thanks. Will this be backported to 10 and 11 branches?
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED Known to work||12.0 --- Comment #9 from Richard Biener --- Fixed for GCC 12.
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 --- Comment #8 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:700e542971251b11623cce877075567815f72965 commit r12-79-g700e542971251b11623cce877075567815f72965 Author: Richard Biener Date: Fri Apr 9 09:35:51 2021 +0200 tree-optimization/99971 - improve BB vect dependence analysis We can use TBAA even when we have a DR, do so. For the testcase that means fully vectorizing it instead of only vectorizing the first store group resulting in suboptimal code. 2021-04-09 Richard Biener PR tree-optimization/99971 * tree-vect-data-refs.c (vect_slp_analyze_node_dependences): Always use TBAA for loads. * g++.dg/vect/slp-pr99971.cc: New testcase.
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 Dávid Bolvanský changed: What|Removed |Added CC||david.bolvansky at gmail dot com --- Comment #7 from Dávid Bolvanský --- Still bad for -O3 -march=skylake-avx512 https://godbolt.org/z/azb8aTG43
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 --- Comment #6 from andysem at mail dot ru --- Hmm, it looks like the original code has changed enough so that the problem no longer reproduces, with or without __restrict__. I don't have the older version of the code, so I can't tell what changed exactly. Data alignment most probably did change, but data layout of A (its equivalent in the original code) as well as the operation on it certainly didn't. Sorry for the confusion.
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 --- Comment #5 from Richard Biener --- (In reply to Richard Biener from comment #4) > (In reply to andysem from comment #3) > > I tried adding __restrict__ to the equivalents of x, y1 and y2 in the > > original larger code base and it didn't help. The compiler (gcc 10.2) would > > still generate the same half-vectorized code. > > Hmm, that's odd. I suppose the equivalent of test() was inlined in the > larger code base? > > I'd be interested in preprocessed source of a translation unit that exhibits > this issue (and a pointer to the point in the source that is relevant). > > Note for GCC 12 I have a patch to improve things w/o requiring the use > of __restrict (and I'm curious on whether that helps for the larger code > base). https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567805.html is the patch which applies to current master.
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 --- Comment #4 from Richard Biener --- (In reply to andysem from comment #3) > I tried adding __restrict__ to the equivalents of x, y1 and y2 in the > original larger code base and it didn't help. The compiler (gcc 10.2) would > still generate the same half-vectorized code. Hmm, that's odd. I suppose the equivalent of test() was inlined in the larger code base? I'd be interested in preprocessed source of a translation unit that exhibits this issue (and a pointer to the point in the source that is relevant). Note for GCC 12 I have a patch to improve things w/o requiring the use of __restrict (and I'm curious on whether that helps for the larger code base).
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 --- Comment #3 from andysem at mail dot ru --- I tried adding __restrict__ to the equivalents of x, y1 and y2 in the original larger code base and it didn't help. The compiler (gcc 10.2) would still generate the same half-vectorized code.
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed||2021-04-09 Status|UNCONFIRMED |ASSIGNED --- Comment #2 from Richard Biener --- Confirmed. While we manage to analyze for the "perfect" solution" we fail because dependence testing doesn't handle a piece, this throws away half of the vectorization. We do actually see that we'll retain the scalar loads and computations but still doing three vector loads and a vector add seems cheaper than doing four scalar stores: 0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body 0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body 0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body 0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body 0x1fddcb0 _15 1 times scalar_store costs 12 in body 0x1fddcb0 _18 1 times scalar_store costs 12 in body 0x1fddcb0 _21 1 times scalar_store costs 12 in body 0x1fddcb0 _24 1 times scalar_store costs 12 in body t.C:28:1: note: Cost model analysis: Vector inside of basic block cost: 40 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 48 t.C:28:1: note: Basic block will be vectorized using SLP now, fortunately GCC 11 will improve on this [a bit] and we'll produce _Z4testR1ARKS_S2_: .LFB2: .cfi_startproc movdqu (%rsi), %xmm0 movdqu (%rdi), %xmm1 paddd %xmm1, %xmm0 movups %xmm0, (%rdi) movd%xmm0, %eax subl(%rdx), %eax movl%eax, (%rdi) pextrd $1, %xmm0, %eax subl4(%rdx), %eax movl%eax, 4(%rdi) pextrd $2, %xmm0, %eax subl8(%rdx), %eax movl%eax, 8(%rdi) pextrd $3, %xmm0, %eax subl12(%rdx), %eax movl%eax, 12(%rdi) ret which is not re-doing the scalar loads/adds but instead uses the vector result. Still the same dependence issue is present: t.C:16:11: missed: can't determine dependence between y1_3(D)->b and x_2(D)->a t.C:16:11: note: removing SLP instance operations starting from: x_2(D)->a = _6; the scalar code before vectorization looks like [local count: 1073741824]: _13 = x_2(D)->a; _14 = y1_3(D)->a; _15 = _13 + _14; x_2(D)->a = _15; _16 = x_2(D)->b; _17 = y1_3(D)->b; <--- _18 = _16 + _17; x_2(D)->b = _18; _19 = x_2(D)->c; _20 = y1_3(D)->c; _21 = _19 + _20; x_2(D)->c = _21; _22 = x_2(D)->d; _23 = y1_3(D)->d; _24 = _22 + _23; x_2(D)->d = _24; _5 = y2_4(D)->a; _6 = _15 - _5; x_2(D)->a = _6; <--- _7 = y2_4(D)->b; _8 = _18 - _7; x_2(D)->b = _8; _9 = y2_4(D)->c; _10 = _21 - _9; x_2(D)->c = _10; _11 = y2_4(D)->d; _12 = _24 - _11; x_2(D)->d = _12; return; Using void test(A& __restrict x, A const& y1, A const& y2) { x += y1; x -= y2; } produces optimal assembly even with GCC 10: _Z4testR1ARKS_S2_: .LFB2: .cfi_startproc movdqu (%rsi), %xmm0 movdqu (%rdx), %xmm1 movdqu (%rdi), %xmm2 psubd %xmm1, %xmm0 paddd %xmm2, %xmm0 movups %xmm0, (%rdi) ret note that I think we should be able to handle the dependences even without the __restrict annotation.
[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971 --- Comment #1 from andysem at mail dot ru --- For reference, an ideal version of this code should look something like this: test(A&, A const&, A const&): movdqu (%rsi), %xmm0 movdqu (%rdi), %xmm1 movdqu (%rdx), %xmm2 paddd %xmm1, %xmm0 psubd %xmm2, %xmm0 movups %xmm0, (%rdi) ret