[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #16 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:5cbaf84c191b9a3e3cb26545c808d208bdbf2ab5 commit r13-3273-g5cbaf84c191b9a3e3cb26545c808d208bdbf2ab5 Author: Richard Biener Date: Thu Oct 13 14:24:05 2022 +0200 tree-optimization/107160 - avoid reusing multiple accumulators Epilogue vectorization is not set up to re-use a vectorized accumulator consisting of more than one vector. For non-SLP we always reduce to a single but for SLP that isn't happening. In such case we currenlty miscompile the epilog so avoid this. PR tree-optimization/107160 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Do not register accumulator if we failed to reduce it to a single vector. * gcc.dg/vect/pr107160.c: New testcase.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 Richard Biener changed: What|Removed |Added Assignee|linkw at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #15 from Richard Biener --- I think that either vect_find_reusable_accumulator should punt on this or vect_create_epilog_for_reduction shouldn't register this accumulator. The caller of vect_find_reusable_accumulator checks for vec_num == 1 (in the epilog) but we register for vec_num == 2. For SLP reductions we "fail" to reduce to a single accumulator. int vec_num = vec_oprnds0.length (); gcc_assert (vec_num == 1 || slp_node); The following fixes this: diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 1996ecfee7a..b1442a93581 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -6232,7 +6232,8 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, } /* Record this operation if it could be reused by the epilogue loop. */ - if (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION) + if (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION + && vec_num == 1) loop_vinfo->reusable_accumulators.put (scalar_results[0], { orig_reduc_input, reduc_info });
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #14 from Richard Biener --- Aha, so the issue is that we have a vectorized epilogue here and the epilogue of _that_ ends up doing [local count: 94607391]: # sum0_48 = PHI # sum1_47 = PHI # sum2_46 = PHI # sum3_45 = PHI # vect_sum3_31.16_101 = PHI # vect_sum3_31.16_102 = PHI # vect_sum3_31.16_103 = PHI # vect_sum3_31.16_104 = PHI _105 = BIT_FIELD_REF ; _106 = BIT_FIELD_REF ; ... this is from the main vect [local count: 81467476]: # sum0_135 = PHI # sum1_136 = PHI # sum2_137 = PHI # sum3_138 = PHI # vect_sum3_50.25_159 = PHI this is from the epilogue [local count: 105119324]: # sum0_15 = PHI # sum1_77 = PHI # sum2_75 = PHI # sum3_13 = PHI # _160 = PHI _161 = BIT_FIELD_REF <_160, 32, 0>; _162 = BIT_FIELD_REF <_160, 32, 32>; _163 = BIT_FIELD_REF <_160, 32, 64>; _164 = BIT_FIELD_REF <_160, 32, 96>; _74 = _161 + _162; _76 = _74 + _163; _78 = _76 + _164; so we fail to accumulate the main loops accumulators and just use the first one. On x86 the vectorized epilogue uses a smaller vector size but the same number of accumulators. It seems it's simply unexpected to have the unrolled SLP reduction and a vectorized epilogue with the same vector mode (but not unrolled). I can reproduce the failure when patching the x86 cost model to force unrolling by 2 (maybe we want a --param to force that to aid debugging...).
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #13 from Martin Liška --- A bit reduced test-case that can be compiled with cross compiler: $ cat pr107160.c #define N 16 float fl[N]; __attribute__ ((noipa, optimize (0))) void init () { for (int i = 0; i < N; i++) fl[i] = 1 << i; } __attribute__ ((noipa)) float foo (int n) { float sum0, sum1, sum2, sum3; sum0 = sum1 = sum2 = sum3 = 0.0f; for (int i = 0; i < n; i += 4) { sum0 += __builtin_fabs (fl[i]); sum1 += __builtin_fabs (fl[i + 1]); sum2 += __builtin_fabs (fl[i + 2]); sum3 += __builtin_fabs (fl[i + 3]); } return sum0 + sum1 + sum2 + sum3; } __attribute__ ((optimize (0))) int main () { init (); float res = foo (N); __builtin_printf ("res:%f\n", res); return 0; } x86_64: ./a.out res:65535.00 ppc64le: ./a.out res:15.00 so the result is wrong, it sums only first 4 elements (and not 16).
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #12 from rguenther at suse dot de --- On Thu, 13 Oct 2022, linkw at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 > > --- Comment #11 from Kewen Lin --- > > > > Btw, I've fixed a SLP reduction issue two days ago in > > > > r13-3226-gee467644c53ee2 > > > > though that looks unrelated? > > > > > > Thanks for the information, I'll double check it. > > > > > To rebase to r13-3226 or even the latest trunk doesn't help. :/ > > > > I do - the epilog is even vectorized and it works fine at runtime. > > You meant the code on x86 is all fine? and the case doesn't show the "res" > difference? hmm, it's weird. Yes. I'll try to look at cross compiler IL.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #11 from Kewen Lin --- > > > Btw, I've fixed a SLP reduction issue two days ago in > > > r13-3226-gee467644c53ee2 > > > though that looks unrelated? > > > > Thanks for the information, I'll double check it. > > To rebase to r13-3226 or even the latest trunk doesn't help. > > I do - the epilog is even vectorized and it works fine at runtime. You meant the code on x86 is all fine? and the case doesn't show the "res" difference? hmm, it's weird.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #10 from rguenther at suse dot de --- On Thu, 13 Oct 2022, linkw at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 > > --- Comment #9 from Kewen Lin --- > > > > The above doesn't look wrong (but may miss the rest of the IL). On > > x86_64 this looks like > > > >[local count: 105119324]: > > # sum0_41 = PHI > > # sum1_39 = PHI > > # sum2_37 = PHI > > # sum3_35 = PHI > > # vect_sum3_31.11_59 = PHI > > _58 = BIT_FIELD_REF ; > > _57 = BIT_FIELD_REF ; > > _56 = BIT_FIELD_REF ; > > _55 = BIT_FIELD_REF ; > > _74 = _58 + _57; > > _76 = _56 + _74; > > _78 = _55 + _76; > > > >[local count: 118111600]: > > # prephitmp_79 = PHI <_78(4), 0.0(2)> > > return prephitmp_79; > > > > Yeah, it looks expected without unrolling. > > > when unrolling is applied, thus with a larger VF, you should ideally > > see the vectors accumulated. > > > > Btw, I've fixed a SLP reduction issue two days ago in > > r13-3226-gee467644c53ee2 > > though that looks unrelated? > > Thanks for the information, I'll double check it. > > > > > When I force a larger VF on x86 by adding a int store in the loop I see > > > >[local count: 94607391]: > > # sum0_48 = PHI > > # sum1_36 = PHI > > # sum2_35 = PHI > > # sum3_24 = PHI > > # vect_sum3_32.16_110 = PHI > > # vect_sum3_32.16_111 = PHI > > # vect_sum3_32.16_112 = PHI > > # vect_sum3_32.16_113 = PHI > > _114 = BIT_FIELD_REF ; > > _115 = BIT_FIELD_REF ; > > _116 = BIT_FIELD_REF ; > > _117 = BIT_FIELD_REF ; > > _118 = BIT_FIELD_REF ; > > _119 = BIT_FIELD_REF ; > > _120 = BIT_FIELD_REF ; > > _121 = BIT_FIELD_REF ; > > _122 = BIT_FIELD_REF ; > > _123 = BIT_FIELD_REF ; > > _124 = BIT_FIELD_REF ; > > _125 = BIT_FIELD_REF ; > > _126 = BIT_FIELD_REF ; > > _127 = BIT_FIELD_REF ; > > _128 = BIT_FIELD_REF ; > > _129 = BIT_FIELD_REF ; > > _130 = _114 + _118; > > _131 = _115 + _119; > > _132 = _116 + _120; > > _133 = _117 + _121; > > _134 = _130 + _122; > > _135 = _131 + _123; > > _136 = _132 + _124; > > _137 = _133 + _125; > > _138 = _134 + _126; > > > > see how the lanes from the different vectors are accumulated? (yeah, > > we should simply add the vectors!) > > Yes, it's the same as what I saw on ppc64le, but the closely following dce6 > removes the three vect_sum3_32 (in your dump, they are > vect_sum3_32.16_10{7,8,9}) as the subsequent joints don't actually use the > separated accumulated lane values (_138 -> sum0 ...) but only use > vect_sum3_32.16_110. I do - the epilog is even vectorized and it works fine at runtime.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #9 from Kewen Lin --- > > The above doesn't look wrong (but may miss the rest of the IL). On > x86_64 this looks like > >[local count: 105119324]: > # sum0_41 = PHI > # sum1_39 = PHI > # sum2_37 = PHI > # sum3_35 = PHI > # vect_sum3_31.11_59 = PHI > _58 = BIT_FIELD_REF ; > _57 = BIT_FIELD_REF ; > _56 = BIT_FIELD_REF ; > _55 = BIT_FIELD_REF ; > _74 = _58 + _57; > _76 = _56 + _74; > _78 = _55 + _76; > >[local count: 118111600]: > # prephitmp_79 = PHI <_78(4), 0.0(2)> > return prephitmp_79; > Yeah, it looks expected without unrolling. > when unrolling is applied, thus with a larger VF, you should ideally > see the vectors accumulated. > > Btw, I've fixed a SLP reduction issue two days ago in > r13-3226-gee467644c53ee2 > though that looks unrelated? Thanks for the information, I'll double check it. > > When I force a larger VF on x86 by adding a int store in the loop I see > >[local count: 94607391]: > # sum0_48 = PHI > # sum1_36 = PHI > # sum2_35 = PHI > # sum3_24 = PHI > # vect_sum3_32.16_110 = PHI > # vect_sum3_32.16_111 = PHI > # vect_sum3_32.16_112 = PHI > # vect_sum3_32.16_113 = PHI > _114 = BIT_FIELD_REF ; > _115 = BIT_FIELD_REF ; > _116 = BIT_FIELD_REF ; > _117 = BIT_FIELD_REF ; > _118 = BIT_FIELD_REF ; > _119 = BIT_FIELD_REF ; > _120 = BIT_FIELD_REF ; > _121 = BIT_FIELD_REF ; > _122 = BIT_FIELD_REF ; > _123 = BIT_FIELD_REF ; > _124 = BIT_FIELD_REF ; > _125 = BIT_FIELD_REF ; > _126 = BIT_FIELD_REF ; > _127 = BIT_FIELD_REF ; > _128 = BIT_FIELD_REF ; > _129 = BIT_FIELD_REF ; > _130 = _114 + _118; > _131 = _115 + _119; > _132 = _116 + _120; > _133 = _117 + _121; > _134 = _130 + _122; > _135 = _131 + _123; > _136 = _132 + _124; > _137 = _133 + _125; > _138 = _134 + _126; > > see how the lanes from the different vectors are accumulated? (yeah, > we should simply add the vectors!) Yes, it's the same as what I saw on ppc64le, but the closely following dce6 removes the three vect_sum3_32 (in your dump, they are vect_sum3_32.16_10{7,8,9}) as the subsequent joints don't actually use the separated accumulated lane values (_138 -> sum0 ...) but only use vect_sum3_32.16_110.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #8 from Richard Biener --- (In reply to Kewen Lin from comment #7) > One reduced test case is: > > > > #include > #include > > #define N 128 > float fl[N]; > > __attribute__ ((noipa, optimize (0))) void > init () > { > for (int i = 0; i < N; i++) > fl[i] = i; > } > > __attribute__ ((noipa)) float > foo (int n1) > { > float sum0, sum1, sum2, sum3; > sum0 = sum1 = sum2 = sum3 = 0.0f; > > int n = (n1 / 4) * 4; > for (int i = 0; i < n; i += 4) > { > sum0 += fabs (fl[i]); > sum1 += fabs (fl[i + 1]); > sum2 += fabs (fl[i + 2]); > sum3 += fabs (fl[i + 3]); > } > > return sum0 + sum1 + sum2 + sum3; > } > > __attribute__ ((optimize (0))) int > main () > { > init (); > float res = foo (80); > __builtin_printf ("res:%f\n", res); > return 0; > } > > > incorrect result "res:670.00" vs expected result "res:3160.00" > > It looks it exposes one bug in vectorization reduction support. The > reduction epilogue handling looks wrong, it generates gimple code like: > > # vect_sum3_31.16_101 = PHI > # vect_sum3_31.16_102 = PHI > # vect_sum3_31.16_103 = PHI > # vect_sum3_31.16_104 = PHI > _105 = BIT_FIELD_REF ; > _106 = BIT_FIELD_REF ; > _107 = BIT_FIELD_REF ; > _108 = BIT_FIELD_REF ; > _109 = BIT_FIELD_REF ; > _110 = BIT_FIELD_REF ; > _111 = BIT_FIELD_REF ; > _112 = BIT_FIELD_REF ; > ... > > it doesn't consider the reduced results vect_sum3_31.16_10{1,2,3,4} from the > loop can be reduced again in loop exit block as they are in the same slp > group. The above doesn't look wrong (but may miss the rest of the IL). On x86_64 this looks like [local count: 105119324]: # sum0_41 = PHI # sum1_39 = PHI # sum2_37 = PHI # sum3_35 = PHI # vect_sum3_31.11_59 = PHI _58 = BIT_FIELD_REF ; _57 = BIT_FIELD_REF ; _56 = BIT_FIELD_REF ; _55 = BIT_FIELD_REF ; _74 = _58 + _57; _76 = _56 + _74; _78 = _55 + _76; [local count: 118111600]: # prephitmp_79 = PHI <_78(4), 0.0(2)> return prephitmp_79; when unrolling is applied, thus with a larger VF, you should ideally see the vectors accumulated. Btw, I've fixed a SLP reduction issue two days ago in r13-3226-gee467644c53ee2 though that looks unrelated? When I force a larger VF on x86 by adding a int store in the loop I see [local count: 94607391]: # sum0_48 = PHI # sum1_36 = PHI # sum2_35 = PHI # sum3_24 = PHI # vect_sum3_32.16_110 = PHI # vect_sum3_32.16_111 = PHI # vect_sum3_32.16_112 = PHI # vect_sum3_32.16_113 = PHI _114 = BIT_FIELD_REF ; _115 = BIT_FIELD_REF ; _116 = BIT_FIELD_REF ; _117 = BIT_FIELD_REF ; _118 = BIT_FIELD_REF ; _119 = BIT_FIELD_REF ; _120 = BIT_FIELD_REF ; _121 = BIT_FIELD_REF ; _122 = BIT_FIELD_REF ; _123 = BIT_FIELD_REF ; _124 = BIT_FIELD_REF ; _125 = BIT_FIELD_REF ; _126 = BIT_FIELD_REF ; _127 = BIT_FIELD_REF ; _128 = BIT_FIELD_REF ; _129 = BIT_FIELD_REF ; _130 = _114 + _118; _131 = _115 + _119; _132 = _116 + _120; _133 = _117 + _121; _134 = _130 + _122; _135 = _131 + _123; _136 = _132 + _124; _137 = _133 + _125; _138 = _134 + _126; see how the lanes from the different vectors are accumulated? (yeah, we should simply add the vectors!)
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 Kewen Lin changed: What|Removed |Added CC||rguenth at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org --- Comment #7 from Kewen Lin --- One reduced test case is: #include #include #define N 128 float fl[N]; __attribute__ ((noipa, optimize (0))) void init () { for (int i = 0; i < N; i++) fl[i] = i; } __attribute__ ((noipa)) float foo (int n1) { float sum0, sum1, sum2, sum3; sum0 = sum1 = sum2 = sum3 = 0.0f; int n = (n1 / 4) * 4; for (int i = 0; i < n; i += 4) { sum0 += fabs (fl[i]); sum1 += fabs (fl[i + 1]); sum2 += fabs (fl[i + 2]); sum3 += fabs (fl[i + 3]); } return sum0 + sum1 + sum2 + sum3; } __attribute__ ((optimize (0))) int main () { init (); float res = foo (80); __builtin_printf ("res:%f\n", res); return 0; } incorrect result "res:670.00" vs expected result "res:3160.00" It looks it exposes one bug in vectorization reduction support. The reduction epilogue handling looks wrong, it generates gimple code like: # vect_sum3_31.16_101 = PHI # vect_sum3_31.16_102 = PHI # vect_sum3_31.16_103 = PHI # vect_sum3_31.16_104 = PHI _105 = BIT_FIELD_REF ; _106 = BIT_FIELD_REF ; _107 = BIT_FIELD_REF ; _108 = BIT_FIELD_REF ; _109 = BIT_FIELD_REF ; _110 = BIT_FIELD_REF ; _111 = BIT_FIELD_REF ; _112 = BIT_FIELD_REF ; ... it doesn't consider the reduced results vect_sum3_31.16_10{1,2,3,4} from the loop can be reduced again in loop exit block as they are in the same slp group.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2022-10-12 Ever confirmed|0 |1 --- Comment #6 from Kewen Lin --- Confirmed. The option can be reduced to: -m64 -O3 -mcpu=power8 -ffast-math The culprit loop is the one in function l1_norm of file vector.cc (actually it's from the header file). The resulting gimple after loop vect pass looks weird to me, it seems not to expose one issue related to math but one actual vect issue. I'll check it further and post one reduced test case.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #5 from seurer at gcc dot gnu.org --- I added that option and 554.roms_r now runs OK.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 Kewen Lin changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org --- Comment #4 from Kewen Lin --- Just back from holiday, thanks for reporting! I'll take a look. (In reply to seurer from comment #3) > Note that 554.roms_r from spec2017 also fails after this commit. I suspected what you saw on SPEC2017 554.roms_r is related to PR103320. The culprit commit can make some loops unrolled (they are not previously), PR103320 showed that loops unrolling matters. So if you did add option "-fno-unsafe-math-optimizations" for the run with "-funroll-loops", then you need it since this commit too. Could you double check?
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #3 from seurer at gcc dot gnu.org --- Note that 554.roms_r from spec2017 also fails after this commit.
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 --- Comment #2 from seurer at gcc dot gnu.org --- One of the compilation commands: /home/seurer/gcc/git/install/gcc-test/bin/g++ -c -o auto_derivative_function.o -DSPEC_CPU -DNDEBUG -Iinclude -DBOOST_DISABLE_THREADS -Ddeal_II_dimension=3 -m64 -O3 -mcpu=power8 -ffast-math -funroll-loops -fpeel-loops -fvect-cost-model -mpopcntd -mrecip=rsqrt-DSPEC_CPU_LP64 -DSPEC_CPU_LINUX -include cstddef -std=gnu++98 auto_derivative_function.cc
[Bug target/107160] [13 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 Richard Biener changed: What|Removed |Added Keywords||wrong-code Target Milestone|--- |13.0 --- Comment #1 from Richard Biener --- what other flags do you use?