[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #30 from GCC Commits --- The master branch has been updated by Tamar Christina : https://gcc.gnu.org/g:f438acf7ce2e6cb862cf62f2543c36639e2af233 commit r14-9997-gf438acf7ce2e6cb862cf62f2543c36639e2af233 Author: Tamar Christina Date: Tue Apr 16 20:56:26 2024 +0100 testsuite: Fix data check loop on vect-early-break_124-pr114403.c The testcase had the wrong indices in the buffer check loop. gcc/testsuite/ChangeLog: PR tree-optimization/114403 * gcc.dg/vect/vect-early-break_124-pr114403.c: Fix check loop.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 Tamar Christina changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #29 from Tamar Christina --- Fixed, thanks for the report!
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #28 from GCC Commits --- The master branch has been updated by Tamar Christina : https://gcc.gnu.org/g:85002f8085c25bb3e74ab013581a74e7c7ae006b commit r14-9969-g85002f8085c25bb3e74ab013581a74e7c7ae006b Author: Tamar Christina Date: Mon Apr 15 12:06:21 2024 +0100 middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403]. This fixes a bug with the interaction between peeling for gaps and early break. Before I go further, I'll first explain how I understand this to work for loops with a single exit. When peeling for gaps we peel N < VF iterations to scalar. This happens by removing N iterations from the calculation of niters such that vect_iters * VF == niters is always false. In other words, when we exit the vector loop we always fall to the scalar loop. The loop bounds adjustment guarantees this. Because of this we potentially execute a vector loop iteration less. That is, if you're at the boundary condition where niters % VF by peeling one or more scalar iterations the vector loop executes one less. This is accounted for by the adjustments in vect_transform_loops. This adjustment happens differently based on whether the the vector loop can be partial or not: Peeling for gaps sets the bias to 0 and then: when not partial: we take the floor of (scalar_upper_bound / VF) - 1 to get the vector latch iteration count. when loop is partial: For a single exit this means the loop is masked, we take the ceil to account for the fact that the loop can handle the final partial iteration using masking. Note that there's no difference between ceil an floor on the boundary condition. There is a difference however when you're slightly above it. i.e. if scalar iterates 14 times and VF = 4 and we peel 1 iteration for gaps. The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in effect the partial iteration is ignored and it's done as scalar. This is fine because the niters modification has capped the vector iteration at 2. So that when we reduce the induction values you end up entering the scalar code with ind_var.2 = ind_var.1 + 2 * VF. Now lets look at early breaks. To make it esier I'll focus on the specific testcase: char buffer[64]; __attribute__ ((noipa)) buff_t *copy (buff_t *first, buff_t *last) { char *buffer_ptr = buffer; char *const buffer_end = [SZ-1]; int store_size = sizeof(first->Val); while (first != last && (buffer_ptr + store_size) <= buffer_end) { const char *value_data = (const char *)(>Val); __builtin_memcpy(buffer_ptr, value_data, store_size); buffer_ptr += store_size; ++first; } if (first == last) return 0; return first; } Here the first, early exit is on the condition: (buffer_ptr + store_size) <= buffer_end and the main exit is on condition: first != last This is important, as this bug only manifests itself when the first exit has a known constant iteration count that's lower than the latch exit count. because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 16 bytes per iteration. So the exit has a known bounds of 8 + 1. The vectorizer correctly analizes this: Statement (exit)if (ivtmp_21 != 0) is executed at most 8 (bounded by 8) + 1 times in loop 1. and as a consequence the IV is bound by 9: # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)> ... vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615 }; mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 }; if (mask_patt_22.17_126 == { -1, -1, -1, -1 }) goto ; [88.89%] else goto ; [11.11%] The imporant bits are this: In this example the value of last - first = 416. the calculated vector iteration count, is: x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27 the bounds generated, adjusting for gaps: x == (((x - 1) >> 2) << 2) which means we'll always fall through to the scalar code. as intended. Here are two key things to note: 1. In this loop, the early exit will always be the one taken. When it's taken we enter the scalar loop with the correct induction value to apply the gap peeling. 2. If the main exit is taken, the induction values assumes you've finished all vector iterations. i.e. it assumes you have completed 24 iterations, as we treat the main exit the same for normal loop vect and early break when not PEELED. This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 * VF;
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #27 from Richard Biener --- I think that adjusting an existing upper bound by -1 because of gap peeling is wrong when that upper bound may not apply to the IV exit. Because gap peeling only affects the IV exit test and not the early exit test.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #26 from Tamar Christina --- (In reply to Richard Biener from comment #25) > That means, when the loop takes the early exit we _must_ take that during > the vector iterations. Peeling for gaps means if we would take the early > exit during one of the gap peeled iterations this is a conflicting > requirement. > Now - the current analysis guarantees that the early exit conditions can > be safely evaluated even for the gap iterations, but not the following > code when the early exit is _not_ taken. > > So peeling for gaps and early exit vect are not compatible? I don't see why not, as my email explains for the early exits we always go to the scalar loop, which already adheres to the condition of peeling for gaps. I just think that peeling for gaps should not force it to exit from the main exit.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 Richard Biener changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #25 from Richard Biener --- That means, when the loop takes the early exit we _must_ take that during the vector iterations. Peeling for gaps means if we would take the early exit during one of the gap peeled iterations this is a conflicting requirement. Now - the current analysis guarantees that the early exit conditions can be safely evaluated even for the gap iterations, but not the following code when the early exit is _not_ taken. So peeling for gaps and early exit vect are not compatible?
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #24 from Tamar Christina --- (In reply to Richard Biener from comment #23) > Maybe easier to understand testcase: > > with -O3 -msse4.1 -fno-vect-cost-model we return 20 instead of 8. Adding > -fdisable-tree-cunroll avoids the issue. The upper bound we set on the > vector loop causes us to force taking the IV exit which continues > with i == (niter - 1) / VF * VF, but 'niter' is 20 here. yes,indeed, that's what my patch was arguing last time, but I didn't explain it well enough. I'm about to send out v2 (waiting for regtest to finish) which hopefully articulates this better.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #23 from Richard Biener --- Maybe easier to understand testcase: long x[9]; long a[20]; struct { long x; long b[40]; } b; int __attribute__((noipa)) foo (int n) { int i = 0; int k = 0; do { if (x[k++]) // early exit, loop upper bound is 8 because of this break; a[i] = b.b[2*i]; // the misaligned 2*i access causes peeling for gaps } while (++i < n); return i; } int main() { x[8] = 1; if (foo (20) != 8) __builtin_abort (); return 0; } with -O3 -msse4.1 -fno-vect-cost-model we return 20 instead of 8. Adding -fdisable-tree-cunroll avoids the issue. The upper bound we set on the vector loop causes us to force taking the IV exit which continues with i == (niter - 1) / VF * VF, but 'niter' is 20 here.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #22 from Tamar Christina --- note that due to the secondary exit the actual full vector iteration count is 8 scalar elements at VF=4 == 2. And it's this boundary condition where we fail, since ceil (8/4) == 2. any other value would have done the partial vector iteration. Basically final_iter_may_be_partial ends up being ignored.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #21 from Tamar Christina --- Created attachment 57932 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57932=edit loop.c attached reduced testcase that reproduces the issue and also checks the buffer position and copied values. As discussed on IRC when peeling for gaps we need to either adjust the upper bounds of the vector loop or force the vector loop to get to the scalar loop. However we already go to the scalar loop, just with the wrong induction value because we were never supposed to take the main exit. whether go to the scalar loop depends on x = (((ptr2 - ptr1) - 16) / 16) + 1 x == (((x - 1) >> 2) << 2) in this case x == 26, so we do go to the scalar code already, but through the main exit. exiting through the main exit assumes you've done all vector iterations, in this case 6 iterations based on the main exit condition which is first != last. In this case the inductions values will be set on niters_vector_mult. so in this case first += 24 But that's wrong since the secondary exit has a known iteration count of 9, due to (buffer_ptr + store_size) <= buffer_end. Statement (exit)if (ivtmp_21 != 0) is executed at most 8 (bounded by 8) + 1 times in loop 1. So we will always exit through it as 9 < 24. that means that when we calculate the upper bounds of the vector loop, we must add a bias so that in this boundary condition that we do an extra partial vector iteration. I think the discussion on IRC went off track for a bit and hopefully this testcase and the explanation above shows that for all early break and all epilogue peeling reasons, we must bias up for the upper bound to give the secondary exits a chance to trigger. So really do think the correct patch is: diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 4375ebdcb49..0973b952c70 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -12144,6 +12144,9 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) -min_epilogue_iters to remove iterations that cannot be performed by the vector code. */ int bias_for_lowest = 1 - min_epilogue_iters; + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) +bias_for_lowest = 1; + int bias_for_assumed = bias_for_lowest; int alignment_npeels = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); if (alignment_npeels && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) for the reasons described above. There's no way for us to take the main exit, which signifies (we've reached the end of all iterations we can possibly do as vector) and get the correct induction values in this case.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #20 from Tamar Christina --- This is a bad interaction with early break and peeling for gaps. when peeling for gaps we set bias_for_lowest to 0, which then negates the ceil for the upper bound calculation when the div is exact. We end up doing on a loop that does: Analyzing # of iterations of loop 1 exit condition [8, + , 18446744073709551615] != 0 bounds on difference of bases: -8 ... -8 result: # of iterations 8, bounded by 8 and a VF=4 calculating: Loop 1 iterates at most 1 times. Loop 1 likely iterates at most 1 times. Analyzing # of iterations of loop 1 exit condition [1, + , 1](no_overflow) < bnd.5505_39 bounds on difference of bases: 0 ... 4611686018427387902 Matching expression match.pd:2011, generic-match-8.cc:27 Applying pattern match.pd:2067, generic-match-1.cc:4813 result: # of iterations bnd.5505_39 + 18446744073709551615, bounded by 4611686018427387902 Estimating sizes for loop 1 ... Induction variable computation will be folded away. size: 2 if (ivtmp_312 < bnd.5505_39) Exit condition will be eliminated in last copy. size: 24-3, last_iteration: 24-5 Loop size: 24 Estimated size after unrolling: 26 ;; Guessed iterations of loop 1 is 0.858446. New upper bound 1. upper bound should be 2 not 1. I have a working patch, trying to create a standalone testcase for it.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 Tamar Christina changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-04-02 Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #19 from Tamar Christina --- Thanks! back from holidays and looking into it now. mine.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #18 from Richard Biener --- Just as hint we've had wrong upper bounds on vectorized loops/epilogues which would trigger wrong unrolling. But then unrolling also always hints as eventually having wrong range-info.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #17 from Sam James --- Created attachment 57780 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57780=edit EarlyCSE.cpp.cpp.182t.cunroll-bad
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #16 from Sam James --- -fdisable-tree-cunroll seems to help.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #15 from Richard Biener --- The valgrind output might be because we vectorize the loads a[i], a[i+8], ... as full vector loads at a[i], a[i+8] but the last we access as scalar. So the uninit load might be harmless.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #14 from Richard Biener --- There are a few vectorizations in the dumps but only one early-exit where we vectorize [local count: 102053600]: first$I_39 = MEM[(struct value_op_iterator *)]; last$I_40 = MEM[(struct value_op_iterator *)]; seed_15 = llvm::hashing::detail::get_execution_seed (); if (first$I_39 != last$I_40) goto ; [94.50%] [local count: 96440652]: [local count: 179229733]: # buffer_ptr_22 = PHI <_20(24), (22)> # first$I_24 = PHI <_29(24), first$I_39(22)> # ivtmp_226 = PHI _20 = buffer_ptr_22 + 8; ivtmp_216 = ivtmp_226 - 1; if (ivtmp_216 == 0) goto ; [51.12%] else goto ; [48.88%] [local count: 87607493]: _30 = MEM[(const struct Use *)first$I_24].Val; _35 = (unsigned long) _30; MEM [(char * {ref-all})buffer_ptr_22] = _35; _29 = first$I_24 + 32; if (_29 != last$I_40) goto ; [94.50%] else goto ; [5.50%] [local count: 82789081]: goto ; [100.00%] [local count: 96440652]: # buffer_ptr_248 = PHI <_20(4), buffer_ptr_22(3)> # first$I_175 = PHI if (last$I_40 == first$I_175) ... as far as I can see that's a non-peeled case and from what I see it looks OK how we process that.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #13 from Sam James --- Created attachment 5 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=5=edit valgrind output when broken
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #12 from Sam James --- Created attachment 57776 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57776=edit EarlyCSE.cpp.cpp.179t.vect-bad
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #11 from Sam James --- Created attachment 57775 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57775=edit EarlyCSE.cpp.cpp.178t.ifcvt-bad
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #10 from Sam James --- Created attachment 57774 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57774=edit EarlyCSE.cpp.cpp.177t.ch_vect-bad optimize("O2") on `template hash_code hash_combine_range_impl(InputIteratorT first, InputIteratorT last) works,` but O3 is broken. Unfortunately, novector pragmas don't work on the while()s in there. I get a ignored warning. Attached those dumps w/ -fdbg-cnt=vect_loop:7 (so just the one bad loop). I can tarball up the 6 vs 7 if useful. Thanks. Will try disabling those passes next..
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 Jeffrey A. Law changed: What|Removed |Added Priority|P3 |P1 CC||law at gcc dot gnu.org
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #9 from Richard Biener --- Nothing obviously suspicious here ... I wonder if you can attach 177t.ch_vect, 178t.ifcvt and 179t.vect for the case with the single vectorized bad loop? Maybe we're running into a latent issue downstream? What happens if you disable most followup passes?
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #8 from Sam James --- Created attachment 57770 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57770=edit EarlyCSE.cpp.cpp.179t.vect.diff (In reply to Sam James from comment #7) > I'll go back to trying to see which specific loop it is. tamar and richi both suggested separately debug counters. lbound: 6 ubound: 7 Attached the diff for EarlyCSE.cpp.cpp.179t.vect. Further suggestions?
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #7 from Sam James --- I'll go back to trying to see which specific loop it is.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #6 from Sam James --- Modifying llvm/include/llvm/ADT/iterator.h like so helps (!): ``` #pragma GCC push_options #pragma GCC optimize ("O0") friend bool operator==(const iterator_adaptor_base , const iterator_adaptor_base ) { return LHS.I == RHS.I; } #pragma GCC pop_options ```
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #5 from Sam James --- I'm narrowing it down in there, currently several headers deep. I'll finish that tomorrow.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #4 from Sam James --- Created attachment 57752 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57752=edit EarlyCSE.cpp.ii.xz The bad object seems to be EarlyCSE.cpp.o. Building it with -O0 makes things work.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 Sam James changed: What|Removed |Added Summary|[14 regression] LLVM|[14 regression] LLVM |miscompiled with -O3|miscompiled with -O3 |-march=znver2 |-march=znver2 |-fno-vect-cost-model|-fno-vect-cost-model since ||r14-6822-g01f4251b8775c8 CC||tnfchris at gcc dot gnu.org --- Comment #3 from Sam James --- r14-6822-g01f4251b8775c8 so far, isolating it is a pain because sometimes llvm-tblgen will segfault during the build (it's built-and-then-run to generate machine descriptions).