[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 Andrew Pinski changed: What|Removed |Added Target Milestone|--- |14.0
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 Jakub Jelinek changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #9 from Jakub Jelinek --- Fixed.
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 --- Comment #8 from GCC Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:97e49bf00d1a7b7a2a02531a1c5362fad27348d9 commit r14-8894-g97e49bf00d1a7b7a2a02531a1c5362fad27348d9 Author: Jakub Jelinek Date: Fri Feb 9 11:06:00 2024 +0100 lower-bitint: Attempt not to emit always true conditions in handle_cast [PR113774] The following patch is the optimization part of PR113774, where in handle_cast we emit some conditionals which are always true and presumably VRP would figure that out later and clean it up, except that instead thread1 is invoked and threads everything through the conditions, so we end up with really ugly code which is hard to be cleaned up later and then run into PR113831 VN bug and miscompile stuff. handle_cast computes low and high as limb indexes, where idx < low doesn't need any special treatment, just uses the operand's limb, idx >= high cases all the bits in the limb are an extension (so, for unsigned widening cast all those bits are 0, for signed widening cast all those bits are equal to the in earlier code computed sign mask, narrowing cast don't trigger this code) and then the idx == low && idx < high case if it exists need special treatment (some bits are copied, others extended, or all bits are copied but sign mask needs to be computed). The code already attempted to optimize away some unneeded casts, in the first hunk below e.g. for the case like 257 -> 321 bit extension, where low is 4 and high 5 and we use a loop handling the first 4 limbs (2 iterations) with m_upwards_2limb 4 - no special handling is needed in the loop, and the special handling is done on the first limb after the loop and then the last limb after the loop gets the extension only, or in the second hunk where can emit a single comparison instead of 2 e.g. for the low == high case - that must be a zero extension from multiple of limb bits, say 192 -> 328, or for the case where we know the idx == low case happens in the other limb processed in the loop, not the current one. But the testcase shows further cases where we always know some of the comparisons can be folded to true/false, in particular there is 255 -> 257 bit zero extension, so low 3, high 4, m_upwards_2limb 4. The loop handles 2 limbs at the time and for the first limb we were emitting idx < low ? operand[idx] : 0; but because idx goes from 0 with step 2 2 iterations, idx < 3 is always true, so we can just emit operand[idx]. This is handled in the first hunk. In addition to fixing it (that is the " - m_first" part in there) I've rewritten it using low to make it more readable. Similarly, in the other limb we were emitting idx + 1 <= low ? (idx + 1 == low ? operand[idx] & 0x7ffff : operand[idx]) : 0 but idx + 1 <= 3 is always true in the loop, so all we should emit is idx + 1 == low ? operand[idx] & 0x7ffff : operand[idx], Unfortunately for the latter, when single_comparison is true, we emit just one comparison, but the code which fills the branches will fill it with the operand[idx] and 0 cases (for zero extension, for sign extension similarly), not the operand[idx] (aka copy) and operand[idx] & 0x7ffff (aka most significant limb of the narrower precision) cases. Instead of making the code less readable by using single_comparison for that and handling it in the code later differently I've chosen to just emit a condition which will be always true and let cfg cleanup clean it up. 2024-02-09 Jakub Jelinek PR tree-optimization/113774 * gimple-lower-bitint.cc (bitint_large_huge::handle_cast): Don't emit any comparison if m_first and low + 1 is equal to m_upwards_2limb, simplify condition for that. If not single_comparison, not m_first and we can prove that the idx <= low comparison will be always true, emit instead of idx <= low comparison low <= low such that cfg cleanup will optimize it at the end of the pass. * gcc.dg/torture/bitint-57.c: New test.
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 --- Comment #7 from Richard Biener --- (In reply to Jakub Jelinek from comment #6) > Thanks. > The #c5 reduced testcase started to be miscompiled with > r9-398-g6b9fc1782effc67dd9f6def16207653d79647553 > Perhaps we should move that to a separate bug so that it can be marked > [11/12/13/14 Regression] and leave this just for the bitint lowering > enhancements not to emit clearly always true or always false conditions if > possible. PR113831
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 --- Comment #6 from Jakub Jelinek --- Thanks. The #c5 reduced testcase started to be miscompiled with r9-398-g6b9fc1782effc67dd9f6def16207653d79647553 Perhaps we should move that to a separate bug so that it can be marked [11/12/13/14 Regression] and leave this just for the bitint lowering enhancements not to emit clearly always true or always false conditions if possible.
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 --- Comment #5 from Richard Biener --- This must go wrong during alias disambiguation, somehow figuring we can ignore the backedge?! The ref we hoist is _68 = VIEW_CONVERT_EXPR(b)[_146]; where _146 is _49 + 1, but _49 is an IV: _134 = _105 & 1; MEM [(unsigned _BitInt(257) *) + 32B] = _134; [local count: 1073741824]: # _49 = PHI <0(4), _50(28)> it's also odd that we seem to arrive at b + 32B. Value numbering stmt = _146 = PHI <_145(8), _140(31)> Setting value number of _146 to _140 (changed) Making available beyond BB10 _146 for value _140 ... Value numbering stmt = .MEM_150 = PHI <.MEM_149(8), .MEM_139(31)> Setting value number of .MEM_150 to .MEM_150 (changed) Value numbering stmt = _68 = VIEW_CONVERT_EXPR(b)[_146]; Setting value number of _68 to _134 (changed) huh. Hmm. But we have # RANGE [irange] sizetype [4, 4][6, +INF] MASK 0xfffe VALUE 0x1 _140 = _49 + 1; # RANGE [irange] sizetype [1, 2][4, 4][6, +INF] MASK 0xfffe VALUE 0x1 # _146 = PHI <_145(8), _140(6)> we should look at the range of _146 Hmm, I _think_ I know what happens. We have [local count: 1073741824]: # _49 = PHI <0(4), _50(28)> # _55 = PHI <0(4), _56(28)> _51 = VIEW_CONVERT_EXPR(b)[_49]; if (_49 <= 2) goto ; [80.00%] else goto ; [20.00%] [local count: 214748360]: _135 = .USUBC (0, _51, _55); _136 = IMAGPART_EXPR <_135>; _137 = REALPART_EXPR <_135>; _138 = _51 | _137; bitint.6[_49] = _138; _140 = _49 + 1; _141 = VIEW_CONVERT_EXPR(b)[_140]; and this is the "same" valueized ref (what gets recorded in the hashtable), but here we can see that _140 >= 4 which makes it known 4 based on the array extent. This matches it up with the store of _134: Value numbering stmt = _141 = VIEW_CONVERT_EXPR(b)[_140]; Setting value number of _141 to _134 (changed) _134 is available for _134 we record the expression with the VUSE of the definition. Later when we look up the same expression from the later block (where _140 isn't known to be 4) we find the very same expression when looking with the VUSE of the definition and thus we take the expression already in the hashtable which has been assigned the value _134 and then boom. Sth like the following is miscompiled at -O2 by FRE. int a[3]; int __attribute__((noipa)) foo(int i, int x) { int tem = 0; a[2] = x; if (i < 1) ++i; else { ++i; tem = a[i]; } tem += a[i]; return tem; } int main() { if (foo (0, 7) != 0) __builtin_abort(); }
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 --- Comment #4 from Jakub Jelinek --- Created attachment 57359 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57359=edit gcc14-pr113774.patch So far lightly tested optimization on the bitint lowering side which emits more optimal code for that (for the unsigned 255 -> 257 bit extension and m_upwards_2limb low is 3 and high is 4, so when processing first limb in the loop, the idx < 3 condition is always true (as idx is 0 or 2) and when processing the second limb in the loop, similarly idxp1 <= 3 condition is always true (as idxp1 is 1 or 3) while idxp1 == 3 still needs to be compared. This makes the PRE (or VN?) bug latent.
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 --- Comment #3 from Jakub Jelinek --- unsigned long long a[32], b[32], v[32], r[32]; void foo (unsigned int n) { unsigned long long c = 0; for (unsigned long i = 0; i < n; i += 2) { unsigned long j = i + 1; b[i] = __builtin_addcll (b[i], v[i], c, ); b[j] = __builtin_addcll (b[j], v[j], c, ); } b[4] = __builtin_addcll (b[4] & 1, v[4] & 1, c, ) & 1; c = 0; for (unsigned long i = 0; i < n; i += 2) { unsigned long j = i + 1; unsigned long long k = i < 3 ? a[i] : 0; r[i] = b[i] | __builtin_subcll (k, b[i], c, ); unsigned long long l = b[j]; if (j <= 3) { if (j == 3) k = a[3] & 0x7fffULL; else k = a[3]; } else k = 0; r[j] = l | __builtin_subcll (k, b[j], c, ); } r[4] = (b[4] | __builtin_subcll (0, b[4] & 1, c, )) & 1; } might help understand better what bitintlower emits there, except it uses PARM_DECLs or automatic VAR_DECLs instead of the global ones (except for v) and n is 4 (I've used function argument only to avoid VRP figuring out earlier that certain paths are dead) and the var sizes is actually just 4 (for a) or 5 elts. That said, even _134 in the #c2 testcase at -O2 in the PHI argument is fishy, but the point is that bb 6 is really dead but it isn't known to the compiler yet; it is reached if _49 <= 2 is false, but given that _49 is incremented in increments of 2 and the array size is 5 maybe PRE knows that _49 then would have to be 4. bb 6 either jumps directly to bb 10 (if _140 (aka _49 + 1) > 3) or hops through bb 8 to bb 10.
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 Jakub Jelinek changed: What|Removed |Added Ever confirmed|0 |1 CC||rguenth at gcc dot gnu.org Status|UNCONFIRMED |NEW Last reconfirmed||2024-02-07 --- Comment #2 from Jakub Jelinek --- /* PR tree-optimization/113774 */ /* { dg-do run { target bitint } } */ /* { dg-options "-std=c23 -pedantic-errors" } */ /* { dg-skip-if "" { ! run_expensive_tests } { "*" } { "-O0" "-O2" } } */ /* { dg-skip-if "" { ! run_expensive_tests } { "-flto" } { "" } } */ #if __BITINT_MAXWIDTH__ >= 512 unsigned _BitInt(512) u; unsigned _BitInt(512) v; void foo (unsigned _BitInt(255) a, unsigned _BitInt(257) b, unsigned _BitInt(512) *r) { b += v; b |= a - b; unsigned _BitInt(512) c = b * 6; unsigned _BitInt(512) h = c >> u; *r = h; } #endif int main () { #if __BITINT_MAXWIDTH__ >= 512 unsigned _BitInt(512) x; foo (0x1wb, 0x10001wb, ); if (x != 0x1fffawb) __builtin_abort (); #endif return 0; } This looks like a PRE bug to me. The bug at runtime is that b |= a - b (i.e. the operand of the multiplication) which ought to be 0x01uwb I think is different. What bitint lowering emits to compute this are 2 separate loops + straight line code after each to handle the most significant limb of 257-bit unsigned _BitInts. The first computes b += v; and because v is 0, it doesn't change anything and uses the same underlying variable (the PARM_DECL) for both the input and result, the second one performs the b |= a - b which is complicated because there needs to be zero extension from 255-bit a to 257 bits. This loop handles 2 limbs at a time, so iterates twice: [local count: 1073741824]: # _49 = PHI <0(4), _50(11)> # _55 = PHI <0(4), _56(11)> _51 = VIEW_CONVERT_EXPR(b)[_49]; if (_49 < 3) goto ; [80.00%] else goto ; [20.00%] [local count: 1073741824]: _52 = VIEW_CONVERT_EXPR(a)[_49]; [local count: 1073741824]: # _53 = PHI <_52(6), 0(5)> _54 = VIEW_CONVERT_EXPR(b)[_49]; _57 = .USUBC (_53, _54, _55); _58 = IMAGPART_EXPR <_57>; _59 = REALPART_EXPR <_57>; _60 = _59 | _51; bitint.6[_49] = _60; _61 = _49 + 1; _62 = VIEW_CONVERT_EXPR(b)[_61]; if (_61 <= 3) goto ; [80.00%] else goto ; [20.00%] [local count: 1073741824]: if (_61 == 3) goto ; [20.00%] else goto ; [80.00%] [local count: 1073741824]: _63 = VIEW_CONVERT_EXPR(a)[_61]; goto ; [100.00%] [local count: 214748360]: _64 = MEM [(unsigned _BitInt(255) *) + 24B]; _65 = () _64; _66 = (unsigned long) _65; [local count: 1073741824]: # _67 = PHI <_63(9), 0(7), _66(10)> _68 = VIEW_CONVERT_EXPR(b)[_61]; _69 = .USUBC (_67, _68, _58); _56 = IMAGPART_EXPR <_69>; _70 = REALPART_EXPR <_69>; _71 = _70 | _62; bitint.6[_61] = _71; _50 = _49 + 2; if (_50 != 4) goto ; [0.05%] else goto ; [99.95%] [local count: 1073741824]: _72 = MEM [(unsigned _BitInt(257) *) + 32B]; _73 = () _72; _74 = MEM [(unsigned _BitInt(257) *) + 32B]; _75 = () _74; _76 = (unsigned long) _75; _77 = .USUBC (0, _76, _56); _78 = IMAGPART_EXPR <_77>; _79 = REALPART_EXPR <_77>; _80 = () _79; _81 = _80 | _73; _82 = (unsigned long) _81; MEM[(unsigned long *) + 32B] = _82; a ={v} {CLOBBER(eos)}; bitint.6 is the result of b | (zext (a) - b) which is then used as argument to __mulbitint3. Now, there is something I should improve, eventhough later optimizations like VRP IMHO ought to be able to figure it out, because the loop iterates just twice, _49 will be in the [0, 2] range (actually 0 or 2), so _49 < 3 condition is actually always true and similarly _61 <= 3 is also always true (because _61 is in [1, 3] range (actually 1 or 3)). Guess I should in the handle_cast handling look at the m_upwards_2limb exact value and figure out conditions which will be always true or always false. Anyway, with the above which is IMHO correct but non-optimal let's see what happens to the processing of the second least significant limb, i.e. the second half of the first iteration which is what is IMHO miscompiled during PRE. First thread1 pass goes wild and threads everything it thinks it is possible to thread. The second least significant limb is handled in # _67 = PHI <_63(8), 0(6)> # _144 = PHI <_143(8), _136(6)> # _146 = PHI <_145(8), _140(6)> # _148 = PHI <_147(8), _141(6)> _68 = VIEW_CONVERT_EXPR(b)[_146]; _69 = .USUBC (_67, _68, _144); _56 = IMAGPART_EXPR <_69>; _70 = REALPART_EXPR <_69>; _71 = _70 | _148; bitint.6[_146] = _71; and the important thing to look at is the second .USUBC argument there, so _68, load from
[Bug tree-optimization/113774] wrong code with _BitInt() arithmetics at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113774 --- Comment #1 from Zdenek Sojka --- Created attachment 57341 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57341=edit another testcase, failing at -O1 I didn't check if the PR113753 patch fixes this. Output: $ x86_64-pc-linux-gnu-gcc -O testcase.c $ ./a.out Aborted