[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 Uroš Bizjak changed: What|Removed |Added Resolution|--- |FIXED Target Milestone|14.0|12.4 Status|ASSIGNED|RESOLVED --- Comment #20 from Uroš Bizjak --- Fixed for gcc-12.4+.
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #19 from CVS Commits --- The releases/gcc-12 branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:eeb8e9a36d7aa9bc4ac8b0d7abe1e84e9afc4250 commit r12-9774-geeb8e9a36d7aa9bc4ac8b0d7abe1e84e9afc4250 Author: Uros Bizjak Date: Fri Jul 14 11:46:22 2023 +0200 cprop: Do not set REG_EQUAL note when simplifying paradoxical subreg [PR110206] cprop1 pass does not consider paradoxical subreg and for (insn 22) claims that it equals 8 elements of HImodeby setting REG_EQUAL note: (insn 21 19 22 4 (set (reg:V4QI 98) (mem/u/c:V4QI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0 S4 A32])) "pr110206.c":12:42 1530 {*movv4qi_internal} (expr_list:REG_EQUAL (const_vector:V4QI [ (const_int -52 [0xffcc]) repeated x4 ]) (nil))) (insn 22 21 23 4 (set (reg:V8HI 100) (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ] "pr110206.c":12:42 7471 {sse4_1_zero_extendv8qiv8hi2} (expr_list:REG_EQUAL (const_vector:V8HI [ (const_int 204 [0xcc]) repeated x8 ]) (expr_list:REG_DEAD (reg:V4QI 98) (nil We rely on the "undefined" vals to have a specific value (from the earlier REG_EQUAL note) but actual code generation doesn't ensure this (it doesn't need to). That said, the issue isn't the constant folding per-se but that we do not actually constant fold but register an equality that doesn't hold. PR target/110206 gcc/ChangeLog: * fwprop.cc (contains_paradoxical_subreg_p): Move to ... * rtlanal.cc (contains_paradoxical_subreg_p): ... here. * rtlanal.h (contains_paradoxical_subreg_p): Add prototype. * cprop.cc (try_replace_reg): Do not set REG_EQUAL note when the original source contains a paradoxical subreg. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110206.c: New test. (cherry picked from commit 1815e313a8fb519a77c94a908eb6dafc4ce51ffe)
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #18 from CVS Commits --- The releases/gcc-13 branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:bef95ba085b0ae9bf3eb79a8eed685236d773116 commit r13-7565-gbef95ba085b0ae9bf3eb79a8eed685236d773116 Author: Uros Bizjak Date: Fri Jul 14 11:46:22 2023 +0200 cprop: Do not set REG_EQUAL note when simplifying paradoxical subreg [PR110206] cprop1 pass does not consider paradoxical subreg and for (insn 22) claims that it equals 8 elements of HImodeby setting REG_EQUAL note: (insn 21 19 22 4 (set (reg:V4QI 98) (mem/u/c:V4QI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0 S4 A32])) "pr110206.c":12:42 1530 {*movv4qi_internal} (expr_list:REG_EQUAL (const_vector:V4QI [ (const_int -52 [0xffcc]) repeated x4 ]) (nil))) (insn 22 21 23 4 (set (reg:V8HI 100) (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ] "pr110206.c":12:42 7471 {sse4_1_zero_extendv8qiv8hi2} (expr_list:REG_EQUAL (const_vector:V8HI [ (const_int 204 [0xcc]) repeated x8 ]) (expr_list:REG_DEAD (reg:V4QI 98) (nil We rely on the "undefined" vals to have a specific value (from the earlier REG_EQUAL note) but actual code generation doesn't ensure this (it doesn't need to). That said, the issue isn't the constant folding per-se but that we do not actually constant fold but register an equality that doesn't hold. PR target/110206 gcc/ChangeLog: * fwprop.cc (contains_paradoxical_subreg_p): Move to ... * rtlanal.cc (contains_paradoxical_subreg_p): ... here. * rtlanal.h (contains_paradoxical_subreg_p): Add prototype. * cprop.cc (try_replace_reg): Do not set REG_EQUAL note when the original source contains a paradoxical subreg. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110206.c: New test. (cherry picked from commit 1815e313a8fb519a77c94a908eb6dafc4ce51ffe)
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #17 from CVS Commits --- The master branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:1815e313a8fb519a77c94a908eb6dafc4ce51ffe commit r14-2525-g1815e313a8fb519a77c94a908eb6dafc4ce51ffe Author: Uros Bizjak Date: Fri Jul 14 11:46:22 2023 +0200 cprop: Do not set REG_EQUAL note when simplifying paradoxical subreg [PR110206] cprop1 pass does not consider paradoxical subreg and for (insn 22) claims that it equals 8 elements of HImodeby setting REG_EQUAL note: (insn 21 19 22 4 (set (reg:V4QI 98) (mem/u/c:V4QI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0 S4 A32])) "pr110206.c":12:42 1530 {*movv4qi_internal} (expr_list:REG_EQUAL (const_vector:V4QI [ (const_int -52 [0xffcc]) repeated x4 ]) (nil))) (insn 22 21 23 4 (set (reg:V8HI 100) (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ] "pr110206.c":12:42 7471 {sse4_1_zero_extendv8qiv8hi2} (expr_list:REG_EQUAL (const_vector:V8HI [ (const_int 204 [0xcc]) repeated x8 ]) (expr_list:REG_DEAD (reg:V4QI 98) (nil We rely on the "undefined" vals to have a specific value (from the earlier REG_EQUAL note) but actual code generation doesn't ensure this (it doesn't need to). That said, the issue isn't the constant folding per-se but that we do not actually constant fold but register an equality that doesn't hold. PR target/110206 gcc/ChangeLog: * fwprop.cc (contains_paradoxical_subreg_p): Move to ... * rtlanal.cc (contains_paradoxical_subreg_p): ... here. * rtlanal.h (contains_paradoxical_subreg_p): Add prototype. * cprop.cc (try_replace_reg): Do not set REG_EQUAL note when the original source contains a paradoxical subreg. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110206.c: New test.
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #16 from Uroš Bizjak --- v2 patch at [1]. [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624491.html
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #15 from Uroš Bizjak --- Created attachment 55537 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55537=edit Proposed patch. v2 patch in testing. This version prevents emission of invalid REG_EQUAL note in cprop.cc/try_replace_reg when original, non-simplified RTX contains SUBREG. The patch is in effect an one-liner: @@ -795,7 +796,8 @@ try_replace_reg (rtx from, rtx to, rtx_insn *insn) /* If we've failed perform the replacement, have a single SET to a REG destination and don't yet have a note, add a REG_EQUAL note to not lose information. */ - if (!success && note == 0 && set != 0 && REG_P (SET_DEST (set))) + if (!success && note == 0 && set != 0 && REG_P (SET_DEST (set)) + && !contains_paradoxical_subreg_p (SET_SRC (set))) note = set_unique_reg_note (insn, REG_EQUAL, copy_rtx (src)); } but we have to move contains_paradoxical_subreg_p to rtlanal.cc.
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #14 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #10) > (In reply to Uroš Bizjak from comment #9) > > and simplify_replace_rtx simplifies the above to: > > > > (gdb) p debug_rtx (src) > > (const_vector:V8HI [ > > (const_int 204 [0xcc]) repeated x8 > > ]) > > Patched compiler simplifies to: > > (gdb) p debug_rtx (src) > (const_vector:V8HI [ > (const_int 204 [0xcc]) repeated x4 > (const_int 0 [0]) repeated x4 > ]) The patched compiler puts the above in REG_EQUAL note. While the value is "more correct", I don't think the compiler has the right to set REG_EQUAL note when the top 4 bytes are actually undefined (as a result of an operation with an undefined input, which is the case with paradoxical subreg).
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #13 from Uroš Bizjak --- (In reply to Richard Biener from comment #12) > I can see cprop1 adds the REG_EQUAL note: > > (insn 22 21 23 4 (set (reg:V8HI 100) > (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0) > (parallel [ > (const_int 0 [0]) > (const_int 1 [0x1]) > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ] "t.c":12:42 7557 {sse4_1_zero_extendv8qiv8hi2} > - (expr_list:REG_DEAD (reg:V4QI 98) > -(nil))) > + (expr_list:REG_EQUAL (const_vector:V8HI [ > +(const_int 204 [0xcc]) repeated x8 > +]) > +(expr_list:REG_DEAD (reg:V4QI 98) > +(nil > > but I don't see yet what the actual wrong transform based on this REG_EQUAL > note is? We constant fold V4QImode const_vector to a V8HImode const_vector with 8 defined elements. We started with undefined top four bytes, but now we magically define them. > > It looks like we CSE the above with > > - 46: r122:V8QI=[`*.LC3'] > - REG_EQUAL const_vector > - 48: r125:V8HI=zero_extend(vec_select(r122:V8QI#0,parallel)) > - REG_EQUAL const_vector > - REG_DEAD r122:V8QI > - 49: r126:V8HI=r124:V8HI*r125:V8HI > - REG_DEAD r125:V8HI > + 49: r126:V8HI=r124:V8HI*r100:V8HI > > but otherwise do nothing. So the issue is that we rely on the "undefined" > vals to have a specific value (from the earlier REG_EQUAL note) but actual > code generation doesn't ensure this (it doesn't need to). That said, > the issue isn't the constant folding per-se but that we do not actually > constant fold but register an equality that doesn't hold. The above CSE is the consequence of REG_EQUAL note that compiler set on the insn. Compiler claims that the value of (insn 22) equals an array of 8 consts { 204 , ... , 204 }, but in reality (c.f. Comment #3) the value in the register %xmm4 before VPMULLW insn is { 0, 0, 0, 0, 204, 204, 204, 204 }.
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #12 from Richard Biener --- I can see cprop1 adds the REG_EQUAL note: (insn 22 21 23 4 (set (reg:V8HI 100) (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ] "t.c":12:42 7557 {sse4_1_zero_extendv8qiv8hi2} - (expr_list:REG_DEAD (reg:V4QI 98) -(nil))) + (expr_list:REG_EQUAL (const_vector:V8HI [ +(const_int 204 [0xcc]) repeated x8 +]) +(expr_list:REG_DEAD (reg:V4QI 98) +(nil but I don't see yet what the actual wrong transform based on this REG_EQUAL note is? It looks like we CSE the above with - 46: r122:V8QI=[`*.LC3'] - REG_EQUAL const_vector - 48: r125:V8HI=zero_extend(vec_select(r122:V8QI#0,parallel)) - REG_EQUAL const_vector - REG_DEAD r122:V8QI - 49: r126:V8HI=r124:V8HI*r125:V8HI - REG_DEAD r125:V8HI + 49: r126:V8HI=r124:V8HI*r100:V8HI but otherwise do nothing. So the issue is that we rely on the "undefined" vals to have a specific value (from the earlier REG_EQUAL note) but actual code generation doesn't ensure this (it doesn't need to). That said, the issue isn't the constant folding per-se but that we do not actually constant fold but register an equality that doesn't hold.
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 Uroš Bizjak changed: What|Removed |Added Keywords|needs-bisection | Assignee|unassigned at gcc dot gnu.org |ubizjak at gmail dot com Status|NEW |ASSIGNED --- Comment #11 from Uroš Bizjak --- Patch at [1]. [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623933.html
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #10 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #9) > and simplify_replace_rtx simplifies the above to: > > (gdb) p debug_rtx (src) > (const_vector:V8HI [ > (const_int 204 [0xcc]) repeated x8 > ]) Patched compiler simplifies to: (gdb) p debug_rtx (src) (const_vector:V8HI [ (const_int 204 [0xcc]) repeated x4 (const_int 0 [0]) repeated x4 ])
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #9 from Uroš Bizjak --- Some more digging through the code: In cprop.cc/try_replace_reg, we try to simplify the source of the set given our substitution: Breakpoint 1, try_replace_reg (from=0x7fffe9f0b7f8, to=0x7fffe9f099e0, insn=0x7fffea01b6c0) at ../../git/gcc/gcc/cprop.cc:789 789 src = simplify_replace_rtx (SET_SRC (set), from, to); (gdb) list 784 if (!success && set && reg_mentioned_p (from, SET_SRC (set))) 785 { 786 /* If above failed and this is a single set, try to simplify the source 787 of the set given our substitution. We could perhaps try this for 788 multiple SETs, but it probably won't buy us anything. */ 789 src = simplify_replace_rtx (SET_SRC (set), from, to); (gdb) p debug_rtx (set) (set (reg:V8HI 100) (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ] (gdb) p debug_rtx (from) (reg:V4QI 98) (gdb) p debug_rtx (to) (const_vector:V4QI [ (const_int -52 [0xffcc]) repeated x4 ]) and simplify_replace_rtx simplifies the above to: (gdb) p debug_rtx (src) (const_vector:V8HI [ (const_int 204 [0xcc]) repeated x8 ]) which is obviously wrong, we have V4QImode input register holding V4QImode constant. Tracing through simplify-rtx.cc brings us to a recursive simplify_replace_fn_rtx, which gets us to: Breakpoint 1, simplify_replace_fn_rtx (x=0x7fffe9f0b888, old_rtx=0x7fffe9f0b7f8, fn=0x0, data=0x7fffe9f099e0) at ../../git/gcc/gcc/simplify-rtx.cc:474 474 op0 = simplify_gen_subreg (GET_MODE (x), op0, (gdb) list 469 if (code == SUBREG) 470 { 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x), old_rtx, fn, data); 472 if (op0 == SUBREG_REG (x)) 473 return x; 474 op0 = simplify_gen_subreg (GET_MODE (x), op0, 475 GET_MODE (SUBREG_REG (x)), 476 SUBREG_BYTE (x)); 477 return op0 ? op0 : x; 478 } (gdb) p debug_rtx (op0) (const_vector:V4QI [ (const_int -52 [0xffcc]) repeated x4 ]) (gdb) p debug_rtx (x) (subreg:V16QI (reg:V4QI 98) 0) and simplify_gen_subreg with the above arguments returns: (gdb) p debug_rtx (op0) (const_vector:V16QI [ (const_int -52 [0xffcc]) repeated x16 ]) No way! It is not possible to get V16QImode vector from V4QImode vector, even when all elements are duplicates. Tracing even deeper to simplify_context::simplify_subreg, we found the following: Breakpoint 1, simplify_context::simplify_subreg (this=0x7fffd528, outermode=E_V16QImode, op=0x7fffe9f099e0, innermode=E_V4QImode, byte=...) at ../../git/gcc/gcc/simplify-rtx.cc:7561 7561return gen_vec_duplicate (outermode, elt); (gdb) list 7556 rtx elt; 7557 7558 if (VECTOR_MODE_P (outermode) 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode) 7560 && vec_duplicate_p (op, )) 7561return gen_vec_duplicate (outermode, elt); 7562 7563 if (outermode == GET_MODE_INNER (innermode) 7564 && vec_duplicate_p (op, )) 7565return elt; (gdb) p outermode $1 = E_V16QImode (gdb) p debug_rtx (elt) (const_int -52 [0xffcc]) (gdb) fin Run till exit from #0 simplify_context::simplify_subreg (this=0x7fffd528, outermode=E_V16QImode, op=0x7fffe9f099e0, innermode=E_V4QImode, byte=...) at ../../git/gcc/gcc/simplify-rtx.cc:7561 0x00eb24d3 in simplify_subreg (byte=..., innermode=E_V4QImode, op=, outermode=) at ../../git/gcc/gcc/rtl.h:3513 3513 return simplify_context ().simplify_subreg (outermode, op, innermode, byte); Value returned is $4 = (rtx_def *) 0x7fffe9f09c10 (gdb) p debug_rtx ($4) (const_vector:V16QI [ (const_int -52 [0xffcc]) repeated x16 ]) Nope. This transformation is valid only for non-paradoxical registers. Patch is then obvious: diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index d7315d82aa3..87ca25086dc 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -7557,6 +7557,7 @@ simplify_context::simplify_subreg (machine_mode outermode, rtx op, if (VECTOR_MODE_P (outermode) && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode) + && !paradoxical_subreg_p (outermode, innermode) && vec_duplicate_p (op, )) return gen_vec_duplicate (outermode, elt);
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #8 from Uroš Bizjak --- The testcase needs __attribute__((noinline)) to supress unwanted constant propagation with recent gcc. void __attribute__((noinline)) foo (U u, u16 c, V *r) ...
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #7 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #4) > cprop1 pass does not consider paradoxical subreg and for (insn 22) claims > that it equals 8 elements of QImode: 8 elements of HImode.
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 --- Comment #6 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #3) > However, VPMULLW needs all 8 QImode elements, but %xmm4 only has 4 loaded; To be consistent, VPSRLVW and VPMULLW use HImode elements.
[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206 Uroš Bizjak changed: What|Removed |Added Component|target |rtl-optimization --- Comment #5 from Uroš Bizjak --- Recategorized as generic RTL optimization problem.