[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #16 from luoxhu at gcc dot gnu.org --- The attached files are all built with -mcpu=power8 and the case also fails on P8LE. Also I verified the code produces expected output on P8BE. ('Aborted' is caused by BE returns 0x41 instead of 0x98 for LE.) P8LE : luoxhu@gcc135 build $ ./q.bad B0: 0, 0,0,0 Aborted P8BE: luoxhu@gcc203:~/workspace/build$ ./q.bad B0: 41fcef98, 91648e8b,7dca18c6,61707865 Aborted P8BE seems generates better code with the patch: luoxhu@gcc203:~/workspace/build$ diff q.good.S q.bad.S -U5 --- q.good.S2022-07-26 09:19:32.487216946 +0300 +++ q.bad.S 2022-07-26 09:15:58.006770996 +0300 @@ -1,6 +1,7 @@ .file "q.C" + .machine power8 .section".text" .section.rodata.str1.8,"aMS",@progbits,1 .align 3 .LC0: .string "B0: %x, %x,%x,%x\n" @@ -24,19 +25,17 @@ .cfi_def_cfa_offset 128 .cfi_offset 65, 16 .cfi_offset 30, -16 .cfi_offset 31, -8 mr %r30,%r3 - vmrghw %v2,%v2,%v4 - vmrghw %v5,%v3,%v5 - vmrghw %v5,%v2,%v5 - vspltw %v0,%v5,3 + vspltw %v0,%v5,0 mfvsrwz %r7,%vs32 - vspltw %v0,%v5,2 + vspltw %v0,%v4,0 mfvsrwz %r6,%vs32 - mfvsrwz %r5,%vs37 - vspltw %v0,%v5,0 + vspltw %v0,%v3,0 + mfvsrwz %r5,%vs32 + vspltw %v0,%v2,0 mfvsrwz %r31,%vs32 rldicl %r7,%r7,0,32 rldicl %r6,%r6,0,32 rldicl %r5,%r5,0,32 rldicl %r4,%r31,0,32 @@ -169,6 +168,6 @@ .set.LANCHOR1,. + 0 .type res, @object .size res, 1 res: .zero 1 - .ident "GCC: (Debian 9.5.0-1) 9.5.0" + .ident "GCC: (GNU) 13.0.0 20220726 (experimental)"
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #15 from luoxhu at gcc dot gnu.org --- In combine: vec_select(vec_concat and the followed vec_select are combined to a single extract instruction, which seems reasonable for both LE and BE? R146: 0 1 2 3 R141: 4 5 6 7 R150: 2 6 3 7// vec_select(vec_concat(r146:V4SI,r141:V4SI),[2 6 3 7]) R151: R150[3]// vec_select(r150:V4SI,3) => R151: R141[3] // vec_select(r141:V4SI,3) Trying 21 -> 24: 21: r150:V4SI=vec_select(vec_concat(r146:V4SI,r141:V4SI),parallel) REG_DEAD r146:V4SI REG_DEAD r141:V4SI 24: {r151:SI=vec_select(r150:V4SI,parallel);clobber scratch;} Failed to match this instruction: (parallel [ (set (reg:SI 151) (vec_select:SI (reg:V4SI 141) (parallel [ (const_int 3 [0x3]) ]))) (clobber (scratch:SI)) (set (reg:V4SI 150) (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) (reg:V4SI 141)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) ]) Failed to match this instruction: (parallel [ (set (reg:SI 151) (vec_select:SI (reg:V4SI 141) (parallel [ (const_int 3 [0x3]) ]))) (set (reg:V4SI 150) (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) (reg:V4SI 141)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) ]) Successfully matched this instruction: (set (reg:V4SI 150) (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) (reg:V4SI 141)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) Successfully matched this instruction: (set (reg:SI 151) (vec_select:SI (reg:V4SI 141) (parallel [ (const_int 3 [0x3]) ]))) allowing combination of insns 21 and 24 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 modifying insn i221: r150:V4SI=vec_select(vec_concat(r146:V4SI,r141:V4SI),parallel) REG_DEAD r146:V4SI deferring rescan insn with uid = 21. modifying insn i324: {r151:SI=vec_select(r141:V4SI,parallel);clobber scratch;} REG_DEAD r141:V4SI deferring rescan insn with uid = 24. I guess the previous unspec implementation bypassed the LE + LE swap check, so now in split2, we should generate vextuwlx instead of vextuwrx on little endian?
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #14 from luoxhu at gcc dot gnu.org --- Created attachment 53354 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53354=edit split2
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #13 from luoxhu at gcc dot gnu.org --- Created attachment 53353 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53353=edit after combine
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #12 from luoxhu at gcc dot gnu.org --- Created attachment 53352 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53352=edit combine
[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293 --- Comment #5 from luoxhu at gcc dot gnu.org --- r12-6086
[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293 --- Comment #4 from luoxhu at gcc dot gnu.org --- Could you try revert (In reply to Richard Biener from comment #2) > I can reproduce a regression with -Ofast -march=znver2 running on Haswell as > well. -fopt-info doesn't reveal anything interesting besides > > -fast_algorithms.c:133:19: optimized: loop with 2 iterations completely > unrolled (header execution count 32987933) > +fast_algorithms.c:133:19: optimized: loop with 2 iterations completely > unrolled (header execution count 129072791) > > obviously the slowdown is in P7Viterbi. There's only minimal changes on the > GIMPLE side, one notable: > > niters_vector_mult_vf.205_2406 = niters.203_442 & 429496729 | _2041 = > niters.203_438 & 3; > _2408 = (int) niters_vector_mult_vf.205_2406; | if (_2041 > == 0) > tmp.206_2407 = k_384 + _2408; | goto 66>; [25.00%] > _2300 = niters.203_442 & 3; < > if (_2300 == 0) < > goto ; [25.00%]< > elseelse > goto ; [75.00%] goto 36>; [75.00%] > >[local count: 41646173]:| > [local count: 177683003]: > # k_2403 = PHI | > niters_vector_mult_vf.205_2409 = niters.203_438 & 429496729 > # DEBUG k => k_2403 | _2411 = > (int) niters_vector_mult_vf.205_2409; > > > tmp.206_2410 = k_382 + _2411; > > > > > [local count: 162950122]: > > # k_2406 = > PHI > > the sink pass now does the transform where it did not do so before. > > That's appearantly because of > > /* If BEST_BB is at the same nesting level, then require it to have > significantly lower execution frequency to avoid gratuitous movement. > */ > if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) > /* If result of comparsion is unknown, prefer EARLY_BB. > Thus use !(...>=..) rather than (...<...) */ > && !(best_bb->count * 100 >= early_bb->count * threshold)) > return best_bb; > > /* No better block found, so return EARLY_BB, which happens to be the > statement's original block. */ > return early_bb; > > where the SRC count is 96726596 before, 236910671 after and the > destination count is 72544947 before, 177683003 at the destination after. > The edge probabilities are 75% vs 25% and param_sink_frequency_threshold > is exactly 75 as well. Since 236910671*0.75 > is rounded down it passes the test while the previous state has an exact > match defeating it. > > It's a little bit of an arbitrary choice, > > diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc > index 2e744d6ae50..9b368e13463 100644 > --- a/gcc/tree-ssa-sink.cc > +++ b/gcc/tree-ssa-sink.cc > @@ -230,7 +230,7 @@ select_best_block (basic_block early_bb, >if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) >/* If result of comparsion is unknown, prefer EARLY_BB. > Thus use !(...>=..) rather than (...<...) */ > - && !(best_bb->count * 100 >= early_bb->count * threshold)) > + && !(best_bb->count * 100 > early_bb->count * threshold)) > return best_bb; > >/* No better block found, so return EARLY_BB, which happens to be the > > fixes the missed sinking but not the regression :/ > > The count differences start to appear in when LC PHI blocks are added > only for virtuals and then pre-existing 'Invalid sum of incoming counts' > eventually lead to mismatches. The 'Invalid sum of incoming counts' > start with the loop splitting pass. > > fast_algorithms.c:145:10: optimized: loop split > > Xionghu Lou did profile count updates there, not sure if that made things > worse in this case. > > At least with broken BB counts splitting/unsplitting an edge can propagate > bogus counts elsewhere it seems. :(, Could you please try revert cd5ae148c47c6dee05adb19acd6a523f7187be7f and see whether performance is back?
[Bug tree-optimization/105740] missed optimization switch transformation for conditions with duplicate conditions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740 --- Comment #10 from luoxhu at gcc dot gnu.org --- (In reply to Martin Liška from comment #9) > (In reply to luoxhu from comment #8) > > (In reply to rguent...@suse.de from comment #6) > > > On Tue, 21 Jun 2022, jakub at gcc dot gnu.org wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740 > > > > > > > > --- Comment #5 from Jakub Jelinek --- > > > > The problem with switch-conversion done multiple times is that when it > > > > is done > > > > early, it can do worse job than when it is done late, e.g. we can have > > > > better > > > > range information later which allows (unfortunately switch-conversion > > > > doesn't > > > > use that yet, there is a PR about it) to ignore some never reachable > > > > values > > > > etc. > > > > So ideally we either need to be able to undo switch-conversion and redo > > > > it if > > > > things have changed, or do it only late and for e.g. inlining costs > > > > perform it > > > > only in analysis mode and record somewhere what kind of lowering would > > > > be done > > > > and how much it would cost. > > > > With multiple if-to-switch, don't we risk that we turn some ifs into > > > > switch, > > > > then > > > > switch-conversion lowers it back to ifs and then another if-to-switch > > > > matches > > > > it again and again lowers it? > > > > > > Yeah, I think ideally switch conversion would be done as part of switch > > > lowering (plus maybe an extra if-to-switch). The issue might be what > > > I said - some passes don't like switches, but they probably need to be > > > taught. As of inline cost yes, doing likely-switch-converted analysis > > > would probably work. > > > > git diff > > diff --git a/gcc/passes.def b/gcc/passes.def > > index b257307e085..1376e7cb28d 100644 > > --- a/gcc/passes.def > > +++ b/gcc/passes.def > > @@ -243,8 +243,6 @@ along with GCC; see the file COPYING3. If not see > > Clean them up. Failure to do so well can lead to false > > positives from warnings for erroneous code. */ > >NEXT_PASS (pass_copy_prop); > >/* Identify paths that should never be executed in a conforming > > program and isolate those paths. */ > >NEXT_PASS (pass_isolate_erroneous_paths); > > @@ -329,6 +327,7 @@ along with GCC; see the file COPYING3. If not see > >POP_INSERT_PASSES () > >NEXT_PASS (pass_simduid_cleanup); > >NEXT_PASS (pass_lower_vector_ssa); > > + NEXT_PASS (pass_if_to_switch); > >NEXT_PASS (pass_lower_switch); > >NEXT_PASS (pass_cse_reciprocals); > >NEXT_PASS (pass_reassoc, false /* early_p */); > > > > Tried this to add the second if_to_switch before lower_switch, but switch > > lowering doesn't work same as switch_conversion: > > Note the lowering expand to a decision tree where node of such tree can be > jump-tables, > bit-tests or simple comparisons. > > > > > ;; Function test2 (test2, funcdef_no=0, decl_uid=1982, cgraph_uid=1, > > symbol_order=0) > > > > beginning to process the following SWITCH statement ((null):0) : --- > > switch (_2) [INV], case 1: [INV], case 2: [INV], > > case 3: [INV], case 4: > 3> [INV], case 5: [INV], case 6: [INV]> > > > > ;; GIMPLE switch case clusters: JT(values:6 comparisons:6 range:6 density: > > 100.00%):1-6 > > So jump-table is selected. Where do you see this GIMPLE representation? This is dumped by the second run of iftoswitch after fre5. > > ... > > > > > ASM still contains indirect jump table like -fno-switch-conversion: > > > > > Is this bug of lower_switch or expected? > > What bug do you mean? Sorry, it not a bug, got to know that switch lower and switch conversion are doing two different things, different with "pass_lower_switch also performs the transforms switch-conversion does" in c#4? > > > From the code, they have different > > purpose as switch_conversion turns switch to single if-else while > > No switch_conversion expands a switch statement to a series of assignment > based on CSWITCH[index] arrays. > > > lower_switch expand CLUSTERS as a decision tree. Yes, rerun pass_convert_switch after the second if_to_switch could generate the CSWITCH[index]. pr105740.c.195t.switchconv2: [local count: 1073741824]: if (x_4(D) > 3) goto ; [50.00%] else goto ; [50.00%] [local count: 536870913]: _1 = f_6(D)->arr[3]; _10 = (unsigned int) _1; _2 = _10 + 4294967295; if (_2 <= 5) goto ; [INV] else goto ; [INV] [local count: 1073741822]: : _8 = 0; goto ; [100.00%] [local count: 1073741822]: : _9 = CSWTCH.4[_2]; [local count: 2147483644]: # _3 = PHI <_8(4), 0(2), _9(5)> : : return _3;
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #8 from luoxhu at gcc dot gnu.org --- init-regs: (insn 13 8 17 2 (set (reg:V4SI 141) (vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 135 [ R2 ]) (reg/v:V4SI 133 [ R0 ])) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) "q.C":22:45 1785 {altivec_vmrglw_direct_v4si} (expr_list:REG_DEAD (reg/v:V4SI 135 [ R2 ]) (expr_list:REG_DEAD (reg/v:V4SI 133 [ R0 ]) (nil (insn 17 13 21 2 (set (reg:V4SI 146) (vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 136 [ R3 ]) (reg/v:V4SI 134 [ R1 ])) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) "q.C":23:45 1785 {altivec_vmrglw_direct_v4si} (expr_list:REG_DEAD (reg/v:V4SI 136 [ R3 ]) (expr_list:REG_DEAD (reg/v:V4SI 134 [ R1 ]) (nil (insn 21 17 24 2 (set (reg:V4SI 150) (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) (reg:V4SI 141)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) "q.C":26:6 1785 {altivec_vmrglw_direct_v4si} (expr_list:REG_DEAD (reg:V4SI 146) (expr_list:REG_DEAD (reg:V4SI 141) (nil (insn 24 21 25 2 (parallel [ (set (reg:SI 151) (vec_select:SI (reg:V4SI 150) (parallel [ (const_int 3 [0x3]) ]))) (clobber (scratch:V4SI)) ]) "q.C":28:10 1400 {*vsx_extract_si} (nil)) (insn 25 24 26 2 (set (reg:DI 152) (zero_extend:DI (reg:SI 151))) "q.C":28:10 16 {zero_extendsidi2} (expr_list:REG_DEAD (reg:SI 151) (nil))) (insn 26 25 27 2 (parallel [ (set (reg:SI 153) (vec_select:SI (reg:V4SI 150) (parallel [ (const_int 2 [0x2]) ]))) (clobber (scratch:V4SI)) ]) "q.C":28:10 1400 {*vsx_extract_si} (nil)) (insn 27 26 28 2 (set (reg:DI 154) (zero_extend:DI (reg:SI 153))) "q.C":28:10 16 {zero_extendsidi2} (expr_list:REG_DEAD (reg:SI 153) (nil))) (insn 28 27 29 2 (parallel [ (set (reg:SI 155) (vec_select:SI (reg:V4SI 150) (parallel [ (const_int 1 [0x1]) ]))) (clobber (scratch:V4SI)) ]) "q.C":28:10 1400 {*vsx_extract_si} (nil)) (insn 29 28 30 2 (set (reg:DI 156) (zero_extend:DI (reg:SI 155))) "q.C":28:10 16 {zero_extendsidi2} (expr_list:REG_DEAD (reg:SI 155) (nil))) (insn 30 29 31 2 (parallel [ (set (reg:SI 157) (vec_select:SI (reg:V4SI 150) (parallel [ (const_int 0 [0]) ]))) (clobber (scratch:V4SI)) ]) "q.C":28:10 1400 {*vsx_extract_si} (expr_list:REG_DEAD (reg:V4SI 150) (nil))) combine: Trying 13 -> 28: 13: r141:V4SI=vec_select(vec_concat(r164:V4SI,r162:V4SI),parallel) REG_DEAD r164:V4SI 28: {r155:SI=vec_select(r141:V4SI,parallel);clobber scratch;} REG_DEAD r141:V4SI Successfully matched this instruction: (parallel [ (set (reg:SI 155) (vec_select:SI (reg:V4SI 164) (parallel [ (const_int 3 [0x3]) ]))) (clobber (scratch:V4SI)) ]) allowing combination of insns 13 and 28 original costs 4 + 8 = 12 replacement cost 8 deferring deletion of insn with uid = 13. modifying insn i328: {r155:SI=vec_select(r164:V4SI,parallel);clobber scratch;} REG_DEAD r164:V4SI deferring rescan insn with uid = 28. (note 7 47 8 2 NOTE_INSN_DELETED) (note 8 7 13 2 NOTE_INSN_FUNCTION_BEG) (note 13 8 17 2 NOTE_INSN_DELETED) (note 17 13 21 2 NOTE_INSN_DELETED) (note 21 17 24 2 NOTE_INSN_DELETED) (insn 24 21 25 2 (parallel [ (set (reg:SI 151) (vec_select:SI (reg:V4SI 162) (parallel [ (const_int 3 [0x3]) ]))) (clobber (scratch:V4SI)) ]) "q.C":28:10 1400 {*vsx_extract_si} (expr_list:REG_DEAD (reg:V4SI 162) (nil))) (note 25 24 26 2 NOTE_INSN_DELETED) (insn 26 25 27 2 (parallel [ (set (reg:SI 153) (vec_select:SI (reg:V4SI 163) (parallel [
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #5 from luoxhu at gcc dot gnu.org --- Seems combine wrongly merged two vec_select instructions: Trying 188 -> 199: 188: r343:V4SI=vec_select(vec_concat(r168:V4SI,r338:V4SI),parallel) REG_DEAD r338:V4SI REG_DEAD r168:V4SI 199: {r353:SI=vec_select(r343:V4SI,parallel);clobber scratch;} Failed to match this instruction: (parallel [ (set (reg:SI 353) (vec_select:SI (reg:V4SI 338) (parallel [ (const_int 3 [0x3]) ]))) (clobber (scratch:V4SI)) (set (reg:V4SI 343) (vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 168 [ R02$m_simd ]) (reg:V4SI 338)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) ]) Failed to match this instruction: (parallel [ (set (reg:SI 353) (vec_select:SI (reg:V4SI 338) (parallel [ (const_int 3 [0x3]) ]))) (set (reg:V4SI 343) (vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 168 [ R02$m_simd ]) (reg:V4SI 338)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) ]) Successfully matched this instruction: (set (reg:V4SI 343) (vec_select:V4SI (vec_concat:V8SI (reg/v:V4SI 168 [ R02$m_simd ]) (reg:V4SI 338)) (parallel [ (const_int 2 [0x2]) (const_int 6 [0x6]) (const_int 3 [0x3]) (const_int 7 [0x7]) ]))) Successfully matched this instruction: (set (reg:SI 353) (vec_select:SI (reg:V4SI 338) (parallel [ (const_int 3 [0x3]) ]))) allowing combination of insns 188 and 199 original costs 4 + 8 = 12 replacement costs 4 + 8 = 12 modifying insn i2 188: r343:V4SI=vec_select(vec_concat(r168:V4SI,r338:V4SI),parallel) REG_DEAD r168:V4SI deferring rescan insn with uid = 188. modifying insn i3 199: {r353:SI=vec_select(r338:V4SI,parallel);clobber scratch;} REG_DEAD r338:V4SI deferring rescan insn with uid = 199.
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #4 from luoxhu at gcc dot gnu.org --- Reduced to: #include extern "C" void *memcpy(void *, const void *, unsigned long); typedef __attribute__((altivec(vector__))) unsigned native_simd_type; union { native_simd_type V; int R[4]; } store_le_vec; struct S { S() = default; S(unsigned B0) { native_simd_type val{B0}; m_simd = val; } void store_le(unsigned char out[]) { store_le_vec.V = m_simd; unsigned int x0 = store_le_vec.R[0]; memcpy(out, , 1); } static void transpose(S , S B1, S B2, S B3) { native_simd_type T0 = __builtin_vec_mergeh(B0.m_simd, B2.m_simd); native_simd_type T1 = __builtin_vec_mergeh(B1.m_simd, B3.m_simd); native_simd_type T2 = __builtin_vec_mergel(B0.m_simd, B2.m_simd); native_simd_type T3 = __builtin_vec_mergel(B1.m_simd, B3.m_simd); B0 = __builtin_vec_mergeh(T0, T1); B3 = __builtin_vec_mergel(T2, T3); printf ("B0: %x, %x,%x,%x\n", B0.m_simd[0], B0.m_simd[1], B0.m_simd[2], B0.m_simd[3]); } S(native_simd_type x) : m_simd(x) {} native_simd_type m_simd; }; void foo (unsigned char output[], unsigned state[], native_simd_type R0, native_simd_type R1, native_simd_type R2, native_simd_type R3) { S R00; R00.m_simd = R0; S R01; R01.m_simd = R1; S R02; R02.m_simd = R2; S R03; R03.m_simd = R3; S::transpose(R00, R01, R02, R03); R00.store_le(output); } unsigned char res[1]; unsigned main_state[]{1634760805, 60878, 2036477234, 6, 0, 825562964, 1471091955, 1346092787, 506976774, 4197066702, 518848283, 118491664, 0, 0, 0, 0}; int main () { native_simd_type R0 = native_simd_type {0x41fcef98, 0,0,0}; native_simd_type R1 = native_simd_type {0x91648e8b, 0,0,0}; native_simd_type R2 = native_simd_type {0x7dca18c6, 0,0,0}; native_simd_type R3 = native_simd_type {0x61707865, 0,0,0}; foo (res, main_state, R0, R1, R2, R3); if (res[0] != 152) __builtin_abort(); }
[Bug tree-optimization/106126] [12 Regression] tree check fail in useless_type_conversion_p, at gimple-expr.cc:87 since r13-1184-g57424087e82db140
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106126 --- Comment #13 from luoxhu at gcc dot gnu.org --- Otherwise we need record first_bb when conditions_in_bbs->is_empty, then check that in is_beneficial, ordered_remove the info entry if that bb is not the first "if condition" with side_effect statement in it, the fix would be as below, but I am not sure whether it is worth way doing this to handle both PR105740 and PR106126? git diff diff --git a/gcc/gimple-if-to-switch.cc b/gcc/gimple-if-to-switch.cc index f7b0b02628b..44bb0228856 100644 --- a/gcc/gimple-if-to-switch.cc +++ b/gcc/gimple-if-to-switch.cc @@ -63,7 +63,7 @@ struct condition_info condition_info (gcond *cond): m_cond (cond), m_bb (gimple_bb (cond)), m_forwarder_bb (NULL), m_ranges (), m_true_edge (NULL), m_false_edge (NULL), -m_true_edge_phi_mapping (), m_false_edge_phi_mapping () +m_true_edge_phi_mapping (), m_false_edge_phi_mapping (), first_bb(false) { m_ranges.create (0); } @@ -80,6 +80,7 @@ struct condition_info edge m_false_edge; mapping_vec m_true_edge_phi_mapping; mapping_vec m_false_edge_phi_mapping; + bool first_bb; }; /* Recond PHI mapping for an original edge E and save these into vector VEC. */ @@ -194,6 +195,16 @@ if_chain::is_beneficial () auto_vec clusters; clusters.create (m_entries.length ()); + for (unsigned i = 0; i < m_entries.length (); i++) +{ + condition_info *info = m_entries[i]; + if (info->first_bb && i != 0 && !no_side_effect_bb (info->m_bb)) + { + m_entries.ordered_remove (i); + break; + } +} + for (unsigned i = 0; i < m_entries.length (); i++) { condition_info *info = m_entries[i]; @@ -397,6 +408,8 @@ find_conditions (basic_block bb, tree_code code = gimple_cond_code (cond); condition_info *info = new condition_info (cond); + if (conditions_in_bbs->is_empty ()) +info->first_bb = true;
[Bug tree-optimization/106126] [12 Regression] tree check fail in useless_type_conversion_p, at gimple-expr.cc:87 since r13-1184-g57424087e82db140
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106126 --- Comment #12 from luoxhu at gcc dot gnu.org --- conditions_in_bbs->is_empty doesn't mean that range is at the start of switch condition:(, so couldn't assume to ignore the no_side_effect_bb check?
[Bug tree-optimization/106126] [12 Regression] tree check fail in useless_type_conversion_p, at gimple-expr.cc:87 since r13-1184-g57424087e82db140
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106126 --- Comment #11 from luoxhu at gcc dot gnu.org --- Sorry for breaking, my bugzilla account is luo...@gcc.gnu.org. The patch seems reasonable to fold 65-90 ('A'-'Z') to switch statement, 4,6c4,6 < ;; Canonical GIMPLE case clusters: 33 60 62 126 < ;; BT can be built: BT(values:3 comparisons:6 range:30 density: 20.00%):33-62 126 < pr106126.c:3:28: optimized: Condition chain with 4 BBs transformed into a switch statement. --- > ;; Canonical GIMPLE case clusters: 33 60 62 65-90 126 > ;; BT can be built: BT(values:3 comparisons:6 range:30 density: 20.00%):33-62 > 65-90 126 > pr106126.c:3:28: optimized: Condition chain with 5 BBs transformed into a > switch statement. ... 96,97c108,109 <: < switch (_13) [INV], case 33: [INV], case 60: [INV], case 62: [INV], case 126: [INV]> --- >: > switch (_13) [INV], case 33: [INV], case 60: > [INV], case 62: [INV], case 65 ... 90: [INV], case 126: > [INV]> complete pr106126.bad.c.046t.iftoswitch: ;; Function pool_conda_matchspec (pool_conda_matchspec, funcdef_no=0, decl_uid=1979, cgraph_uid=1, symbol_order=1) ;; Canonical GIMPLE case clusters: 33 60 62 65-90 126 ;; BT can be built: BT(values:3 comparisons:6 range:30 density: 20.00%):33-62 65-90 126 pr106126.c:3:28: optimized: Condition chain with 5 BBs transformed into a switch statement. Removing basic block 9 ;; basic block 9, loop depth 2 ;; pred: if (_13 != 62) goto ; [INV] else goto ; [INV] ;; succ: 10 ;; 12 Removing basic block 10 ;; basic block 10, loop depth 2 ;; pred: if (_13 != 33) goto ; [INV] else goto ; [INV] ;; succ: 11 ;; 12 Removing basic block 11 ;; basic block 11, loop depth 2 ;; pred: if (_13 != 126) goto ; [INV] else goto ; [INV] ;; succ: 3 ;; 12 Removing basic block 3 ;; basic block 3, loop depth 2 ;; pred: _3 = (unsigned char) _13; _4 = _3 + 191; if (_4 <= 25) goto ; [INV] else goto ; [INV] ;; succ: 14 ;; 13 Expanded into a new gimple STMT: switch (_13) [INV], case 33: [INV], case 60: [INV], case 62: [INV], case 65 ... 90: [INV], case 126: [INV]> Removing basic block 13 ;; basic block 13, loop depth 2 ;; pred: : goto ; [100.00%] ;; succ: 6 Removing basic block 14 ;; basic block 14, loop depth 1 ;; pred: : ;; succ: 4 fix_loop_structure: fixing up loops for function void pool_conda_matchspec () { unsigned char _8; char _10; char * var_1.3_11; char _13; unsigned char _14; char * var_1.3_15; : goto ; [INV] : # _14 = PHI <_3(7)> # var_1.3_15 = PHI : _8 = _14 + 65; _10 = (char) _8; *var_1.3_15 = _10; : : : var_1.3_11 = var_1; if (var_1.3_11 != 0B) goto ; [INV] else goto ; [INV] : _13 = *var_1.3_11; if (_13 != 0) goto ; [INV] else goto ; [INV] : switch (_13) [INV], case 33: [INV], case 60: [INV], case 62: [INV], case 65 ... 90: [INV], case 126: [INV]> : : return; _8 = _14 + 65; _10 = (char) _8; *var_1.3_15 = _10; : : : var_1.3_11 = var_1; if (var_1.3_11 != 0B) goto ; [INV] else goto ; [INV] : _13 = *var_1.3_11; if (_13 != 0) goto ; [INV] else goto ; [INV] : switch (_13) [INV], case 33: [INV], case 60: [INV], case 62: [INV], case 65 ... 90: [INV], case 126: [INV]> : : return; } The problem is _3 is removed in basic block 3, but _14 is still using it.
[Bug tree-optimization/105903] Missed optimization for __synth3way
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105903 --- Comment #2 from luoxhu at gcc dot gnu.org --- diff --git a/gcc/match.pd b/gcc/match.pd index 4a570894b2e..f6b5415a351 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -5718,6 +5718,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (bit_xor (convert (rshift @0 {shifter;})) @1) (bit_not (bit_xor (convert (rshift @0 {shifter;})) @1))) +/* X >= Y ? X > Y : 0 into X > Y. */ +(simplify + (cond (ge @0 @1) (gt @0 @1) integer_zerop) + (if (INTEGRAL_TYPE_P (type) + && POINTER_TYPE_P (TREE_TYPE (@0)) + && POINTER_TYPE_P (TREE_TYPE (@1))) +(gt @0 @1))) + +/* X < Y ? 0 : X > Y into X > Y. */ +(simplify + (cond (lt @0 @1) integer_zerop (gt @0 @1)) + (if (INTEGRAL_TYPE_P (type) + && POINTER_TYPE_P (TREE_TYPE (@0)) + && POINTER_TYPE_P (TREE_TYPE (@1))) +(gt @0 @1))) + The two patterns could fold PHI in phiopt4 for the two greater3way and generate expected results.
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #2 from luoxhu at gcc dot gnu.org --- Could you also paste the ASM difference please? (I don't have environment at handle so far..)
[Bug tree-optimization/105740] missed optimization switch transformation for conditions with duplicate conditions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740 --- Comment #8 from luoxhu at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #6) > On Tue, 21 Jun 2022, jakub at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740 > > > > --- Comment #5 from Jakub Jelinek --- > > The problem with switch-conversion done multiple times is that when it is > > done > > early, it can do worse job than when it is done late, e.g. we can have > > better > > range information later which allows (unfortunately switch-conversion > > doesn't > > use that yet, there is a PR about it) to ignore some never reachable values > > etc. > > So ideally we either need to be able to undo switch-conversion and redo it > > if > > things have changed, or do it only late and for e.g. inlining costs perform > > it > > only in analysis mode and record somewhere what kind of lowering would be > > done > > and how much it would cost. > > With multiple if-to-switch, don't we risk that we turn some ifs into switch, > > then > > switch-conversion lowers it back to ifs and then another if-to-switch > > matches > > it again and again lowers it? > > Yeah, I think ideally switch conversion would be done as part of switch > lowering (plus maybe an extra if-to-switch). The issue might be what > I said - some passes don't like switches, but they probably need to be > taught. As of inline cost yes, doing likely-switch-converted analysis > would probably work. git diff diff --git a/gcc/passes.def b/gcc/passes.def index b257307e085..1376e7cb28d 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -243,8 +243,6 @@ along with GCC; see the file COPYING3. If not see Clean them up. Failure to do so well can lead to false positives from warnings for erroneous code. */ NEXT_PASS (pass_copy_prop); /* Identify paths that should never be executed in a conforming program and isolate those paths. */ NEXT_PASS (pass_isolate_erroneous_paths); @@ -329,6 +327,7 @@ along with GCC; see the file COPYING3. If not see POP_INSERT_PASSES () NEXT_PASS (pass_simduid_cleanup); NEXT_PASS (pass_lower_vector_ssa); + NEXT_PASS (pass_if_to_switch); NEXT_PASS (pass_lower_switch); NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_reassoc, false /* early_p */); Tried this to add the second if_to_switch before lower_switch, but switch lowering doesn't work same as switch_conversion: ;; Function test2 (test2, funcdef_no=0, decl_uid=1982, cgraph_uid=1, symbol_order=0) beginning to process the following SWITCH statement ((null):0) : --- switch (_2) [INV], case 1: [INV], case 2: [INV], case 3: [INV], case 4: [INV], case 5: [INV], case 6: [INV]> ;; GIMPLE switch case clusters: JT(values:6 comparisons:6 range:6 density: 100.00%):1-6 Removing basic block 11 ;; basic block 11, loop depth 0 ;; pred: switch (_2) [INV], case 1: [INV], case 2: [INV], case 3: [INV], case 4: [INV], case 5: [INV], case 6: [INV]> ;; succ: 4 ;; 5 ;; 6 ;; 7 ;; 8 ;; 9 ;; 10 Updating SSA: Registering new PHI nodes in block #0 Registering new PHI nodes in block #2 Updating SSA information for statement _1 = f_10(D)->len; Registering new PHI nodes in block #3 Updating SSA information for statement _2 = f_10(D)->arr[3]; ... int test2 (struct fs * f) { int _1; int _2; int _8; [local count: 1073741824]: _1 = f_10(D)->len; if (_1 > 3) goto ; [50.00%] else goto ; [50.00%] [local count: 536870913]: _2 = f_10(D)->arr[3]; switch (_2) [0.00%], case 1: [16.67%], case 2: [16.67%], case 3: [16.67%], case 4: [16.67%], case 5: [16.67%], case 6: [16.67%]> [local count: 67108864]: : goto ; [100.00%] [local count: 62914560]: : goto ; [100.00%] [local count: 58982400]: : goto ; [100.00%] [local count: 55296000]: : goto ; [100.00%] [local count: 5184]: : goto ; [100.00%] [local count: 4860]: : [local count: 1073741824]: # _8 = PHI <12(4), 27(5), 38(6), 18(7), 58(8), 68(9), 0(3), 0(2)> : return _8; } ASM still contains indirect jump table like -fno-switch-conversion: test2: .LFB0: .cfi_startproc xorl%eax, %eax cmpl$3, (%rdi) jle .L1 cmpl$6, 16(%rdi) ja .L3 movl16(%rdi), %eax jmp *.L5(,%rax,8) .section.rodata .align 8 .align 4 .L5: .quad .L3 .quad .L11 .quad .L9 .quad .L8 .quad .L7 .quad .L6 .quad .L4 .text .p2align 4,,10 .p2align 3 .L11: movl$12, %eax .L1: ret
[Bug tree-optimization/105740] missed optimization switch transformation for conditions with duplicate conditions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105740 --- Comment #2 from luoxhu at gcc dot gnu.org --- Run if_to_switch and convert_switch again after copyprop2 could remove the redundant statement and expose opportunity for if-to-switch again, is this reasonable or just move if-to-switch/switch-conversion later run only once? diff --git a/gcc/gimple-if-to-switch.cc b/gcc/gimple-if-to-switch.cc index f7b0b02628b..8f55d0e2f75 100644 --- a/gcc/gimple-if-to-switch.cc +++ b/gcc/gimple-if-to-switch.cc @@ -484,6 +484,8 @@ public: || bit_test_cluster::is_enabled ()); } + opt_pass *clone () { return new pass_if_to_switch (m_ctxt); } + virtual unsigned int execute (function *); }; // class pass_if_to_switch diff --git a/gcc/passes.def b/gcc/passes.def index 375d3d62d51..b257307e085 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -243,6 +243,8 @@ along with GCC; see the file COPYING3. If not see Clean them up. Failure to do so well can lead to false positives from warnings for erroneous code. */ NEXT_PASS (pass_copy_prop); + NEXT_PASS (pass_if_to_switch); + NEXT_PASS (pass_convert_switch); /* Identify paths that should never be executed in a conforming program and isolate those paths. */ NEXT_PASS (pass_isolate_erroneous_paths); diff --git a/gcc/tree-switch-conversion.cc b/gcc/tree-switch-conversion.cc index 50a17927f39..d5c8262785e 100644 --- a/gcc/tree-switch-conversion.cc +++ b/gcc/tree-switch-conversion.cc @@ -2429,6 +2429,9 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { return flag_tree_switch_conversion != 0; } + + opt_pass *clone () { return new pass_convert_switch (m_ctxt); } + virtual unsigned int execute (function *);
[Bug ipa/100034] missed optimization for dead code elimination at -O3 (vs. -O1, -Os, -O2)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100034 --- Comment #2 from luoxhu at gcc dot gnu.org --- (In reply to Richard Biener from comment #1) > Looks related to PR1 - we do an IPA SRA clone but fail to inline it and > thus we end up with > > void d.isra () > { > int D.1980; > int g.2_1; > >[local count: 10631108]: > >[local count: 96646437]: > g.2_1 = 0; > if (g.2_1 != 0) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 1073741824]: > foo (); > goto ; [100.00%] > > } > > int main () > { > int a.0_2; > int b.1_3; > >[local count: 59461674]: > goto ; [100.00%] > >[local count: 1014686025]: > a.0_2 = a; > if (a.0_2 == 0) > goto ; [99.96%] > else > goto ; [0.04%] > >[local count: 1014280151]: > // predicted unlikely by continue predictor. > goto ; [100.00%] > >[local count: 405874]: > d.isra (); > >[local count: 1073741824]: > b.1_3 = b; > if (b.1_3 != 0) > goto ; [94.50%] > else > goto ; [5.50%] > >[local count: 59055800]: > return 0; > > } > > where we optimize main to 'return 0' but fail to elide the unused d.isra. > > So also a dup of the cases where a late IPA function reclaim is missing. early_inliner inlines e to main in -O3 due to param_early_inlining_insns is 14 for O3, but it is 6 for -O2, so want_early_inline_function_p returns different. Then ipa-inline fails to inline d.isra by inline_functions_called_once as it is called by two callees e->d.isra and main->d.isra. But The two d.isra calls are removed by gimple 102t.ccp2 pass after all ipa passes: pr100034.O3.c.103t.objsz1: ;; Function d.isra (d.isra.0, funcdef_no=4, decl_uid=2014, cgraph_uid=7, symbol_order=10) (executed once) void d.isra () { int D.2016; [local count: 10631108]: [local count: 1073741824]: foo (); goto ; [100.00%] } ;; Function e (e, funcdef_no=2, decl_uid=1994, cgraph_uid=3, symbol_order=6) void e () { [local count: 59461674]: return; } ;; Function main (main, funcdef_no=3, decl_uid=1999, cgraph_uid=4, symbol_order=7) (executed once) int main () { [local count: 59461674]: return 0; } Currently all IPA passes are run before gimple optimizations, is it possible to run some passes like pass_rebuild_cgraph_edges and pass_ipa_remove_symbols after some gimple optimisations expose new opertunities?
[Bug ipa/93318] [10 regression] Firefox LTO+FDO ICEs in speculative_call_info
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93318 --- Comment #10 from luoxhu at gcc dot gnu.org --- And the Profile id of that node is streamed to many objects after lto partition: grep -- "19598949" ** db_server.ltrans0.000i.cgraph: Profile id: 19598949 db_server.ltrans0.000i.cgraph: Profile id: 19598949 db_server.ltrans0.000i.cgraph: Profile id: 19598949 db_server.ltrans0.079i.inline: Profile id: 19598949 db_server.ltrans0.079i.inline: Profile id: 19598949 db_server.ltrans12.000i.cgraph: Profile id: 19598949 db_server.ltrans12.000i.cgraph: Profile id: 19598949 db_server.ltrans12.000i.cgraph: Profile id: 19598949 db_server.ltrans14.000i.cgraph: Profile id: 19598949 db_server.ltrans26.000i.cgraph: Profile id: 19598949 db_server.ltrans26.000i.cgraph: Profile id: 19598949 db_server.ltrans26.000i.cgraph: Profile id: 19598949 db_server.ltrans31.000i.cgraph: Profile id: 19598949 db_server.ltrans32.000i.cgraph: Profile id: 19598949 db_server.wpa.000i.cgraph: Profile id: 19598949 db_server.wpa.001i.lto-link: Profile id: 19598949 db_server.wpa.003i.lto-partition: Profile id: 19598949 db_server.wpa.070i.whole-program: Profile id: 19598949 db_server.wpa.071i.profile_estimate: Profile id: 19598949 db_server.wpa.072i.icf: Profile id: 19598949 db_server.wpa.073i.devirt: Profile id: 19598949 db_server.wpa.074i.cp: Profile id: 19598949 db_server.wpa.075i.sra: Profile id: 19598949 db_server.wpa.078i.fnsummary: Profile id: 19598949 db_server.wpa.079i.inline: Profile id: 19598949 db_server.wpa.080i.pure-const: Profile id: 19598949 db_server.wpa.080i.pure-const: Profile id: 19598949 db_server.wpa.080i.pure-const: Profile id: 19598949 db_server.wpa.080i.pure-const: Profile id: 19598949 db_server.wpa.082i.static-var: Profile id: 19598949 db_server.wpa.082i.static-var: Profile id: 19598949
[Bug ipa/93318] [10 regression] Firefox LTO+FDO ICEs in speculative_call_info
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93318 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #9 from luoxhu at gcc dot gnu.org --- I have a testcase ICE at: external/com_google_protobuf/src/google/protobuf/message_lite.h:515:68: internal compiler error: Segmentation fault 0xde2816 crash_signal ../../gcc/toplev.c:328 0xe82370 copy_bb ../../gcc/tree-inline.c:2204 0xe84afa copy_cfg_body ../../gcc/tree-inline.c:3022 0xe855ea copy_body ../../gcc/tree-inline.c:3270 0xe8945b expand_call_inline ../../gcc/tree-inline.c:5061 0xe8a055 gimple_expand_calls_inline ../../gcc/tree-inline.c:5251 0xe8a831 optimize_inline_calls(tree_node*) ../../gcc/tree-inline.c:5424 0xb976ea inline_transform(cgraph_node*) ../../gcc/ipa-inline-transform.c:736 0xd1a147 execute_one_ipa_transform_pass ../../gcc/passes.c:2233 0xd1a2a1 execute_all_ipa_transforms(bool) ../../gcc/passes.c:2272 0x901809 cgraph_node::expand() ../../gcc/cgraphunit.c:2293 0x901e4a expand_all_functions ../../gcc/cgraphunit.c:2471 0x9028dd symbol_table::compile() ../../gcc/cgraphunit.c:2822 0x834fbc lto_main() ../../gcc/lto/lto.c:653 tree-inline.c:2204 2204:cgraph_edge *indirect = old_edge->speculative_call_indirect_edge (); 2205:profile_count indir_cnt = indirect->count; the returned indirect is 0 caused assert on 2205. (gdb) p old_edge->caller->debug() _ZNK6google8protobuf11MessageLite23IsInitializedWithErrorsEv/15805768 (IsInitializedWithErrors) @0x76d44438 Type: function definition analyzed Visibility: external public visibility_specified visibility:hidden References: _ZNK4trpc15RequestProtocol13IsInitializedEv/15470318 (addr) (speculative) Referring: Function IsInitializedWithErrors/15805768 is inline copy in OnExtendedInfosReceive/3878638 Availability: local Unit id: 1201 Function flags: count:26415 (adjusted) first_run:577 body local hot Called by: _ZN7yottadb2ds18BoundedReadWatcher22OnExtendedInfosReceiveERKSs/3878638 (inlined) (26415 (adjusted),1.00 per call) (can throw external) Calls: _ZNK6google8protobuf11MessageLite29LogInitializationErrorMessageEv/15806151 (0 (guessed),0.00 per call) (can throw external) _ZNK7yottadb2ds28AppendLogRequestExtendedInfo13IsInitializedEv.constprop.0/16350633 (speculative) (inl ined) (12547 (adjusted),0.47 per call) (can throw external) _ZNK7yottadb2ds28AppendLogRequestExtendedInfo13IsInitializ edEv.constprop.0/16375492 (inlined) (indirect_inlining) (13868 (adjusted),0.52 per call) (can throw external) $84 = void (gdb) p old_edge->callee->debug() _ZNK7yottadb2ds28AppendLogRequestExtendedInfo13IsInitializedEv.constprop.0/16350633 (IsInitialized.constprop) @0x7 6d44b40 Type: function definition analyzed Visibility: artificial References: Referring: Read from file: db_server.ltrans32.o Function IsInitialized.constprop/16350633 is inline copy in OnExtendedInfosReceive/3878638 Availability: local Unit id: 116 Function flags: count:12547 (adjusted) first_run:8235 body local icf_merged nonfreeing_fn Called by: _ZNK6google8protobuf11MessageLite23IsInitializedWithErrorsEv/15805768 (speculative) (inlined) (12547 (adj usted),0.47 per call) (can throw external) Calls: In wpa.079i.inline, it has TWO *polymorphic indirect call* speculative targets, I wrote a case like it but passed. _ZNK6google8protobuf11MessageLite23IsInitializedWithErrorsEv/15805768 (IsInitializedWithErrors) @0x7efdc479a2d0 Type: function definition analyzed Visibility: prevailing_def_ironly previous sharing asm name: 16375490 References: _ZNK4trpc15RequestProtocol13IsInitializedEv/15470318 (addr) (speculative) _ZNK7yottadb3rpc17RunCommandRequest13IsInitializedEv/9954194 (addr) (speculative) Referring: Read from file: bazel-out/k8-dbg/bin/external/com_google_protobuf/libprotobuf_lite.a Availability: local Profile id: 19598949 Unit id: 1200 Function flags: count:1072 (adjusted) first_run:577 local Called by: _ZN6google8protobuf11MessageLite9ParseFromILNS1_10ParseFlagsE1ESsEEbRKT0_/16456195 (1824663 (estimated locally),0.00 per call) (can throw external) _ZN6google8protobuf11MessageLite9ParseFromILNS1_10ParseFlagsE1EPNS0_2io19ZeroCopyInputStreamEEEbRKT0_/15806727 (14 (adjusted),1.00 per call) (can throw external) _ZN6google8protobuf11MessageLite9ParseFromILNS1_10ParseFlagsE1ESsEEbRKT0_/15806733 (1006 (adjusted),1.00 per call) (can throw external) _ZN6google8protobuf11MessageLite9ParseFromILNS1_10ParseFlagsE1ENS0_11StringPieceEEEbRKT0_/15806735 (52 (precise),1.00 per call) (can throw external) Calls: _ZNK7yottadb2ds28AppendLogRequestExtendedInfo13IsInitializedEv.constprop.0/16365519 (speculative) (inlined) (456 (adjusted),0.43 per call) (
[Bug lto/105133] lto/gold: lto failed to link --start-lib/--end-lib in gold for duplicate libraries
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105133 --- Comment #2 from luoxhu at gcc dot gnu.org --- (In reply to Richard Biener from comment #1) > (In reply to luoxhu from comment #0) > > > > cat hellow.res > > 3 > > hello.o 2 > > 192 ccb9165e03755470 PREVAILING_DEF main > > 197 ccb9165e03755470 PREVAILING_DEF_IRONLY s > > ./B/libhello.c.o 1 > > 205 68e0b97e93a52d7a PREEMPTED_REG hello > > ./C/libhello.c.o 1 > > 205 18fe2d3482bfb511 PREEMPTED_REG hello > > This looks like a gold bug - we have 'hello' pre-empted twice but no > prevailing > symbol in the IR - are you ending up with fat LTO objects? It is not fat LTO objects since I didn't add -ffat-lto-objects when generating lib: nm libhello.a libhello.c.o: nm: libhello.c.o: plugin needed to handle lto object 0001 C __gnu_lto_slim > > OTOH PREEMPTED_REG seems then handled wrongly by LTO as well - it should > throw away both copies since the linker told us it found a preempting > definition in a non-IR object file. So I'd expect a unresolved reference > to 'hello' rather than LTO complaining about multiple definitions ... Will you fix it? :) > > Note gold is really unmaintained, so you should probably avoid using it. Thanks. Will try lld instead.
[Bug lto/105133] New: lto/gold: lto failed to link --start-lib/--end-lib in gold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105133 Bug ID: 105133 Summary: lto/gold: lto failed to link --start-lib/--end-lib in gold Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: luoxhu at gcc dot gnu.org CC: marxin at gcc dot gnu.org Target Milestone: --- Hi, linker gold supports --start-lib and --end-lib to "mimics the semantics of static libraries, but without needing to actually create the archive file."(https://reviews.llvm.org/D66848). Sometimes large application may introduce multiple libraries from different repositories with same source code, they would be linked into one binary finally, recently I suffered from a link error with gold as linker and reduced an example as below: cat hello.c extern int hello(int a); int main(void) { return 0; /* hello(10); */ } cat ./B/libhello.c #include int hello(int a) { puts("Hello"); return 0; } cat ./C/libhello.c #include int hello(int a) { puts("Hello"); return 0; } (1) NON lto link with gold is OK: gcc -O2 -o ./B/libhello.c.o -c ./B/libhello.c gcc-ar qc ./B/libhello.a ./B/libhello.c.o gcc-ranlib ./B/libhello.a gcc -O2 -o ./C/libhello.c.o -c ./C/libhello.c gcc-ar qc ./C/libhello.a ./C/libhello.c.o gcc-ranlib ./C/libhello.a gcc hello.c -o hello.o -c -O2 gcc -o hellow hello.o -Wl,--start-lib ./B/libhello.c.o -Wl,--end-lib -Wl,--start-lib ./C/libhello.c.o -Wl,--end-lib -O2 -fuse-ld=gold (2) lto link with gold fails with redefinition: gcc -O2 -flto -o ./B/libhello.c.o -c ./B/libhello.c gcc-ar qc ./B/libhello.a ./B/libhello.c.o gcc-ranlib ./B/libhello.a gcc -O2 -flto -o ./C/libhello.c.o -c ./C/libhello.c gcc-ar qc ./C/libhello.a ./C/libhello.c.o gcc-ranlib ./C/libhello.a gcc hello.c -o hello.o -c -O2 -flto gcc -o hellow hello.o -Wl,--start-lib ./B/libhello.c.o -Wl,--end-lib -Wl,--start-lib ./C/libhello.c.o -Wl,--end-lib -O2 -flto -fuse-ld=gold ./B/libhello.c:5:5: error: 'hello' has already been defined 5 | int hello(int a) | ^ ./B/libhello.c:5:5: note: previously defined here lto1: fatal error: errors during merging of translation units compilation terminated. lto-wrapper: fatal error: gcc returned 1 exit status compilation terminated. /usr/bin/ld.gold: fatal error: lto-wrapper failed collect2: error: ld returned 1 exit status This error happens at function gcc/lto/lto-symtab.c:lto_symtab_resolve_symbols, simply remove the error_at line could work, but this may be not a reasonable fix. /* Find the single non-replaceable prevailing symbol and diagnose ODR violations. */ for (e = first; e; e = e->next_sharing_asm_name) { if (!lto_symtab_resolve_can_prevail_p (e)) continue; /* If we have a non-replaceable definition it prevails. */ if (!lto_symtab_resolve_replaceable_p (e)) { if (prevailing) { error_at (DECL_SOURCE_LOCATION (e->decl), "%qD has already been defined", e->decl); inform (DECL_SOURCE_LOCATION (prevailing->decl), "previously defined here"); } prevailing = e; } } cat hellow.res 3 hello.o 2 192 ccb9165e03755470 PREVAILING_DEF main 197 ccb9165e03755470 PREVAILING_DEF_IRONLY s ./B/libhello.c.o 1 205 68e0b97e93a52d7a PREEMPTED_REG hello ./C/libhello.c.o 1 205 18fe2d3482bfb511 PREEMPTED_REG hello Secondly, If call hello(10) in hello.c , there will be NO error reported out. The difference is the resolution type is changed from PREEMPTED_REG to RESOLVED_IR/PREVAILING_DEF_IRONLY. 3 hello.o 3 192 19ef867d12f62129 PREVAILING_DEF main 197 19ef867d12f62129 PREVAILING_DEF_IRONLY s 201 19ef867d12f62129 RESOLVED_IR hello ./B/libhello.c.o 1 205 23c5c855935478ce PREVAILING_DEF_IRONLY hello ./C/libhello.c.o 1 205 abbf050f5c23b448 PREEMPTED_REG hello Is this a valid bug? Thanks.
[Bug target/102239] powerpc suboptimal boolean test of contiguous bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #13 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #8 from luoxhu at gcc dot gnu.org --- Fixed by The master branch has been updated by Xiong Hu Luo : https://gcc.gnu.org/g:0552605b7b27dc6beed62e71bd05bc1efd191c0d commit r12-6430-g0552605b7b27dc6beed62e71bd05bc1efd191c0d Author: Xionghu Luo Date: Mon Jan 10 20:05:56 2022 -0600 testsuite: Fix regression on m32 by r12-6087 [PR103820] r12-6087 will avoid move cold bb out of hot loop, while the original intent of this testcase is to hoist divides out of loop and CSE them to only one divide. So increase the loop count to turn the cold bb to hot bb again. Then the 3 divides could be rewritten with same reciptmp. Tested pass on Power-Linux {32,64}, x86 {64,32} and i686-linux. gcc/testsuite/ChangeLog: PR testsuite/103820 * gcc.dg/tree-ssa/recip-3.c: Adjust.
[Bug bootstrap/103820] [12 Regression] i686 failed to bootstrap with ada by r12-6077
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103820 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #7 from luoxhu at gcc dot gnu.org --- (In reply to CVS Commits from comment #6) > The master branch has been updated by Xiong Hu Luo : > > https://gcc.gnu.org/g:0552605b7b27dc6beed62e71bd05bc1efd191c0d > > commit r12-6430-g0552605b7b27dc6beed62e71bd05bc1efd191c0d > Author: Xionghu Luo > Date: Mon Jan 10 20:05:56 2022 -0600 > > testsuite: Fix regression on m32 by r12-6087 [PR103820] > > r12-6087 will avoid move cold bb out of hot loop, while the original > intent of this testcase is to hoist divides out of loop and CSE them to > only one divide. So increase the loop count to turn the cold bb to hot > bb again. Then the 3 divides could be rewritten with same reciptmp. > > Tested pass on Power-Linux {32,64}, x86 {64,32} and i686-linux. > > gcc/testsuite/ChangeLog: > > PR testsuite/103820 > * gcc.dg/tree-ssa/recip-3.c: Adjust. Typo. should be PR103802.
[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802 --- Comment #6 from luoxhu at gcc dot gnu.org --- (In reply to Richard Biener from comment #5) > So the point is that P is invariant but we do not hoist it because it's > computed in a (estimated) cold block? I notice that the condition is > invariant, too, so > in principle we could hoist as > > if (d > 0.01) > P = ( W < E ) ? (W - E)/d : (E - W)/d; > for (i=0; i < 2; i++ ) > if( d > 0.01 ) > F[i] += P; Yes. But this loop only iterates twice, so bbs in loop is colder than preheader. -funswitch-loops should move the condition out of loop, but also need increase the loop iteration count: "/home/luoxhu/workspace/gcc-master/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c:16:14: note: Not unswitching, loop is not expected to iterate" > > alternatively one might argue that invariant expressions (unconditionally > computed or in a special way under invariant conditions) should be costed > differently. > > I think best would be to restore the original intent of the testcase which > was added with the fix for PRs 23109, 23948 and 24123. I suppose there > we saw the invariant hoisted(?) and the loop unrolled so I would suggest > to either apply the hoisting or the unrolling manually to the testcase. > (just look at the PRs whether you get a better idea of the origin of the > testcase). To restore the original intent of the testcase, increase the loop count is better than "either apply the hoisting or unrolling". Change it from "2" to at least "5" will turn the cold bb to hot bb, then the two divides could be hoisted out in LIM pass again(Verified below change could both pass on power-m32 and x86-i686): (It is much reasonable than the other two directions as loop iteration count is not key for the test code.) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 641c91e..a1d2d87 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -1,7 +1,7 @@ /* { dg-do compile } */ /* { dg-options "-O1 -fno-trapping-math -funsafe-math-optimizations -fdump-tree-recip" } */ -double F[2] = { 0.0, 0.0 }, e; +double F[5] = { 0.0, 0.0 }, e; /* In this case the optimization is interesting. */ float h () @@ -13,7 +13,7 @@ float h () d = 2.*e; E = 1. - d; - for( i=0; i < 2; i++ ) + for( i=0; i < 5; i++ ) if( d > 0.01 ) { P = ( W < E ) ? (W - E)/d : (E - W)/d; @@ -23,4 +23,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802 --- Comment #4 from luoxhu at gcc dot gnu.org --- Or restore the previous recip count check by comment out the if condition to avoid bb in loop turns cold? diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 641c91e719e..d3c3053486d 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -14,7 +14,13 @@ float h () E = 1. - d; for( i=0; i < 2; i++ ) - if( d > 0.01 ) + // if( d > 0.01 ) + /* The if condition will make followed bb cold (profile count + less then the loop preheader), while r12-6087 is a + optimization that avoids move COLD invariant expression out + of loop, since this test case is to test recip expression + could be CSE and eliminated, so comment the condition to keep + the test point. */ { P = ( W < E ) ? (W - E)/d : (E - W)/d; F[i] += P; @@ -23,4 +29,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
[Bug tree-optimization/103793] [12 Regression] ICE: in to_reg_br_prob_base, at profile-count.h:277 with -O3 -fno-guess-branch-probability since r12-6086-gcd5ae148c47c6dee
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103793 luoxhu at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #4 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug rtl-optimization/94790] Failure to use andn in specific pattern in which it is available
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94790 --- Comment #4 from luoxhu at gcc dot gnu.org --- Just noticed they are different case, scalar vs. vector...
[Bug rtl-optimization/94790] Failure to use andn in specific pattern in which it is available
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94790 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #3 from luoxhu at gcc dot gnu.org --- On Power, '(~mask & a) | (b & mask)' is better than 'a ^ ((a ^ b) & mask)' as the first can be generated as one instruction 'xxsel' as PR90323 shows.
[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802 --- Comment #2 from luoxhu at gcc dot gnu.org --- -funroll-loops could work around this, is this reasonable?
[Bug tree-optimization/103802] [12 regression] recip-3.c fails after r12-6087 on Power m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802 --- Comment #1 from luoxhu at gcc dot gnu.org --- MOVE_MAX_PIECES is 4 on m32 but it is 8 on m64, then estimate_move_cost is different between them 2 vs 1 for “((size + MOVE_MAX_PIECES - 1) / MOVE_MAX_PIECES)". recip-3.m32.c.172t.cunroll: BB: 11, after_exit: 0 BB: 7, after_exit: 0 size: 2 _4 = F[i_23]; size: 1 _5 = _4 + iftmp.1_10; size: 2 F[i_23] = _5; BB: 5, after_exit: 0 size: 1 _2 = d_14 + 1.00088817841970012523233890533447265625e-1; size: 1 reciptmp_12 = 1.0e+0 / d_14; size: 1 iftmp.1_18 = reciptmp_12 * _2; BB: 6, after_exit: 0 size: 1 _3 = -1.00088817841970012523233890533447265625e-1 - d_14; size: 1 reciptmp_25 = 1.0e+0 / d_14; size: 1 iftmp.1_17 = reciptmp_25 * _3; BB: 4, after_exit: 0 size: 2 if (e.0_1 < -5.00444089209850062616169452667236328125e-2) size: 19-4, last_iteration: 19-4 Loop size: 19 Estimated size after unrolling: 20 Not unrolling loop 1: size would grow. But recip-3.m64.c.172t.cunroll: BB: 11, after_exit: 0 BB: 7, after_exit: 0 size: 1 _4 = F[i_23]; size: 1 _5 = _4 + iftmp.1_10; size: 1 F[i_23] = _5; BB: 5, after_exit: 0 size: 1 _2 = d_14 + 1.00088817841970012523233890533447265625e-1; size: 1 reciptmp_12 = 1.0e+0 / d_14; size: 1 iftmp.1_18 = reciptmp_12 * _2; BB: 6, after_exit: 0 size: 1 _3 = -1.00088817841970012523233890533447265625e-1 - d_14; size: 1 reciptmp_25 = 1.0e+0 / d_14; size: 1 iftmp.1_17 = reciptmp_25 * _3; BB: 4, after_exit: 0 size: 2 if (e.0_1 < -5.00444089209850062616169452667236328125e-2) size: 17-4, last_iteration: 17-4 Loop size: 17 Estimated size after unrolling: 17 Making edge 18->9 impossible by redistributing probability to other edges. Making edge 8->10 impossible by redistributing probability to other edges. /home/luoxhu/workspace/gcc-master/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c:16:14: optimized: loop with 1 iterations completely unrolled (header execution count 357878154)
[Bug middle-end/103802] New: [12 regression] recip-3.c fails after r12-6087 on Power m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103802 Bug ID: 103802 Summary: [12 regression] recip-3.c fails after r12-6087 on Power m32 Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: luoxhu at gcc dot gnu.org Target Milestone: --- Invoking the compiler as /home/luoxhu/workspace/gcc-master_build/gcc/xgcc -B/home/luoxhu/workspace/gcc-master_build/gcc/ /home/luoxhu/workspace/gcc-master/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c -fdiagnostics-plain-output -O1 -fno-trapping-math -funsafe-math-optimizations -fdump-tree-recip -S -m32 -o recip-3.s Executing on host: /home/luoxhu/workspace/gcc-master_build/gcc/xgcc -B/home/luoxhu/workspace/gcc-master_build/gcc/ /home/luoxhu/workspace/gcc-master/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c -fdiagnostics-plain-output -O1 -fno-trapping-math -funsafe-math-optimizations -fdump-tree-recip -S -m32 -o recip-3.s(timeout = 300) gcc.dg/tree-ssa/recip-3.c: pattern found 3 times FAIL: gcc.dg/tree-ssa/recip-3.c scan-tree-dump-times recip " / " 5 Reson is m32 fail to cunroll due to recip-3.m32.c.172t.cunroll: Not unrolling loop 1: size would grow.
[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #7 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug tree-optimization/103793] [12 Regression] ICE: in to_reg_br_prob_base, at profile-count.h:277 with -O3 -fno-guess-branch-probability since r12-6086-gcd5ae148c47c6dee
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103793 luoxhu at gcc dot gnu.org changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |luoxhu at gcc dot gnu.org --- Comment #2 from luoxhu at gcc dot gnu.org --- Confirmed. -fno-guess-branch-probability requires the profile_count be initialized, so add guard like this? + if (true_edge->probability.initialized_p ()) + { + edge exit_to_latch1 = single_pred_edge (loop1->latch); + exit_to_latch1->probability + = exit_to_latch1->probability.apply_scale ( + true_edge->probability.to_reg_br_prob_base (), + REG_BR_PROB_BASE); + single_exit (loop1)->probability + = exit_to_latch1->probability.invert (); + }
[Bug middle-end/102860] [12 regression] libgomp.fortran/simd2.f90 ICEs after r12-4526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102860 --- Comment #6 from luoxhu at gcc dot gnu.org --- Fortran's modulo is floor_mod as documented here: https://gcc.gnu.org/onlinedocs/gfortran/MODULO.html? Syntax: RESULT = MODULO(A, P) Return value: The type and kind of the result are those of the arguments. (As a GNU extension, kind is the largest kind of the actual arguments.) If A and P are of type INTEGER: MODULO(A,P) has the value R such that A=Q*P+R, where Q is an integer and R is between 0 (inclusive) and P (exclusive). If A and P are of type REAL: MODULO(A,P) has the value of A - FLOOR (A / P) * P. The returned value has the same sign as P and a magnitude less than the magnitude of P. program test_modulo print *, modulo(17,3) print *, modulo(17.5,5.5) print *, modulo(-17,3) print *, modulo(-17.5,5.5) print *, modulo(17,-3) print *, modulo(17.5,-5.5) end program LD_LIBRARY_PATH=./x86_64-pc-linux-gnu/libgfortran/.libs/ ./a.out 2 1. 1 4.5000 -1 -4.5000
[Bug middle-end/102860] [12 regression] libgomp.fortran/simd2.f90 ICEs after r12-4526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102860 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #5 from luoxhu at gcc dot gnu.org --- P8, P9 and X86 doesn't vectorize the floor_mod operation, so they passed. The fix in #c2 only fixes ICE, but execution still fails, reason is R239 is used but not defined.
[Bug target/102239] powerpc suboptimal boolean test of contiguous bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239 --- Comment #11 from luoxhu at gcc dot gnu.org --- +(define_insn_and_split "*anddi3_insn_dot" + [(set (pc) +(if_then_else (eq (and:DI (match_operand:DI 1 "gpc_reg_operand" "%r,r") + (match_operand:DI 2 "const_int_operand" "n,n")) + (const_int 0)) + (label_ref (match_operand 3 "")) + (pc))) + (clobber (match_scratch:DI 0 "=r,r"))] + "rs6000_is_valid_2insn_and (operands[2], DImode) + && !(rs6000_is_valid_and_mask (operands[2], DImode) + || logical_const_operand (operands[2], DImode))" + "#" + "&& reload_completed" + [(pc)] +{ + int nb, ne; + if (rs6000_is_valid_mask (operands[2], , , DImode) && nb >= ne) + { + unsigned HOST_WIDE_INT val = INTVAL (operands[2]); + int shift = 63 - nb; + rtx tmp = gen_rtx_ASHIFT (DImode, operands[1], GEN_INT (shift)); + tmp = gen_rtx_AND (DImode, tmp, GEN_INT (val << shift)); + rtx cr0 = gen_rtx_REG (CCmode, CR0_REGNO); + rs6000_emit_dot_insn (operands[0], tmp, 1, cr0); + rtx loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[3]); + rtx cond = gen_rtx_EQ (CCEQmode, cr0, const0_rtx); + rtx ite = gen_rtx_IF_THEN_ELSE (VOIDmode, cond, loc_ref, pc_rtx); + emit_jump_insn (gen_rtx_SET (pc_rtx, ite)); + DONE; + } + else + FAIL; +} + [(set_attr "type" "shift") + (set_attr "dot" "yes") + (set_attr "length" "8,12")]) + This pattern could combine the two instructions from 9: {r123:CC=cmp(r124:DI&0x6,0);clobber scratch;} REG_DEAD r124:DI 10: pc={(r123:CC==0)?L15:pc} REG_DEAD r123:CC to: 10: {pc={(r124:DI&0x6==0)?L15:pc};clobber scratch;} then split2 will split it to one rotate dot instruction, is this OK? (insn 32 9 33 2 (parallel [ (set (reg:CC 100 0) (compare:CC (and:DI (ashift:DI (reg:DI 3 3 [124]) (const_int 29 [0x1d])) (const_int -4611686018427387904 [0xc000])) (const_int 0 [0]))) (clobber (reg:DI 3 3 [125])) ]) "pr102239.c":4:6 239 {*rotldi3_mask_dot} (nil)) (jump_insn 33 32 11 2 (set (pc) (if_then_else (eq:CCEQ (reg:CC 100 0) (const_int 0 [0])) (label_ref 15) (pc))) "pr102239.c":4:6 869 {*cbranch} (int_list:REG_BR_PROB 536870916 (nil)) -> 15)
[Bug target/102239] powerpc suboptimal boolean test of contiguous bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239 --- Comment #9 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #8) > (In reply to luoxhu from comment #6) > > > > foo: > > > > .LFB0: > > > > .cfi_startproc > > > > rldicr. 3,3,29,1 > > > > beq 0,.L2 > > > > > > This is fine, but only because it tests the EQ bit (not the LT or GT > > > bits). > > > So the generated RTL for this insn (the 2insn one) is not correct. > > > > The generated RTL in pr102239.c.300r.split2 is: > > > > (insn 32 8 33 2 (parallel [ > > (set (reg:CC 100 0 [123]) > > (compare:CC (and:DI (ashift:DI (reg:DI 3 3 [124]) > > (const_int 29 [0x1d])) > > (const_int -4611686018427387904 > > [0xc000])) > > (const_int 0 [0]))) > > (clobber (reg:DI 3 3 [125])) > > ]) "pr102239.c":4:6 238 {*rotldi3_mask_dot} > > (nil)) > > (insn 33 32 10 2 (set (reg:DI 3 3 [125]) > > (lshiftrt:DI (reg:DI 3 3 [125]) > > (const_int 29 [0x1d]))) "pr102239.c":4:6 278 {lshrdi3} > > (nil)) > > (jump_insn 10 33 11 2 (set (pc) > > (if_then_else (eq (reg:CC 100 0 [123]) > > (const_int 0 [0])) > > (label_ref 15) > > (pc))) "pr102239.c":4:6 868 {*cbranch} > > (int_list:REG_BR_PROB 536870916 (nil)) > > -> 15) > > So combine will have to look at insn 10 as well when it does the combination > (it often already does, via "other_insn") -- but also it does have to know > an "eq" is okay here, and that requires a new pattern. > > > rotldi3_mask_dot is what you mentioned in c#1, it is a shifted result and > > not matter for comparing to 0: > > It does matter, if what you are want to see is if it is smaller than zero or > greater than zero. CCmode includes those things. There is a CCEQmode for > if only the EQ bit is set correctly. Got it, thanks. As the example in c#7. If CCmode is LT, rotate data to highest bits will get negative result and set CR0 to negative, which is unexpected. > > > > *rotl3_mask_dot cannot do this either; the base and the dot2 of that > > > cannot be done, they return a shifted result, but that doesn't matter for > > > comparing it to 0. So we should add a specialised version. > > > > What specialized version to add? > > Some pattern that just does this as an rldicr, as a single insn. It will > have to be excluded by the 2insn thing (it is only a single insn itself!), > and it will have to have comparison mode CCEQ only. I was motivated by the clang code, and tried to rotate the data to LSB instead, it doesn't suffer from CCmode issue again? Will this be simpler than the combine & new pattern solution? diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index c9ce0550df1..d2a5b916b1d 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -11747,11 +11747,11 @@ rs6000_emit_2insn_and (machine_mode mode, rtx *operands, bool expand, int dot) } else { - rtx tmp = gen_rtx_ASHIFT (mode, operands[1], GEN_INT (shift)); - tmp = gen_rtx_AND (mode, tmp, GEN_INT (val << shift)); - emit_move_insn (operands[0], tmp); - tmp = gen_rtx_LSHIFTRT (mode, operands[0], GEN_INT (shift)); + rtx tmp = gen_rtx_LSHIFTRT (mode, operands[1], GEN_INT (ne)); + tmp = gen_rtx_AND (mode, tmp, GEN_INT (val >> ne)); rs6000_emit_dot_insn (operands[0], tmp, dot, dot ? operands[3] : 0); + tmp = gen_rtx_ASHIFT (mode, operands[0], GEN_INT (ne)); + emit_move_insn (operands[0], tmp); } return; RTL pr102239.c.300r.split2: (insn 32 8 33 2 (parallel [ (set (reg:CC 100 0 [123]) (compare:CC (and:DI (lshiftrt:DI (reg:DI 3 3 [124]) (const_int 33 [0x21])) (const_int 3 [0x3])) (const_int 0 [0]))) (clobber (reg:DI 3 3 [125])) ]) "pr102239.c":4:6 238 {*rotldi3_mask_dot} (nil)) (insn 33 32 10 2 (set (reg:DI 3 3 [125]) (ashift:DI (reg:DI 3 3 [125]) (const_int 33 [0x21]))) "pr102239.c":4:6 268 {ashldi3} (nil)) (jump_insn 10 33 11 2 (set (pc) (if_then_else (eq (reg:CC 100 0 [123]) (const_int 0 [0])) (label_ref 15) (pc))) "pr102239.c":4:6 868 {*cbranch} (int_list:REG_BR_PROB 536870916 (nil)) -> 15) ASM pr102239.s: foo: .LFB0: .cfi_startproc rldicl. 3,3,31,62 beq 0,.L2 #APP # 5 "pr102239.c" 1 # if # 0 "" 2 #NO_APP blr .p2align 4,,15 .L2: #APP
[Bug target/102239] powerpc suboptimal boolean test of contiguous bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239 --- Comment #7 from luoxhu at gcc dot gnu.org --- 1| Dump of assembler code for function foo: 2|0x15e0 <+0>: rldicr. r3,r3,29,1 3+> 0x15e4 <+4>: beq 0x15f0 4|0x15e8 <+8>: blr 5|0x15ec <+12>:ori r2,r2,0 6|0x15f0 <+16>:blr 7|0x15f4 <+20>:.long 0x0 8|0x15f8 <+24>:.long 0x0 (gdb) si 0x15e4 in foo () 1: /x $r3 = 0xc000 2: /x $cr = 0x82000282 cr0 is negative if only rotldi3_mask_dot, but it was 0x42000282 on master code. BTW, clang also generated instructions with two rorates: foo(long):# @foo(long) rldicl 3, 3, 31, 33 rldicl. 3, 3, 33, 29 beq 0, .LBB0_2 blr .LBB0_2: blr .long 0 .quad 0
[Bug target/102239] powerpc suboptimal boolean test of contiguous bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239 --- Comment #6 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #5) > (In reply to luoxhu from comment #4) > > Simply adjust the sequence of dot instruction could produce expected code, > > is this correct? > > No it isn't. Sorry. Sorry I don't understand what is wrong... > > > foo: > > .LFB0: > > .cfi_startproc > > rldicr. 3,3,29,1 > > beq 0,.L2 > > This is fine, but only because it tests the EQ bit (not the LT or GT bits). > So the generated RTL for this insn (the 2insn one) is not correct. The generated RTL in pr102239.c.300r.split2 is: (insn 32 8 33 2 (parallel [ (set (reg:CC 100 0 [123]) (compare:CC (and:DI (ashift:DI (reg:DI 3 3 [124]) (const_int 29 [0x1d])) (const_int -4611686018427387904 [0xc000])) (const_int 0 [0]))) (clobber (reg:DI 3 3 [125])) ]) "pr102239.c":4:6 238 {*rotldi3_mask_dot} (nil)) (insn 33 32 10 2 (set (reg:DI 3 3 [125]) (lshiftrt:DI (reg:DI 3 3 [125]) (const_int 29 [0x1d]))) "pr102239.c":4:6 278 {lshrdi3} (nil)) (jump_insn 10 33 11 2 (set (pc) (if_then_else (eq (reg:CC 100 0 [123]) (const_int 0 [0])) (label_ref 15) (pc))) "pr102239.c":4:6 868 {*cbranch} (int_list:REG_BR_PROB 536870916 (nil)) -> 15) rotldi3_mask_dot is what you mentioned in c#1, it is a shifted result and not matter for comparing to 0: > *rotl3_mask_dot cannot do this either; the base and the dot2 of that > cannot be done, they return a shifted result, but that doesn't matter for > comparing it to 0. So we should add a specialised version. What specialized version to add?
[Bug target/102239] powerpc suboptimal boolean test of contiguous bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239 --- Comment #4 from luoxhu at gcc dot gnu.org --- Simply adjust the sequence of dot instruction could produce expected code, is this correct? foo: .LFB0: .cfi_startproc rldicr. 3,3,29,1 beq 0,.L2 #APP # 10 "pr102239.c" 1 # if # 0 "" 2 #NO_APP blr .p2align 4,,15 .L2: #APP # 12 "pr102239.c" 1 # else # 0 "" 2 #NO_APP blr git diff diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index c9ce0550df1..2f0b5992bbf 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -11749,9 +11749,9 @@ rs6000_emit_2insn_and (machine_mode mode, rtx *operands, bool expand, int dot) { rtx tmp = gen_rtx_ASHIFT (mode, operands[1], GEN_INT (shift)); tmp = gen_rtx_AND (mode, tmp, GEN_INT (val << shift)); - emit_move_insn (operands[0], tmp); - tmp = gen_rtx_LSHIFTRT (mode, operands[0], GEN_INT (shift)); rs6000_emit_dot_insn (operands[0], tmp, dot, dot ? operands[3] : 0); + tmp = gen_rtx_LSHIFTRT (mode, operands[0], GEN_INT (shift)); + emit_move_insn (operands[0], tmp); } return; }
[Bug target/102239] powerpc suboptimal boolean test of contiguous bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #2 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #1) > Confirmed. > > So the relevant insn > > (parallel [(set (reg:CC 123) > (compare:CC (and:DI (reg:DI 124) > (const_int 25769803776 [0x6])) > (const_int 0 [0]))) >(clobber (scratch:DI))]) > > is matched by *and3_2insn but not by any pattern that ends up as just > one insn. Not *and3_mask_dot, because that doesn't do a shift first, > is just an AND and there are no machine insns to do that; but there is no > pattern for what we can do. > > *rotl3_mask_dot cannot do this either; the base and the dot2 of that > cannot be done, they return a shifted result, but that doesn't matter for > comparing it to 0. So we should add a specialised version. Seems different with what you describe, in combine, it was combined to anddi3_2insn_dot: (insn 9 8 10 2 (parallel [ (set (reg:CC 122) (compare:CC (and:DI (reg:DI 123) (const_int 25769803776 [0x6])) (const_int 0 [0]))) (clobber (scratch:DI)) ]) "pr102239.c":3:6 210 {*anddi3_2insn_dot} (expr_list:REG_DEAD (reg:DI 123) (nil))) (jump_insn 10 9 11 2 (set (pc) (if_then_else (eq (reg:CC 122) (const_int 0 [0])) (label_ref 15) (pc))) "pr102239.c":3:6 868 {*cbranch} (expr_list:REG_DEAD (reg:CC 122) (int_list:REG_BR_PROB 536870916 (nil))) Then in pr102239.c.302r.split2, it is split by "*and3_2insn_dot" to rotldi3_mask+lshrdi3_dot: Splitting with gen_split_80 (rs6000.md:3721) (insn 32 8 33 2 (set (reg:DI 3 3 [124]) (and:DI (ashift:DI (reg:DI 3 3 [123]) (const_int 29 [0x1d])) (const_int -4611686018427387904 [0xc000]))) "pr102239.c":3:6 236 {*rotldi3_mask} (nil)) (insn 33 32 10 2 (parallel [ (set (reg:CC 100 0 [122]) (compare:CC (lshiftrt:DI (reg:DI 3 3 [124]) (const_int 29 [0x1d])) (const_int 0 [0]))) (clobber (reg:DI 3 3 [124])) ]) "pr102239.c":3:6 281 {*lshrdi3_dot} (nil)) Why this difference happens? 0x6 is not a valid mask for anddi3_2insn_dot: "(mode == Pmode || UINTVAL (operands[2]) <= 0x7fff) && rs6000_is_valid_2insn_and (operands[2], mode) && !(rs6000_is_valid_and_mask (operands[2], mode) || logical_const_operand (operands[2], mode))" (gdb) p UINTVAL (operands[2]) <= 0x7fff $84 = false (gdb) p rs6000_is_valid_2insn_and (operands[2], E_DImode) $85 = true (gdb) p logical_const_operand (operands[2], E_DImode) $86 = false (gdb) p rs6000_is_valid_and_mask (operands[2], E_DImode) $87 = false (gdb) p Pmode $88 = DImode
[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270 --- Comment #5 from luoxhu at gcc dot gnu.org --- ;; Loop 0 ;; header 0, latch 1 ;; depth 0, outer -1 ;; nodes: 0 1 2 3 4 5 6 11 7 8 10 9 ;; ;; Loop 1 ;; header 8, latch 7 ;; depth 1, outer 0 ;; nodes: 8 7 6 10 5 4 11 3 ;; ;; Loop 2 ;; header 6, latch 5 ;; depth 2, outer 1 ;; nodes: 6 5 4 11 3 ;; ;; Loop 3 ;; header 4, latch 3 ;; depth 3, outer 2 ;; nodes: 4 3 ;; 2 succs { 8 } ;; 3 succs { 4 } ;; 4 succs { 3 5 } ;; 5 succs { 6 } ;; 6 succs { 11 7 } ;; 11 succs { 4 } ;; 7 succs { 8 } ;; 8 succs { 10 9 } ;; 10 succs { 6 } ;; 9 succs { 1 } The CFG is: 2 | 8< | \ | 10 9 | || 67 6< || 11 | || 4<- | | \| | 5 3 | || -- When iterating loop 3 in predict_extra_loop_exits, exit edge is 4->5, it finds edge 3->4 for statement "if (d_8 == 0)", and set all e->src->preds with "predict_paths_leading_to_edge (e1, PRED_LOOP_EXTRA_EXIT, NOT_TAKEN);". (gdb) pbb 3 ;; basic block 3, loop depth 3 ;; pred: 4 _1 = *i_19(D); _2 = a_4 & c_6; _3 = _1 + _2; *i_19(D) = _3; ;; succ: 4 (gdb) pbb 4 ;; basic block 4, loop depth 3 ;; pred: 11 ;; 3 # c_6 = PHI # d_8 = PHI <0(11), 1(3)> if (d_8 == 0) goto ; [INV] else goto ; [INV] ;; succ: 3 ;; 5 (gdb) p e->src->preds $16 = 0x74fba140 = { 3)>}
[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270 --- Comment #4 from luoxhu at gcc dot gnu.org --- Created attachment 51851 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51851=edit Fix incorrect loop exit edge probability
[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270 --- Comment #3 from luoxhu at gcc dot gnu.org --- The profile count is correct but something wrong with edge probability, and it turns out that r12-4526 exposes a long-existing issue in profile_estimate:predict_extra_loop_exits, when searching extra exit edges for inner loop, it goes out and find a edge belongs to *outer loop*, setting that edge with predict value 33%, then predict_loops won't reset that edge for outer loop. I drafted a patch to ignore EDGE_DFS_BACK edges when iterating in predict_extra_loop_exits, then inner loop becomes hot again. diff base/pr103270.c.047t.profile_estimate patched/pr103270.c.047t.profile_estimate -U15 Predictions for bb 5 1 edges in bb 5 predicted to even probabilities Predictions for bb 6 - first match heuristics: 33.00% - combined heuristics: 33.00% + first match heuristics: 91.67% + combined heuristics: 91.67% opcode values nonequal (on trees) heuristics of edge 6->11 (ignored): 66.00% - extra loop exit heuristics of edge 6->11: 33.00% + loop iterations heuristics of edge 6->7: 8.33% Predictions for bb 11 1 edges in bb 11 predicted to even probabilities Predictions for bb 7 1 edges in bb 7 predicted to even probabilitie … - [local count: 88915474]: + [local count: 6029625]: goto ; [100.00%] - [local count: 354334800]: + [local count: 536870913]: _1 = *i_19(D); _2 = a_4 & c_6; _3 = _1 + _2; *i_19(D) = _3; - [local count: 708669601]: + [local count: 1073741824]: # c_6 = PHI # d_8 = PHI <0(11), 1(3)> if (d_8 == 0) goto ; [50.00%] else goto ; [50.00%] - [local count: 354334800]: + [local count: 536870913]: # c_21 = PHI b_18 = b_5 + -1; - [local count: 1073741824]: + [local count: 585656064]: # b_5 = PHI <0(10), b_18(5)> # c_7 = PHI <0(10), c_21(5)> if (b_5 != -11) -goto ; [33.00%] +goto ; [91.67%] else -goto ; [67.00%] +goto ; [8.33%] - [local count: 354334800]: + [local count: 536870913]: goto ; [100.00%] - [local count: 719407024]: + [local count: 48785151]: a_16 = a_4 + 1; - [local count: 808322498]: + [local count: 54814777]: # a_4 = PHI if (a_4 <= 4) goto ; [89.00%] else goto ; [11.00%] - [local count: 719407024]: + [local count: 48785151]: goto ; [100.00%] - [local count: 88915474]: + [local count: 6029625]: return;
[Bug testsuite/103270] [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270 --- Comment #2 from luoxhu at gcc dot gnu.org --- (In reply to Richard Biener from comment #1) > So you say this is a problem with loop header copying, that would mean the > issue is really latent and general, no? Header copying uses > gimple_duplicate_sese_region and has no own profile updating. I guess its > profile updating code isn't designed to cope with copying a region with > "side"-entries (we are ignoring the backedge here). Not sure if we can > somehow generally handle those (maybe we can learn from tracer or threader > here). > > Honza? Yes, it seems to be a general issue in gimple_duplicate_sese_region, the inner loop cfg was: 8 | 3<-- | \ | 5 4 And it is modified by ch_base::copy_headers->gimple_duplicate_sese_region to( entry edge is 8->3, exit edge is 3->4): 8 | 12 | 4<-- | | 3--- | 5 bb 12 is copied block from bb 3 as new preheader, bb 3 is rotated to be new exit of the loop, bb 3 and bb 12 are adjusted count to "total_count - entry_count" (354334800) and "entry_count"(719407024), at last bb 3 and bb 4 will be merged to one block by gimple_merge_blocks later by TODO_cleanup_cfg with much smaller count than preheader. gimple_duplicate_sese_region: if (total_count.initialized_p () && entry_count.initialized_p ()) { scale_bbs_frequencies_profile_count (region, n_region, total_count - entry_count, total_count); scale_bbs_frequencies_profile_count (region_copy, n_region, entry_count, total_count); } Obviously, region of bb 3's profile count shouldn't be decreased from "total_count" to "total_count - entry_count", it executes at every execution of the loop. Simply adjust it back to total_count and region_copy to entry_count will cause some other cases fail. And at the moment edge 3->4 is still not a backedge now?
[Bug testsuite/103270] New: [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103270 Bug ID: 103270 Summary: [12 regression] gcc.dg/vect/pr96698.c inner loop turned from hot to cold after r12-4526 Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: testsuite Assignee: unassigned at gcc dot gnu.org Reporter: luoxhu at gcc dot gnu.org Target Milestone: --- For the testcase gcc.dg/vect/pr96698.c, the inner loop was hot (preheader count < loop count), but it is NOT now after r12-4526, bb 3's profile count 354334801 is only 1/2 of the preheader bb 5's profile count 719407024. But I guess it should be fixed in tree-ssa-loop-ch.c when copy_headers, there are profile count update there, this case should be handled specially when the single exit loop only has two bbs and the old header is new exit->src, no need to scale down the old header profile count to preserve the hotness of the loop. pr96698.c.138t.lim2: void test (int a, int * i) { int i__lsm.5; int c; int b; int _22; int _23; int _24; [local count: 88915474]: if (a_12(D) <= 4) goto ; [89.00%] else goto ; [11.00%] [local count: 79134772]: i__lsm.5_11 = *i_16(D); goto ; [100.00%] [local count: 116930484]: [local count: 354334801]: # b_3 = PHI # c_17 = PHI # i__lsm.5_20 = PHI _22 = i__lsm.5_20; _23 = a_2 & c_17; _24 = _22 + _23; i__lsm.5_4 = _24; b_15 = b_3 + -1; if (b_15 != -11) goto ; [33.00%] else goto ; [67.00%] [local count: 719407024]: # i__lsm.5_7 = PHI a_14 = a_2 + 1; if (a_14 <= 4) goto ; [89.00%] else goto ; [11.00%] [local count: 640272252]: [local count: 719407024]: # a_2 = PHI # i__lsm.5_1 = PHI goto ; [100.00%] [local count: 79134772]: # i__lsm.5_5 = PHI *i_16(D) = i__lsm.5_5; [local count: 88915474]: return; }
[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #9 from luoxhu at gcc dot gnu.org --- Fixed and backported to gcc-11.
[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991 --- Comment #7 from luoxhu at gcc dot gnu.org --- Fixed, will backport to gcc-11 in a week.
[Bug tree-optimization/103029] [12 regression] gcc.dg/vect/pr82436.c ICEs on r12-4818
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103029 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||ro at gcc dot gnu.org --- Comment #8 from luoxhu at gcc dot gnu.org --- *** Bug 103041 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/103041] [12 regression] gcc.dg/vect/slp-reduc-10a.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103041 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #5 from luoxhu at gcc dot gnu.org --- duplicate and fixed. *** This bug has been marked as a duplicate of bug 103029 ***
[Bug tree-optimization/103041] [12 regression] gcc.dg/vect/slp-reduc-10a.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103041 --- Comment #1 from luoxhu at gcc dot gnu.org --- Could you please verify whether it is caused by r12-4818 instead of r12-4819? r12-4819 is a NFC patch which seems more unlikely, and r12-4818 also ICEs in PR103029, it is possibly a duplicate of that. commit f35af8df241a9eb9c2edf7da26d3c5f53d6e2511 Author: Xionghu Luo Date: Mon Nov 1 00:12:36 2021 -0500 Refactor loop_version
[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991 --- Comment #5 from luoxhu at gcc dot gnu.org --- P9: .L149: lxvx %vs32,%r8,%r10 vadduwm %v12,%v12,%v1 mfvsrd %r5,%vs43 mfvsrld %r4,%vs43 vadduwm %v11,%v11,%v9 stxv %vs44,112(%r1) xxperm %vs32,%vs32,%vs42 vcmpequw %v13,%v0,%v1 vadduwm %v0,%v1,%v0 xxlandc %vs45,%vs33,%vs45 // here. xxperm %vs32,%vs32,%vs42 xxlor %vs0,%vs0,%vs45 stxvx %vs32,%r8,%r10 stxv %vs0,128(%r1) addi %r8,%r8,-16 bdnz .L149 $vs43 is not changed by xxlandc
[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991 --- Comment #4 from luoxhu at gcc dot gnu.org --- vect-simd-17.p10.c.335r.final: 3379: %v1:V16QI=unspec[%v1:V16QI,%v1:V16QI,%v9:V16QI] 254 3372: {%v11:V4SI=~%v0:V4SI&%v13:V4SI|%v11:V4SI;clobber %r10:V4SI;} // wrong code. REG_DEAD %v0:V4SI REG_UNUSED %r10:V4SI 3373: [%r1:DI+0x80]=%v11:V4SI ASM: .L149: lxvx %vs32,%r9,%r8 vadduwm %v12,%v12,%v13 mfvsrd %r5,%vs42 mfvsrld %r4,%vs42 vadduwm %v10,%v10,%v8 stxv %vs44,112(%r1) xxperm %vs32,%vs32,%vs41 vadduwm %v1,%v13,%v0 vcmpequw %v0,%v0,%v13 xxperm %vs33,%vs33,%vs41 vandc %r10,%v13,%v0 // wrong code vor %v11,%r10,%v11// wrong code stxv %vs43,128(%r1) stxvx %vs33,%r9,%r8 addi %r8,%r8,-16 bdnz .L149 But the binary is (/opt/binutils-power10/bin/objdump -d vect-simd-17.p10 | less): 10002ea0: 19 42 09 7c lxvxvs32,r9,r8 10002ea4: 80 68 8c 11 vadduwm v12,v12,v13 10002ea8: 67 00 45 7d mfvrd r5,v10 10002eac: 67 02 44 7d mfvsrld r4,vs42 10002eb0: 80 40 4a 11 vadduwm v10,v10,v8 10002eb4: 7d 00 81 f5 stxvvs44,112(r1) 10002eb8: d7 48 00 f0 xxperm vs32,vs32,vs41 10002ebc: 80 00 2d 10 vadduwm v1,v13,v0 10002ec0: 86 68 00 10 vcmpequw v0,v0,v13 10002ec4: d7 48 21 f0 xxperm vs33,vs33,vs41 10002ec8: 44 04 4d 11 vandc v10,v13,v0// wrong code 10002ecc: 84 5c 6a 11 vor v11,v10,v11 // wrong code 10002ed0: 8d 00 61 f5 stxvvs43,128(r1) 10002ed4: 19 43 29 7c stxvx vs33,r9,r8 10002ed8: f0 ff 08 39 addir8,r8,-16 10002edc: c4 ff 00 42 bdnz10002ea0 %vs42 is a global constant data loaded from memory, it was modified at address 0x10002ec8, there r10 is changed to v10 from ASM to binary, which was supposed to be never change in the loop. (gdb) 0x10002eb4 : 7d 00 81 f5 stxvvs44,112(r1) 0x10002eb8 : d7 48 00 f0 xxperm vs32,vs32,vs41 0x10002ebc : 80 00 2d 10 vadduwm v1,v13,v0 0x10002ec0 : 86 68 00 10 vcmpequw v0,v0,v13 0x10002ec4 : d7 48 21 f0 xxperm vs33,vs33,vs41 => 0x10002ec8 : 44 04 4d 11 vandc v10,v13,v0 0x10002ecc : 84 5c 6a 11 vor v11,v10,v11 0x10002ed0 : 8d 00 61 f5 stxvvs43,128(r1) 7: $vs42.v4_int32 = {-30, -29, -28, -27} (gdb) si 0x10002eb4 : 7d 00 81 f5 stxvvs44,112(r1) 0x10002eb8 : d7 48 00 f0 xxperm vs32,vs32,vs41 0x10002ebc : 80 00 2d 10 vadduwm v1,v13,v0 0x10002ec0 : 86 68 00 10 vcmpequw v0,v0,v13 0x10002ec4 : d7 48 21 f0 xxperm vs33,vs33,vs41 0x10002ec8 : 44 04 4d 11 vandc v10,v13,v0 => 0x10002ecc : 84 5c 6a 11 vor v11,v10,v11 0x10002ed0 : 8d 00 61 f5 stxvvs43,128(r1) 7: $vs42.v4_int32 = {0, 0, 0, 0}
[Bug tree-optimization/103029] [12 regression] gcc.dg/vect/pr82436.c ICEs on r12-4818
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103029 --- Comment #3 from luoxhu at gcc dot gnu.org --- This hack could restore the previous phi order to put nondfs phi args before dfs_edge args. But I am not sure whether this is the correct direction. At least it proves that the phi order matters for later vectorizer code. diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c index 455c3ef8db9..2ca256c15fa 100644 --- a/gcc/cfgloopmanip.c +++ b/gcc/cfgloopmanip.c @@ -31,6 +31,7 @@ along with GCC; see the file COPYING3. If not see #include "gimplify-me.h" #include "tree-ssa-loop-manip.h" #include "dumpfile.h" +#include "ssa.h" static void copy_loops_to (class loop **, int, class loop *); @@ -1577,6 +1578,41 @@ lv_adjust_loop_entry_edge (basic_block first_head, basic_block second_head, e1->probability = then_prob; e->probability = else_prob; + edge le, dfs = NULL, nondfs = NULL; + edge_iterator ei; + + if (EDGE_COUNT (e1->dest->preds) > 1) + { +FOR_EACH_EDGE (le, ei, e1->dest->preds) + { + if (le->flags & EDGE_DFS_BACK) + dfs = le; + else + nondfs = le; + } +if (dfs && nondfs && dfs->dest_idx < nondfs->dest_idx) + { + gphi_iterator psi; + gphi *phi; + tree dfsdef, nondfsdef; + for (psi = gsi_start_phis (e1->dest); !gsi_end_p (psi); gsi_next ()) + { + phi = psi.phi (); + dfsdef = PHI_ARG_DEF (phi, dfs->dest_idx); + nondfsdef = PHI_ARG_DEF (phi, nondfs->dest_idx); + SET_PHI_ARG_DEF (phi, dfs->dest_idx, nondfsdef); + SET_PHI_ARG_DEF (phi, nondfs->dest_idx, dfsdef); + } + + EDGE_PRED (e1->dest, dfs->dest_idx) = nondfs; + EDGE_PRED (e1->dest, nondfs->dest_idx) = dfs; + + unsigned int temp = nondfs->dest_idx; + nondfs->dest_idx = dfs->dest_idx; + dfs->dest_idx = temp; + } + } +
[Bug tree-optimization/103029] [12 regression] gcc.dg/vect/pr82436.c ICEs on r12-4818
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103029 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org, ||rguenther at suse dot de --- Comment #2 from luoxhu at gcc dot gnu.org --- Confirmed. P7's extra option -mno-allow-movmisalign makes this ICE happens. If add this option on P9 also ICEs. Reason is the phi arguments order changes if switch the sequence of loopify and lv_adjust_loop_entry_edge. the constant input argument from bb 18 is in phi index 1 now makes the followed vectorize code fail to handle? if (_42 != 0) goto ; [80.00%] else goto ; [20.00%] [local count: 67276368]: [local count: 611603351]: # i_76 = PHI // here # y_lsm.6_74 = PHI <_61(10), 0.0(18)> // here # w_lsm.7_73 = PHI <_58(10), 0.0(18)> // here i.0_72 = (unsigned int) i_76; _70 = (long unsigned int) i.0_72; _69 = _70 * 80; x_68 = r_22(D) + _69; fpred_67 = x_68->f_pred; fexp_66 = x_68->f_exp; tem_65 = fpred_67 - fexp_66; _64 = x_68->f_sigma; _63 = tem_65 / _64; _62 = ABS_EXPR <_63>; _61 = _62 + y_lsm.6_74; _60 = tem_65 / fexp_66; _59 = ABS_EXPR <_60>; _58 = _59 + w_lsm.7_73; i_57 = i_76 + 1; if (n_19(D) > i_57) goto ; [89.00%] else goto ; [11.00%] [local count: 544326983]: goto ; [100.00%] It was: if (_42 != 0) goto ; [80.00%] else goto ; [20.00%] [local count: 67276368]: [local count: 611603351]: # i_76 = PHI <1(18), i_57(10)> # y_lsm.6_74 = PHI <0.0(18), _61(10)> # w_lsm.7_73 = PHI <0.0(18), _58(10)> i.0_72 = (unsigned int) i_76; _70 = (long unsigned int) i.0_72; _69 = _70 * 80; x_68 = r_22(D) + _69; fpred_67 = x_68->f_pred; fexp_66 = x_68->f_exp; tem_65 = fpred_67 - fexp_66; _64 = x_68->f_sigma; _63 = tem_65 / _64; _62 = ABS_EXPR <_63>; _61 = _62 + y_lsm.6_74; _60 = tem_65 / fexp_66; _59 = ABS_EXPR <_60>; _58 = _59 + w_lsm.7_73; i_57 = i_76 + 1; if (n_19(D) > i_57) goto ; [89.00%] else goto ; [11.00%] [local count: 544326983]: goto ; [100.00%] The comments in function gimple_lv_adjust_loop_header_phi says /* Browse all 'second' basic block phi nodes and add phi args to edge 'e' for 'first' head. PHI args are always in correct order. */ Any function to fix the phi order?
[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991 --- Comment #3 from luoxhu at gcc dot gnu.org --- (In reply to Kewen Lin from comment #2) > (In reply to luoxhu from comment #1) > > Couldn't reproduce on rain6p1 (P10): > > > > It's weird, I can reproduce this on rain6p1. > > FAIL: gcc.dg/vect/vect-simd-17.c execution test > FAIL: gcc.dg/vect/vect-simd-17.c -flto -ffat-lto-objects execution test > > >--->---=== gcc Summary === > > # of expected passes>--->---2 > # of unexpected failures>---2 > > Probably due to you still specified --with-cpu=power9 instead of > --with-cpu=power10 in gcc configuration? Thanks, confirmed. --with-cpu=power9 doesn't fail on both P9 and P10 with the patch. It aborts at vect-simd-17.c of line 274.
[Bug target/102991] [12 regression] gcc.dg/vect/vect-simd-17.c fails after r12-4757
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102991 --- Comment #1 from luoxhu at gcc dot gnu.org --- Couldn't reproduce on rain6p1 (P10): Test run by luoxhu on Fri Oct 29 04:08:49 2021 Native configuration is powerpc64le-unknown-linux-gnu === gcc tests === Schedule of variations: unix Running target unix Running /home/luoxhu/workspace/gcc/gcc/testsuite/gcc.dg/vect/vect.exp ... PASS: gcc.dg/vect/vect-simd-17.c (test for excess errors) PASS: gcc.dg/vect/vect-simd-17.c execution test PASS: gcc.dg/vect/vect-simd-17.c -flto -ffat-lto-objects (test for excess errors) PASS: gcc.dg/vect/vect-simd-17.c -flto -ffat-lto-objects execution test === gcc Summary === # of expected passes4 /home/luoxhu/workspace/build/gcc/xgcc version 12.0.0 20211029 (experimental) (GCC)
[Bug target/102868] Missed optimization with __builtin_shuffle and zero vector on ppc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102868 luoxhu at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #3 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug target/94613] S/390, powerpc: Wrong code generated for vec_sel builtin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94613 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #17 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug target/102868] Missed optimization with __builtin_shuffle and zero vector on ppc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102868 --- Comment #1 from luoxhu at gcc dot gnu.org --- Patch submitted: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/582452.html
[Bug target/102868] New: Missed optimization with __builtin_shuffle and zero vector on ppc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102868 Bug ID: 102868 Summary: Missed optimization with __builtin_shuffle and zero vector on ppc Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: luoxhu at gcc dot gnu.org Target Milestone: --- Similar to PR94680 and PR100165, PPC currently generates inefficient instructions for below case: typedef float V __attribute__((vector_size(16))); typedef int VI __attribute__((vector_size(16))); V foo (V x) { return __builtin_shuffle (x, (V) { 0, 0, 0, 0 }, (VI) {0, 1, 4, 5}); } foo: .LFB0: .cfi_startproc .LCF0: 0: addis 2,12,.TOC.-.LCF0@ha addi 2,2,.TOC.-.LCF0@l .localentry foo,.-foo addis %r9,%r2,.LC0@toc@ha xxspltib %vs32,0 addi %r9,%r9,.LC0@toc@l lxv %vs33,0(%r9) xxperm %vs34,%vs32,%vs33 blr It will be better to produce: foo: .LFB0: .cfi_startproc vspltisw %v0,0 xxpermdi %vs34,%vs32,%vs34,3
[Bug target/97142] __builtin_fmod not optimized on POWER
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #22 from luoxhu at gcc dot gnu.org --- Fixed on master and backported to gcc-11 and gcc-10.
[Bug tree-optimization/102075] fill_always_executed_in_1 incomplete computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102075 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #1 from luoxhu at gcc dot gnu.org --- Fixed by Richard’s r12-3313, r12-3429 and r12-3430.
[Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178 --- Comment #2 from luoxhu at gcc dot gnu.org --- Verified 470.lbm doesn't show regression on Power8 with Ofast. runtime is 141 sec for r12-897, without that patch it is 142 sec.
[Bug rtl-optimization/102008] [12 Regression] no cmov generated for loads next to each other
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102008 --- Comment #3 from luoxhu at gcc dot gnu.org --- phiopt4 and sink2 are doing reverse optimizations: pr102008.c.200t.phiopt4: Hoisting adjacent loads from 3 and 4 into 2: _6 = foo_4(D)->a; _5 = foo_4(D)->b; pr102008.c.202t.sink2: Sinking _5 = foo_4(D)->b; from bb 2 to bb 4 Sinking _6 = foo_4(D)->a; from bb 2 to bb 3
[Bug rtl-optimization/102008] [12 Regression] no cmov generated for loads next to each other
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102008 --- Comment #2 from luoxhu at gcc dot gnu.org --- Confirmed if move the sink2 pass before phiopt4 could restore the previous instructons for this case: test: .LFB0: .cfi_startproc cmp w0, 1 ldp w0, w1, [x1] cselw0, w1, w0, ne ret .cfi_endproc diff --git a/gcc/passes.def b/gcc/passes.def index 945d2bc797c..83b8310f1ee 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -345,10 +345,10 @@ along with GCC; see the file COPYING3. If not see /* After late CD DCE we rewrite no longer addressed locals into SSA form if possible. */ NEXT_PASS (pass_forwprop); + NEXT_PASS (pass_sink_code); NEXT_PASS (pass_phiopt, false /* early_p */); NEXT_PASS (pass_fold_builtins); NEXT_PASS (pass_optimize_widening_mul); - NEXT_PASS (pass_sink_code); NEXT_PASS (pass_store_merging); NEXT_PASS (pass_tail_calls); ls *sink* pr102008.c.139t.sink1 pr102008.c.199t.sink2 ls *phiopt* pr102008.c.042t.phiopt1 pr102008.c.119t.phiopt2 pr102008.c.131t.phiopt3 pr102008.c.200t.phiopt4
[Bug target/97142] __builtin_fmod not optimized on POWER
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142 --- Comment #15 from luoxhu at gcc dot gnu.org --- Patch updated: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578740.html
[Bug middle-end/102075] New: fill_always_executed_in_1 incomplete computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102075 Bug ID: 102075 Summary: fill_always_executed_in_1 incomplete computation Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: luoxhu at gcc dot gnu.org Target Milestone: --- ALWAYS_EXECUTED_IN is not computed completely for nested loops. Current design will exit if an inner loop doesn't dominate outer loop's latch or exit after exiting from inner loop, which caused early return from outer loop, then ALWAYS EXECUTED blocks after inner loops are skipped. For example, x->k should be move out of outer loop but doesn't. struct X { int i; int j; int k;}; void foo(struct X *x, int n, int l) { for (int j = 0; j < l; j++) { for (int i = 0; i < n; ++i) { int *p = >j; int tem = *p; x->j += tem * i; } int *r = >k; int tem2 = *r; x->k += tem2 * j; } } Discussion lists: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577444.html
[Bug tree-optimization/101250] adjust_iv_update_pos update the iv statement unexpectedly cause memory address offset mismatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101250 --- Comment #1 from luoxhu at gcc dot gnu.org --- Patch posted: [PATCH] ivopts: Don't adjust IV update statement if both operands use the IV in COND [PR101250] https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573894.html
[Bug middle-end/101250] New: adjust_iv_update_pos update the iv statement unexpectedly cause memory address offset mismatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101250 Bug ID: 101250 Summary: adjust_iv_update_pos update the iv statement unexpectedly cause memory address offset mismatch Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: luoxhu at gcc dot gnu.org Target Milestone: --- Test case: unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) { unsigned int len = 2; do { len++; }while(len < maxlen && ip[len] == ref[len]); return len; } ivopts: [local count: 1014686026]: _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.16_15 * 1]; ivtmp.16_16 = ivtmp.16_15 + 1; _19 = ref_12(D) + 18446744073709551615; _6 = MEM[(unsigned char *)_19 + ivtmp.16_16 * 1]; if (_3 == _6) goto ; [94.50%] else goto ; [5.50%] Disable adjust_iv_update_pos will produce: [local count: 1014686026]: _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.16_15 * 1]; _6 = MEM[(unsigned char *)ref_12(D) + ivtmp.16_15 * 1]; ivtmp.16_16 = ivtmp.16_15 + 1; if (_3 == _6) goto ; [94.50%] else goto ; [5.50%] discussions: https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573709.html
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #13 from luoxhu at gcc dot gnu.org --- It is not visible in combine due to the constant data is in *.LC0 and UNSPEC_VPERM. Will shelf this and switch to other high priority issues. pr100866.c.277r.combine: (note 4 0 20 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 20 4 2 2 (set (reg:V8HI 126) (reg:V8HI 66 %v2 [ a ])) "pr100866.c":18:1 1132 {vsx_movv8hi_64bit} (expr_list:REG_DEAD (reg:V8HI 66 %v2 [ a ]) (nil))) (note 2 20 3 2 NOTE_INSN_DELETED) (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG) (insn 6 3 18 2 (set (reg/f:DI 122) (unspec:DI [ (symbol_ref/u:DI ("*.LC0") [flags 0x82]) (reg:DI 2 %r2) ] UNSPEC_TOCREL)) "pr100866.c":19:13 719 {*tocrefdi} (expr_list:REG_EQUAL (symbol_ref/u:DI ("*.LC0") [flags 0x82]) (nil))) (insn 18 6 9 2 (set (reg:V16QI 123) (mem/u/c:V16QI (and:DI (reg/f:DI 122) (const_int -16 [0xfff0])) [0 S16 A128])) "pr100866.c":19:13 1131 {vsx_movv16qi_64bit} (expr_list:REG_DEAD (reg/f:DI 122) (nil))) (insn 9 18 10 2 (set (reg:V16QI 124) (not:V16QI (reg:V16QI 123))) "pr100866.c":19:13 508 {one_cmplv16qi2} (expr_list:REG_DEAD (reg:V16QI 123) (nil))) (note 10 9 15 2 NOTE_INSN_DELETED) (insn 15 10 16 2 (set (reg/i:V8HI 66 %v2) (unspec:V8HI [ (reg:V8HI 126) repeated x2 (reg:V16QI 124) ] UNSPEC_VPERM)) "pr100866.c":20:1 1830 {altivec_vperm_v8hi_direct} (expr_list:REG_DEAD (reg:V16QI 124) (expr_list:REG_DEAD (reg:V8HI 126) (nil (insn 16 15 0 2 (use (reg/i:V8HI 66 %v2)) "pr100866.c":20:1 -1 (nil)) ;; Combiner totals: 12 attempts, 12 substitutions (2 requiring new space),
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #8 from luoxhu at gcc dot gnu.org --- (In reply to Jens Seifert from comment #7) > Regarding vec_revb for vector unsigned int. I agree that > revb: > .LFB0: > .cfi_startproc > vspltish %v1,8 > vspltisw %v0,-16 > vrlh %v2,%v2,%v1 > vrlw %v2,%v2,%v0 > blr > > works. But in this case, I would prefer the vperm approach assuming that the > loaded constant for the permute vector can be re-used multiple times. > But please get rid of the xxlnor 32,32,32. That does not make sense after > loading a constant. Change the constant that need to be loaded. xxlnor is LE specific requirement(not existed if build with -mbig), we need to turn the index {0,1,2,3} to {31, 30,29,28} for vperm usage, it is required otherwise produces incorrect result: 6|0x1630 <+16>:lvx v0,0,r9 7+> 0x1634 <+20>:xxlnor vs32,vs32,vs32 8|0x1638 <+24>:vperm v2,v2,v2,v0 9|0x163c <+28>:blr (gdb) 0x1634 in revb () 2: /x $vs34.uint128 = 0x42345678323456782234567812345678 5: /x $vs32.uint128 = 0xc0d0e0f08090a0b0405060700010203 (gdb) si 0x1638 in revb () 2: /x $vs34.uint128 = 0x42345678323456782234567812345678 5: /x $vs32.uint128 = 0xf3f2f1f0f7f6f5f4fbfaf9f8fffefdfc (gdb) si 0x163c in revb () 2: /x $vs34.uint128 = 0x78563442785634327856342278563412 5: /x $vs32.uint128 = 0xf3f2f1f0f7f6f5f4fbfaf9f8fffefdfc Quoted from the ISA: vperm VRT,VRA,VRB,VRC vsrc.qword[0] ← VSR[VRA+32] vsrc.qword[1] ← VSR[VRB+32] do i = 0 to 15 index ← VSR[VRC+32].byte[i].bit[3:7] VSR[VRT+32].byte[i] ← src.byte[index] end Let the source vector be the concatenation of the contents of VSR[VRA+32] followed by the contents of VSR[VRB+32]. For each integer value i from 0 to 15, do the following. Let index be the value specified by bits 3:7 of byte element i of VSR[VRC+32]. The contents of byte element index of src are placed into byte element i of VSR[VRT+32].
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #6 from luoxhu at gcc dot gnu.org --- For V4SI, it is also better to use vector splat and vector rotate operations. revb: .LFB0: .cfi_startproc vspltish %v1,8 vspltisw %v0,-16 vrlh %v2,%v2,%v1 vrlw %v2,%v2,%v0 blr Performance improved from 7.322s to 2.445s with a small benchmark due to load instruction replaced. But for V2DI, we don't have "vspltisd" to splat {32,32} to vector register before Power9, so lvx is still required? vector unsigned long long revb_pwr7_l(vector unsigned long long a) { return vec_rl(a, vec_splats((unsigned long long)32)); } generates: revb_pwr7_l: .LFB1: .cfi_startproc .LCF1: 0: addis 2,12,.TOC.-.LCF1@ha addi 2,2,.TOC.-.LCF1@l .localentry revb_pwr7_l,.-revb_pwr7_l addis %r9,%r2,.LC0@toc@ha addi %r9,%r9,.LC0@toc@l lvx %v0,0,%r9 vrld %v2,%v2,%v0 blr .LC0: .quad 32 .quad 32 .align 4
[Bug target/93571] PPC: fmr gets used instead of faster xxlor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571 --- Comment #3 from luoxhu at gcc dot gnu.org --- BTW, I didn't see performance difference between fmr and xxlor within a small benchmark. Max Ops Per CycleLatency (Min) Latency (Max) fmr - - ALU FPR 4 2 2 1 R - - - - Floating Move Register xxlor - - ALU VSR 2 2 2 1 V - 1 S - - VSX Vector Logical OR
[Bug target/93571] PPC: fmr gets used instead of faster xxlor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #2 from luoxhu at gcc dot gnu.org --- It is generated by "*mov_hardfloat64" (i.e. {*movdf_hardfloat64}), switch the constraint of fmr and xxlor could generate expected code, is that correct?
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #5 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #4) > This PR is specifically about the vec_revb builtin. But yes, we should > look at what is generated for all other code (having only the builtin > generate good code is suboptimal for a generic thing like this), and for > other sizes as well. Sorry I don't quite understand what you mean. IMO vec_revb is expanded by CODE_FOR_revb_v8hi through revb_ pattern. So this is where we should change to make better code generation... For V8HI, it is natural to use vspltish 8+vrlh to turn {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} to {1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14}. But for V4SI, we need use vspltish+vrlh to turn it to {1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14} first, and a "vrlw 16" to turn it to {3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12}. I am not sure whether this is better than lvx+xxlnor+vperm especially for V2DI with additional "vrld 32" or "vrld 32"+"vrlq 64"? (Those are all operations on register without load from memory like lvx.) bt 5 #0 gen_revb_v8hi (operand0=0x74d4ce40, operand1=0x74d4cf60) at ../../gcc/gcc/config/rs6000/vsx.md:5858 #1 0x10b05360 in insn_gen_fn::operator() (this=0x130ab188 ) at../../gcc/gcc/recog.h:407 #2 0x11aa1e30 in rs6000_expand_unop_builtin (icode=CODE_FOR_revb_v8hi, exp= , target=0x74d4ce40) at ../../gcc/gcc/config/rs6000/rs6000-call.c:9451 #3 0x11ab27a4 in rs6000_expand_builtin (exp=, target=0x74d4ce40, subtarget=0x0, mode=E_V8HImode, ignore=0) at ../../gcc/gcc/config/rs6000/rs6000-call.c:13157 #4 0x10815268 in expand_builtin (exp=, target=0x74d4ce40, subtarget=0x0, mode=E_V8HImode, ignore=0) at ../../gcc/gcc/builtins.c:9559
[Bug testsuite/101020] [12 regression] Several test case failures after r12-1316
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101020 luoxhu at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #4 from luoxhu at gcc dot gnu.org --- Fixed.
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #3 from luoxhu at gcc dot gnu.org --- diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md index 097a127be07..35b3f1a0e1a 100644 --- a/gcc/config/rs6000/altivec.md +++ b/gcc/config/rs6000/altivec.md @@ -1932,7 +1932,7 @@ (define_insn "altivec_vpkuum_direct" } [(set_attr "type" "vecperm")]) -(define_insn "*altivec_vrl" +(define_insn "altivec_vrl" [(set (match_operand:VI2 0 "register_operand" "=v") (rotate:VI2 (match_operand:VI2 1 "register_operand" "v") (match_operand:VI2 2 "register_operand" "v")))] diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md index 8c5865b8c34..88b34a2285a 100644 --- a/gcc/config/rs6000/vsx.md +++ b/gcc/config/rs6000/vsx.md @@ -5849,9 +5849,18 @@ (define_expand "revb_" /* Want to have the elements in reverse order relative to the endian mode in use, i.e. in LE mode, put elements in BE order. */ - rtx sel = swap_endian_selector_for_mode(mode); - emit_insn (gen_altivec_vperm_ (operands[0], operands[1], - operands[1], sel)); + if (mode == V8HImode) + { + rtx splt = gen_reg_rtx (V8HImode); + emit_insn (gen_altivec_vspltish (splt, GEN_INT (8))); + emit_insn (gen_altivec_vrlh (operands[0], operands[1], splt)); + } + else + { + rtx sel = swap_endian_selector_for_mode ( mode); + emit_insn (gen_altivec_vperm_ (operands[0], operands[1], + operands[1], sel)); + } } With above change, it could generate the expected code: revb: .LFB0: .cfi_startproc vspltisw 0,8 vrlw 2,2,0 blr
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #2 from luoxhu at gcc dot gnu.org --- But it only works for V8HImode, no better code generation for other modes like V4SI/V2DI/V1TI to do byte swap with only two instructions vspltish+vrlh? unsigned int swap1[16] = {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0}; unsigned int swap2[16] = {7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8}; unsigned int swap4[16] = {3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12}; unsigned int swap8[16] = {1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14}; For example V4SI, need swap short first, then swap word, it seems not so straight forward than vperm?
[Bug testsuite/101020] [12 regression] Several test case failures after r12-1316
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101020 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||segher at gcc dot gnu.org, ||segher at kernel dot crashing.org --- Comment #1 from luoxhu at gcc dot gnu.org --- Confirmed. The BE-m32 test is a nightmare to me... :( For float128-call.c, need check target BE or LE. And for pr100085.c, vector __int128 is not supported with {-m32}, just skip it. Ok to trunk? [PATCH] rs6000: Fix test case failures by PR100085 [PR101020] gcc/testsuite/ChangeLog: PR target/101020 * gcc.target/powerpc/float128-call.c: Adjust. * gcc.target/powerpc/pr100085.c: Likewise. --- gcc/testsuite/gcc.target/powerpc/float128-call.c | 6 -- gcc/testsuite/gcc.target/powerpc/pr100085.c | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/gcc/testsuite/gcc.target/powerpc/float128-call.c b/gcc/testsuite/gcc.target/powerpc/float128-call.c index a1f09df..b64ffc6 100644 --- a/gcc/testsuite/gcc.target/powerpc/float128-call.c +++ b/gcc/testsuite/gcc.target/powerpc/float128-call.c @@ -21,5 +21,7 @@ TYPE one (void) { return ONE; } void store (TYPE a, TYPE *p) { *p = a; } -/* { dg-final { scan-assembler "lvx 2" } } */ -/* { dg-final { scan-assembler "stvx 2" } } */ +/* { dg-final { scan-assembler {\mlxvd2x 34\M} {target be} } } */ +/* { dg-final { scan-assembler {\mstxvd2x 34\M} {target be} } } */ +/* { dg-final { scan-assembler {\mlvx 2\M} {target le} } } */ +/* { dg-final { scan-assembler {\mstvx 2\M} {target le} } } */ diff --git a/gcc/testsuite/gcc.target/powerpc/pr100085.c b/gcc/testsuite/gcc.target/powerpc/pr100085.c index 7d8b147..b6738ea 100644 --- a/gcc/testsuite/gcc.target/powerpc/pr100085.c +++ b/gcc/testsuite/gcc.target/powerpc/pr100085.c @@ -1,4 +1,4 @@ -/* { dg-do compile } */ +/* { dg-do compile {target lp64} } */ /* { dg-options "-O2 -mdejagnu-cpu=power8" } */
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #10 from luoxhu at gcc dot gnu.org --- float128 to vector __int128 is fixed by: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=f700e4b0ee3ef53b48975cf89be26b9177e3a3f3
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #9 from luoxhu at gcc dot gnu.org --- Patch sent, it could fix the __float128 to vector __int128 issue, https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571689.html But for __float128 to __int128 mentioned in #c4, need hack rs6000_modes_tieable_p to remove the stack operation in dse1. But I am not sure this is *LEGAL* since TImode is allocated to GPR, It seems not true to access TImode from ALTIVEC or VSX without copying? diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index ad11b67b125..ee69463ac46 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1974,6 +1974,9 @@ rs6000_modes_tieable_p (machine_mode mode1, machine_mode mode2) || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode) return mode1 == mode2; + if (mode1 == TImode && ALTIVEC_OR_VSX_VECTOR_MODE (mode2)) +return true; + xxpermdi %vs0,%vs34,%vs34,3 mfvsrd %r4,%vs34 mfvsrd %r3,%vs0
[Bug target/94613] S/390, powerpc: Wrong code generated for vec_sel builtin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94613 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #14 from luoxhu at gcc dot gnu.org --- Patch submmited: https://gcc.gnu.org/pipermail/gcc-patches/2021-April/569255.html
[Bug target/97142] __builtin_fmod not optimized on POWER
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142 --- Comment #12 from luoxhu at gcc dot gnu.org --- Patch submitted: https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568143.html
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #7 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #3) > The rotates in 6 and 7 are not merged, and neither are the vec_selects in > 8 and 9. Both should be pretty easy to do, there is no unspec in sight, > etc. Should this be done in pass bswaps or combine or by peephole2? :)
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #17 from luoxhu at gcc dot gnu.org --- If the constant limitation is removed, it could be combined successfully with my new patch for PR94613. https://gcc.gnu.org/pipermail/gcc-patches/2021-April/569255.html And what do you mean"This is not canonical form on RTL, and it's not a useful form either" in c#7, please? Not understanding the point... Trying 11 -> 16: 11: r124:V4SI=r127:V4SI:V4SI|~r129:V4SI:V4SI REG_DEAD r128:V4SI REG_DEAD r129:V4SI REG_DEAD r127:V4SI 16: %v2:V4SI=r124:V4SI REG_DEAD r124:V4SI Successfully matched this instruction: (set (reg/i:V4SI 66 %v2) (ior:V4SI (and:V4SI (reg:V4SI 127) (reg:V4SI 129)) (and:V4SI (not:V4SI (reg:V4SI 129)) (reg:V4SI 128 allowing combination of insns 11 and 16 original costs 4 + 4 = 8 replacement cost 4 deferring deletion of insn with uid = 11. modifying insn i316: %v2:V4SI=r127:V4SI:V4SI|~r129:V4SI:V4SI REG_DEAD r127:V4SI REG_DEAD r129:V4SI REG_DEAD r128:V4SI deferring rescan insn with uid = 16. diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 571e2337e27..701f37eb03e 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -3405,7 +3405,6 @@ simplify_context::simplify_binary_operation_1 (rtx_code code, machines, and also has shorter instruction path length. */ if (GET_CODE (op0) == AND && GET_CODE (XEXP (op0, 0)) == XOR - && CONST_INT_P (XEXP (op0, 1)) && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1)) { rtx a = trueop1; @@ -3419,7 +3418,6 @@ simplify_context::simplify_binary_operation_1 (rtx_code code, /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) */ else if (GET_CODE (op0) == AND && GET_CODE (XEXP (op0, 0)) == XOR - && CONST_INT_P (XEXP (op0, 1)) && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1)) { rtx a = XEXP (XEXP (op0, 0), 0);
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #16 from luoxhu at gcc dot gnu.org --- > +2016-11-09 Segher Boessenkool > + > + * simplify-rtx.c (simplify_binary_operation_1): Simplify > + (xor (and (xor A B) C) B) to (ior (and A C) (and B ~C)) and > + (xor (and (xor A B) C) A) to (ior (and A ~C) (and B C)) if C > + is a const_int. Is it a MUST that C be const here? For this case in PR90323, C is not a const actually. l = l & ~mask; l |= mask & r; Trying 8, 9 -> 10: 8: r127:V4SI=r124:V4SI^r131:V4SI REG_DEAD r131:V4SI 9: r122:V4SI=r127:V4SI:V4SI REG_DEAD r130:V4SI REG_DEAD r127:V4SI 10: r128:V4SI=r124:V4SI^r122:V4SI REG_DEAD r124:V4SI REG_DEAD r122:V4SI
[Bug target/97142] __builtin_fmod not optimized on POWER
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142 --- Comment #10 from luoxhu at gcc dot gnu.org --- If not built with fast-math, gimple_has_side_effects will return true and cause the expand_call_stmt fail to expand the "_1 = fmod (x_2(D), y_3(D));" to internal function. X86 also produces "bl fmod" for O3 build. xlF expands the fmod to below ASM, no FMA generated? 1900 : 1900: 8c 03 01 10 vspltisw v0,1 1904: 00 00 24 c8 lfd f1,0(r4) 1908: 00 00 03 c8 lfd f0,0(r3) 190c: e2 03 40 f0 xvcvsxwdp vs2,vs32 1910: c0 09 62 f0 xsdivdp vs3,vs2,vs1 1914: 80 19 80 f0 xsmuldp vs4,vs0,vs3 1918: 64 21 a0 f0 xsrdpiz vs5,vs4 191c: 88 2d 01 f0 xsnmsubadp vs0,vs1,vs5 1920: 18 00 20 fc frspf1,f0 1924: 20 00 80 4e blr
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #15 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #14) > (In reply to luoxhu from comment #12) > > That code was called by combine pass but fail to match. > > > > > pr newpat > > (set (reg:DI 125 [ l ]) > > (xor:DI (and:DI (xor:DI (reg/v:DI 120 [ l ]) > > (reg:DI 127)) > > (const_int 267390975 [0xff00fff])) > > (reg/v:DI 120 [ l ]))) > > Note this is 0x0ff00fff, and this is not a valid mask for rlwimi. OK, it also fails to combine for 0x0100. .cfi_startproc xor 4,3,4 rlwinm 4,4,0,7,7 xor 3,4,3 blr
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #12 from luoxhu at gcc dot gnu.org --- That code was called by combine pass but fail to match. pr newpat (set (reg:DI 125 [ l ]) (xor:DI (and:DI (xor:DI (reg/v:DI 120 [ l ]) (reg:DI 127)) (const_int 267390975 [0xff00fff])) (reg/v:DI 120 [ l ]))) Trying 8, 10 -> 11: 8: r123:DI=r120:DI^r127:DI REG_DEAD r127:DI 10: r118:DI=r123:DI&0xff00fff REG_DEAD r123:DI 11: r125:DI=r118:DI^r120:DI REG_DEAD r120:DI REG_DEAD r118:DI Failed to match this instruction: (set (reg:DI 125 [ l ]) (ior:DI (and:DI (reg/v:DI 120 [ l ]) (const_int -267390976 [0xf00ff000])) (and:DI (reg:DI 127) (const_int 267390975 [0xff00fff] Successfully matched this instruction: (set (reg:DI 118 [ _2 ]) (and:DI (reg:DI 127) (const_int 267390975 [0xff00fff]))) Failed to match this instruction: (set (reg:DI 125 [ l ]) (ior:DI (and:DI (reg/v:DI 120 [ l ]) (const_int -267390976 [0xf00ff000])) (reg:DI 118 [ _2 ])))
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #11 from luoxhu at gcc dot gnu.org --- I noticed that you added the below optimization with commit a62436c0a505155fc8becac07a8c0abe2c265bfe. But it doesn't even handle this case, cse1 pass will call simplify_binary_operation_1, both op0 and op1 are REGs instead of AND operators, do you have a test case to cover that piece of code? __attribute__ ((noinline)) long without_sel3( long l, long r) { long tmp = {0x0ff00fff}; l = ( (l ^ r) & tmp) ^ l; return l; } without_sel3: xor 4,3,4 rlwinm 4,4,0,20,11 rldicl 4,4,0,36 xor 3,4,3 blr .long 0 .byte 0,0,0,0,0,0,0,0 +2016-11-09 Segher Boessenkool + + * simplify-rtx.c (simplify_binary_operation_1): Simplify + (xor (and (xor A B) C) B) to (ior (and A C) (and B ~C)) and + (xor (and (xor A B) C) A) to (ior (and A ~C) (and B C)) if C + is a const_int. diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 5c3dea1a349..11a2e0267c7 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -2886,6 +2886,37 @@ simplify_binary_operation_1 (enum rtx_code code, machine_mode mode, } } + /* If we have (xor (and (xor A B) C) A) with C a constant we can instead +do (ior (and A ~C) (and B C)) which is a machine instruction on some +machines, and also has shorter instruction path length. */ + if (GET_CODE (op0) == AND + && GET_CODE (XEXP (op0, 0)) == XOR + && CONST_INT_P (XEXP (op0, 1)) + && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1)) + { + rtx a = trueop1; + rtx b = XEXP (XEXP (op0, 0), 1); + rtx c = XEXP (op0, 1); + rtx nc = simplify_gen_unary (NOT, mode, c, mode); + rtx a_nc = simplify_gen_binary (AND, mode, a, nc); + rtx bc = simplify_gen_binary (AND, mode, b, c); + return simplify_gen_binary (IOR, mode, a_nc, bc); + } + /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) */ + else if (GET_CODE (op0) == AND + && GET_CODE (XEXP (op0, 0)) == XOR + && CONST_INT_P (XEXP (op0, 1)) + && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1)) + { + rtx a = XEXP (XEXP (op0, 0), 0); + rtx b = trueop1; + rtx c = XEXP (op0, 1); + rtx nc = simplify_gen_unary (NOT, mode, c, mode); + rtx b_nc = simplify_gen_binary (AND, mode, b, nc); + rtx ac = simplify_gen_binary (AND, mode, a, c); + return simplify_gen_binary (IOR, mode, ac, b_nc); + }
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #9 from luoxhu at gcc dot gnu.org --- Then we could optimized it in match.pd diff --git a/gcc/match.pd b/gcc/match.pd index 036f92fa959..8944312c153 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3711,6 +3711,17 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (integer_all_onesp (@1) && integer_zerop (@2)) @0 +#if GIMPLE +(simplify + (bit_xor @0 (bit_and @2 (bit_xor @0 @1))) + (if (optimize_vectors_before_lowering_p () && types_match (@0, @1) + && types_match (@0, @2) && VECTOR_TYPE_P (TREE_TYPE (@0)) + && VECTOR_TYPE_P (TREE_TYPE (@1)) && VECTOR_TYPE_P (TREE_TYPE (@2))) + (with { tree itype = truth_type_for (type); } + (vec_cond (convert:itype @2) @1 @0 +#endif in pr90323.c.033t.forwprop1, it will be optimized to: : _1 = ~mask_3(D); l_5 = _1 & l_4(D); _2 = mask_3(D) & r_6(D); _8 = l_4(D) ^ r_6(D); _10 = mask_3(D) & _8; _11 = (vector(4) ) mask_3(D); l_7 = VEC_COND_EXPR <_11, r_6(D), l_4(D)>; return l_7; Then in pr90323.c.243t.isel: [local count: 1073741824]: _6 = (vector(4) ) mask_1(D); l_4 = .VCOND_MASK (_6, r_3(D), l_2(D)); return l_4; final ASM: without_sel: .LFB11: .cfi_startproc xxsel 34,34,35,36 blr .long 0 .byte 0,0,0,0,0,0,0,0 .cfi_endproc .LFE11: .size without_sel,.-without_sel .align 2 .p2align 4,,15 .globl with_sel .type with_sel, @function with_sel: .LFB12: .cfi_startproc xxsel 34,34,35,36 blr @segher, Is this reasonable fix ???
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #8 from luoxhu at gcc dot gnu.org --- Two minor updates for the case mentioned in #c2: for VEC_SEL (ARG1, ARG2, ARG3): Returns a vector containing the value of either ARG1 or ARG2 depending on the value of ARG3. #include #include volatile vector unsigned orig = {0xebebebeb, 0x34343434, 0x76767676, 0x12121212}; volatile vector unsigned mask = {0x, 0, 0x, 0}; volatile vector unsigned fill = {0xfefefefe, 0x, 0x, 0x}; volatile vector unsigned expected = {0xfefefefe, 0x34343434, 0x, 0x12121212}; __attribute__ ((noinline)) vector unsigned without_sel(vector unsigned l, vector unsigned r, vector unsigned mask) { -l = l & ~r; +l = l & ~mask; l |= mask & r; return l; } __attribute__ ((noinline)) vector unsigned with_sel(vector unsigned l, vector unsigned r, vector unsigned mask) { -return vec_sel(l, mask, r); +return vec_sel(l, r, mask); } int main() { vector unsigned res1 = without_sel(orig, fill, mask); vector unsigned res2 = with_sel(orig, fill, mask); if (!vec_all_eq(res1, expected)) printf ("error1\n"); if (!vec_all_eq(res2, expected)) printf ("error2\n"); return 0; } And the ASM would be: without_sel: xxlxor 35,34,35 xxland 35,35,36 xxlxor 34,34,35 blr .long 0 .byte 0,0,0,0,0,0,0,0 with_sel: xxsel 34,34,35,36 blr .long 0 .byte 0,0,0,0,0,0,0,0
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #21 from luoxhu at gcc dot gnu.org --- Fixed on mater.
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #19 from luoxhu at gcc dot gnu.org --- https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567395.html This patch extends variable vec_insert to all 32bit VSX targets including Power7{BE} {32,64}, Power8{BE}{32, 64}, Power8{LE}{64}, Power9{LE}{64}, all tested pass for power testcases, though AIX is not tested yet. @Segher, please review this one instead of the previous that disables 32 bit variable vec_insert, thanks. For Altivec targets like power5/6/G4/G5, take the previous "vector store/scalar store/vector load" code path. -mcpu=power6 -O2 -maltivec -c -S f2: .LFB0: .cfi_startproc addi 10,1,-16 sldi 5,5,2 li 9,32 addi 8,1,-48 stvx 2,8,9 stwx 6,10,5 lvx 2,8,9 blr
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #15 from luoxhu at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #14) > You still have: > if (VECTOR_MEM_VSX_P (mode)) > { > if (!CONST_INT_P (elt_rtx)) > { > if ((TARGET_P9_VECTOR && TARGET_POWERPC64) || width == 8) > return ..._p9 (...); > else if (TARGET_P8_VECTOR) > return ..._p8 (...); > } > > if (mode == V2DFmode) > insn = gen_vsx_set_v2df (target, target, val, elt_rtx); > > else if (mode == V2DImode) > insn = gen_vsx_set_v2di (target, target, val, elt_rtx); > > else if (TARGET_P9_VECTOR && TARGET_POWERPC64) > { > ... > } > if (insn) > return; > } > > gcc_assert (CONST_INT_P (elt_rtx)); > > while the vector.md condition is VECTOR_MEM_ALTIVEC_OR_VSX_P (mode), > i.e. true for TARGET_ALTIVEC for many modes already (V4SI, V8HI, V16QI, V4SF > and > for TARGET_VSX also V2DF and V2DI, right). > I somehow don't see how this can work properly. > Looking at vsx_set_v2df and vsx_set_v2di, neither of them will handle > non-constant elt_rtx (it ICEs on anything but const0_rtx and const1_rtx). > > So, questions: > 1) does the rs6000_expand_vector_set_var_p9 routine for width == 8 (i.e. > V2DImode or V2DFmode?) > handle everything, even when TARGET_P9_VECTOR or TARGET_POWERPC64 is not > true, plain old VSX? Yes. V2DI/V2DF for P8 {BE,LE} {m32,m64} will call rs6000_expand_vector_set_var_p9 instead of xxx_p8. Do you mean Power7 for the plain old VSX? I verified the pr98914.c on Power7, it exactly ICEs on "gcc_assert (CONST_INT_P (elt_rtx));" for both m64 and m32. This is still not fixed by the patch in #c11 yet. For builtin call in rs6000-c.c:altivec_build_resolved_builtin, it is guarded by TARGET_P8_VECTOR, so Power7 doesn't generate IFN VEC_INSERT before. This ICE also comes from internal optimization gimple-isel.c:gimple_expand_vec_set_expr, can_vec_set_var_idx_p doesn't return false due to VECTOR_MEM_ALTIVEC_OR_VSX_P is true when Power7 VSX, change the "if (VECTOR_MEM_VSX_P (mode))" to "if (VECTOR_MEM_ALTIVEC_OR_VSX_P (mode))" in rs6000.c:rs6000_expand_vector_set and remove TARGET_P8_VECTOR in the else branch could fix the ICE on P7 {m32,64}, so this means even P7 VSX could benefit from this optimization, which is different from what discussed before. > 2) what happens if TARGET_P8_VECTOR is false and TARGET_VSX is true and mode > is other than V2DI/V2DF? If I read the code right, it will fall through to > gcc_assert (CONST_INT_P (elt_rtx)); Same like 1)? > 3) what happens if !TARGET_VSX (more specifically, when VECTOR_MEM_VSX_P > (mode) is false. > I see there just the assertion that would fail right away. > Perhaps I'm missing something obvious and those cases are impossible, but if > that is the case, it would still be better to add further assertion at least > to the if (...) else if (...) as else gcc_assert ... Thanks for pointing out, the "gcc_assert (CONST_INT_P (elt_rtx));" should be moved into the "if (!CONST_INT_P (elt_rtx))" condition like you said. gen_vsx_set_v2df and gen_vsx_set_v2di are supposed to handle only const elt_rtx.
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #13 from luoxhu at gcc dot gnu.org --- Performance data in #c11 is for int variable vec_insert of 32bit mode, the float variable vec_insert of 32-bit is a bit slower but much better than original(extra stfs+lwz of insn #17 and insn 18 in expand to move SF register to SI register by hex value.): 46.677s -> 8.723s test.c #include #define TYPE float vector TYPE test (vector TYPE u, TYPE i, signed int n){ return vec_insert (i, u, n); } Expand: 1: NOTE_INSN_DELETED 6: NOTE_INSN_BASIC_BLOCK 2 2: r122:V4SF=%2:V4SF 3: r123:SF=%1:SF 4: r124:SI=%3:SI 5: NOTE_INSN_FUNCTION_BEG 8: r120:V4SF=r122:V4SF 9: r125:SI=r124:SI&0x3 10: r126:V4SF=r120:V4SF 11: r128:SI=r125:SI<<0x2 12: {r128:SI=0x14-r128:SI;clobber ca:SI;} 13: r132:SI=high(`*.LC0') 14: r131:SI=r132:SI+low(`*.LC0') REG_EQUAL `*.LC0' 15: r130:V2DI=[r131:SI] REG_EQUAL const_vector 16: r129:V16QI=r130:V2DI#0 17: [r112:SI]=r123:SF 18: r133:SI=[r112:SI] 19: r136:DI#4=r133:SI 22: {r137:SI=r133:SI>>0x1f;clobber ca:SI;} 23: r136:DI#0=r137:SI 24: r138:DI=0 25: r135:V2DI=vec_concat(r136:DI,r138:DI) 26: r134:V16QI=r135:V2DI#0 27: r139:V16QI=unspec[r128:SI] 151 28: r140:V16QI=unspec[r134:V16QI,r134:V16QI,r139:V16QI] 236 29: r141:V16QI=unspec[r129:V16QI,r129:V16QI,r139:V16QI] 236 30: r126:V4SF#0={(r141:V16QI!=const_vector)?r140:V16QI:r126:V4SF#0} 31: r119:V4SF=r126:V4SF 32: r120:V4SF=r119:V4SF ASM: .LFB0: .cfi_startproc stwu 1,-16(1) .cfi_def_cfa_offset 16 lis 9,.LC0@ha rlwinm 3,3,2,28,29 xxlxor 0,0,0 la 9,.LC0@l(9) subfic 3,3,20 lxvd2x 33,0,9 lvsl 13,0,3 stfs 1,8(1) vperm 1,1,1,13 ori 2,2,0 lwz 9,8(1) addi 1,1,16 .cfi_def_cfa_offset 0 srawi 10,9,31 mtvsrwz 13,9 mtvsrwz 12,10 fmrgow 11,12,13 xxpermdi 32,11,0,0 vperm 0,0,0,13 xxsel 34,34,32,33 blr