On Wed, 26 Jun 2019 at 23:45, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > > On Wed, 26 Jun 2019 at 16:05, Richard Sandiford > > <richard.sandif...@arm.com> wrote: > >> > >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > >> > On Tue, 25 Jun 2019 at 20:05, Richard Sandiford > >> > <richard.sandif...@arm.com> wrote: > >> >> > >> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > >> >> > On Mon, 24 Jun 2019 at 21:41, Prathamesh Kulkarni > >> >> > <prathamesh.kulka...@linaro.org> wrote: > >> >> >> > >> >> >> On Mon, 24 Jun 2019 at 19:51, Richard Sandiford > >> >> >> <richard.sandif...@arm.com> wrote: > >> >> >> > > >> >> >> > Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > >> >> >> > > @@ -1415,6 +1460,19 @@ forward_propagate_into (df_ref use) > >> >> >> > > if (!def_set) > >> >> >> > > return false; > >> >> >> > > > >> >> >> > > + if (reg_prop_only > >> >> >> > > + && !REG_P (SET_SRC (def_set)) > >> >> >> > > + && !REG_P (SET_DEST (def_set))) > >> >> >> > > + return false; > >> >> >> > > >> >> >> > This should be: > >> >> >> > > >> >> >> > if (reg_prop_only > >> >> >> > && (!REG_P (SET_SRC (def_set)) || !REG_P (SET_DEST > >> >> >> > (def_set)))) > >> >> >> > return false; > >> >> >> > > >> >> >> > so that we return false if either operand isn't a register. > >> >> >> Oops, sorry about that -:( > >> >> >> > > >> >> >> > > + > >> >> >> > > + /* Allow propagations into a loop only for reg-to-reg copies, > >> >> >> > > since > >> >> >> > > + replacing one register by another shouldn't increase the > >> >> >> > > cost. */ > >> >> >> > > + > >> >> >> > > + if (DF_REF_BB (def)->loop_father != DF_REF_BB > >> >> >> > > (use)->loop_father > >> >> >> > > + && !REG_P (SET_SRC (def_set)) > >> >> >> > > + && !REG_P (SET_DEST (def_set))) > >> >> >> > > + return false; > >> >> >> > > >> >> >> > Same here. > >> >> >> > > >> >> >> > OK with that change, thanks. > >> >> >> Thanks for the review, will make the changes and commit the patch > >> >> >> after re-testing. > >> >> > Hi, > >> >> > Testing the patch showed following failures on 32-bit x86: > >> >> > > >> >> > Executed from: g++.target/i386/i386.exp > >> >> > g++:g++.target/i386/pr88152.C scan-assembler-not > >> >> > vpcmpgt|vpcmpeq|vpsra > >> >> > Executed from: gcc.target/i386/i386.exp > >> >> > gcc:gcc.target/i386/pr66768.c scan-assembler add*.[ \t]%gs: > >> >> > gcc:gcc.target/i386/pr90178.c scan-assembler-times xorl[\\t > >> >> > ]*\\%eax,[\\t ]*%eax 1 > >> >> > > >> >> > The failure of pr88152.C is also seen on x86_64. > >> >> > > >> >> > For pr66768.c, and pr90178.c, forwprop replaces register which is > >> >> > volatile and frame related respectively. > >> >> > To avoid that, the attached patch, makes a stronger constraint that > >> >> > src and dest should be a register > >> >> > and not have frame_related or volatil flags set, which is checked in > >> >> > usable_reg_p(). > >> >> > Which avoids the failures for both the cases. > >> >> > Does it look OK ? > >> >> > >> >> That's not the reason it's a bad transform. In both cases we're > >> >> propagating r2 <- r1 even though > >> >> > >> >> (a) r1 dies in the copy and > >> >> (b) fwprop can't replace all uses of r2, because some have multiple > >> >> definitions > >> >> > >> >> This has the effect of making both values live in cases where only one > >> >> was previously. > >> >> > >> >> In the case of pr66768.c, fwprop2 is undoing the effect of > >> >> cse.c:canon_reg, which tries to pick the best register to use > >> >> (see cse.c:make_regs_eqv). fwprop1 makes the same change, > >> >> and made it even before the patch, but the cse.c choice should win. > >> >> > >> >> A (hopefully) conservative fix would be to propagate the copy only if > >> >> both registers have a single definition, which you can test with: > >> >> > >> >> (DF_REG_DEF_COUNT (regno) == 1 > >> >> && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno)) > >> >> > >> >> In that case, fwprop should see all uses of the destination, and should > >> >> be able to replace it in all cases with the source. > >> > Ah I see, thanks for the explanation! > >> > The above check works to resolve the failure. > >> > IIUC, !bitmap_bit_p (...) above checks that reg isn't used uninitialized > >> > ? > >> > >> Right. > >> > >> >> > For g++.target/i386/pr88152.C, the issue is that after the patch, > >> >> > forwprop1 does following propagation (in f10) which wasn't done > >> >> > before: > >> >> > > >> >> > In insn 10, replacing > >> >> > (unspec:SI [ > >> >> > (reg:V2DF 91) > >> >> > ] UNSPEC_MOVMSK) > >> >> > with (unspec:SI [ > >> >> > (subreg:V2DF (reg:V2DI 90) 0) > >> >> > ] UNSPEC_MOVMSK) > >> >> > > >> >> > This later defeats combine because insn 9 gets deleted. > >> >> > Without patch, the following combination takes place: > >> >> > > >> >> > Trying 7 -> 9: > >> >> > 7: r90:V2DI=r89:V2DI>r93:V2DI > >> >> > REG_DEAD r93:V2DI > >> >> > REG_DEAD r89:V2DI > >> >> > 9: r91:V2DF=r90:V2DI#0 > >> >> > REG_DEAD r90:V2DI > >> >> > Successfully matched this instruction: > >> >> > (set (subreg:V2DI (reg:V2DF 91) 0) > >> >> > (gt:V2DI (reg:V2DI 89) > >> >> > (reg:V2DI 93))) > >> >> > allowing combination of insns 7 and 9 > >> >> > > >> >> > and then: > >> >> > Trying 6, 9 -> 10: > >> >> > 6: r89:V2DI=const_vector > >> >> > 9: r91:V2DF#0=r89:V2DI>r93:V2DI > >> >> > REG_DEAD r89:V2DI > >> >> > REG_DEAD r93:V2DI > >> >> > 10: r87:SI=unspec[r91:V2DF] 43 > >> >> > REG_DEAD r91:V2DF > >> >> > Successfully matched this instruction: > >> >> > (set (reg:SI 87) > >> >> > (unspec:SI [ > >> >> > (lt:V2DF (reg:V2DI 93) > >> >> > (const_vector:V2DI [ > >> >> > (const_int 0 [0]) repeated x2 > >> >> > ])) > >> >> > ] UNSPEC_MOVMSK)) > >> >> > >> >> Eh? lt:*V2DF*? Does that mean that it's 0 for false and an all-1 NaN > >> >> for true? > >> > Well looking at .optimized dump: > >> > > >> > vector(2) long int _2; > >> > vector(2) double _3; > >> > int _6; > >> > signed long _7; > >> > long int _8; > >> > signed long _9; > >> > long int _10; > >> > > >> > <bb 2> [local count: 1073741824]: > >> > _7 = BIT_FIELD_REF <a_4(D), 64, 0>; > >> > _8 = _7 < 0 ? -1 : 0; > >> > _9 = BIT_FIELD_REF <a_4(D), 64, 64>; > >> > _10 = _9 < 0 ? -1 : 0; > >> > _2 = {_8, _10}; > >> > _3 = VIEW_CONVERT_EXPR<__m128d>(_2); > >> > _6 = __builtin_ia32_movmskpd (_3); [tail call] > >> > return _6; > >> > > >> > So AFAIU, we're using -1 for representing true and 0 for false > >> > and casting -1 (literally) to double would change it's value to -NaN ? > >> > >> Exactly. And for -ffinite-math-only, we might as well then fold the > >> condition to false. :-) > >> > >> IMO rtl condition results have to have integral modes and not folding > >> (subreg:V2DF (lt:V2DI ...) 0) is the correct behaviour. So I think this > >> is just a latent bug and shouldn't hold up the patch. > >> > >> I'm not sure whether: > >> > >> reinterpret_cast<__m128d> (a > ...) > >> > >> is something we expect users to write, or whether it was just > >> included for completeness. > > I will report the issue and commit after re-testing patch. > > Thanks a lot for the helpful reviews! > > Since it seems FP comparison results are OK, I guess it needs to be > fixed after all. I think the problem is that the simplification > is only done by gen_lowpart_for_combine: > > /* If X is a comparison operator, rewrite it in a new mode. This > probably won't match, but may allow further simplifications. */ > else if (COMPARISON_P (x)) > return gen_rtx_fmt_ee (GET_CODE (x), omode, XEXP (x, 0), XEXP (x, 1)); > > and triggers via expand_field_assignment when the subreg is on the lhs > of the SET. For it to work for a general subreg on the rhs, it probably > needs to be moved from here to simplify_subreg. > > We'll need to be careful about the conditions under which it > happens though. The above doesn't for example check whether > the new mode has the same number of elements as the original, > so could generate things like: > > (gt:V4SF (reg:V2DI X) (reg:V2DI Y)) > > for: > > (set (reg:V2DI res) (gt:V2DI (reg:V2DI X) (reg:V2DI Y))) > (subreg:V4SF (reg:V2DI res) 0) > > I think most code would expect the result of a comparison to have the > same number of elements as the inputs and could ICE if they don't, > so I don't think it's up to the target to decide whether mismatches > are OK. (But there again, I thought that last time. :-)) > > We'd also need to check the element sizes, since e.g. a lowpart > (subreg:V2HI ...) of a V2SI comparison isn't the same as a V2HI > comparison. Hi Richard, Thanks for the detailed suggestions and sorry for late response, for some reason, I missed this email earlier -;( With the changes to simplify_subreg, the test doesn't regress anymore.
IIUC it does the following change: (subreg:V2DF (lt:V2DI (reg:V2DI 93) (const_vector 0))) -> (lt:V2DF (reg:V2DI 93) (const_vector 0)) which allowed the 6, 7 -> 10 combination which was failing earlier. Does the attached version look OK ? Testing in progress. Thanks, Prathamesh > > Thanks, > Richard
diff --git a/gcc/fwprop.c b/gcc/fwprop.c index 45703fe5f01..e6f37527192 100644 --- a/gcc/fwprop.c +++ b/gcc/fwprop.c @@ -448,6 +448,18 @@ enum { PR_OPTIMIZE_FOR_SPEED = 4 }; +/* Check that X has a single def. */ + +static bool +reg_single_def_p (rtx x) +{ + if (!REG_P (x)) + return false; + + int regno = REGNO (x); + return (DF_REG_DEF_COUNT (regno) == 1 + && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (cfun)), regno)); +} /* Replace all occurrences of OLD in *PX with NEW and try to simplify the resulting expression. Replace *PX with a new RTL expression if an @@ -547,6 +559,54 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags) tem = simplify_gen_subreg (mode, op0, GET_MODE (SUBREG_REG (x)), SUBREG_BYTE (x)); } + + else + { + rtvec vec; + rtvec newvec; + const char *fmt = GET_RTX_FORMAT (code); + rtx op; + + for (int i = 0; fmt[i]; i++) + switch (fmt[i]) + { + case 'E': + vec = XVEC (x, i); + newvec = vec; + for (int j = 0; j < GET_NUM_ELEM (vec); j++) + { + op = RTVEC_ELT (vec, j); + valid_ops &= propagate_rtx_1 (&op, old_rtx, new_rtx, flags); + if (op != RTVEC_ELT (vec, j)) + { + if (newvec == vec) + { + newvec = shallow_copy_rtvec (vec); + if (!tem) + tem = shallow_copy_rtx (x); + XVEC (tem, i) = newvec; + } + RTVEC_ELT (newvec, j) = op; + } + } + break; + + case 'e': + if (XEXP (x, i)) + { + op = XEXP (x, i); + valid_ops &= propagate_rtx_1 (&op, old_rtx, new_rtx, flags); + if (op != XEXP (x, i)) + { + if (!tem) + tem = shallow_copy_rtx (x); + XEXP (tem, i) = op; + } + } + break; + } + } + break; case RTX_OBJ: @@ -1370,10 +1430,11 @@ forward_propagate_and_simplify (df_ref use, rtx_insn *def_insn, rtx def_set) /* Given a use USE of an insn, if it has a single reaching definition, try to forward propagate it into that insn. - Return true if cfg cleanup will be needed. */ + Return true if cfg cleanup will be needed. + REG_PROP_ONLY is true if we should only propagate register copies. */ static bool -forward_propagate_into (df_ref use) +forward_propagate_into (df_ref use, bool reg_prop_only = false) { df_ref def; rtx_insn *def_insn, *use_insn; @@ -1394,10 +1455,6 @@ forward_propagate_into (df_ref use) if (DF_REF_IS_ARTIFICIAL (def)) return false; - /* Do not propagate loop invariant definitions inside the loop. */ - if (DF_REF_BB (def)->loop_father != DF_REF_BB (use)->loop_father) - return false; - /* Check if the use is still present in the insn! */ use_insn = DF_REF_INSN (use); if (DF_REF_FLAGS (use) & DF_REF_IN_NOTE) @@ -1415,6 +1472,19 @@ forward_propagate_into (df_ref use) if (!def_set) return false; + if (reg_prop_only + && (!reg_single_def_p (SET_SRC (def_set)) + || !reg_single_def_p (SET_DEST (def_set)))) + return false; + + /* Allow propagations into a loop only for reg-to-reg copies, since + replacing one register by another shouldn't increase the cost. */ + + if (DF_REF_BB (def)->loop_father != DF_REF_BB (use)->loop_father + && (!reg_single_def_p (SET_SRC (def_set)) + || !reg_single_def_p (SET_DEST (def_set)))) + return false; + /* Only try one kind of propagation. If two are possible, we'll do it on the following iterations. */ if (forward_propagate_and_simplify (use, def_insn, def_set) @@ -1483,7 +1553,7 @@ gate_fwprop (void) } static unsigned int -fwprop (void) +fwprop (bool fwprop_addr_p) { unsigned i; @@ -1502,11 +1572,16 @@ fwprop (void) df_ref use = DF_USES_GET (i); if (use) - if (DF_REF_TYPE (use) == DF_REF_REG_USE - || DF_REF_BB (use)->loop_father == NULL - /* The outer most loop is not really a loop. */ - || loop_outer (DF_REF_BB (use)->loop_father) == NULL) - forward_propagate_into (use); + { + if (DF_REF_TYPE (use) == DF_REF_REG_USE + || DF_REF_BB (use)->loop_father == NULL + /* The outer most loop is not really a loop. */ + || loop_outer (DF_REF_BB (use)->loop_father) == NULL) + forward_propagate_into (use, fwprop_addr_p); + + else if (fwprop_addr_p) + forward_propagate_into (use, false); + } } fwprop_done (); @@ -1537,7 +1612,7 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { return gate_fwprop (); } - virtual unsigned int execute (function *) { return fwprop (); } + virtual unsigned int execute (function *) { return fwprop (false); } }; // class pass_rtl_fwprop @@ -1549,33 +1624,6 @@ make_pass_rtl_fwprop (gcc::context *ctxt) return new pass_rtl_fwprop (ctxt); } -static unsigned int -fwprop_addr (void) -{ - unsigned i; - - fwprop_init (); - - /* Go through all the uses. df_uses_create will create new ones at the - end, and we'll go through them as well. */ - for (i = 0; i < DF_USES_TABLE_SIZE (); i++) - { - if (!propagations_left) - break; - - df_ref use = DF_USES_GET (i); - if (use) - if (DF_REF_TYPE (use) != DF_REF_REG_USE - && DF_REF_BB (use)->loop_father != NULL - /* The outer most loop is not really a loop. */ - && loop_outer (DF_REF_BB (use)->loop_father) != NULL) - forward_propagate_into (use); - } - - fwprop_done (); - return 0; -} - namespace { const pass_data pass_data_rtl_fwprop_addr = @@ -1600,7 +1648,7 @@ public: /* opt_pass methods: */ virtual bool gate (function *) { return gate_fwprop (); } - virtual unsigned int execute (function *) { return fwprop_addr (); } + virtual unsigned int execute (function *) { return fwprop (true); } }; // class pass_rtl_fwprop_addr diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 89a46a933fa..79bd0cfbd28 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -6697,6 +6697,19 @@ simplify_subreg (machine_mode outermode, rtx op, } } + /* If op is a vector comparison operator, rewrite it in a new mode. + This probably won't match, but may allow further simplifications. + Also check if number of elements and size of each element + match for outermode and innermode. */ + + if (COMPARISON_P (op) + && VECTOR_MODE_P (outermode) + && VECTOR_MODE_P (innermode) + && (known_eq (GET_MODE_NUNITS (outermode), GET_MODE_NUNITS (innermode))) + && (known_eq (GET_MODE_UNIT_SIZE (outermode), + GET_MODE_UNIT_SIZE (innermode)))) + return gen_rtx_fmt_ee (GET_CODE (op), outermode, XEXP (op, 0), XEXP (op, 1)); + return NULL_RTX; } diff --git a/gcc/testsuite/gfortran.dg/pr88833.f90 b/gcc/testsuite/gfortran.dg/pr88833.f90 new file mode 100644 index 00000000000..224e6ce5f3d --- /dev/null +++ b/gcc/testsuite/gfortran.dg/pr88833.f90 @@ -0,0 +1,9 @@ +! { dg-do assemble { target aarch64_asm_sve_ok } } +! { dg-options "-O3 -march=armv8.2-a+sve --save-temps" } + +subroutine foo(x) + real :: x(100) + x = x + 10 +end subroutine foo + +! { dg-final { scan-assembler {\twhilelo\tp[0-9]+\.s, wzr, (w[0-9]+).*\twhilelo\tp[0-9]+\.s, w[0-9]+, \1} } }