Re: [PATCH] i386: Improve chaining of _{addcarry, subborrow}_u{32, 64} [PR97387]

2020-10-14 Thread Uros Bizjak via Gcc-patches
On Wed, Oct 14, 2020 at 3:49 PM Jakub Jelinek  wrote:
>
> On Wed, Oct 14, 2020 at 03:17:03PM +0200, Uros Bizjak wrote:
> > > +(define_insn_and_split "*setcc_qi_addqi3_cconly_overflow_1_"
> > > +  [(set (reg:CCC FLAGS_REG)
> > > +   (compare:CCC (neg:QI (geu:QI (reg:CC_CCC FLAGS_REG) (const_int 
> > > 0)))
> > > +(ltu:QI (reg:CC_CCC FLAGS_REG) (const_int 0]
> > > +  "ix86_pre_reload_split ()"
> > > +  "#"
> > > +  "&& 1"
> > > +  [(const_int 0)])
> > >
> >
> > Hmm... does the above really represent a NOP?
>
> It is just what combine.c + simplify-rtx.c make out of this.
> We have:
> (insn 10 9 11 2 (set (reg:QI 88 [ _31 ])
> (ltu:QI (reg:CCC 17 flags)
> (const_int 0 [0]))) "include/adxintrin.h":69:10 785 {*setcc_qi}
>  (expr_list:REG_DEAD (reg:CCC 17 flags)
> (nil)))
> and
> (insn 17 15 18 2 (parallel [
> (set (reg:CCC 17 flags)
> (compare:CCC (plus:QI (reg:QI 88 [ _31 ])
> (const_int -1 [0x]))
> (reg:QI 88 [ _31 ])))
> (clobber (scratch:QI))
> ]) "include/adxintrin.h":69:10 350 {*addqi3_cconly_overflow_1}
>  (expr_list:REG_DEAD (reg:QI 88 [ _31 ])
> (nil)))
> So when substituting (reg:QI 88) for (ltu:QI flags 0), we initially get:
> (compare:CCC (plus:QI (ltu:QI (reg:CCC 17 flags) (const_int [0])) (const_int 
> -1 [0x]))
>  (ltu:QI (reg:CCC 17 flags) (const_int [0])))
> On this triggers simplify_binary_operation_1 rule:
>   /* (plus (comparison A B) C) can become (neg (rev-comp A B)) if
>  C is 1 and STORE_FLAG_VALUE is -1 or if C is -1 and STORE_FLAG_VALUE
>  is 1.  */
>   if (COMPARISON_P (op0)
>   && ((STORE_FLAG_VALUE == -1 && trueop1 == const1_rtx)
>   || (STORE_FLAG_VALUE == 1 && trueop1 == constm1_rtx))
>   && (reversed = reversed_comparison (op0, mode)))
> return
>   simplify_gen_unary (NEG, mode, reversed, mode);
> As STORE_FLAG_VALUE is 1 on i386, it triggers for that -1.
>
> Now, in CCCmode we have just 2 possible values, 0 and 1, CF clear and CF set.
> So, either (ltu:QI (reg:CCC 17 flags) (const_int 0 [0])) is 0, in that case
> (geu:QI (reg:CCC 17 flags) (const_int 0 [0])) is 1 and so
> (compare:CCC (neg:QI (geu:QI (reg:CCC FLAGS_REG) (const_int 0)))
>  (ltu:QI (reg:CCC FLAGS_REG) (const_int 0
> is (compare:CCC (neg:QI (const_int 1 [0x1])) (const_int 0 [0]))
> which is (compare:CCC (const_int -1 [0x]) (const_int 0 [0]))
> Or (ltu:QI (reg:CCC 17 flags) (const_int 0 [0])) is 1, in that case
> (geu:QI (reg:CCC 17 flags) (const_int 0 [0])) is 0 and so
> (compare:CCC (neg:QI (geu:QI (reg:CCC FLAGS_REG) (const_int 0)))
>  (ltu:QI (reg:CCC FLAGS_REG) (const_int 0
> is (compare:CCC (neg:QI (const_int 0 [0x0])) (const_int 1 [0x1]))
> which is (compare:CCC (const_int 0 [0]) (const_int 1 [0x1]))
> As CCCmode flags can be only used in LTU or GEU comparisons, we are asking if
> 0xffU < 0 (false, CF clear) or 0 < 1 (true, CF set).
>
> So I think the pattern is meaningful and is really a nop.

Phew ... it took me some time to wrap my mind around the above logic.

The explanation clears my concerns, and also answers the question if
this simplification should be implemented in some generic,
target-independant way. No, due to all the target-dependant stuff,
mentioned in the explanation.

OK for mainline.

Thanks,
Uros.


Re: [PATCH] i386: Improve chaining of _{addcarry, subborrow}_u{32, 64} [PR97387]

2020-10-14 Thread Jakub Jelinek via Gcc-patches
On Wed, Oct 14, 2020 at 03:17:03PM +0200, Uros Bizjak wrote:
> > +(define_insn_and_split "*setcc_qi_addqi3_cconly_overflow_1_"
> > +  [(set (reg:CCC FLAGS_REG)
> > +   (compare:CCC (neg:QI (geu:QI (reg:CC_CCC FLAGS_REG) (const_int 0)))
> > +(ltu:QI (reg:CC_CCC FLAGS_REG) (const_int 0]
> > +  "ix86_pre_reload_split ()"
> > +  "#"
> > +  "&& 1"
> > +  [(const_int 0)])
> >
> 
> Hmm... does the above really represent a NOP?

It is just what combine.c + simplify-rtx.c make out of this.
We have:
(insn 10 9 11 2 (set (reg:QI 88 [ _31 ])
(ltu:QI (reg:CCC 17 flags)
(const_int 0 [0]))) "include/adxintrin.h":69:10 785 {*setcc_qi}
 (expr_list:REG_DEAD (reg:CCC 17 flags)
(nil)))
and
(insn 17 15 18 2 (parallel [
(set (reg:CCC 17 flags)
(compare:CCC (plus:QI (reg:QI 88 [ _31 ])
(const_int -1 [0x]))
(reg:QI 88 [ _31 ])))
(clobber (scratch:QI))
]) "include/adxintrin.h":69:10 350 {*addqi3_cconly_overflow_1}
 (expr_list:REG_DEAD (reg:QI 88 [ _31 ])
(nil)))
So when substituting (reg:QI 88) for (ltu:QI flags 0), we initially get:
(compare:CCC (plus:QI (ltu:QI (reg:CCC 17 flags) (const_int [0])) (const_int -1 
[0x]))
 (ltu:QI (reg:CCC 17 flags) (const_int [0])))
On this triggers simplify_binary_operation_1 rule:
  /* (plus (comparison A B) C) can become (neg (rev-comp A B)) if
 C is 1 and STORE_FLAG_VALUE is -1 or if C is -1 and STORE_FLAG_VALUE
 is 1.  */
  if (COMPARISON_P (op0)
  && ((STORE_FLAG_VALUE == -1 && trueop1 == const1_rtx)
  || (STORE_FLAG_VALUE == 1 && trueop1 == constm1_rtx))
  && (reversed = reversed_comparison (op0, mode)))
return
  simplify_gen_unary (NEG, mode, reversed, mode);
As STORE_FLAG_VALUE is 1 on i386, it triggers for that -1.

Now, in CCCmode we have just 2 possible values, 0 and 1, CF clear and CF set.
So, either (ltu:QI (reg:CCC 17 flags) (const_int 0 [0])) is 0, in that case
(geu:QI (reg:CCC 17 flags) (const_int 0 [0])) is 1 and so
(compare:CCC (neg:QI (geu:QI (reg:CCC FLAGS_REG) (const_int 0)))
 (ltu:QI (reg:CCC FLAGS_REG) (const_int 0
is (compare:CCC (neg:QI (const_int 1 [0x1])) (const_int 0 [0]))
which is (compare:CCC (const_int -1 [0x]) (const_int 0 [0]))
Or (ltu:QI (reg:CCC 17 flags) (const_int 0 [0])) is 1, in that case
(geu:QI (reg:CCC 17 flags) (const_int 0 [0])) is 0 and so
(compare:CCC (neg:QI (geu:QI (reg:CCC FLAGS_REG) (const_int 0)))
 (ltu:QI (reg:CCC FLAGS_REG) (const_int 0
is (compare:CCC (neg:QI (const_int 0 [0x0])) (const_int 1 [0x1]))
which is (compare:CCC (const_int 0 [0]) (const_int 1 [0x1]))
As CCCmode flags can be only used in LTU or GEU comparisons, we are asking if
0xffU < 0 (false, CF clear) or 0 < 1 (true, CF set).

So I think the pattern is meaningful and is really a nop.

Jakub



Re: [PATCH] i386: Improve chaining of _{addcarry, subborrow}_u{32, 64} [PR97387]

2020-10-14 Thread Uros Bizjak via Gcc-patches
On Wed, Oct 14, 2020 at 11:01 AM Jakub Jelinek  wrote:
>
> Hi!
>
> These builtins have two known issues and this patch fixes one of them.
>
> One issue is that the builtins effectively return two results and
> they make the destination addressable until expansion, which means
> a stack slot is allocated for them and e.g. with -fstack-protector*
> DSE isn't able to optimize that away.  I think for that we want to use
> the technique of returning complex value; the patch doesn't handle that
> though.  See PR93990 for that.
>
> The other problem is optimization of successive uses of the builtin
> e.g. for arbitrary precision arithmetic additions/subtractions.
> As shown PR93990, combine is able to optimize the case when the first
> argument to these builtins is 0 (the first instance when several are used
> together), and also the last one if the last one ignores its result (i.e.
> the carry/borrow is dead and thrown away in that case).
> As shown in this PR, combiner refuses to optimize the rest, where it sees:
> (insn 10 9 11 2 (set (reg:QI 88 [ _31 ])
> (ltu:QI (reg:CCC 17 flags)
> (const_int 0 [0]))) "include/adxintrin.h":69:10 785 {*setcc_qi}
>  (expr_list:REG_DEAD (reg:CCC 17 flags)
> (nil)))
> - set pseudo 88 to CF from flags, then some uninteresting insns that
> don't modify flags, and finally:
> (insn 17 15 18 2 (parallel [
> (set (reg:CCC 17 flags)
> (compare:CCC (plus:QI (reg:QI 88 [ _31 ])
> (const_int -1 [0x]))
> (reg:QI 88 [ _31 ])))
> (clobber (scratch:QI))
> ]) "include/adxintrin.h":69:10 350 {*addqi3_cconly_overflow_1}
>  (expr_list:REG_DEAD (reg:QI 88 [ _31 ])
> (nil)))
> to set CF in flags back to what we saved earlier.  The combiner just punts
> trying to combine the 10, 17 and following addcarrydi (etc.) instruction,
> because
>   if (i1 && !can_combine_p (i1, i3, i0, NULL, i2, NULL, &i1dest, &i1src))
> {
>   if (dump_file && (dump_flags & TDF_DETAILS))
> fprintf (dump_file, "Can't combine i1 into i3\n");
>   undo_all ();
>   return 0;
> }
> fails - the 3 insns aren't all adjacent and
>   || (! all_adjacent
>   && (((!MEM_P (src)
> || ! find_reg_note (insn, REG_EQUIV, src))
>&& modified_between_p (src, insn, i3))
> src (flags hard register) is modified between the first and third insn - in
> the second insn.
>
> The following patch optimizes this by optimizing just the two insns,
> 10 and 17 above, i.e. save CF into pseudo, set CF from that pseudo, into
> a nop.  The new define_insn_and_split matches how combine simplifies those
> two together (except without the ix86_cc_mode change it was choosing CCmode
> for the destination instead of CCCmode, so had to change that function too,
> and also adjust costs so that combiner understand it is beneficial).
>
> With this, all the testcases are optimized, so that the:
> setc%dl
> ...
> addb$-1, %dl
> insns in between the ad[dc][lq] or s[ub]b[lq] instructions are all optimized
> away (sure, if something would clobber flags in between they wouldn't, but
> there is nothing that can be done about that).
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2020-10-14  Jakub Jelinek  
>
> PR target/97387
> * config/i386/i386.md (CC_CCC): New mode iterator.
> (*setcc_qi_addqi3_cconly_overflow_1_): New
> define_insn_and_split.
> * config/i386/i386.c (ix86_cc_mode): Return CCCmode for
> *setcc_qi_addqi3_cconly_overflow_1_ pattern operands.
> (ix86_rtx_costs): Return true and *total = 0; for
> *setcc_qi_addqi3_cconly_overflow_1_ pattern.
>
> * gcc.target/i386/pr97387-1.c: New test.
> * gcc.target/i386/pr97387-2.c: New test.
>
> --- gcc/config/i386/i386.md.jj  2020-10-01 10:40:09.955758167 +0200
> +++ gcc/config/i386/i386.md 2020-10-13 13:38:24.644980815 +0200
> @@ -7039,6 +7039,20 @@ (define_expand "subborrow_0"
>(set (match_operand:SWI48 0 "register_operand")
>(minus:SWI48 (match_dup 1) (match_dup 2)))])]
>"ix86_binary_operator_ok (MINUS, mode, operands)")
> +
> +(define_mode_iterator CC_CCC [CC CCC])
> +
> +;; Pre-reload splitter to optimize
> +;; *setcc_qi followed by *addqi3_cconly_overflow_1 with the same QI
> +;; operand and no intervening flags modifications into nothing.
> +(define_insn_and_split "*setcc_qi_addqi3_cconly_overflow_1_"
> +  [(set (reg:CCC FLAGS_REG)
> +   (compare:CCC (neg:QI (geu:QI (reg:CC_CCC FLAGS_REG) (const_int 0)))
> +(ltu:QI (reg:CC_CCC FLAGS_REG) (const_int 0]
> +  "ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(const_int 0)])
>

Hmm... does the above really represent a NOP?

This is a compare of a *negation* of a reversed condition with itself, e.g:

cmp (neg (reversed cond), (cond)

we are sure that (reversed cond) 

Re: [PATCH] i386: Improve chaining of _{addcarry, subborrow}_u{32, 64} [PR97387]

2020-10-14 Thread Richard Biener via Gcc-patches
On Wed, Oct 14, 2020 at 11:49 AM Jakub Jelinek  wrote:
>
> On Wed, Oct 14, 2020 at 11:22:48AM +0200, Richard Biener wrote:
> > > +  if (mode == CCCmode
> > > + && GET_CODE (XEXP (x, 0)) == NEG
> > > + && GET_CODE (XEXP (XEXP (x, 0), 0)) == GEU
> > > + && REG_P (XEXP (XEXP (XEXP (x, 0), 0), 0))
> > > + && (GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)) == CCCmode
> > > + || GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)) == CCmode)
> > > + && REGNO (XEXP (XEXP (XEXP (x, 0), 0), 0)) == FLAGS_REG
> > > + && XEXP (XEXP (XEXP (x, 0), 0), 1) == const0_rtx
> > > + && GET_CODE (XEXP (x, 1)) == LTU
> > > + && REG_P (XEXP (XEXP (x, 1), 0))
> > > + && (GET_MODE (XEXP (XEXP (x, 1), 0))
> > > + == GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)))
> > > + && REGNO (XEXP (XEXP (x, 1), 0)) == FLAGS_REG
> > > + && XEXP (XEXP (x, 1), 1) == const0_rtx)
> >
> > Meh ;)  templates to the rescue?
> >
> >   rtx_match < un > ().matches (x)
> >
> > and with fancy metaprogramming expand it to above?  Not sure if it's easier
> > to read that way.  Maybe
>
> It would certainly not match the style used elsewhere in the backend.
> >
> >   rtx neg, geu;
> >   if (mode == CCCmode
> >   && (neg = XEXP (x, 0), GET_CODE (neg) == NEG)
> >   && (geu = XEXP (neg, 0), GET_CODE (geu) == GEU)
> > ...
> >
> > or
> >
> >   if (mode == CCCmode
> >   && GET_CODE (neg = XEXP (x, 0)) == NEG
> >
> > thus some manual CSE and naming in this matching would help?
>
> Attached are two incremental patches, one just adds op0 and op1 for the
> COMPARE operand of all the costs COMPARE handling, which replaces all the
> XEXP (x, 0) with op0 and XEXP (x, 1) with op1, the other is that plus
> the geu you've suggested.

OK, so the visual appearance is not very much improved.  I guess
the main issue is the checks do not "align" with a view of the expression
but I can't see how to improve that.  When one actually looks at
the tests the CSEd vairants are easier to match-up and I'd pick the geu one.

Eventually intermixed comments would help the casual reader?

+  rtx geu;
   /* (neg:CCC */
   if (mode == CCCmode
+ && GET_CODE (op0) == NEG
   /* (geu (reg:CC[C] cc) const0)) */
+ && GET_CODE (geu = XEXP (op0, 0)) == GEU
+ && REG_P (XEXP (geu, 0))
+ && (GET_MODE (XEXP (geu, 0)) == CCCmode
+ || GET_MODE (XEXP (geu, 0)) == CCmode)
+ && REGNO (XEXP (geu, 0)) == FLAGS_REG
+ && XEXP (geu, 1) == const0_rtx
   /* (LTU (reg:CCCmode cc) const0)) */
+ && GET_CODE (op1) == LTU
+ && REG_P (XEXP (op1, 0))
+ && GET_MODE (XEXP (op1, 0)) == GET_MODE (XEXP (geu, 0))
+ && REGNO (XEXP (op1, 0)) == FLAGS_REG
+ && XEXP (op1, 1) == const0_rtx)


I'll leave the actual review to Uros of course.

Richard.

>
> Jakub


Re: [PATCH] i386: Improve chaining of _{addcarry, subborrow}_u{32, 64} [PR97387]

2020-10-14 Thread Jakub Jelinek via Gcc-patches
On Wed, Oct 14, 2020 at 11:22:48AM +0200, Richard Biener wrote:
> > +  if (mode == CCCmode
> > + && GET_CODE (XEXP (x, 0)) == NEG
> > + && GET_CODE (XEXP (XEXP (x, 0), 0)) == GEU
> > + && REG_P (XEXP (XEXP (XEXP (x, 0), 0), 0))
> > + && (GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)) == CCCmode
> > + || GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)) == CCmode)
> > + && REGNO (XEXP (XEXP (XEXP (x, 0), 0), 0)) == FLAGS_REG
> > + && XEXP (XEXP (XEXP (x, 0), 0), 1) == const0_rtx
> > + && GET_CODE (XEXP (x, 1)) == LTU
> > + && REG_P (XEXP (XEXP (x, 1), 0))
> > + && (GET_MODE (XEXP (XEXP (x, 1), 0))
> > + == GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)))
> > + && REGNO (XEXP (XEXP (x, 1), 0)) == FLAGS_REG
> > + && XEXP (XEXP (x, 1), 1) == const0_rtx)
> 
> Meh ;)  templates to the rescue?
> 
>   rtx_match < un > ().matches (x)
> 
> and with fancy metaprogramming expand it to above?  Not sure if it's easier
> to read that way.  Maybe

It would certainly not match the style used elsewhere in the backend.
> 
>   rtx neg, geu;
>   if (mode == CCCmode
>   && (neg = XEXP (x, 0), GET_CODE (neg) == NEG)
>   && (geu = XEXP (neg, 0), GET_CODE (geu) == GEU)
> ...
> 
> or
> 
>   if (mode == CCCmode
>   && GET_CODE (neg = XEXP (x, 0)) == NEG
> 
> thus some manual CSE and naming in this matching would help?

Attached are two incremental patches, one just adds op0 and op1 for the
COMPARE operand of all the costs COMPARE handling, which replaces all the
XEXP (x, 0) with op0 and XEXP (x, 1) with op1, the other is that plus
the geu you've suggested.

Jakub
--- gcc/config/i386/i386.c.jj   2020-10-14 11:31:38.0 +0200
+++ gcc/config/i386/i386.c  2020-10-14 11:37:02.843258215 +0200
@@ -19765,44 +19765,44 @@ ix86_rtx_costs (rtx x, machine_mode mode
   return false;
 
 case COMPARE:
-  if (GET_CODE (XEXP (x, 0)) == ZERO_EXTRACT
- && XEXP (XEXP (x, 0), 1) == const1_rtx
- && CONST_INT_P (XEXP (XEXP (x, 0), 2))
- && XEXP (x, 1) == const0_rtx)
+  rtx op0, op1;
+  op0 = XEXP (x, 0);
+  op1 = XEXP (x, 1);
+  if (GET_CODE (op0) == ZERO_EXTRACT
+ && XEXP (op0, 1) == const1_rtx
+ && CONST_INT_P (XEXP (op0, 2))
+ && op1 == const0_rtx)
{
  /* This kind of construct is implemented using test[bwl].
 Treat it as if we had an AND.  */
- mode = GET_MODE (XEXP (XEXP (x, 0), 0));
+ mode = GET_MODE (XEXP (op0, 0));
  *total = (cost->add
-   + rtx_cost (XEXP (XEXP (x, 0), 0), mode, outer_code,
+   + rtx_cost (XEXP (op0, 0), mode, outer_code,
opno, speed)
+ rtx_cost (const1_rtx, mode, outer_code, opno, speed));
  return true;
}
 
-  if (GET_CODE (XEXP (x, 0)) == PLUS
- && rtx_equal_p (XEXP (XEXP (x, 0), 0), XEXP (x, 1)))
+  if (GET_CODE (op0) == PLUS && rtx_equal_p (XEXP (op0, 0), op1))
{
  /* This is an overflow detection, count it as a normal compare.  */
- *total = rtx_cost (XEXP (x, 0), GET_MODE (XEXP (x, 0)),
-COMPARE, 0, speed);
+ *total = rtx_cost (op0, GET_MODE (op0), COMPARE, 0, speed);
  return true;
}
 
   if (mode == CCCmode
- && GET_CODE (XEXP (x, 0)) == NEG
- && GET_CODE (XEXP (XEXP (x, 0), 0)) == GEU
- && REG_P (XEXP (XEXP (XEXP (x, 0), 0), 0))
- && (GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)) == CCCmode
- || GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)) == CCmode)
- && REGNO (XEXP (XEXP (XEXP (x, 0), 0), 0)) == FLAGS_REG
- && XEXP (XEXP (XEXP (x, 0), 0), 1) == const0_rtx
- && GET_CODE (XEXP (x, 1)) == LTU
- && REG_P (XEXP (XEXP (x, 1), 0))
- && (GET_MODE (XEXP (XEXP (x, 1), 0))
- == GET_MODE (XEXP (XEXP (XEXP (x, 0), 0), 0)))
- && REGNO (XEXP (XEXP (x, 1), 0)) == FLAGS_REG
- && XEXP (XEXP (x, 1), 1) == const0_rtx)
+ && GET_CODE (op0) == NEG
+ && GET_CODE (XEXP (op0, 0)) == GEU
+ && REG_P (XEXP (XEXP (op0, 0), 0))
+ && (GET_MODE (XEXP (XEXP (op0, 0), 0)) == CCCmode
+ || GET_MODE (XEXP (XEXP (op0, 0), 0)) == CCmode)
+ && REGNO (XEXP (XEXP (op0, 0), 0)) == FLAGS_REG
+ && XEXP (XEXP (op0, 0), 1) == const0_rtx
+ && GET_CODE (op1) == LTU
+ && REG_P (XEXP (op1, 0))
+ && GET_MODE (XEXP (op1, 0)) == GET_MODE (XEXP (XEXP (op0, 0), 0))
+ && REGNO (XEXP (op1, 0)) == FLAGS_REG
+ && XEXP (op1, 1) == const0_rtx)
{
  /* This is *setcc_qi_addqi3_cconly_overflow_1_* patterns, a nop.  */
  *total = 0;
@@ -19810,8 +19810,7 @@ ix86_rtx_costs (rtx x, machine_mode mode
}
 
   /* The embedded comparison operand is completely free.  */
-  if (!general_operand (XEXP (x, 0

Re: [PATCH] i386: Improve chaining of _{addcarry, subborrow}_u{32, 64} [PR97387]

2020-10-14 Thread Richard Biener via Gcc-patches
On Wed, Oct 14, 2020 at 11:01 AM Jakub Jelinek via Gcc-patches
 wrote:
>
> Hi!
>
> These builtins have two known issues and this patch fixes one of them.
>
> One issue is that the builtins effectively return two results and
> they make the destination addressable until expansion, which means
> a stack slot is allocated for them and e.g. with -fstack-protector*
> DSE isn't able to optimize that away.  I think for that we want to use
> the technique of returning complex value; the patch doesn't handle that
> though.  See PR93990 for that.
>
> The other problem is optimization of successive uses of the builtin
> e.g. for arbitrary precision arithmetic additions/subtractions.
> As shown PR93990, combine is able to optimize the case when the first
> argument to these builtins is 0 (the first instance when several are used
> together), and also the last one if the last one ignores its result (i.e.
> the carry/borrow is dead and thrown away in that case).
> As shown in this PR, combiner refuses to optimize the rest, where it sees:
> (insn 10 9 11 2 (set (reg:QI 88 [ _31 ])
> (ltu:QI (reg:CCC 17 flags)
> (const_int 0 [0]))) "include/adxintrin.h":69:10 785 {*setcc_qi}
>  (expr_list:REG_DEAD (reg:CCC 17 flags)
> (nil)))
> - set pseudo 88 to CF from flags, then some uninteresting insns that
> don't modify flags, and finally:
> (insn 17 15 18 2 (parallel [
> (set (reg:CCC 17 flags)
> (compare:CCC (plus:QI (reg:QI 88 [ _31 ])
> (const_int -1 [0x]))
> (reg:QI 88 [ _31 ])))
> (clobber (scratch:QI))
> ]) "include/adxintrin.h":69:10 350 {*addqi3_cconly_overflow_1}
>  (expr_list:REG_DEAD (reg:QI 88 [ _31 ])
> (nil)))
> to set CF in flags back to what we saved earlier.  The combiner just punts
> trying to combine the 10, 17 and following addcarrydi (etc.) instruction,
> because
>   if (i1 && !can_combine_p (i1, i3, i0, NULL, i2, NULL, &i1dest, &i1src))
> {
>   if (dump_file && (dump_flags & TDF_DETAILS))
> fprintf (dump_file, "Can't combine i1 into i3\n");
>   undo_all ();
>   return 0;
> }
> fails - the 3 insns aren't all adjacent and
>   || (! all_adjacent
>   && (((!MEM_P (src)
> || ! find_reg_note (insn, REG_EQUIV, src))
>&& modified_between_p (src, insn, i3))
> src (flags hard register) is modified between the first and third insn - in
> the second insn.
>
> The following patch optimizes this by optimizing just the two insns,
> 10 and 17 above, i.e. save CF into pseudo, set CF from that pseudo, into
> a nop.  The new define_insn_and_split matches how combine simplifies those
> two together (except without the ix86_cc_mode change it was choosing CCmode
> for the destination instead of CCCmode, so had to change that function too,
> and also adjust costs so that combiner understand it is beneficial).
>
> With this, all the testcases are optimized, so that the:
> setc%dl
> ...
> addb$-1, %dl
> insns in between the ad[dc][lq] or s[ub]b[lq] instructions are all optimized
> away (sure, if something would clobber flags in between they wouldn't, but
> there is nothing that can be done about that).
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2020-10-14  Jakub Jelinek  
>
> PR target/97387
> * config/i386/i386.md (CC_CCC): New mode iterator.
> (*setcc_qi_addqi3_cconly_overflow_1_): New
> define_insn_and_split.
> * config/i386/i386.c (ix86_cc_mode): Return CCCmode for
> *setcc_qi_addqi3_cconly_overflow_1_ pattern operands.
> (ix86_rtx_costs): Return true and *total = 0; for
> *setcc_qi_addqi3_cconly_overflow_1_ pattern.
>
> * gcc.target/i386/pr97387-1.c: New test.
> * gcc.target/i386/pr97387-2.c: New test.
>
> --- gcc/config/i386/i386.md.jj  2020-10-01 10:40:09.955758167 +0200
> +++ gcc/config/i386/i386.md 2020-10-13 13:38:24.644980815 +0200
> @@ -7039,6 +7039,20 @@ (define_expand "subborrow_0"
>(set (match_operand:SWI48 0 "register_operand")
>(minus:SWI48 (match_dup 1) (match_dup 2)))])]
>"ix86_binary_operator_ok (MINUS, mode, operands)")
> +
> +(define_mode_iterator CC_CCC [CC CCC])
> +
> +;; Pre-reload splitter to optimize
> +;; *setcc_qi followed by *addqi3_cconly_overflow_1 with the same QI
> +;; operand and no intervening flags modifications into nothing.
> +(define_insn_and_split "*setcc_qi_addqi3_cconly_overflow_1_"
> +  [(set (reg:CCC FLAGS_REG)
> +   (compare:CCC (neg:QI (geu:QI (reg:CC_CCC FLAGS_REG) (const_int 0)))
> +(ltu:QI (reg:CC_CCC FLAGS_REG) (const_int 0]
> +  "ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(const_int 0)])
>
>  ;; Overflow setting add instructions
>
> --- gcc/config/i386/i386.c.jj   2020-10-01 10:40:09.951758225 +0200
> +++ gcc/config/i386/i386.c  2020-10-13 13:40:20.471300518