subject:"RFC\: allowing fwprop to propagate subregs"

Re: [PATCH] Do not handle SUBREG in apply_distributive_law (Re: RFC: allowing fwprop to propagate subregs)

2012-03-12 Thread Richard Guenther

On Wed, Mar 7, 2012 at 6:40 PM, Ulrich Weigand  wrote:
> Richard Kenner wrote:
>
>> > Given the current set of results, since I do not have any way to verify
>> > whether my simplify_set changes would actually trigger correctly, I'd
>> > rather propose to just remove the SUBREG case in apply_distributive_law
>> > (i.e. only apply the first patch below).
>> >
>> > Thoughts?
>>
>> I think that's reasonable.  But I'd replace it with a comment saying
>> what used to be there and why it was removed.
>
> Now that we're back in stage 1, I'd like to try and move forward with
> this again.  Here's a patch that implements your suggestion.
>
> Tested on arm-linux-gnueabi, i386-linux-gnu and s390-linux-gnu.
>
> OK for mainline?

Ok.

Thanks,
Richard.

> Bye,
> Ulrich
>
> 2012-03-07  Ulrich Weigand  
>
>        gcc/
>        * combine.c (apply_distributive_law): Do not distribute SUBREG.
>
> === modified file 'gcc/combine.c'
> --- gcc/combine.c       2012-02-07 15:48:52 +
> +++ gcc/combine.c       2012-02-22 11:57:19 +
> @@ -9286,36 +9286,22 @@
>       /* This is also a multiply, so it distributes over everything.  */
>       break;
>
> -    case SUBREG:
> -      /* Non-paradoxical SUBREGs distributes over all operations,
> -        provided the inner modes and byte offsets are the same, this
> -        is an extraction of a low-order part, we don't convert an fp
> -        operation to int or vice versa, this is not a vector mode,
> -        and we would not be converting a single-word operation into a
> -        multi-word operation.  The latter test is not required, but
> -        it prevents generating unneeded multi-word operations.  Some
> -        of the previous tests are redundant given the latter test,
> -        but are retained because they are required for correctness.
> -
> -        We produce the result slightly differently in this case.  */
> -
> -      if (GET_MODE (SUBREG_REG (lhs)) != GET_MODE (SUBREG_REG (rhs))
> -         || SUBREG_BYTE (lhs) != SUBREG_BYTE (rhs)
> -         || ! subreg_lowpart_p (lhs)
> -         || (GET_MODE_CLASS (GET_MODE (lhs))
> -             != GET_MODE_CLASS (GET_MODE (SUBREG_REG (lhs
> -         || paradoxical_subreg_p (lhs)
> -         || VECTOR_MODE_P (GET_MODE (lhs))
> -         || GET_MODE_SIZE (GET_MODE (SUBREG_REG (lhs))) > UNITS_PER_WORD
> -         /* Result might need to be truncated.  Don't change mode if
> -            explicit truncation is needed.  */
> -         || !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (x),
> -                                            GET_MODE (SUBREG_REG (lhs
> -       return x;
> -
> -      tem = simplify_gen_binary (code, GET_MODE (SUBREG_REG (lhs)),
> -                                SUBREG_REG (lhs), SUBREG_REG (rhs));
> -      return gen_lowpart (GET_MODE (x), tem);
> +    /* This used to handle SUBREG, but this turned out to be counter-
> +       productive, since (subreg (op ...)) usually is not handled by
> +       insn patterns, and this "optimization" therefore transformed
> +       recognizable patterns into unrecognizable ones.  Therefore the
> +       SUBREG case was removed from here.
> +
> +       It is possible that distributing SUBREG over arithmetic operations
> +       leads to an intermediate result than can then be optimized further,
> +       e.g. by moving the outer SUBREG to the other side of a SET as done
> +       in simplify_set.  This seems to have been the original intent of
> +       handling SUBREGs here.
> +
> +       However, with current GCC this does not appear to actually happen,
> +       at least on major platforms.  If some case is found where removing
> +       the SUBREG case here prevents follow-on optimizations, distributing
> +       SUBREGs ought to be re-added at that place, e.g. in simplify_set.  */
>
>     default:
>       return x;
>
>
>
> --
>  Dr. Ulrich Weigand
>  GNU Toolchain for Linux on System z and Cell BE
>  ulrich.weig...@de.ibm.com
>

[PATCH] Do not handle SUBREG in apply_distributive_law (Re: RFC: allowing fwprop to propagate subregs)

2012-03-07 Thread Ulrich Weigand

Richard Kenner wrote:

> > Given the current set of results, since I do not have any way to verify
> > whether my simplify_set changes would actually trigger correctly, I'd
> > rather propose to just remove the SUBREG case in apply_distributive_law
> > (i.e. only apply the first patch below).
> > 
> > Thoughts?
> 
> I think that's reasonable.  But I'd replace it with a comment saying
> what used to be there and why it was removed.

Now that we're back in stage 1, I'd like to try and move forward with
this again.  Here's a patch that implements your suggestion.

Tested on arm-linux-gnueabi, i386-linux-gnu and s390-linux-gnu.

OK for mainline?

Bye,
Ulrich

2012-03-07  Ulrich Weigand  

gcc/
* combine.c (apply_distributive_law): Do not distribute SUBREG.

=== modified file 'gcc/combine.c'
--- gcc/combine.c   2012-02-07 15:48:52 +
+++ gcc/combine.c   2012-02-22 11:57:19 +
@@ -9286,36 +9286,22 @@
   /* This is also a multiply, so it distributes over everything.  */
   break;
 
-case SUBREG:
-  /* Non-paradoxical SUBREGs distributes over all operations,
-provided the inner modes and byte offsets are the same, this
-is an extraction of a low-order part, we don't convert an fp
-operation to int or vice versa, this is not a vector mode,
-and we would not be converting a single-word operation into a
-multi-word operation.  The latter test is not required, but
-it prevents generating unneeded multi-word operations.  Some
-of the previous tests are redundant given the latter test,
-but are retained because they are required for correctness.
-
-We produce the result slightly differently in this case.  */
-
-  if (GET_MODE (SUBREG_REG (lhs)) != GET_MODE (SUBREG_REG (rhs))
- || SUBREG_BYTE (lhs) != SUBREG_BYTE (rhs)
- || ! subreg_lowpart_p (lhs)
- || (GET_MODE_CLASS (GET_MODE (lhs))
- != GET_MODE_CLASS (GET_MODE (SUBREG_REG (lhs
- || paradoxical_subreg_p (lhs)
- || VECTOR_MODE_P (GET_MODE (lhs))
- || GET_MODE_SIZE (GET_MODE (SUBREG_REG (lhs))) > UNITS_PER_WORD
- /* Result might need to be truncated.  Don't change mode if
-explicit truncation is needed.  */
- || !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (x),
-GET_MODE (SUBREG_REG (lhs
-   return x;
-
-  tem = simplify_gen_binary (code, GET_MODE (SUBREG_REG (lhs)),
-SUBREG_REG (lhs), SUBREG_REG (rhs));
-  return gen_lowpart (GET_MODE (x), tem);
+/* This used to handle SUBREG, but this turned out to be counter-
+   productive, since (subreg (op ...)) usually is not handled by
+   insn patterns, and this "optimization" therefore transformed
+   recognizable patterns into unrecognizable ones.  Therefore the
+   SUBREG case was removed from here.
+
+   It is possible that distributing SUBREG over arithmetic operations
+   leads to an intermediate result than can then be optimized further,
+   e.g. by moving the outer SUBREG to the other side of a SET as done
+   in simplify_set.  This seems to have been the original intent of
+   handling SUBREGs here.
+
+   However, with current GCC this does not appear to actually happen,
+   at least on major platforms.  If some case is found where removing
+   the SUBREG case here prevents follow-on optimizations, distributing
+   SUBREGs ought to be re-added at that place, e.g. in simplify_set.  */
 
 default:
   return x;



-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com

Re: RFC: allowing fwprop to propagate subregs

2012-01-17 Thread Richard Kenner

> I tried to implement that suggestion, but interestingly enough I cannot
> really test it since I was unable to find any single case where that
> SUBREG case in apply_distributive_law actually causes any difference
> whatsoever in generated code.
> 
> Do you have any further suggestion of how to find a testcase (some
> particular source code and/or architecture)?

No.  It may well be irrelevant now and was only relevant in cases where
we just had SImode and not the narrower modes on some of the older
architectures such as a29k.  It may also be that the optimizations that
this caught are now done at tree level, though I think the former is
the more likely.

> Given the current set of results, since I do not have any way to verify
> whether my simplify_set changes would actually trigger correctly, I'd
> rather propose to just remove the SUBREG case in apply_distributive_law
> (i.e. only apply the first patch below).
> 
> Thoughts?

I think that's reasonable.  But I'd replace it with a comment saying
what used to be there and why it was removed.

Re: RFC: allowing fwprop to propagate subregs

2012-01-17 Thread Ulrich Weigand

Richard Kenner wrote:
> > Maybe the best solution would be to remove the SUBREG case from the generic
> > apply_distributive_law subroutine, and instead add a special check for the
> > distributed subreg case right at the above place in simplify_set; i.e. to
> > perform the inverse distribution only if it is already guaranteed that we
> > will also be able to move the subreg to the LHS ...
> 
> That could indeed work.

I tried to implement that suggestion, but interestingly enough I cannot
really test it since I was unable to find any single case where that
SUBREG case in apply_distributive_law actually causes any difference
whatsoever in generated code.

As test case I used the whole of libstdc++.so on the following set of
platforms:
  - i686-pc-linux
  - s390x-ibm-linux
  - powerpc-ibm-linux
  - arm-linux-gnueabi
and built the compiler and libstdc++.so for each of:
  - current mainline
  - current mainline plus the first patch below
  - current mainline plus both patches below

All three resulting object files were identical for every platform.

Do you have any further suggestion of how to find a testcase (some
particular source code and/or architecture)?

Given the current set of results, since I do not have any way to verify
whether my simplify_set changes would actually trigger correctly, I'd
rather propose to just remove the SUBREG case in apply_distributive_law
(i.e. only apply the first patch below).

Thoughts?

Thanks,
Ulrich

Patch A: Remove SUBREG case in apply_distributive_law

Index: gcc/combine.c
===
--- gcc/combine.c   (revision 183240)
+++ gcc/combine.c   (working copy)
@@ -9238,37 +9269,6 @@
   /* This is also a multiply, so it distributes over everything.  */
   break;
 
-case SUBREG:
-  /* Non-paradoxical SUBREGs distributes over all operations,
-provided the inner modes and byte offsets are the same, this
-is an extraction of a low-order part, we don't convert an fp
-operation to int or vice versa, this is not a vector mode,
-and we would not be converting a single-word operation into a
-multi-word operation.  The latter test is not required, but
-it prevents generating unneeded multi-word operations.  Some
-of the previous tests are redundant given the latter test,
-but are retained because they are required for correctness.
-
-We produce the result slightly differently in this case.  */
-
-  if (GET_MODE (SUBREG_REG (lhs)) != GET_MODE (SUBREG_REG (rhs))
- || SUBREG_BYTE (lhs) != SUBREG_BYTE (rhs)
- || ! subreg_lowpart_p (lhs)
- || (GET_MODE_CLASS (GET_MODE (lhs))
- != GET_MODE_CLASS (GET_MODE (SUBREG_REG (lhs
- || paradoxical_subreg_p (lhs)
- || VECTOR_MODE_P (GET_MODE (lhs))
- || GET_MODE_SIZE (GET_MODE (SUBREG_REG (lhs))) > UNITS_PER_WORD
- /* Result might need to be truncated.  Don't change mode if
-explicit truncation is needed.  */
- || !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (x),
-GET_MODE (SUBREG_REG (lhs
-   return x;
-
-  tem = simplify_gen_binary (code, GET_MODE (SUBREG_REG (lhs)),
-SUBREG_REG (lhs), SUBREG_REG (rhs));
-  return gen_lowpart (GET_MODE (x), tem);
-
 default:
   return x;
 }


Patch B: Re-implement SUBREG case specifically in simplify_set


Index: gcc/combine.c
===
--- gcc/combine.c   (revision 183240)
+++ gcc/combine.c   (working copy)
@@ -6299,6 +6299,7 @@
   rtx dest = SET_DEST (x);
   enum machine_mode mode
 = GET_MODE (src) != VOIDmode ? GET_MODE (src) : GET_MODE (dest);
+  rtx src_subreg;
   rtx other_insn;
   rtx *cc_use;
 
@@ -6496,6 +6497,10 @@
  and X being a REG or (subreg (reg)), we may be able to convert this to
  (set (subreg:m2 x) (op)).
 
+ Similarly, if we have (set x (op:m1 (subreg:m2 ...) (subreg:m2 ...))),
+ we may be able to first distribute the subreg over op, and then apply
+ the above transformation.
+
  We can always do this if M1 is narrower than M2 because that means that
  we only care about the low bits of the result.
 
@@ -6504,30 +6509,56 @@
  be undefined.  On machine where it is defined, this transformation is safe
  as long as M1 and M2 have the same number of words.  */
 
+  src_subreg = NULL_RTX;
   if (GET_CODE (src) == SUBREG && subreg_lowpart_p (src)
-  && !OBJECT_P (SUBREG_REG (src))
+  && !OBJECT_P (SUBREG_REG (src)))
+src_subreg = SUBREG_REG (src);
+  else if (GET_CODE (src) == IOR || GET_CODE (src) == XOR
+  || GET_CODE (src) == AND
+  || GET_CODE (src) == PLUS || GET_CODE (src) == MINUS)
+{
+  rtx lhs = XEXP (x, 0);
+  rtx rhs = XEXP (x, 1);
+
+  /* We can distribute non-paradoxical lowpart SUBREGs if the
+

Re: RFC: allowing fwprop to propagate subregs

2012-01-16 Thread Richard Kenner

> Maybe the best solution would be to remove the SUBREG case from the generic
> apply_distributive_law subroutine, and instead add a special check for the
> distributed subreg case right at the above place in simplify_set; i.e. to
> perform the inverse distribution only if it is already guaranteed that we
> will also be able to move the subreg to the LHS ...

That could indeed work.

Re: RFC: allowing fwprop to propagate subregs

2012-01-16 Thread Ulrich Weigand

Richard Kenner wrote:

> > > The problem appears to be that an RTX expression like
> > >
> > > (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0)
> > >   (subreg:QI (reg:SI 66 [ b ]) 0))
> > >
> > > seems to be considered non-canonical by combine, and is instead
> > > transformed into
> > >
> > > (subreg:QI (minus:SI (reg:SI 64 [ a ])
> > >  (reg:SI 66 [ b ])) 0)
> > > On the one hand, this seems odd, as SUBREGs with a non-trivial
> > > expression inside will usually not match any insn pattern.  On
> > > the other hand, I guess this might still be useful behaviour
> > > for combine on platforms that support only word arithmetic,
> > > when the SUBREG might be considered a "split" point. 
> 
> No, I think the idea was that the outer SUBREG would be moved to the LHS
> of the assignment to make a "paradoxical SUBREG".  Except, as you point
> out, there doesn't seem to be an advantage of doing this for arithmetic
> unless you only have it in SImode (which it could, of course, check).

Right, I was mistaken about the "split" point case; this is done only
for SUBREG (MEM).  But you're correct that there is a special optimization
in simplify_set that will move the outer SUBREG to the LHS:

  /* If we have (set x (subreg:m1 (op:m2 ...) 0)) with OP being some operation,
 and X being a REG or (subreg (reg)), we may be able to convert this to
 (set (subreg:m2 x) (op)).

However, it would appear that this particular location is in fact the only
place where (subreg (op ...)) is handled; I don't see any other place where
this type of pattern would provide any benefit.

Maybe the best solution would be to remove the SUBREG case from the generic
apply_distributive_law subroutine, and instead add a special check for the
distributed subreg case right at the above place in simplify_set; i.e. to
perform the inverse distribution only if it is already guaranteed that we
will also be able to move the subreg to the LHS ...

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com

Re: RFC: allowing fwprop to propagate subregs

2012-01-12 Thread Richard Kenner

> [Richard, combine question at the bottom for you.  I've quoted Ulrich's
>  whole message in order to provide a bit of context.]

I don't remember ALL of this, but can perhaps make a few hints.

> > The problem appears to be that an RTX expression like
> >
> > (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0)
> >   (subreg:QI (reg:SI 66 [ b ]) 0))
> >
> > seems to be considered non-canonical by combine, and is instead
> > transformed into
> >
> > (subreg:QI (minus:SI (reg:SI 64 [ a ])
> >  (reg:SI 66 [ b ])) 0)
> > On the one hand, this seems odd, as SUBREGs with a non-trivial
> > expression inside will usually not match any insn pattern.  On
> > the other hand, I guess this might still be useful behaviour
> > for combine on platforms that support only word arithmetic,
> > when the SUBREG might be considered a "split" point. 

No, I think the idea was that the outer SUBREG would be moved to the LHS
of the assignment to make a "paradoxical SUBREG".  Except, as you point
out, there doesn't seem to be an advantage of doing this for arithmetic
unless you only have it in SImode (which it could, of course, check).

> (Of course, this doesn't work for the compute-result-and-cc type patterns.)

Right.

> No immediate suggestions, sorry.  It looks like the combine case was
> added by this pre-egcs patch:
> 
> Wed Mar 18 05:54:25 1998  Richard Kenner  
> 
>   * combine.c (gen_binary): Don't make AND that does nothing.
>   (simplify_comparison, case AND): Commute AND and SUBREG.
> 
> so maybe Richard remembers (or has a record of) what this was designed
> to handle?

I suspect the two halves related to each other: we commute AND and SUBREG
in the hope that this will allow us to decide that the AND does nothing
and isn't needed.  But I don't see how this is related: this is the AND
case in a comparison and we have a MINUS.

Re: RFC: allowing fwprop to propagate subregs

2012-01-12 Thread Richard Sandiford

[Richard, combine question at the bottom for you.  I've quoted Ulrich's
 whole message in order to provide a bit of context.]

"Ulrich Weigand"  writes:
> Richard Sandiford wrote:
>> At the moment, fwprop will propagate constants and registers
>> even if no further rtl simplifications are possible:
>> 
>>   if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
>> flags |= PR_CAN_APPEAR;
>> 
>> What do you think about extending this to subregs?
>
> I've been testing your patch (in its latest version including
> the SUBREG test pointed out by H.J.), and ran into issues where
> this additional propagation suppresses other optimizations.
>
> In particular, I'm seeing this testsuite regression on x86_64:
> FAIL: gcc.target/i386/pr30315.c scan-assembler-times cmp 4
>
> The problem is that the combined "compute result and set
> condition code" insn patterns sometimes no longer match.
>
> The source code in question looks like:
>
> int c;
>
> unsigned char
> de (unsigned char a, unsigned char b)
> {
>   unsigned char difference = a - b;
>   if (difference > a)
> c--;
>   return difference;
> }
>
>
> Before your patch, we generate insns like:
>
> (insn 3 4 5 2 (set (reg/v:QI 63 [ a ])
> (subreg:QI (reg:SI 64 [ a ]) 0)) pr30315.i:5 66 {*movqi_internal}
>  (expr_list:REG_DEAD (reg:SI 64 [ a ])
> (nil)))
>
> (insn 5 3 6 2 (set (reg/v:QI 65 [ b ])
> (subreg:QI (reg:SI 66 [ b ]) 0)) pr30315.i:5 66 {*movqi_internal}
>  (expr_list:REG_DEAD (reg:SI 66 [ b ])
> (nil)))
>
> (insn 9 6 10 2 (parallel [
> (set (reg/v:QI 59 [ difference ])
> (minus:QI (reg/v:QI 63 [ a ])
> (reg/v:QI 65 [ b ])))
> (clobber (reg:CC 17 flags))
> ]) pr30315.i:6 287 {*subqi_1}
>  (expr_list:REG_DEAD (reg/v:QI 65 [ b ])
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil
>
> (insn 10 9 11 2 (set (reg:CC 17 flags)
> (compare:CC (reg/v:QI 63 [ a ])
> (reg/v:QI 59 [ difference ]))) pr30315.i:7 4 {*cmpqi_1}
>  (expr_list:REG_DEAD (reg/v:QI 63 [ a ])
> (nil)))
>
> which get combined into:
>
> (insn 4 2 3 2 (set (reg:SI 66 [ b ])
> (reg:SI 4 si [ b ])) pr30315.i:5 64 {*movsi_internal}
>  (expr_list:REG_DEAD (reg:SI 4 si [ b ])
> (nil)))
>
> (insn 3 4 5 2 (set (reg/v:QI 63 [ a ])
> (subreg:QI (reg:SI 64 [ a ]) 0)) pr30315.i:5 66 {*movqi_internal}
>  (expr_list:REG_DEAD (reg:SI 64 [ a ])
> (nil)))
>
> (insn 10 9 11 2 (parallel [
> (set (reg:CCC 17 flags)
> (compare:CCC (minus:QI (reg/v:QI 63 [ a ])
> (subreg:QI (reg:SI 66 [ b ]) 0))
> (reg/v:QI 63 [ a ])))
> (set (reg/v:QI 59 [ difference ])
> (minus:QI (reg/v:QI 63 [ a ])
> (subreg:QI (reg:SI 66 [ b ]) 0)))
> ]) pr30315.i:7 322 {*subqi3_cc_overflow}
>  (expr_list:REG_DEAD (reg:SI 66 [ b ])
> (expr_list:REG_DEAD (reg/v:QI 63 [ a ])
> (nil
>
>
> Now, with your patch we have instead before combine:
>
> (insn 9 6 10 2 (parallel [
> (set (reg/v:QI 59 [ difference ])
> (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0)
> (subreg:QI (reg:SI 66 [ b ]) 0)))
> (clobber (reg:CC 17 flags))
> ]) pr30315.i:6 287 {*subqi_1}
>  (expr_list:REG_DEAD (reg:SI 66 [ b ])
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil
>
> (insn 10 9 11 2 (set (reg:CC 17 flags)
> (compare:CC (subreg:QI (reg:SI 64 [ a ]) 0)
> (reg/v:QI 59 [ difference ]))) pr30315.i:7 4 {*cmpqi_1}
>  (expr_list:REG_DEAD (reg:SI 64 [ a ])
> (nil)))
>
> which combine is unfortunately unable to process:
>
> Trying 9 -> 10:
> Failed to match this instruction:
> (parallel [
> (set (reg:CC 17 flags)
> (compare:CC (subreg:QI (minus:SI (reg:SI 64 [ a ])
> (reg:SI 66 [ b ])) 0)
> (subreg:QI (reg:SI 64 [ a ]) 0)))
> (set (reg/v:QI 59 [ difference ])
> (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0)
> (subreg:QI (reg:SI 66 [ b ]) 0)))
> ])
>
>
> The problem appears to be that an RTX expression like
>
> (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0)
>   (subreg:QI (reg:SI 66 [ b ]) 0))
>
> seems to be considered non-canonical by combine, and is instead
> transformed into
>
> (subreg:QI (minus:SI (reg:SI 64 [ a ])
>  (reg:SI 66 [ b ])) 0)
>
> This happens via combine.c:apply_distributive_law.
>
>
> On the one hand, this seems odd, as SUBREGs with a non-trivial
> expression inside will usually not match any insn pattern.  On
> the other hand, I guess this might still be useful behaviour
> for combine on platforms that support only word arithmetic,
> when the SUBREG might be considered a "split" point.  (Of course,
> this doesn't work for the compute-result-and-cc type patte

Re: RFC: allowing fwprop to propagate subregs

2012-01-11 Thread Paolo Bonzini


On 01/11/2012 05:55 PM, Ulrich Weigand wrote:

In either case, it is always odd to have RTX in the insn stream
that isn't "stable" under common simplication ...   Do you have
any suggestions on how to fix this?  If we add the fwprop patch,
should we then disable apply_distributive_law for SUBREGs?


This strange canonicalization is also the cause of PR39726, where I also 
wanted to move subregs inside arithmetic operations.  See the attached 
patch (which is not enough to fix the PR; this is just the combine part).


Unfortunately, doing this also caused x86_64 regressions, I think in 
zero_extract patterns.


Paolo
change canonicalization

Index: gcc/Makefile.in
===
--- gcc/Makefile.in	(branch pr39726)
+++ gcc/Makefile.in	(working copy)
@@ -2844,7 +2844,7 @@ jump.o : jump.c $(CONFIG_H) $(SYSTEM_H) 
 simplify-rtx.o : simplify-rtx.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
$(RTL_H) $(REGS_H) hard-reg-set.h $(FLAGS_H) $(REAL_H) insn-config.h \
$(RECOG_H) $(EXPR_H) $(TOPLEV_H) output.h $(FUNCTION_H) $(GGC_H) $(TM_P_H) \
-   $(TREE_H) $(TARGET_H)
+   $(TREE_H) $(TARGET_H) $(OPTABS_H)
 cgraph.o : cgraph.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(TREE_H) \
langhooks.h $(TOPLEV_H) $(FLAGS_H) $(GGC_H) $(TARGET_H) $(CGRAPH_H) \
gt-cgraph.h output.h intl.h $(BASIC_BLOCK_H) debug.h $(HASHTAB_H) \
Index: gcc/simplify-rtx.c
===
--- gcc/simplify-rtx.c	(branch pr39726)
+++ gcc/simplify-rtx.c	(working copy)
@@ -39,6 +39,7 @@ along with GCC; see the file COPYING3.  
 #include "output.h"
 #include "ggc.h"
 #include "target.h"
+#include "optabs.h"
 
 /* Simplification and canonicalization of RTL.  */
 
@@ -5191,6 +5192,8 @@ rtx
 simplify_subreg (enum machine_mode outermode, rtx op,
 		 enum machine_mode innermode, unsigned int byte)
 {
+  int is_lowpart;
+
   /* Little bit of sanity checking.  */
   gcc_assert (innermode != VOIDmode);
   gcc_assert (outermode != VOIDmode);
@@ -5300,11 +5303,13 @@ simplify_subreg (enum machine_mode outer
   return NULL_RTX;
 }
 
+  is_lowpart = subreg_lowpart_offset (outermode, innermode) == byte;
+
   /* Merge implicit and explicit truncations.  */
 
   if (GET_CODE (op) == TRUNCATE
   && GET_MODE_SIZE (outermode) < GET_MODE_SIZE (innermode)
-  && subreg_lowpart_offset (outermode, innermode) == byte)
+  && is_lowpart)
 return simplify_gen_unary (TRUNCATE, outermode, XEXP (op, 0),
 			   GET_MODE (XEXP (op, 0)));
 
@@ -5343,7 +5348,7 @@ simplify_subreg (enum machine_mode outer
 	 The information is used only by alias analysis that can not
 	 grog partial register anyway.  */
 
-	  if (subreg_lowpart_offset (outermode, innermode) == byte)
+	  if (is_lowpart)
 	ORIGINAL_REGNO (x) = ORIGINAL_REGNO (op);
 	  return x;
 	}
@@ -5393,6 +5398,51 @@ simplify_subreg (enum machine_mode outer
   return NULL_RTX;
 }
 
+  /* Try to move a subreg inside an arithmetic operation.  */
+  if (is_lowpart && ARITHMETIC_P (op)
+  && GET_MODE_CLASS (outermode) == MODE_INT
+  && GET_MODE_CLASS (innermode) == MODE_INT)
+{
+  enum insn_code ic;
+  enum machine_mode cnt_mode;
+  switch (GET_CODE (op))
+	{
+	case ABS:
+	case NEG:
+	  return simplify_gen_unary (GET_CODE (op), outermode,
+ rtl_hooks.gen_lowpart_no_emit
+  (outermode, XEXP (op, 0)),
+ outermode);
+
+	case ASHIFT:
+	  /* i386 always uses QImode for the shift count.  Get the
+	 appropriate mode from the optab.  */
+	  ic = ashl_optab->handlers[outermode].insn_code;
+	  if (ic != CODE_FOR_nothing)
+	cnt_mode = insn_data[ic].operand[2].mode;
+	  else
+	cnt_mode = outermode;
+	  return simplify_gen_binary (GET_CODE (op), outermode,
+  rtl_hooks.gen_lowpart_no_emit
+   (outermode, XEXP (op, 0)),
+  rtl_hooks.gen_lowpart_no_emit
+   (cnt_mode, XEXP (op, 1)));
+
+	case PLUS:
+	case MINUS:
+	case AND:
+	case IOR:
+	case XOR:
+	  return simplify_gen_binary (GET_CODE (op), outermode,
+  rtl_hooks.gen_lowpart_no_emit
+   (outermode, XEXP (op, 0)),
+  rtl_hooks.gen_lowpart_no_emit
+   (outermode, XEXP (op, 1)));
+	default:
+	  break;
+	}
+}
+
   /* Optimize SUBREG truncations of zero and sign extended values.  */
   if ((GET_CODE (op) == ZERO_EXTEND
|| GET_CODE (op) == SIGN_EXTEND)
Index: gcc/combine.c
===
--- gcc/combine.c	(branch pr39726)
+++ gcc/combine.c	(working copy)
@@ -11155,47 +11155,6 @@ simplify_comparison (enum rtx_code code,
 	  continue;
 	}
 
-	  /* If this is (and:M1 (subreg:M2 X 0) (const_int C1)) where C1
-	 fits in both M1 and M2 and the SUBREG is either paradoxical
-	 or represents the low part, permute the SUBREG and the AND
-	 and try again.  */
-	  if (GET_CODE (XEXP (op0, 0)) == SUBREG)
-	{
-	  unsigned HOST_WIDE_INT c1;
-

Re: RFC: allowing fwprop to propagate subregs

2012-01-11 Thread Ulrich Weigand

Richard Sandiford wrote:

> At the moment, fwprop will propagate constants and registers
> even if no further rtl simplifications are possible:
> 
>   if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
> flags |= PR_CAN_APPEAR;
> 
> What do you think about extending this to subregs?

I've been testing your patch (in its latest version including
the SUBREG test pointed out by H.J.), and ran into issues where
this additional propagation suppresses other optimizations.

In particular, I'm seeing this testsuite regression on x86_64:
FAIL: gcc.target/i386/pr30315.c scan-assembler-times cmp 4

The problem is that the combined "compute result and set
condition code" insn patterns sometimes no longer match.

The source code in question looks like:

int c;

unsigned char
de (unsigned char a, unsigned char b)
{
  unsigned char difference = a - b;
  if (difference > a)
c--;
  return difference;
}


Before your patch, we generate insns like:

(insn 3 4 5 2 (set (reg/v:QI 63 [ a ])
(subreg:QI (reg:SI 64 [ a ]) 0)) pr30315.i:5 66 {*movqi_internal}
 (expr_list:REG_DEAD (reg:SI 64 [ a ])
(nil)))

(insn 5 3 6 2 (set (reg/v:QI 65 [ b ])
(subreg:QI (reg:SI 66 [ b ]) 0)) pr30315.i:5 66 {*movqi_internal}
 (expr_list:REG_DEAD (reg:SI 66 [ b ])
(nil)))

(insn 9 6 10 2 (parallel [
(set (reg/v:QI 59 [ difference ])
(minus:QI (reg/v:QI 63 [ a ])
(reg/v:QI 65 [ b ])))
(clobber (reg:CC 17 flags))
]) pr30315.i:6 287 {*subqi_1}
 (expr_list:REG_DEAD (reg/v:QI 65 [ b ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil

(insn 10 9 11 2 (set (reg:CC 17 flags)
(compare:CC (reg/v:QI 63 [ a ])
(reg/v:QI 59 [ difference ]))) pr30315.i:7 4 {*cmpqi_1}
 (expr_list:REG_DEAD (reg/v:QI 63 [ a ])
(nil)))

which get combined into:

(insn 4 2 3 2 (set (reg:SI 66 [ b ])
(reg:SI 4 si [ b ])) pr30315.i:5 64 {*movsi_internal}
 (expr_list:REG_DEAD (reg:SI 4 si [ b ])
(nil)))

(insn 3 4 5 2 (set (reg/v:QI 63 [ a ])
(subreg:QI (reg:SI 64 [ a ]) 0)) pr30315.i:5 66 {*movqi_internal}
 (expr_list:REG_DEAD (reg:SI 64 [ a ])
(nil)))

(insn 10 9 11 2 (parallel [
(set (reg:CCC 17 flags)
(compare:CCC (minus:QI (reg/v:QI 63 [ a ])
(subreg:QI (reg:SI 66 [ b ]) 0))
(reg/v:QI 63 [ a ])))
(set (reg/v:QI 59 [ difference ])
(minus:QI (reg/v:QI 63 [ a ])
(subreg:QI (reg:SI 66 [ b ]) 0)))
]) pr30315.i:7 322 {*subqi3_cc_overflow}
 (expr_list:REG_DEAD (reg:SI 66 [ b ])
(expr_list:REG_DEAD (reg/v:QI 63 [ a ])
(nil


Now, with your patch we have instead before combine:

(insn 9 6 10 2 (parallel [
(set (reg/v:QI 59 [ difference ])
(minus:QI (subreg:QI (reg:SI 64 [ a ]) 0)
(subreg:QI (reg:SI 66 [ b ]) 0)))
(clobber (reg:CC 17 flags))
]) pr30315.i:6 287 {*subqi_1}
 (expr_list:REG_DEAD (reg:SI 66 [ b ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil

(insn 10 9 11 2 (set (reg:CC 17 flags)
(compare:CC (subreg:QI (reg:SI 64 [ a ]) 0)
(reg/v:QI 59 [ difference ]))) pr30315.i:7 4 {*cmpqi_1}
 (expr_list:REG_DEAD (reg:SI 64 [ a ])
(nil)))

which combine is unfortunately unable to process:

Trying 9 -> 10:
Failed to match this instruction:
(parallel [
(set (reg:CC 17 flags)
(compare:CC (subreg:QI (minus:SI (reg:SI 64 [ a ])
(reg:SI 66 [ b ])) 0)
(subreg:QI (reg:SI 64 [ a ]) 0)))
(set (reg/v:QI 59 [ difference ])
(minus:QI (subreg:QI (reg:SI 64 [ a ]) 0)
(subreg:QI (reg:SI 66 [ b ]) 0)))
])


The problem appears to be that an RTX expression like

(minus:QI (subreg:QI (reg:SI 64 [ a ]) 0)
  (subreg:QI (reg:SI 66 [ b ]) 0))

seems to be considered non-canonical by combine, and is instead
transformed into

(subreg:QI (minus:SI (reg:SI 64 [ a ])
 (reg:SI 66 [ b ])) 0)

This happens via combine.c:apply_distributive_law.


On the one hand, this seems odd, as SUBREGs with a non-trivial
expression inside will usually not match any insn pattern.  On
the other hand, I guess this might still be useful behaviour
for combine on platforms that support only word arithmetic,
when the SUBREG might be considered a "split" point.  (Of course,
this doesn't work for the compute-result-and-cc type patterns.)


In either case, it is always odd to have RTX in the insn stream
that isn't "stable" under common simplication ...   Do you have
any suggestions on how to fix this?  If we add the fwprop patch,
should we then disable apply_distributive_law for SUBREGs?


Thanks,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com

Re: RFC: allowing fwprop to propagate subregs

2011-09-14 Thread Paolo Bonzini

On 09/14/2011 05:44 PM, Richard Sandiford wrote:

>  A SUBREG may not be REG nor CONSTANT. Don't you need
>  to check REG_P/CONSTANT_P on SUBREG?

Yeah, good point.  There should be a "&&  REG_P (SUBREG_REG (new_rtx))"
in there.  Probably also worth checking for non-paradoxical subregs.

I was going to suggest the former.

Paradoxical subregs are already handled elsewhere in fwprop, but it also 
shouldn't hurt to include them.

Paolo

Re: RFC: allowing fwprop to propagate subregs

2011-09-14 Thread Richard Sandiford

"H.J. Lu"  writes:
> On Wed, Sep 14, 2011 at 8:24 AM, Richard Sandiford
>  wrote:
>> At the moment, fwprop will propagate constants and registers
>> even if no further rtl simplifications are possible:
>>
>>  if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
>>    flags |= PR_CAN_APPEAR;
>>
>> What do you think about extending this to subregs?  The reason for
>> asking is that on NEON, vector loads like vld4 are represented as a load
>> of a single monolithic register followed by subreg extractions of each
>> vector:
>>
>>  (set (reg:OI FULL) (...))
>>  (set (reg:V2SI V0) (subreg:V2SI (reg:OI FULL) 0))
>>  (set (reg:V2SI V1) (subreg:V2SI (reg:OI FULL) 16))
>>  (set (reg:V2SI V2) (subreg:V2SI (reg:OI FULL) 32))
>>  (set (reg:V2SI V3) (subreg:V2SI (reg:OI FULL) 48))
>>
>> Nothing ever propagates these subregs, so the separate moves
>> survive until IRA.  This has three problems:
>>
>>  - We generally want the registers allocated to V0...V3 to be the same
>>    as FULL, so that the four subreg moves become nops.  And this often
>>    happens in simple examples.  But if register pressure is relatively
>>    high, these moves can sometimes cause IRA to spill in cases where
>>    it doesn't if the subregs are used instead of each Vi.
>>
>>  - Perhaps related, register pressure becomes harder to estimate.
>>
>>  - These moves can interfere with pre-reload scheduling.
>>
>> In combination with the MODES_TIEABLE_P patch that I posted here:
>>
>>    http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00626.html
>>
>> this patch significantly improves the code generated for several libav
>> loops.  Unfortunately, I don't have a setup that can do meaningful
>> x86_64 performance measurements, but a diff of the before and after
>> output for libav showed many cases where the patch removed moves.
>>
>> What do you think?  Alternatives include propagating in lower-subreg,
>> or maybe only in the second fwprop pass.
>>
>> Richard
>>
>>
>> gcc/
>>        * fwprop.c (propagate_rtx): Also set PR_CAN_APPEAR for subregs.
>>
>> Index: gcc/fwprop.c
>> ===
>> --- gcc/fwprop.c        2011-08-26 09:58:28.829540497 +0100
>> +++ gcc/fwprop.c        2011-08-26 10:14:03.767707504 +0100
>> @@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode
>>     return NULL_RTX;
>>
>>   flags = 0;
>> -  if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
>> +  if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == 
>> SUBREG)
>>     flags |= PR_CAN_APPEAR;
>>   if (!for_each_rtx (&new_rtx, varying_mem_p, NULL))
>>     flags |= PR_HANDLE_MEM;
>>
>
> A SUBREG may not be REG nor CONSTANT. Don't you need
> to check REG_P/CONSTANT_P on SUBREG?

Yeah, good point.  There should be a "&& REG_P (SUBREG_REG (new_rtx))"
in there.  Probably also worth checking for non-paradoxical subregs.

Richard

Re: RFC: allowing fwprop to propagate subregs

2011-09-14 Thread H.J. Lu

On Wed, Sep 14, 2011 at 8:24 AM, Richard Sandiford
 wrote:
> At the moment, fwprop will propagate constants and registers
> even if no further rtl simplifications are possible:
>
>  if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
>    flags |= PR_CAN_APPEAR;
>
> What do you think about extending this to subregs?  The reason for
> asking is that on NEON, vector loads like vld4 are represented as a load
> of a single monolithic register followed by subreg extractions of each
> vector:
>
>  (set (reg:OI FULL) (...))
>  (set (reg:V2SI V0) (subreg:V2SI (reg:OI FULL) 0))
>  (set (reg:V2SI V1) (subreg:V2SI (reg:OI FULL) 16))
>  (set (reg:V2SI V2) (subreg:V2SI (reg:OI FULL) 32))
>  (set (reg:V2SI V3) (subreg:V2SI (reg:OI FULL) 48))
>
> Nothing ever propagates these subregs, so the separate moves
> survive until IRA.  This has three problems:
>
>  - We generally want the registers allocated to V0...V3 to be the same
>    as FULL, so that the four subreg moves become nops.  And this often
>    happens in simple examples.  But if register pressure is relatively
>    high, these moves can sometimes cause IRA to spill in cases where
>    it doesn't if the subregs are used instead of each Vi.
>
>  - Perhaps related, register pressure becomes harder to estimate.
>
>  - These moves can interfere with pre-reload scheduling.
>
> In combination with the MODES_TIEABLE_P patch that I posted here:
>
>    http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00626.html
>
> this patch significantly improves the code generated for several libav
> loops.  Unfortunately, I don't have a setup that can do meaningful
> x86_64 performance measurements, but a diff of the before and after
> output for libav showed many cases where the patch removed moves.
>
> What do you think?  Alternatives include propagating in lower-subreg,
> or maybe only in the second fwprop pass.
>
> Richard
>
>
> gcc/
>        * fwprop.c (propagate_rtx): Also set PR_CAN_APPEAR for subregs.
>
> Index: gcc/fwprop.c
> ===
> --- gcc/fwprop.c        2011-08-26 09:58:28.829540497 +0100
> +++ gcc/fwprop.c        2011-08-26 10:14:03.767707504 +0100
> @@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode
>     return NULL_RTX;
>
>   flags = 0;
> -  if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
> +  if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == 
> SUBREG)
>     flags |= PR_CAN_APPEAR;
>   if (!for_each_rtx (&new_rtx, varying_mem_p, NULL))
>     flags |= PR_HANDLE_MEM;
>

A SUBREG may not be REG nor CONSTANT. Don't you need
to check REG_P/CONSTANT_P on SUBREG?


-- 
H.J.

RFC: allowing fwprop to propagate subregs

2011-09-14 Thread Richard Sandiford

At the moment, fwprop will propagate constants and registers
even if no further rtl simplifications are possible:

  if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
flags |= PR_CAN_APPEAR;

What do you think about extending this to subregs?  The reason for
asking is that on NEON, vector loads like vld4 are represented as a load
of a single monolithic register followed by subreg extractions of each
vector:

  (set (reg:OI FULL) (...))
  (set (reg:V2SI V0) (subreg:V2SI (reg:OI FULL) 0))
  (set (reg:V2SI V1) (subreg:V2SI (reg:OI FULL) 16))
  (set (reg:V2SI V2) (subreg:V2SI (reg:OI FULL) 32))
  (set (reg:V2SI V3) (subreg:V2SI (reg:OI FULL) 48))

Nothing ever propagates these subregs, so the separate moves
survive until IRA.  This has three problems:

  - We generally want the registers allocated to V0...V3 to be the same
as FULL, so that the four subreg moves become nops.  And this often
happens in simple examples.  But if register pressure is relatively
high, these moves can sometimes cause IRA to spill in cases where
it doesn't if the subregs are used instead of each Vi.

  - Perhaps related, register pressure becomes harder to estimate.

  - These moves can interfere with pre-reload scheduling.

In combination with the MODES_TIEABLE_P patch that I posted here:

http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00626.html

this patch significantly improves the code generated for several libav
loops.  Unfortunately, I don't have a setup that can do meaningful
x86_64 performance measurements, but a diff of the before and after
output for libav showed many cases where the patch removed moves.

What do you think?  Alternatives include propagating in lower-subreg,
or maybe only in the second fwprop pass.

Richard


gcc/
* fwprop.c (propagate_rtx): Also set PR_CAN_APPEAR for subregs.

Index: gcc/fwprop.c
===
--- gcc/fwprop.c2011-08-26 09:58:28.829540497 +0100
+++ gcc/fwprop.c2011-08-26 10:14:03.767707504 +0100
@@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode 
 return NULL_RTX;
 
   flags = 0;
-  if (REG_P (new_rtx) || CONSTANT_P (new_rtx))
+  if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == SUBREG)
 flags |= PR_CAN_APPEAR;
   if (!for_each_rtx (&new_rtx, varying_mem_p, NULL))
 flags |= PR_HANDLE_MEM;

Re: [PATCH] Do not handle SUBREG in apply_distributive_law (Re: RFC: allowing fwprop to propagate subregs)

[PATCH] Do not handle SUBREG in apply_distributive_law (Re: RFC: allowing fwprop to propagate subregs)

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

Re: RFC: allowing fwprop to propagate subregs

RFC: allowing fwprop to propagate subregs

14 matches

Site Navigation

Mail list logo

Footer information