Re: RFC: 2->2 combine patch

2016-06-20 Thread Segher Boessenkool
On Mon, Jun 20, 2016 at 02:39:06PM +0100, Kyrill Tkachov wrote:
> >So I tried out the patch below.  It decreases code size on most targets
> >(mostly fixed length insn targets), and increases it a small bit on some
> >variable length insn targets (doing an op twice, instead of doing it once
> >and doing a move).  It looks to be all good there too, but there are so
> >many changes that it is almost impossible to really check.
> >
> >So: can people try this out with their favourite benchmarks, please?
> 
> I hope to give this a run on AArch64 but I'll probably manage to get to it 
> only next week.
> In the meantime I've had a quick look at some SPEC2006 codegen on aarch64.
> Some benchmarks decrease in size, others increase. One recurring theme I 
> spotted is
> shifts being repeatedly combined with arithmetic operations rather than 
> being computed
> once and reusing the result. For example:
> lslx30, x15, 3
> addx4, x5, x30
> addx9, x7, x30
> addx24, x8, x30
> addx10, x0, x30
> addx2, x22, x30
> 
> becoming (modulo regalloc fluctuations):
> addx14, x2, x15, lsl 3
> addx13, x22, x15, lsl 3
> addx21, x4, x15, lsl 3
> addx6, x0, x15, lsl 3
> addx3, x30, x15, lsl 3
> 
> which, while saving one instruction can be harmful overall because the 
> extra shift operation
> in the arithmetic instructions can increase the latency of the instruction. 
> I believe the aarch64
> rtx costs should convey this information. Do you expect RTX costs to gate 
> this behaviour?

Yes, RTX costs are used for *all* of combine's combinations.  So it seems
your add,lsl patterns are the same cost as plain add?  If it were more
expensive, combine would reject this combination.

> Some occurrences that hurt code size look like:
> cmpx8, x11, asr 5
> 
> being transformed into:
> asrx12, x11, 5
> cmpx12, x8, uxtw //zero-extend x8
> with the user of the condition code inverted to match the change in order 
> of operands
> to the comparisons.
> I haven't looked at the RTL dumps yet to figure out why this is happening, 
> it could be a backend
> RTL representation issue.

That could be a target thing yes, hard to tell; it's not clear to me
what combination is made here (if any).


Segher


Re: RFC: 2->2 combine patch

2016-06-20 Thread Kyrill Tkachov

Hi Segher,

On 17/06/16 01:07, Segher Boessenkool wrote:

On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:

With the proposed cost change for vector construction we will end up
vectorizing the testcase in PR68961 again (on x86_64 and likely
on ppc64le as well after that target gets adjustments).  Currently
we can't optimize that away again noticing the direct overlap of
argument and return registers.  The obstackle is

(insn 7 4 8 2 (set (reg:V2DF 93)
 (vec_concat:V2DF (reg/v:DF 91 [ a ])
 (reg/v:DF 92 [ aa ])))
...
(insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
 (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
(insn 24 21 11 2 (set (reg:DI 100 [+8 ])
 (subreg:DI (reg:TI 88 [ D.1756 ]) 8))

which we eventually optimize to DFmode subregs of (reg:V2DF 93).

First of all simplify_subreg doesn't handle the subregs of a vec_concat
(easy fix below).

Then combine doesn't like to simplify the multi-use (it tries some
parallel it seems).

Combine will not do a 2->2 combination currently.  Say it is combining
A with a later B into C, and the result of A is used again later, then
it tries a parallel of A with C.  That usually does not match an insn for
the target.

If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
(so that they will disappear later in combines), combine will break the
parallel into two and see if that matches.  We can in fact do that for
2->2 combinations as well: this removes a log_link (from A to B), so
combine cannot get into an infinite loop, even though it does not make
the number of RTL insns lower.

So I tried out the patch below.  It decreases code size on most targets
(mostly fixed length insn targets), and increases it a small bit on some
variable length insn targets (doing an op twice, instead of doing it once
and doing a move).  It looks to be all good there too, but there are so
many changes that it is almost impossible to really check.

So: can people try this out with their favourite benchmarks, please?


I hope to give this a run on AArch64 but I'll probably manage to get to it only 
next week.
In the meantime I've had a quick look at some SPEC2006 codegen on aarch64.
Some benchmarks decrease in size, others increase. One recurring theme I 
spotted is
shifts being repeatedly combined with arithmetic operations rather than being 
computed
once and reusing the result. For example:
lslx30, x15, 3
addx4, x5, x30
addx9, x7, x30
addx24, x8, x30
addx10, x0, x30
addx2, x22, x30

becoming (modulo regalloc fluctuations):
addx14, x2, x15, lsl 3
addx13, x22, x15, lsl 3
addx21, x4, x15, lsl 3
addx6, x0, x15, lsl 3
addx3, x30, x15, lsl 3

which, while saving one instruction can be harmful overall because the extra 
shift operation
in the arithmetic instructions can increase the latency of the instruction. I 
believe the aarch64
rtx costs should convey this information. Do you expect RTX costs to gate this 
behaviour?

Some occurrences that hurt code size look like:
cmpx8, x11, asr 5

being transformed into:
asrx12, x11, 5
cmpx12, x8, uxtw //zero-extend x8
with the user of the condition code inverted to match the change in order of 
operands
to the comparisons.
I haven't looked at the RTL dumps yet to figure out why this is happening, it 
could be a backend
RTL representation issue.

Kyrill




Segher


diff --git a/gcc/combine.c b/gcc/combine.c
index 6b5d000..2c99b4e 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
   && XVECLEN (newpat, 0) == 2
   && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
   && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
-  && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
- || set_noop_p (XVECEXP (newpat, 0, 1)))
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT





RFC: 2->2 combine patch (was: Re: [PATCH] Allow fwprop to undo vectorization harm (PR68961))

2016-06-16 Thread Segher Boessenkool
On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:
> With the proposed cost change for vector construction we will end up
> vectorizing the testcase in PR68961 again (on x86_64 and likely
> on ppc64le as well after that target gets adjustments).  Currently
> we can't optimize that away again noticing the direct overlap of
> argument and return registers.  The obstackle is
> 
> (insn 7 4 8 2 (set (reg:V2DF 93)
> (vec_concat:V2DF (reg/v:DF 91 [ a ])
> (reg/v:DF 92 [ aa ]))) 
> ...
> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> 
> which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> 
> First of all simplify_subreg doesn't handle the subregs of a vec_concat
> (easy fix below).
> 
> Then combine doesn't like to simplify the multi-use (it tries some
> parallel it seems).

Combine will not do a 2->2 combination currently.  Say it is combining
A with a later B into C, and the result of A is used again later, then
it tries a parallel of A with C.  That usually does not match an insn for
the target.

If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
(so that they will disappear later in combines), combine will break the
parallel into two and see if that matches.  We can in fact do that for
2->2 combinations as well: this removes a log_link (from A to B), so
combine cannot get into an infinite loop, even though it does not make
the number of RTL insns lower.

So I tried out the patch below.  It decreases code size on most targets
(mostly fixed length insn targets), and increases it a small bit on some
variable length insn targets (doing an op twice, instead of doing it once
and doing a move).  It looks to be all good there too, but there are so
many changes that it is almost impossible to really check.

So: can people try this out with their favourite benchmarks, please?


Segher


diff --git a/gcc/combine.c b/gcc/combine.c
index 6b5d000..2c99b4e 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
   && XVECLEN (newpat, 0) == 2
   && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
   && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
-  && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
- || set_noop_p (XVECEXP (newpat, 0, 1)))
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
   && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT