Hi Segher,

On 17/06/16 01:07, Segher Boessenkool wrote:
On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:
With the proposed cost change for vector construction we will end up
vectorizing the testcase in PR68961 again (on x86_64 and likely
on ppc64le as well after that target gets adjustments).  Currently
we can't optimize that away again noticing the direct overlap of
argument and return registers.  The obstackle is

(insn 7 4 8 2 (set (reg:V2DF 93)
         (vec_concat:V2DF (reg/v:DF 91 [ a ])
             (reg/v:DF 92 [ aa ])))
...
(insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
(insn 24 21 11 2 (set (reg:DI 100 [+8 ])
         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))

which we eventually optimize to DFmode subregs of (reg:V2DF 93).

First of all simplify_subreg doesn't handle the subregs of a vec_concat
(easy fix below).

Then combine doesn't like to simplify the multi-use (it tries some
parallel it seems).
Combine will not do a 2->2 combination currently.  Say it is combining
A with a later B into C, and the result of A is used again later, then
it tries a parallel of A with C.  That usually does not match an insn for
the target.

If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
(so that they will disappear later in combines), combine will break the
parallel into two and see if that matches.  We can in fact do that for
2->2 combinations as well: this removes a log_link (from A to B), so
combine cannot get into an infinite loop, even though it does not make
the number of RTL insns lower.

So I tried out the patch below.  It decreases code size on most targets
(mostly fixed length insn targets), and increases it a small bit on some
variable length insn targets (doing an op twice, instead of doing it once
and doing a move).  It looks to be all good there too, but there are so
many changes that it is almost impossible to really check.

So: can people try this out with their favourite benchmarks, please?

I hope to give this a run on AArch64 but I'll probably manage to get to it only 
next week.
In the meantime I've had a quick look at some SPEC2006 codegen on aarch64.
Some benchmarks decrease in size, others increase. One recurring theme I 
spotted is
shifts being repeatedly combined with arithmetic operations rather than being 
computed
once and reusing the result. For example:
    lsl    x30, x15, 3
    add    x4, x5, x30
    add    x9, x7, x30
    add    x24, x8, x30
    add    x10, x0, x30
    add    x2, x22, x30

becoming (modulo regalloc fluctuations):
    add    x14, x2, x15, lsl 3
    add    x13, x22, x15, lsl 3
    add    x21, x4, x15, lsl 3
    add    x6, x0, x15, lsl 3
    add    x3, x30, x15, lsl 3

which, while saving one instruction can be harmful overall because the extra 
shift operation
in the arithmetic instructions can increase the latency of the instruction. I 
believe the aarch64
rtx costs should convey this information. Do you expect RTX costs to gate this 
behaviour?

Some occurrences that hurt code size look like:
    cmp    x8, x11, asr 5

being transformed into:
    asr    x12, x11, 5
    cmp    x12, x8, uxtw //zero-extend x8
with the user of the condition code inverted to match the change in order of 
operands
to the comparisons.
I haven't looked at the RTL dumps yet to figure out why this is happening, it 
could be a backend
RTL representation issue.

Kyrill



Segher


diff --git a/gcc/combine.c b/gcc/combine.c
index 6b5d000..2c99b4e 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
           && XVECLEN (newpat, 0) == 2
           && GET_CODE (XVECEXP (newpat, 0, 0)) == SET
           && GET_CODE (XVECEXP (newpat, 0, 1)) == SET
-          && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
-                 || set_noop_p (XVECEXP (newpat, 0, 1)))
           && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
           && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
           && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT


Reply via email to