Hi Segher,
On 17/06/16 01:07, Segher Boessenkool wrote:
On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:
With the proposed cost change for vector construction we will end up
vectorizing the testcase in PR68961 again (on x86_64 and likely
on ppc64le as well after that target gets adjustments). Currently
we can't optimize that away again noticing the direct overlap of
argument and return registers. The obstackle is
(insn 7 4 8 2 (set (reg:V2DF 93)
(vec_concat:V2DF (reg/v:DF 91 [ a ])
(reg/v:DF 92 [ aa ])))
...
(insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
(subreg:DI (reg:TI 88 [ D.1756 ]) 0))
(insn 24 21 11 2 (set (reg:DI 100 [+8 ])
(subreg:DI (reg:TI 88 [ D.1756 ]) 8))
which we eventually optimize to DFmode subregs of (reg:V2DF 93).
First of all simplify_subreg doesn't handle the subregs of a vec_concat
(easy fix below).
Then combine doesn't like to simplify the multi-use (it tries some
parallel it seems).
Combine will not do a 2->2 combination currently. Say it is combining
A with a later B into C, and the result of A is used again later, then
it tries a parallel of A with C. That usually does not match an insn for
the target.
If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
(so that they will disappear later in combines), combine will break the
parallel into two and see if that matches. We can in fact do that for
2->2 combinations as well: this removes a log_link (from A to B), so
combine cannot get into an infinite loop, even though it does not make
the number of RTL insns lower.
So I tried out the patch below. It decreases code size on most targets
(mostly fixed length insn targets), and increases it a small bit on some
variable length insn targets (doing an op twice, instead of doing it once
and doing a move). It looks to be all good there too, but there are so
many changes that it is almost impossible to really check.
So: can people try this out with their favourite benchmarks, please?
I hope to give this a run on AArch64 but I'll probably manage to get to it only
next week.
In the meantime I've had a quick look at some SPEC2006 codegen on aarch64.
Some benchmarks decrease in size, others increase. One recurring theme I
spotted is
shifts being repeatedly combined with arithmetic operations rather than being
computed
once and reusing the result. For example:
lsl x30, x15, 3
add x4, x5, x30
add x9, x7, x30
add x24, x8, x30
add x10, x0, x30
add x2, x22, x30
becoming (modulo regalloc fluctuations):
add x14, x2, x15, lsl 3
add x13, x22, x15, lsl 3
add x21, x4, x15, lsl 3
add x6, x0, x15, lsl 3
add x3, x30, x15, lsl 3
which, while saving one instruction can be harmful overall because the extra
shift operation
in the arithmetic instructions can increase the latency of the instruction. I
believe the aarch64
rtx costs should convey this information. Do you expect RTX costs to gate this
behaviour?
Some occurrences that hurt code size look like:
cmp x8, x11, asr 5
being transformed into:
asr x12, x11, 5
cmp x12, x8, uxtw //zero-extend x8
with the user of the condition code inverted to match the change in order of
operands
to the comparisons.
I haven't looked at the RTL dumps yet to figure out why this is happening, it
could be a backend
RTL representation issue.
Kyrill
Segher
diff --git a/gcc/combine.c b/gcc/combine.c
index 6b5d000..2c99b4e 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
rtx_insn *i0,
&& XVECLEN (newpat, 0) == 2
&& GET_CODE (XVECEXP (newpat, 0, 0)) == SET
&& GET_CODE (XVECEXP (newpat, 0, 1)) == SET
- && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
- || set_noop_p (XVECEXP (newpat, 0, 1)))
&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT