On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote: > With the proposed cost change for vector construction we will end up > vectorizing the testcase in PR68961 again (on x86_64 and likely > on ppc64le as well after that target gets adjustments). Currently > we can't optimize that away again noticing the direct overlap of > argument and return registers. The obstackle is > > (insn 7 4 8 2 (set (reg:V2DF 93) > (vec_concat:V2DF (reg/v:DF 91 [ a ]) > (reg/v:DF 92 [ aa ]))) > ... > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ]) > (subreg:DI (reg:TI 88 [ D.1756 ]) 0)) > (insn 24 21 11 2 (set (reg:DI 100 [+8 ]) > (subreg:DI (reg:TI 88 [ D.1756 ]) 8)) > > which we eventually optimize to DFmode subregs of (reg:V2DF 93). > > First of all simplify_subreg doesn't handle the subregs of a vec_concat > (easy fix below). > > Then combine doesn't like to simplify the multi-use (it tries some > parallel it seems).
Combine will not do a 2->2 combination currently. Say it is combining A with a later B into C, and the result of A is used again later, then it tries a parallel of A with C. That usually does not match an insn for the target. If this were a 3->2 (or 4->2) combination, or A or C are no-op moves (so that they will disappear later in combines), combine will break the parallel into two and see if that matches. We can in fact do that for 2->2 combinations as well: this removes a log_link (from A to B), so combine cannot get into an infinite loop, even though it does not make the number of RTL insns lower. So I tried out the patch below. It decreases code size on most targets (mostly fixed length insn targets), and increases it a small bit on some variable length insn targets (doing an op twice, instead of doing it once and doing a move). It looks to be all good there too, but there are so many changes that it is almost impossible to really check. So: can people try this out with their favourite benchmarks, please? Segher diff --git a/gcc/combine.c b/gcc/combine.c index 6b5d000..2c99b4e 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, && XVECLEN (newpat, 0) == 2 && GET_CODE (XVECEXP (newpat, 0, 0)) == SET && GET_CODE (XVECEXP (newpat, 0, 1)) == SET - && (i1 || set_noop_p (XVECEXP (newpat, 0, 0)) - || set_noop_p (XVECEXP (newpat, 0, 1))) && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT