With the proposed cost change for vector construction we will end up
vectorizing the testcase in PR68961 again (on x86_64 and likely
on ppc64le as well after that target gets adjustments).  Currently
we can't optimize that away again noticing the direct overlap of
argument and return registers.  The obstackle is

(insn 7 4 8 2 (set (reg:V2DF 93)
        (vec_concat:V2DF (reg/v:DF 91 [ a ])
            (reg/v:DF 92 [ aa ]))) 
...
(insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
        (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
(insn 24 21 11 2 (set (reg:DI 100 [+8 ])
        (subreg:DI (reg:TI 88 [ D.1756 ]) 8))

which we eventually optimize to DFmode subregs of (reg:V2DF 93).

First of all simplify_subreg doesn't handle the subregs of a vec_concat
(easy fix below).

Then combine doesn't like to simplify the multi-use (it tries some
parallel it seems).  So I went to forwprop which eventually manages
to do this but throws away the result (reg:DF 91) or (reg:DF 92)
because it is not a constant.  Thus I allow arbitrary simplification
results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
to be a magic flag to tell it to restrict to the case where all
uses can be simplified or so, nor to restrict simplifications to a REG.
But I don't see any undesirable simplifications of (subreg 
([vec_]concat)).

For the testcase I'm not sure if I have to exclude some ABIs (mingw?).

Boostrap and regtest in progress on x86_64-unknown-linux-gnu, I'll
install the simplify-rtx.c if that succeeds but like to have opinions
on the fwprop.c change.

Thanks,
Richard.

2016-06-10  Richard Biener  <rguent...@suse.de>

        PR rtl-optimization/68961
        * simplify-rtx.c (simplify_subreg): Handle VEC_CONCAT like CONCAT.
        * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
        to simplify to a non-constant.

        * gcc.target/i386/pr68961.c: New testcase.

Index: gcc/simplify-rtx.c
===================================================================
*** gcc/simplify-rtx.c  (revision 237286)
--- gcc/simplify-rtx.c  (working copy)
*************** simplify_subreg (machine_mode outermode,
*** 6108,6116 ****
        && GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
      return adjust_address_nv (op, outermode, byte);
  
!   /* Handle complex values represented as CONCAT
!      of real and imaginary part.  */
!   if (GET_CODE (op) == CONCAT)
      {
        unsigned int part_size, final_offset;
        rtx part, res;
--- 6108,6117 ----
        && GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
      return adjust_address_nv (op, outermode, byte);
  
!   /* Handle complex or vector values represented as CONCAT or VEC_CONCAT
!      of two parts.  */
!   if (GET_CODE (op) == CONCAT
!       || GET_CODE (op) == VEC_CONCAT)
      {
        unsigned int part_size, final_offset;
        rtx part, res;
Index: gcc/fwprop.c
===================================================================
*** gcc/fwprop.c        (revision 237286)
--- gcc/fwprop.c        (working copy)
*************** propagate_rtx (rtx x, machine_mode mode,
*** 664,670 ****
        || (GET_CODE (new_rtx) == SUBREG
          && REG_P (SUBREG_REG (new_rtx))
          && (GET_MODE_SIZE (mode)
!             <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
      flags |= PR_CAN_APPEAR;
    if (!varying_mem_p (new_rtx))
      flags |= PR_HANDLE_MEM;
--- 664,673 ----
        || (GET_CODE (new_rtx) == SUBREG
          && REG_P (SUBREG_REG (new_rtx))
          && (GET_MODE_SIZE (mode)
!             <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
!       || ((GET_CODE (new_rtx) == VEC_CONCAT
!          || GET_CODE (new_rtx) == CONCAT)
!         && GET_CODE (x) == SUBREG))
      flags |= PR_CAN_APPEAR;
    if (!varying_mem_p (new_rtx))
      flags |= PR_HANDLE_MEM;
Index: gcc/testsuite/gcc.target/i386/pr68961.c
===================================================================
*** gcc/testsuite/gcc.target/i386/pr68961.c     (revision 0)
--- gcc/testsuite/gcc.target/i386/pr68961.c     (working copy)
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile { target lp64 } } */
+ /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
+ 
+ struct x { double d[2]; };
+ 
+ struct x
+ pack (double a, double aa)
+ {
+   struct x u;
+   u.d[0] = a;
+   u.d[1] = aa;
+   return u;
+ }
+ 
+ /* The function should be optimized to just return as arguments and
+    result exactly overlap even when previously vectorized.  */
+ 
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
+ /* { dg-final { scan-assembler-not "mov" } } */

Reply via email to