Re: [PATCH], PR target/80718, Improve PowerPC splat double word

2017-05-23 Thread Richard Sandiford
Michael Meissner  writes:
> When I was comparing spec 2006 numbers between GCC 6.3 and 7.1, there was one
> benchmark that was noticeably slower (milc).  In looking at the code 
> generated,
> the #1 hot function (mult_adj_su3_mat_vec) had some cases where automatic
> vectorization generated splat of double from memory.
>
> The register allocator did not use the load with splat instruction (LXVDSX)
> because all of the loads were register+offset.  For the scalar values that it
> could load into the FPR registers, it used the normal register+offset load
> (d-form).  For the other scalar values that would wind up in the traditional
> Altivec registers, the register allocator decided to load up the value into a
> GPR register and do a direct move.
>
> Now, it turns out that while the above code is inefficient, it is not a cause
> for slow down of the milc benchmark.  However there might be other places 
> where
> using a load, direct move, and double word permute are causing a performance
> problem, so I made this patch.
>
> The patch splits the splat into a register splat and a memory splat.  This
> forces the register allocator to convert the load to the indexed form which 
> the
> LXVDSX instruction uses.  I did a spec 2006 run with these changes, and there
> were no significant performance differences with this patch.
>
> In the mult_adj_su3_mat_vec function, there were previously 5 GPR loads, 
> direct
> move, and permute sequences along with one LXVDSK.  With this patch, those GPR
> loads have been replaced with LXVDSKs.

It sounds like that might create the opposite problem, in that if the RA
ends up having to spill the input operand of a register splat, it'll be
forced to keep the instruction as a register splat and load the spill
slot into a temporary register.

I thought the advice was normally to do what the port did before the
patch and present the register and memory versions as alternatives of a
single pattern.  If there's no current way of getting efficient code in
that case then maybe we need to tweak some of the costing infrastructure...

Thanks,
Richard


Re: [PATCH], PR target/80718, Improve PowerPC splat double word

2017-05-22 Thread Segher Boessenkool
On Mon, May 22, 2017 at 02:32:44PM -0400, Michael Meissner wrote:
> The register allocator did not use the load with splat instruction (LXVDSX)
> because all of the loads were register+offset.  For the scalar values that it
> could load into the FPR registers, it used the normal register+offset load
> (d-form).  For the other scalar values that would wind up in the traditional
> Altivec registers, the register allocator decided to load up the value into a
> GPR register and do a direct move.

> The patch splits the splat into a register splat and a memory splat.  This
> forces the register allocator to convert the load to the indexed form which 
> the
> LXVDSX instruction uses.

> Can I apply this patch to the trunk, and later apply it to the GCC 7 and 6
> branches?

Please add a comment near the new instruction patterns saying why reg and
mem are separate.  Okay for trunk and backports with that.  Thanks!


Segher


> [gcc]
> 2017-05-22  Michael Meissner  
> 
>   PR target/80718
>   * config/rs6000/vsx.md (vsx_splat_, VSX_D iterator): Split
>   V2DF/V2DI splat into two separate patterns, one that handles
>   registers, and the other that only handles memory.  Drop support
>   for splatting from a GPR on ISA 2.07 and then splitting the
>   splat into direct move and splat.
>   (vsx_splat__reg): Likewise.
>   (vsx_splat__mem): Likewise.
> 
> [gcc/testsuite]
> 2017-05-22  Michael Meissner  
> 
>   PR target/80718
>   * gcc.target/powerpc/pr80718.c: New test.


[PATCH], PR target/80718, Improve PowerPC splat double word

2017-05-22 Thread Michael Meissner
When I was comparing spec 2006 numbers between GCC 6.3 and 7.1, there was one
benchmark that was noticeably slower (milc).  In looking at the code generated,
the #1 hot function (mult_adj_su3_mat_vec) had some cases where automatic
vectorization generated splat of double from memory.

The register allocator did not use the load with splat instruction (LXVDSX)
because all of the loads were register+offset.  For the scalar values that it
could load into the FPR registers, it used the normal register+offset load
(d-form).  For the other scalar values that would wind up in the traditional
Altivec registers, the register allocator decided to load up the value into a
GPR register and do a direct move.

Now, it turns out that while the above code is inefficient, it is not a cause
for slow down of the milc benchmark.  However there might be other places where
using a load, direct move, and double word permute are causing a performance
problem, so I made this patch.

The patch splits the splat into a register splat and a memory splat.  This
forces the register allocator to convert the load to the indexed form which the
LXVDSX instruction uses.  I did a spec 2006 run with these changes, and there
were no significant performance differences with this patch.

In the mult_adj_su3_mat_vec function, there were previously 5 GPR loads, direct
move, and permute sequences along with one LXVDSK.  With this patch, those GPR
loads have been replaced with LXVDSKs.

Can I apply this patch to the trunk, and later apply it to the GCC 7 and 6
branches?

[gcc]
2017-05-22  Michael Meissner  

PR target/80718
* config/rs6000/vsx.md (vsx_splat_, VSX_D iterator): Split
V2DF/V2DI splat into two separate patterns, one that handles
registers, and the other that only handles memory.  Drop support
for splatting from a GPR on ISA 2.07 and then splitting the
splat into direct move and splat.
(vsx_splat__reg): Likewise.
(vsx_splat__mem): Likewise.

[gcc/testsuite]
2017-05-22  Michael Meissner  

PR target/80718
* gcc.target/powerpc/pr80718.c: New test.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Index: gcc/config/rs6000/vsx.md
===
--- gcc/config/rs6000/vsx.md(revision 248266)
+++ gcc/config/rs6000/vsx.md(working copy)
@@ -3066,29 +3066,36 @@ (define_expand "vsx_mergeh_"
 })
 
 ;; V2DF/V2DI splat
-(define_insn_and_split "vsx_splat_"
-  [(set (match_operand:VSX_D 0 "vsx_register_operand"
-   "=,,we,")
+(define_expand "vsx_splat_"
+  [(set (match_operand:VSX_D 0 "vsx_register_operand")
(vec_duplicate:VSX_D
-(match_operand: 1 "splat_input_operand"
-   ",Z,b, wA")))]
+(match_operand: 1 "input_operand")))]
+  "VECTOR_MEM_VSX_P (mode)"
+{
+  rtx op1 = operands[1];
+  if (MEM_P (op1))
+operands[1] = rs6000_address_for_fpconvert (op1);
+  else if (!REG_P (op1))
+op1 = force_reg (mode, op1);
+})
+
+(define_insn "vsx_splat__reg"
+  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=,?we")
+   (vec_duplicate:VSX_D
+(match_operand: 1 "gpc_reg_operand" ",b")))]
   "VECTOR_MEM_VSX_P (mode)"
   "@
xxpermdi %x0,%x1,%x1,0
-   lxvdsx %x0,%y1
-   mtvsrdd %x0,%1,%1
-   #"
-  "&& reload_completed && TARGET_POWERPC64 && !TARGET_P9_VECTOR
-   && int_reg_operand (operands[1], mode)"
-  [(set (match_dup 2)
-   (match_dup 1))
-   (set (match_dup 0)
-   (vec_duplicate:VSX_D (match_dup 2)))]
-{
-  operands[2] = gen_rtx_REG (mode, reg_or_subregno (operands[0]));
-}
-  [(set_attr "type" "vecperm,vecload,vecperm,vecperm")
-   (set_attr "length" "4,4,4,8")])
+   mtvsrdd %x0,%1,%1"
+  [(set_attr "type" "vecperm")])
+
+(define_insn "vsx_splat__mem"
+  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=")
+   (vec_duplicate:VSX_D
+(match_operand: 1 "memory_operand" "Z")))]
+  "VECTOR_MEM_VSX_P (mode)"
+  "lxvdsx %x0,%y1"
+  [(set_attr "type" "vecload")])
 
 ;; V4SI splat support
 (define_insn "vsx_splat_v4si"
Index: gcc/testsuite/gcc.target/powerpc/pr80718.c
===
--- gcc/testsuite/gcc.target/powerpc/pr80718.c  (revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80718.c  (revision 0)
@@ -0,0 +1,298 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
+/* { dg-options "-mcpu=power8 -O3 -ffast-math" } */
+
+/* Taken from the Spec 2006 milc brenchmark.  Ultimately, GCC wants to generate
+   a DF splat from offsettable memory.  The register allocator decided it was
+   better to do the load in