Re: [Neon intrinsics] Literal vector construction through vcombine is poor

2017-06-27 Thread Richard Earnshaw (lists)
On 27/06/17 07:13, Michael Collison wrote:
> Richard,
> 
> I reworked the patch using an assert as you suggested. Bootstrapped and 
> retested. Okay for trunk?
> 

Yes, fine thanks.

R.

> 
> -Original Message-
> From: Richard Earnshaw (lists) [mailto:richard.earns...@arm.com] 
> Sent: Friday, June 23, 2017 2:09 AM
> To: Michael Collison ; GCC Patches 
> 
> Cc: nd 
> Subject: Re: [Neon intrinsics] Literal vector construction through vcombine 
> is poor
> 
> On 23/06/17 00:10, Michael Collison wrote:
>> Richard,
>>
>> I reworked the patch and retested on big endian as well as little. The 
>> original code was performing two swaps in the big endian case which works 
>> out to no swaps at all.
>>
>> I also updated the ChangeLog per your comments. Okay for trunk?
>>
>> 2017-06-19  Michael Collison  
>>
>>  * config/aarch64/aarch64-simd.md (aarch64_combine): Directly
>>  call aarch64_split_simd_combine.
>>  * (aarch64_combine_internal): Delete pattern.
>>  * config/aarch64/aarch64.c (aarch64_split_simd_combine):
>>  Allow register and subreg operands.
>>
>> -Original Message-
>> From: Richard Earnshaw (lists) [mailto:richard.earns...@arm.com]
>> Sent: Monday, June 19, 2017 6:37 AM
>> To: Michael Collison ; GCC Patches 
>> 
>> Cc: nd 
>> Subject: Re: [Neon intrinsics] Literal vector construction through 
>> vcombine is poor
>>
>> On 16/06/17 22:08, Michael Collison wrote:
>>> This patch improves code generation for literal vector construction by 
>>> expanding and exposing the pattern to rtl optimization earlier. The current 
>>> implementation delays splitting the pattern until after reload which 
>>> results in poor code generation for the following code:
>>>
>>>
>>> #include "arm_neon.h"
>>>
>>> int16x8_t
>>> foo ()
>>> {
>>>   return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8)); }
>>>
>>> Trunk generates:
>>>
>>> foo:
>>> moviv1.2s, 0
>>> moviv0.4h, 0x8
>>> dup d2, v1.d[0]
>>> ins v2.d[1], v0.d[0]
>>> orr v0.16b, v2.16b, v2.16b
>>> ret
>>>
>>> With the patch we now generate:
>>>
>>> foo:
>>> moviv1.4h, 0x8
>>> moviv0.4s, 0
>>> ins v0.d[1], v1.d[0]
>>> ret
>>>
>>> Bootstrapped and tested on aarch64-linux-gnu. Okay for trunk.
>>>
>>> 2017-06-15  Michael Collison  
>>>
>>> * config/aarch64/aarch64-simd.md(aarch64_combine_internal):
>>> Convert from define_insn_and_split into define_expand
>>> * config/aarch64/aarch64.c(aarch64_split_simd_combine):
>>> Allow register and subreg operands.
>>>
>>
>> Your changelog entry is confusing.  You've deleted the 
>> aarch64_combine_internal pattern entirely, having merged some of its 
>> functionality directly into its caller (aarch64_combine).
>>
>> So I think it should read:
>>
>> * config/aarch64/aarch64-simd.md (aarch64_combine): Directly call 
>> aarch64_split_simd_combine.
>> (aarch64_combine_internal): Delete pattern.
>> * ...
>>
>> Note also there should be a space between the file name and the open bracket 
>> for the first function name.
>>
>> Why don't you need the big-endian code path any more?
>>
>> R.
>>
>>>
>>> pr7057.patch
>>>
>>>
>>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>>> b/gcc/config/aarch64/aarch64-simd.md
>>> index c462164..4a253a9 100644
>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>> @@ -2807,27 +2807,11 @@
>>>op1 = operands[1];
>>>op2 = operands[2];
>>>  }
>>> -  emit_insn (gen_aarch64_combine_internal (operands[0], op1, 
>>> op2));
>>> -  DONE;
>>> -}
>>> -)
>>>  
>>> -(define_insn_and_split "aarch64_combine_internal"
>>> -  [(set (match_operand: 0 "register_operand" "=&w")
>>> -(vec_concat: (match_operand:VDC 1 "register_operand" "w")
>>> -  (match_operand:VDC 2 "register_operand" "w")))]
>>> -  "TARGET_SIMD"
>>> -  "#"
>>> -  "&& reload_completed"
>>> -  [(const_int 0)]
>>> 

RE: [Neon intrinsics] Literal vector construction through vcombine is poor

2017-06-26 Thread Michael Collison
Richard,

I reworked the patch using an assert as you suggested. Bootstrapped and 
retested. Okay for trunk?


-Original Message-
From: Richard Earnshaw (lists) [mailto:richard.earns...@arm.com] 
Sent: Friday, June 23, 2017 2:09 AM
To: Michael Collison ; GCC Patches 

Cc: nd 
Subject: Re: [Neon intrinsics] Literal vector construction through vcombine is 
poor

On 23/06/17 00:10, Michael Collison wrote:
> Richard,
> 
> I reworked the patch and retested on big endian as well as little. The 
> original code was performing two swaps in the big endian case which works out 
> to no swaps at all.
> 
> I also updated the ChangeLog per your comments. Okay for trunk?
> 
> 2017-06-19  Michael Collison  
> 
>   * config/aarch64/aarch64-simd.md (aarch64_combine): Directly
>   call aarch64_split_simd_combine.
>   * (aarch64_combine_internal): Delete pattern.
>   * config/aarch64/aarch64.c (aarch64_split_simd_combine):
>   Allow register and subreg operands.
> 
> -Original Message-
> From: Richard Earnshaw (lists) [mailto:richard.earns...@arm.com]
> Sent: Monday, June 19, 2017 6:37 AM
> To: Michael Collison ; GCC Patches 
> 
> Cc: nd 
> Subject: Re: [Neon intrinsics] Literal vector construction through 
> vcombine is poor
> 
> On 16/06/17 22:08, Michael Collison wrote:
>> This patch improves code generation for literal vector construction by 
>> expanding and exposing the pattern to rtl optimization earlier. The current 
>> implementation delays splitting the pattern until after reload which results 
>> in poor code generation for the following code:
>>
>>
>> #include "arm_neon.h"
>>
>> int16x8_t
>> foo ()
>> {
>>   return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8)); }
>>
>> Trunk generates:
>>
>> foo:
>>  moviv1.2s, 0
>>  moviv0.4h, 0x8
>>  dup d2, v1.d[0]
>>  ins v2.d[1], v0.d[0]
>>  orr v0.16b, v2.16b, v2.16b
>>  ret
>>
>> With the patch we now generate:
>>
>> foo:
>>  moviv1.4h, 0x8
>>  moviv0.4s, 0
>>  ins v0.d[1], v1.d[0]
>>  ret
>>
>> Bootstrapped and tested on aarch64-linux-gnu. Okay for trunk.
>>
>> 2017-06-15  Michael Collison  
>>
>>  * config/aarch64/aarch64-simd.md(aarch64_combine_internal):
>>  Convert from define_insn_and_split into define_expand
>>  * config/aarch64/aarch64.c(aarch64_split_simd_combine):
>>  Allow register and subreg operands.
>>
> 
> Your changelog entry is confusing.  You've deleted the 
> aarch64_combine_internal pattern entirely, having merged some of its 
> functionality directly into its caller (aarch64_combine).
> 
> So I think it should read:
> 
> * config/aarch64/aarch64-simd.md (aarch64_combine): Directly call 
> aarch64_split_simd_combine.
> (aarch64_combine_internal): Delete pattern.
> * ...
> 
> Note also there should be a space between the file name and the open bracket 
> for the first function name.
> 
> Why don't you need the big-endian code path any more?
> 
> R.
> 
>>
>> pr7057.patch
>>
>>
>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> index c462164..4a253a9 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -2807,27 +2807,11 @@
>>op1 = operands[1];
>>op2 = operands[2];
>>  }
>> -  emit_insn (gen_aarch64_combine_internal (operands[0], op1, 
>> op2));
>> -  DONE;
>> -}
>> -)
>>  
>> -(define_insn_and_split "aarch64_combine_internal"
>> -  [(set (match_operand: 0 "register_operand" "=&w")
>> -(vec_concat: (match_operand:VDC 1 "register_operand" "w")
>> -   (match_operand:VDC 2 "register_operand" "w")))]
>> -  "TARGET_SIMD"
>> -  "#"
>> -  "&& reload_completed"
>> -  [(const_int 0)]
>> -{
>> -  if (BYTES_BIG_ENDIAN)
>> -aarch64_split_simd_combine (operands[0], operands[2], operands[1]);
>> -  else
>> -aarch64_split_simd_combine (operands[0], operands[1], operands[2]);
>> +  aarch64_split_simd_combine (operands[0], op1, op2);
>> +
>>DONE;
>>  }
>> -[(set_attr "type" "multiple")]
>>  )
>>  
>>  (define_expand "aarch64_simd_combine"
>> diff --git a/gcc/config/aarch64/aarch64.c 
>> b/gcc/config/aarch64/aarch64.c inde

Re: [Neon intrinsics] Literal vector construction through vcombine is poor

2017-06-23 Thread Richard Earnshaw (lists)
On 23/06/17 00:10, Michael Collison wrote:
> Richard,
> 
> I reworked the patch and retested on big endian as well as little. The 
> original code was performing two swaps in the big endian case which works out 
> to no swaps at all.
> 
> I also updated the ChangeLog per your comments. Okay for trunk?
> 
> 2017-06-19  Michael Collison  
> 
>   * config/aarch64/aarch64-simd.md (aarch64_combine): Directly
>   call aarch64_split_simd_combine.
>   * (aarch64_combine_internal): Delete pattern.
>   * config/aarch64/aarch64.c (aarch64_split_simd_combine):
>   Allow register and subreg operands.
> 
> -Original Message-
> From: Richard Earnshaw (lists) [mailto:richard.earns...@arm.com] 
> Sent: Monday, June 19, 2017 6:37 AM
> To: Michael Collison ; GCC Patches 
> 
> Cc: nd 
> Subject: Re: [Neon intrinsics] Literal vector construction through vcombine 
> is poor
> 
> On 16/06/17 22:08, Michael Collison wrote:
>> This patch improves code generation for literal vector construction by 
>> expanding and exposing the pattern to rtl optimization earlier. The current 
>> implementation delays splitting the pattern until after reload which results 
>> in poor code generation for the following code:
>>
>>
>> #include "arm_neon.h"
>>
>> int16x8_t
>> foo ()
>> {
>>   return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8)); }
>>
>> Trunk generates:
>>
>> foo:
>>  moviv1.2s, 0
>>  moviv0.4h, 0x8
>>  dup d2, v1.d[0]
>>  ins v2.d[1], v0.d[0]
>>  orr v0.16b, v2.16b, v2.16b
>>  ret
>>
>> With the patch we now generate:
>>
>> foo:
>>  moviv1.4h, 0x8
>>  moviv0.4s, 0
>>  ins v0.d[1], v1.d[0]
>>  ret
>>
>> Bootstrapped and tested on aarch64-linux-gnu. Okay for trunk.
>>
>> 2017-06-15  Michael Collison  
>>
>>  * config/aarch64/aarch64-simd.md(aarch64_combine_internal):
>>  Convert from define_insn_and_split into define_expand
>>  * config/aarch64/aarch64.c(aarch64_split_simd_combine):
>>  Allow register and subreg operands.
>>
> 
> Your changelog entry is confusing.  You've deleted the 
> aarch64_combine_internal pattern entirely, having merged some of its 
> functionality directly into its caller (aarch64_combine).
> 
> So I think it should read:
> 
> * config/aarch64/aarch64-simd.md (aarch64_combine): Directly call 
> aarch64_split_simd_combine.
> (aarch64_combine_internal): Delete pattern.
> * ...
> 
> Note also there should be a space between the file name and the open bracket 
> for the first function name.
> 
> Why don't you need the big-endian code path any more?
> 
> R.
> 
>>
>> pr7057.patch
>>
>>
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index c462164..4a253a9 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -2807,27 +2807,11 @@
>>op1 = operands[1];
>>op2 = operands[2];
>>  }
>> -  emit_insn (gen_aarch64_combine_internal (operands[0], op1, 
>> op2));
>> -  DONE;
>> -}
>> -)
>>  
>> -(define_insn_and_split "aarch64_combine_internal"
>> -  [(set (match_operand: 0 "register_operand" "=&w")
>> -(vec_concat: (match_operand:VDC 1 "register_operand" "w")
>> -   (match_operand:VDC 2 "register_operand" "w")))]
>> -  "TARGET_SIMD"
>> -  "#"
>> -  "&& reload_completed"
>> -  [(const_int 0)]
>> -{
>> -  if (BYTES_BIG_ENDIAN)
>> -aarch64_split_simd_combine (operands[0], operands[2], operands[1]);
>> -  else
>> -aarch64_split_simd_combine (operands[0], operands[1], operands[2]);
>> +  aarch64_split_simd_combine (operands[0], op1, op2);
>> +
>>DONE;
>>  }
>> -[(set_attr "type" "multiple")]
>>  )
>>  
>>  (define_expand "aarch64_simd_combine"
>> diff --git a/gcc/config/aarch64/aarch64.c 
>> b/gcc/config/aarch64/aarch64.c index 2e385c4..46bd78b 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -1650,7 +1650,8 @@ aarch64_split_simd_combine (rtx dst, rtx src1, 
>> rtx src2)
>>  
>>gcc_assert (VECTOR_MODE_P (dst_mode));
>>  
>> -  if (REG_P (dst) && REG_P (src1) && REG_P (src2))
>&

RE: [Neon intrinsics] Literal vector construction through vcombine is poor

2017-06-22 Thread Michael Collison
Richard,

I reworked the patch and retested on big endian as well as little. The original 
code was performing two swaps in the big endian case which works out to no 
swaps at all.

I also updated the ChangeLog per your comments. Okay for trunk?

2017-06-19  Michael Collison  

* config/aarch64/aarch64-simd.md (aarch64_combine): Directly
call aarch64_split_simd_combine.
* (aarch64_combine_internal): Delete pattern.
* config/aarch64/aarch64.c (aarch64_split_simd_combine):
Allow register and subreg operands.

-Original Message-
From: Richard Earnshaw (lists) [mailto:richard.earns...@arm.com] 
Sent: Monday, June 19, 2017 6:37 AM
To: Michael Collison ; GCC Patches 

Cc: nd 
Subject: Re: [Neon intrinsics] Literal vector construction through vcombine is 
poor

On 16/06/17 22:08, Michael Collison wrote:
> This patch improves code generation for literal vector construction by 
> expanding and exposing the pattern to rtl optimization earlier. The current 
> implementation delays splitting the pattern until after reload which results 
> in poor code generation for the following code:
> 
> 
> #include "arm_neon.h"
> 
> int16x8_t
> foo ()
> {
>   return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8)); }
> 
> Trunk generates:
> 
> foo:
>   moviv1.2s, 0
>   moviv0.4h, 0x8
>   dup d2, v1.d[0]
>   ins v2.d[1], v0.d[0]
>   orr v0.16b, v2.16b, v2.16b
>   ret
> 
> With the patch we now generate:
> 
> foo:
>   moviv1.4h, 0x8
>   moviv0.4s, 0
>   ins v0.d[1], v1.d[0]
>   ret
> 
> Bootstrapped and tested on aarch64-linux-gnu. Okay for trunk.
> 
> 2017-06-15  Michael Collison  
> 
>   * config/aarch64/aarch64-simd.md(aarch64_combine_internal):
>   Convert from define_insn_and_split into define_expand
>   * config/aarch64/aarch64.c(aarch64_split_simd_combine):
>   Allow register and subreg operands.
> 

Your changelog entry is confusing.  You've deleted the 
aarch64_combine_internal pattern entirely, having merged some of its 
functionality directly into its caller (aarch64_combine).

So I think it should read:

* config/aarch64/aarch64-simd.md (aarch64_combine): Directly call 
aarch64_split_simd_combine.
(aarch64_combine_internal): Delete pattern.
* ...

Note also there should be a space between the file name and the open bracket 
for the first function name.

Why don't you need the big-endian code path any more?

R.

> 
> pr7057.patch
> 
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index c462164..4a253a9 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2807,27 +2807,11 @@
>op1 = operands[1];
>op2 = operands[2];
>  }
> -  emit_insn (gen_aarch64_combine_internal (operands[0], op1, 
> op2));
> -  DONE;
> -}
> -)
>  
> -(define_insn_and_split "aarch64_combine_internal"
> -  [(set (match_operand: 0 "register_operand" "=&w")
> -(vec_concat: (match_operand:VDC 1 "register_operand" "w")
> -(match_operand:VDC 2 "register_operand" "w")))]
> -  "TARGET_SIMD"
> -  "#"
> -  "&& reload_completed"
> -  [(const_int 0)]
> -{
> -  if (BYTES_BIG_ENDIAN)
> -aarch64_split_simd_combine (operands[0], operands[2], operands[1]);
> -  else
> -aarch64_split_simd_combine (operands[0], operands[1], operands[2]);
> +  aarch64_split_simd_combine (operands[0], op1, op2);
> +
>DONE;
>  }
> -[(set_attr "type" "multiple")]
>  )
>  
>  (define_expand "aarch64_simd_combine"
> diff --git a/gcc/config/aarch64/aarch64.c 
> b/gcc/config/aarch64/aarch64.c index 2e385c4..46bd78b 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -1650,7 +1650,8 @@ aarch64_split_simd_combine (rtx dst, rtx src1, 
> rtx src2)
>  
>gcc_assert (VECTOR_MODE_P (dst_mode));
>  
> -  if (REG_P (dst) && REG_P (src1) && REG_P (src2))
> +  if (register_operand (dst, dst_mode) && register_operand (src1, src_mode)
> +  && register_operand (src2, src_mode))
>  {
>rtx (*gen) (rtx, rtx, rtx);
>  
> 



pr7057v4.patch
Description: pr7057v4.patch


Re: [Neon intrinsics] Literal vector construction through vcombine is poor

2017-06-19 Thread Richard Earnshaw (lists)
On 16/06/17 22:08, Michael Collison wrote:
> This patch improves code generation for literal vector construction by 
> expanding and exposing the pattern to rtl optimization earlier. The current 
> implementation delays splitting the pattern until after reload which results 
> in poor code generation for the following code:
> 
> 
> #include "arm_neon.h"
> 
> int16x8_t
> foo ()
> {
>   return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8));
> }
> 
> Trunk generates:
> 
> foo:
>   moviv1.2s, 0
>   moviv0.4h, 0x8
>   dup d2, v1.d[0]
>   ins v2.d[1], v0.d[0]
>   orr v0.16b, v2.16b, v2.16b
>   ret
> 
> With the patch we now generate:
> 
> foo:
>   moviv1.4h, 0x8
>   moviv0.4s, 0
>   ins v0.d[1], v1.d[0]
>   ret
> 
> Bootstrapped and tested on aarch64-linux-gnu. Okay for trunk.
> 
> 2017-06-15  Michael Collison  
> 
>   * config/aarch64/aarch64-simd.md(aarch64_combine_internal):
>   Convert from define_insn_and_split into define_expand
>   * config/aarch64/aarch64.c(aarch64_split_simd_combine):
>   Allow register and subreg operands.
> 

Your changelog entry is confusing.  You've deleted the
aarch64_combine_internal pattern entirely, having merged some of
its functionality directly into its caller (aarch64_combine).

So I think it should read:

* config/aarch64/aarch64-simd.md (aarch64_combine): Directly call
aarch64_split_simd_combine.
(aarch64_combine_internal): Delete pattern.
* ...

Note also there should be a space between the file name and the open
bracket for the first function name.

Why don't you need the big-endian code path any more?

R.

> 
> pr7057.patch
> 
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index c462164..4a253a9 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2807,27 +2807,11 @@
>op1 = operands[1];
>op2 = operands[2];
>  }
> -  emit_insn (gen_aarch64_combine_internal (operands[0], op1, op2));
> -  DONE;
> -}
> -)
>  
> -(define_insn_and_split "aarch64_combine_internal"
> -  [(set (match_operand: 0 "register_operand" "=&w")
> -(vec_concat: (match_operand:VDC 1 "register_operand" "w")
> -(match_operand:VDC 2 "register_operand" "w")))]
> -  "TARGET_SIMD"
> -  "#"
> -  "&& reload_completed"
> -  [(const_int 0)]
> -{
> -  if (BYTES_BIG_ENDIAN)
> -aarch64_split_simd_combine (operands[0], operands[2], operands[1]);
> -  else
> -aarch64_split_simd_combine (operands[0], operands[1], operands[2]);
> +  aarch64_split_simd_combine (operands[0], op1, op2);
> +
>DONE;
>  }
> -[(set_attr "type" "multiple")]
>  )
>  
>  (define_expand "aarch64_simd_combine"
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 2e385c4..46bd78b 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -1650,7 +1650,8 @@ aarch64_split_simd_combine (rtx dst, rtx src1, rtx src2)
>  
>gcc_assert (VECTOR_MODE_P (dst_mode));
>  
> -  if (REG_P (dst) && REG_P (src1) && REG_P (src2))
> +  if (register_operand (dst, dst_mode) && register_operand (src1, src_mode)
> +  && register_operand (src2, src_mode))
>  {
>rtx (*gen) (rtx, rtx, rtx);
>  
>