Re: [aarch64][PATCH v2] Disable reg offset in quad-word store for Falkor

2018-01-24 Thread Siddhesh Poyarekar
On Wednesday 24 January 2018 06:29 PM, Siddhesh Poyarekar wrote:
>>> +  /* Avoid register indexing for 128-bit stores when the
>>> + AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE option is set.  */
>>> +  if (!optimize_size
>>> +  && type == ADDR_QUERY_STR
>>> +  && (aarch64_tune_params.extra_tuning_flags
>>> + & AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE)
>>> +  && (mode == TImode || mode == TFmode
>>> + || aarch64_vector_data_mode_p (mode)))
>>> +    allow_reg_index_p = false;
>>
>> The aarch64_classify_vector_mode code has been reworked recently for SVE
>> so I'm not entirely
>> up to date with its logic, but I believe that
>> "aarch64_classify_vector_mode (mode)" will
>> allow 64-bit vector modes, which would not be using the 128-bit Q
>> register, so you may be disabling
>> register indexing for D-register memory stores.
> 
> I check this and fix the condition if necessary.

Looking back at the patch I remember why I used
aarch64_vector_data_mode_p; this is to catch the pattern
aarch64_simd_mov which optimizes a 64-bit store pair into a
single quad word store.  It should not avoid register indexing for any
other vector modes since their patterns won't pass ADDR_QUERY_STR.  In
any case, I will be doing the CPU2017 run without -mcpu=falkor, so I'll
report results from that.

Siddhesh


Re: [aarch64][PATCH v2] Disable reg offset in quad-word store for Falkor

2018-01-24 Thread Siddhesh Poyarekar
On Wednesday 24 January 2018 05:50 PM, Kyrill Tkachov wrote:
> I would tend towards making the costs usage more intelligent and
> differentiating
> between loads and stores but I agree that is definitely GCC 9 material.
> Whether this approach is an acceptable stopgap for GCC 8 is up to the
> aarch64 maintainers
> and will depend, among other things, on the impact it has on generic
> (non-Falkor) codegen.
> A good experiment to help this approach would be to compile a large
> codebase (for example CPU2017)
> with a non-Falkor -mcpu setting and make sure that there are no assembly
> changes (or minimal).
> This would help justify the aarch64.md constraint changes.

Thanks, I'll verify with CPU2017.

> file paths don't need the gcc/ because the ChangeLog file is already in
> the gcc/ directory
> 
>>     gcc/testsuite/
>>     * gcc/testsuite/gcc.target/aarch64/pr82533.c: New test case.
> 
> Similarly, you don't need the gcc/testsuite/ prefix.
> Also, since you have a bugzilla PR entry please reference it in the
> ChangeLog
> right above the file list:
> PR target/82533
> 
> That way when the patch is committed the SVN hooks will pick it up
> automagically and update bugzilla.

Ugh, sorry, I was tardy about this - I'll fix it up.

>> @@ -5530,6 +5530,16 @@ aarch64_classify_address (struct
>> aarch64_address_info *info,
>>  || vec_flags == VEC_ADVSIMD
>>  || vec_flags == VEC_SVE_DATA));
>>
>> +  /* Avoid register indexing for 128-bit stores when the
>> + AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE option is set.  */
>> +  if (!optimize_size
>> +  && type == ADDR_QUERY_STR
>> +  && (aarch64_tune_params.extra_tuning_flags
>> + & AARCH64_EXTRA_TUNE_SLOW_REGOFFSET_QUADWORD_STORE)
>> +  && (mode == TImode || mode == TFmode
>> + || aarch64_vector_data_mode_p (mode)))
>> +    allow_reg_index_p = false;
> 
> The aarch64_classify_vector_mode code has been reworked recently for SVE
> so I'm not entirely
> up to date with its logic, but I believe that
> "aarch64_classify_vector_mode (mode)" will
> allow 64-bit vector modes, which would not be using the 128-bit Q
> register, so you may be disabling
> register indexing for D-register memory stores.

I check this and fix the condition if necessary.

Thanks,
Siddhesh


Re: [aarch64][PATCH v2] Disable reg offset in quad-word store for Falkor

2018-01-24 Thread Kyrill Tkachov

Hi Siddhesh,

On 23/01/18 15:41, Siddhesh Poyarekar wrote:

Hi,

Here's v2 of the patch to disable register offset addressing mode for
stores of 128-bit values on Falkor because they're very costly.
Differences from the last version:

 - Incorporated changes Jim made to his patch earlier that I missed,
   i.e. adding an extra tuning parameter called
   SLOW_REGOFFSET_QUADWORD_STORE instead of making it conditional on
   TUNE_FALKOR.

 - Added a new query type ADDR_QUERY_STR to indicate the queried
   address is used for a store.  This way I can use it for other
   scenarios where stores are significantly more expensive than loads,
   such as pre/post modify addressing modes.

 - Incorporated the constraint functionality into
   aarch64_legitimate_address_p and aarch64_classify_address.

I evaluated the suggestion of using costs to do this but it's not
possible with the current costs as they do not differentiate between
loads and stores.  If modifying all passes that use these costs to
identify loads vs stores is considered OK (ivopts seems to be the
biggest user) then I can volunteer to do that work for gcc9 and
evetually replace this.



I would tend towards making the costs usage more intelligent and differentiating
between loads and stores but I agree that is definitely GCC 9 material.
Whether this approach is an acceptable stopgap for GCC 8 is up to the aarch64 
maintainers
and will depend, among other things, on the impact it has on generic 
(non-Falkor) codegen.
A good experiment to help this approach would be to compile a large codebase 
(for example CPU2017)
with a non-Falkor -mcpu setting and make sure that there are no assembly 
changes (or minimal).
This would help justify the aarch64.md constraint changes.

I have a couple of comments on the patch inline.

Thanks,
Kyrill




On Falkor, because of an idiosyncracy of how the pipelines are designed, a
quad-word store using a reg+reg addressing mode is almost twice as slow as an
add followed by a quad-word store with a single reg addressing mode.  So we
get better performance if we disallow addressing modes using register offsets
with quad-word stores.

This patch improves fpspeed by 0.3% and intspeed by 0.22% in CPU2017,
with omnetpp_s (4.3%) and pop2_s (2.6%) being the biggest winners.

2018-xx-xx  Jim Wilson  
Kugan Vivenakandarajah 
Siddhesh Poyarekar 

gcc/
* gcc/config/aarch64/aarch64-protos.h (aarch64_addr_query_type):
New member ADDR_QUERY_STR.
* gcc/config/aarch64/aarch64-tuning-flags.def
(SLOW_REGOFFSET_QUADWORD_STORE): New.
* gcc/config/aarch64/aarch64.c (qdf24xx_tunings): Add
SLOW_REGOFFSET_QUADWORD_STORE to tuning flags.
(aarch64_classify_address): Avoid register indexing for quad
mode stores when SLOW_REGOFFSET_QUADWORD_STORE is set.
* gcc/config/aarch64/constraints.md (Uts): New constraint.
* gcc/config/aarch64/aarch64.md (movti_aarch64, movtf_aarch64):
Use it.
* gcc/config/aarch64/aarch64-simd.md (aarch64_simd_mov):
Likewise.



file paths don't need the gcc/ because the ChangeLog file is already in the 
gcc/ directory


gcc/testsuite/
* gcc/testsuite/gcc.target/aarch64/pr82533.c: New test case.


Similarly, you don't need the gcc/testsuite/ prefix.
Also, since you have a bugzilla PR entry please reference it in the ChangeLog
right above the file list:
PR target/82533

That way when the patch is committed the SVN hooks will pick it up 
automagically and update bugzilla.


---
 gcc/config/aarch64/aarch64-protos.h |  4 
 gcc/config/aarch64/aarch64-simd.md  |  4 ++--
 gcc/config/aarch64/aarch64-tuning-flags.def |  4 
 gcc/config/aarch64/aarch64.c| 12 +++-
 gcc/config/aarch64/aarch64.md   |  6 +++---
 gcc/config/aarch64/constraints.md   |  7 +++
 gcc/testsuite/gcc.target/aarch64/pr82533.c  | 11 +++
 7 files changed, 42 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr82533.c

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index ef1b0bc8e28..5fedc85f283 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -120,6 +120,9 @@ enum aarch64_symbol_type
ADDR_QUERY_LDP_STP
   Query what is valid for a load/store pair.

+   ADDR_QUERY_STR
+  Query what is valid for a store.
+
ADDR_QUERY_ANY
   Query what is valid for at least one memory constraint, which may
   allow things that "m" doesn't.  For example, the SVE LDR and STR
@@ -128,6 +131,7 @@ enum aarch64_symbol_type
 enum aarch64_addr_query_type {
   ADDR_QUERY_M,
   ADDR_QUERY_LDP_STP,
+  ADDR_QUERY_STR,
   ADDR_QUERY_ANY
 };

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 

[aarch64][PATCH v2] Disable reg offset in quad-word store for Falkor

2018-01-23 Thread Siddhesh Poyarekar
Hi,

Here's v2 of the patch to disable register offset addressing mode for
stores of 128-bit values on Falkor because they're very costly.
Differences from the last version:

 - Incorporated changes Jim made to his patch earlier that I missed,
   i.e. adding an extra tuning parameter called
   SLOW_REGOFFSET_QUADWORD_STORE instead of making it conditional on
   TUNE_FALKOR.

 - Added a new query type ADDR_QUERY_STR to indicate the queried
   address is used for a store.  This way I can use it for other
   scenarios where stores are significantly more expensive than loads,
   such as pre/post modify addressing modes.

 - Incorporated the constraint functionality into
   aarch64_legitimate_address_p and aarch64_classify_address.

I evaluated the suggestion of using costs to do this but it's not
possible with the current costs as they do not differentiate between
loads and stores.  If modifying all passes that use these costs to
identify loads vs stores is considered OK (ivopts seems to be the
biggest user) then I can volunteer to do that work for gcc9 and
evetually replace this.



On Falkor, because of an idiosyncracy of how the pipelines are designed, a
quad-word store using a reg+reg addressing mode is almost twice as slow as an
add followed by a quad-word store with a single reg addressing mode.  So we
get better performance if we disallow addressing modes using register offsets
with quad-word stores.

This patch improves fpspeed by 0.3% and intspeed by 0.22% in CPU2017,
with omnetpp_s (4.3%) and pop2_s (2.6%) being the biggest winners.

2018-xx-xx  Jim Wilson  
Kugan Vivenakandarajah  
Siddhesh Poyarekar  

gcc/
* gcc/config/aarch64/aarch64-protos.h (aarch64_addr_query_type):
New member ADDR_QUERY_STR.
* gcc/config/aarch64/aarch64-tuning-flags.def
(SLOW_REGOFFSET_QUADWORD_STORE): New.
* gcc/config/aarch64/aarch64.c (qdf24xx_tunings): Add
SLOW_REGOFFSET_QUADWORD_STORE to tuning flags.
(aarch64_classify_address): Avoid register indexing for quad
mode stores when SLOW_REGOFFSET_QUADWORD_STORE is set.
* gcc/config/aarch64/constraints.md (Uts): New constraint.
* gcc/config/aarch64/aarch64.md (movti_aarch64, movtf_aarch64):
Use it.
* gcc/config/aarch64/aarch64-simd.md (aarch64_simd_mov):
Likewise.

gcc/testsuite/
* gcc/testsuite/gcc.target/aarch64/pr82533.c: New test case.
---
 gcc/config/aarch64/aarch64-protos.h |  4 
 gcc/config/aarch64/aarch64-simd.md  |  4 ++--
 gcc/config/aarch64/aarch64-tuning-flags.def |  4 
 gcc/config/aarch64/aarch64.c| 12 +++-
 gcc/config/aarch64/aarch64.md   |  6 +++---
 gcc/config/aarch64/constraints.md   |  7 +++
 gcc/testsuite/gcc.target/aarch64/pr82533.c  | 11 +++
 7 files changed, 42 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr82533.c

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index ef1b0bc8e28..5fedc85f283 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -120,6 +120,9 @@ enum aarch64_symbol_type
ADDR_QUERY_LDP_STP
   Query what is valid for a load/store pair.
 
+   ADDR_QUERY_STR
+  Query what is valid for a store.
+
ADDR_QUERY_ANY
   Query what is valid for at least one memory constraint, which may
   allow things that "m" doesn't.  For example, the SVE LDR and STR
@@ -128,6 +131,7 @@ enum aarch64_symbol_type
 enum aarch64_addr_query_type {
   ADDR_QUERY_M,
   ADDR_QUERY_LDP_STP,
+  ADDR_QUERY_STR,
   ADDR_QUERY_ANY
 };
 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 3d1f6a01cb7..48d92702723 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -131,9 +131,9 @@
 
 (define_insn "*aarch64_simd_mov"
   [(set (match_operand:VQ 0 "nonimmediate_operand"
-   "=w, Umq,  m,  w, ?r, ?w, ?r, w")
+   "=w, Umq, Uts,  w, ?r, ?w, ?r, w")
(match_operand:VQ 1 "general_operand"
-   "m,  Dz, w,  w,  w,  r,  r, Dn"))]
+   "m,  Dz,w,  w,  w,  r,  r, Dn"))]
   "TARGET_SIMD
&& (register_operand (operands[0], mode)
|| aarch64_simd_reg_or_zero (operands[1], mode))"
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index ea9ead234cb..04baf5b6de6 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -41,4 +41,8 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", 
SLOW_UNALIGNED_LDPW)
are not considered cheap.  */
 AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
 
+/* Don't use a register offset in a memory address for a quad-word store.  */