Re: [PATCH ARM] Improve ARM memset inlining

2014-07-08 Thread Bin.Cheng
On Fri, Jul 4, 2014 at 1:17 PM, Bin Cheng bin.ch...@arm.com wrote:




 Hi Ramana,
 This is the rebased patch, there is no conflict against latest trunk.  I am 
 still doing some tests.  Is it OK if tests are fine?
 Also, it depends on patch at 
 https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that 
 patch two.

Hi Ramana,

Bootstrap and tests for this patch are done.  Is it ok for me to submit?

Thanks,
bin


Re: [PATCH ARM] Improve ARM memset inlining

2014-07-08 Thread Ramana Radhakrishnan


Hi Ramana,
This is the rebased patch, there is no conflict against latest trunk.  I am 
still doing some tests.  Is it OK if tests are fine?
Also, it depends on patch at 
https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that 
patch two.

Thanks,
bin



Index: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 212295)
+++ gcc/config/arm/arm.c(working copy)
@@ -1588,34 +1588,38 @@ const struct tune_params arm_slowmul_tune =
 {
   arm_slowmul_rtx_costs,
   NULL,
-  NULL,/* Sched adj cost.  */
-  3,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  NULL,/* Sched adj cost.  */
+  3,   /* Constant limit.  */
+  5,   /* Max cond insns.  */


Please make sure alignment is maintained with comments as today. I'm not 
sure why I see the following diffs in your patch since you don't really 
should be touching those lines, that applies to all the cost tables. I 
haven't called out all the places where you appear to have unrelated 
formatting changes in detail, but have done so in one cost table.


Please re-create a patch that doesn't have these hunks.


   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,/* Prefer constant 
pool.  */
+  true,/* Prefer constant pool.  */


Likewise.


   arm_default_branch_cost,
-  false,   /* Prefer LDRD/STRD.  */
-  {true, true},/* Prefer non short 
circuit.  */
-  arm_default_vec_cost,/* Vectorizer costs.  */
-  false,/* Prefer Neon for 64-bits 
bitops.  */
-  false, false  /* Prefer 32-bit encodings.  */
+  false,   /* Prefer LDRD/STRD.  */
+  {true, true},/* Prefer non short circuit.  */
+  arm_default_vec_cost,   /* Vectorizer costs.  */
+  false,   /* Prefer Neon for 64-bits bitops.  */
+  false, false,/* Prefer 32-bit encodings.  */


Likewise.


+  false,   /* Prefer Neon for stringops.  */
+  8/* Maximum insns to inline memset.  */
 };

 const struct tune_params arm_fastmul_tune =
 {
   arm_fastmul_rtx_costs,
   NULL,
-  NULL,/* Sched adj cost.  */
-  1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  NULL,/* Sched adj cost.  */
+  1,   /* Constant limit.  */
+  5,   /* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,/* Prefer constant 
pool.  */
+  true,/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,   /* Prefer LDRD/STRD.  */
-  {true, true},/* Prefer non short 
circuit.  */
-  arm_default_vec_cost,/* Vectorizer costs.  */
-  false,/* Prefer Neon for 64-bits 
bitops.  */
-  false, false  /* Prefer 32-bit encodings.  */
+  false,   /* Prefer LDRD/STRD.  */
+  {true, true},/* Prefer non short circuit.  */
+  arm_default_vec_cost,   /* Vectorizer costs.  */
+  false,   /* Prefer Neon for 64-bits bitops.  */
+  false, false,/* Prefer 32-bit encodings.  */
+  false,   /* Prefer Neon for stringops.  */
+  8/* Maximum insns to inline memset.  */
 };

 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1625,17 +1629,19 @@ const struct tune_params arm_strongarm_tune =
 {
   arm_fastmul_rtx_costs,
   NULL,
-  NULL,/* Sched adj cost.  */
-  1,   /* Constant limit.  */
-  3,   /* Max cond insns.  */
+  NULL,/* Sched adj cost.  */
+  1,   /* Constant limit.  */
+  3,   /* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,/* Prefer constant 
pool.  */
+  true,   

Re: [PATCH ARM] Improve ARM memset inlining

2014-06-27 Thread Ramana Radhakrishnan
On Tue, May 6, 2014 at 5:59 AM, bin.cheng bin.ch...@arm.com wrote:


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of bin.cheng
 Sent: Monday, May 05, 2014 3:21 PM
 To: Richard Earnshaw
 Cc: gcc-patches@gcc.gnu.org
 Subject: RE: [PATCH ARM] Improve ARM memset inlining

 Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
 also updated the patch.

  -Original Message-
  From: Richard Earnshaw
  Sent: Friday, May 02, 2014 10:00 PM
  To: Bin Cheng
  Cc: gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH ARM] Improve ARM memset inlining
 
  On 30/04/14 03:52, bin.cheng wrote:
   Hi,
   This patch expands small memset calls into direct memory set
   instructions by introducing setmemsi pattern.  For processors
   without NEON support, it expands memset using general store
   instruction.  For example, strd for 4-bytes aligned addresses.  For
   processors with NEON support, it expands memset using neon
   instructions like vstr and miscellaneous vst1.* instructions for
   both
 aligned
  and unaligned cases.
  
   This patch depends on
   http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
   vst1.64 will be generated for 32-bit aligned memory unit.
  
   There is also one leftover work of this patch:  Since vst1.*
   instructions only support post-increment addressing mode, the
   inlined memset for unaligned neon cases should be like:
 vmov.i32   q8, #...
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]
 
  Other than for zero, I'd expect the vmov to be vmov.i8 to move an
 arbitrary
 I just used vmov.i32 as an example.  The element size is actually
 calculated by
 function neon_valid_immediate which works as expected I think.

  byte value into all lanes in a vector.  After that, if the alignment
  is
 known to
  be more than 8-bit, I'd expect the vst1 instructions (with the
  exception
 of the
  last store if the length is not a multiple of the alignment) to use
 
  vst1.align {reg}, [addr-reg :align]!
 
  Hence, for 16-bit aligned data, we want
 
  vst1.16 {q8}, [r3:16]!
 Did I miss something important?  It seems to me the explicit alignment
 notes
 supported are 64/128/256.  So what do you mean by 16 bits alignment here?

 
   But for now, gcc can't do this and below code is generated:
 vmov.i32   q8, #...
 vst1.8 {q8}, [r3]
 addr2,   r3,  #16
 addr3,   r2,  #16
 vst1.8 {q8}, [r2]
 vst1.8 {q8}, [r3]
 addr2,   r3,  #16
 vst1.8 {q8}, [r2]
  
   I investigated this issue.  The root cause lies in rtx cost returned
   by ARM backend.  Anyway, I think this is another issue and should be
   fixed in separated patch.

Ok looks like Charles B from Linaro has run into the same thing and
has some fixes to suggest in costs.

  
   Bootstrap and reg-test on cortex-a15, with or without neon support.
   Is it OK?
  
 
  Some more comments inline.
 
   Thanks,
   bin
  
  
   2014-04-29  Bin Cheng  bin.ch...@arm.com
  
 PR target/55701
 * config/arm/arm.md (setmem): New pattern.
 * config/arm/arm-protos.h (struct tune_params): New field.
 (arm_gen_setmem): New prototype.
 * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
 (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
 (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
 (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
 (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
 (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
 (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
 (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
 (arm_const_inline_cost): New function.
 (arm_block_set_max_insns): New function.
 (arm_block_set_straight_profit_p): New function.
 (arm_block_set_vect_profit_p): New function.
 (arm_block_set_unaligned_vect): New function.
 (arm_block_set_aligned_vect): New function.
 (arm_block_set_unaligned_straight): New function.
 (arm_block_set_aligned_straight): New function.
 (arm_block_set_vect, arm_gen_setmem): New functions.
  
   gcc/testsuite/ChangeLog
   2014-04-29  Bin Cheng  bin.ch...@arm.com
  
 PR target/55701
 * gcc.target/arm/memset-inline-1.c: New test.
 * gcc.target/arm/memset-inline-2.c: New test.
 * gcc.target/arm/memset-inline-3.c: New test.
 * gcc.target/arm/memset-inline-4.c: New test.
 * gcc.target/arm/memset-inline-5.c: New test.
 * gcc.target/arm/memset-inline-6.c: New test.
 * gcc.target/arm/memset-inline-7.c: New test.
 * gcc.target/arm/memset-inline-8.c: New test.
 * gcc.target/arm/memset-inline-9.c: New test.
  
  
   j1328-20140429.txt
  
  
   Index: gcc/config/arm/arm.c
  
 
 ==
  =
   --- gcc/config/arm/arm.c  (revision 209852)
   +++ gcc/config/arm/arm.c  (working

RE: [PATCH ARM] Improve ARM memset inlining

2014-06-04 Thread bin.cheng
Ping^4.

The original thread is
https://gcc.gnu.org/ml/gcc-patches/2014-05/msg00182.html, also there is some
info at https://gcc.gnu.org/ml/gcc-patches/2014-05/msg00182.html in the same
thread.

Thanks,
bin

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of bin.cheng
 Sent: Wednesday, May 28, 2014 4:53 PM
 To: Richard Earnshaw
 Cc: gcc-patches List
 Subject: RE: [PATCH ARM] Improve ARM memset inlining
 
 Ping^3
 
  -Original Message-
  From: Bin.Cheng [mailto:amker.ch...@gmail.com]
  Sent: Monday, May 19, 2014 2:40 PM
  To: Bin Cheng
  Cc: Richard Earnshaw; gcc-patches List
  Subject: Re: [PATCH ARM] Improve ARM memset inlining
 
  Ping^2
 
  Thanks,
  bin
 
  On Mon, May 12, 2014 at 11:17 AM, Bin.Cheng amker.ch...@gmail.com
  wrote:
   Ping.
  
   Thanks,
   bin
  
   On Tue, May 6, 2014 at 12:59 PM, bin.cheng bin.ch...@arm.com
 wrote:
  
  
 
   Precisely, I configured gcc with options --with-arch=armv7-a
   --with-cpu|--with-tune=cortex-a9.
   I read gcc documents and realized that -mcpu is ignored when
   -march is specified.  I don't know why gcc acts in this manner,
   but it leads to inconsistent configuration/command line behavior.
   If we configure GCC with --with-arch=armv7-a
   --with-cpu=cortex-a9, then only -march=armv7-a is passed to cc1.
   If we compile with -march=armv7-a -mcpu=cortex-a9, then gcc works
   fine and passes -march=armv7-a -mcpu=cortex-a9 to cc1.
  
   Even more weird cc1 warns that switch -mcpu=cortex-m4 conflicts
   with -march=armv7-m switch.
  
   Thanks,
   bin
  
  
  
  
  
  
  
   --
   Best Regards.
 
 
 
  --
  Best Regards.
 
 
 
 






RE: [PATCH ARM] Improve ARM memset inlining

2014-05-28 Thread bin.cheng
Ping^3

 -Original Message-
 From: Bin.Cheng [mailto:amker.ch...@gmail.com]
 Sent: Monday, May 19, 2014 2:40 PM
 To: Bin Cheng
 Cc: Richard Earnshaw; gcc-patches List
 Subject: Re: [PATCH ARM] Improve ARM memset inlining
 
 Ping^2
 
 Thanks,
 bin
 
 On Mon, May 12, 2014 at 11:17 AM, Bin.Cheng amker.ch...@gmail.com
 wrote:
  Ping.
 
  Thanks,
  bin
 
  On Tue, May 6, 2014 at 12:59 PM, bin.cheng bin.ch...@arm.com wrote:
 
 
 
  Precisely, I configured gcc with options --with-arch=armv7-a
  --with-cpu|--with-tune=cortex-a9.
  I read gcc documents and realized that -mcpu is ignored when
  -march is specified.  I don't know why gcc acts in this manner, but
  it leads to inconsistent configuration/command line behavior.
  If we configure GCC with --with-arch=armv7-a --with-cpu=cortex-a9,
  then only -march=armv7-a is passed to cc1.
  If we compile with -march=armv7-a -mcpu=cortex-a9, then gcc works
  fine and passes -march=armv7-a -mcpu=cortex-a9 to cc1.
 
  Even more weird cc1 warns that switch -mcpu=cortex-m4 conflicts with
  -march=armv7-m switch.
 
  Thanks,
  bin
 
 
 
 
 
 
 
  --
  Best Regards.
 
 
 
 --
 Best Regards.






Re: [PATCH ARM] Improve ARM memset inlining

2014-05-19 Thread Bin.Cheng
Ping^2

Thanks,
bin

On Mon, May 12, 2014 at 11:17 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 Ping.

 Thanks,
 bin

 On Tue, May 6, 2014 at 12:59 PM, bin.cheng bin.ch...@arm.com wrote:



 Precisely, I configured gcc with options --with-arch=armv7-a
 --with-cpu|--with-tune=cortex-a9.
 I read gcc documents and realized that -mcpu is ignored when -march is
 specified.  I don't know why gcc acts in this manner, but it leads to
 inconsistent configuration/command line behavior.
 If we configure GCC with --with-arch=armv7-a --with-cpu=cortex-a9, then
 only -march=armv7-a is passed to cc1.
 If we compile with -march=armv7-a -mcpu=cortex-a9, then gcc works fine and
 passes -march=armv7-a -mcpu=cortex-a9 to cc1.

 Even more weird cc1 warns that switch -mcpu=cortex-m4 conflicts with
 -march=armv7-m switch.

 Thanks,
 bin







 --
 Best Regards.



-- 
Best Regards.


Re: [PATCH ARM] Improve ARM memset inlining

2014-05-11 Thread Bin.Cheng
Ping.

Thanks,
bin

On Tue, May 6, 2014 at 12:59 PM, bin.cheng bin.ch...@arm.com wrote:


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of bin.cheng
 Sent: Monday, May 05, 2014 3:21 PM
 To: Richard Earnshaw
 Cc: gcc-patches@gcc.gnu.org
 Subject: RE: [PATCH ARM] Improve ARM memset inlining

 Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
 also updated the patch.

  -Original Message-
  From: Richard Earnshaw
  Sent: Friday, May 02, 2014 10:00 PM
  To: Bin Cheng
  Cc: gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH ARM] Improve ARM memset inlining
 
  On 30/04/14 03:52, bin.cheng wrote:
   Hi,
   This patch expands small memset calls into direct memory set
   instructions by introducing setmemsi pattern.  For processors
   without NEON support, it expands memset using general store
   instruction.  For example, strd for 4-bytes aligned addresses.  For
   processors with NEON support, it expands memset using neon
   instructions like vstr and miscellaneous vst1.* instructions for
   both
 aligned
  and unaligned cases.
  
   This patch depends on
   http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
   vst1.64 will be generated for 32-bit aligned memory unit.
  
   There is also one leftover work of this patch:  Since vst1.*
   instructions only support post-increment addressing mode, the
   inlined memset for unaligned neon cases should be like:
 vmov.i32   q8, #...
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]
 
  Other than for zero, I'd expect the vmov to be vmov.i8 to move an
 arbitrary
 I just used vmov.i32 as an example.  The element size is actually
 calculated by
 function neon_valid_immediate which works as expected I think.

  byte value into all lanes in a vector.  After that, if the alignment
  is
 known to
  be more than 8-bit, I'd expect the vst1 instructions (with the
  exception
 of the
  last store if the length is not a multiple of the alignment) to use
 
  vst1.align {reg}, [addr-reg :align]!
 
  Hence, for 16-bit aligned data, we want
 
  vst1.16 {q8}, [r3:16]!
 Did I miss something important?  It seems to me the explicit alignment
 notes
 supported are 64/128/256.  So what do you mean by 16 bits alignment here?

 
   But for now, gcc can't do this and below code is generated:
 vmov.i32   q8, #...
 vst1.8 {q8}, [r3]
 addr2,   r3,  #16
 addr3,   r2,  #16
 vst1.8 {q8}, [r2]
 vst1.8 {q8}, [r3]
 addr2,   r3,  #16
 vst1.8 {q8}, [r2]
  
   I investigated this issue.  The root cause lies in rtx cost returned
   by ARM backend.  Anyway, I think this is another issue and should be
   fixed in separated patch.
  
   Bootstrap and reg-test on cortex-a15, with or without neon support.
   Is it OK?
  
 
  Some more comments inline.
 
   Thanks,
   bin
  
  
   2014-04-29  Bin Cheng  bin.ch...@arm.com
  
 PR target/55701
 * config/arm/arm.md (setmem): New pattern.
 * config/arm/arm-protos.h (struct tune_params): New field.
 (arm_gen_setmem): New prototype.
 * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
 (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
 (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
 (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
 (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
 (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
 (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
 (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
 (arm_const_inline_cost): New function.
 (arm_block_set_max_insns): New function.
 (arm_block_set_straight_profit_p): New function.
 (arm_block_set_vect_profit_p): New function.
 (arm_block_set_unaligned_vect): New function.
 (arm_block_set_aligned_vect): New function.
 (arm_block_set_unaligned_straight): New function.
 (arm_block_set_aligned_straight): New function.
 (arm_block_set_vect, arm_gen_setmem): New functions.
  
   gcc/testsuite/ChangeLog
   2014-04-29  Bin Cheng  bin.ch...@arm.com
  
 PR target/55701
 * gcc.target/arm/memset-inline-1.c: New test.
 * gcc.target/arm/memset-inline-2.c: New test.
 * gcc.target/arm/memset-inline-3.c: New test.
 * gcc.target/arm/memset-inline-4.c: New test.
 * gcc.target/arm/memset-inline-5.c: New test.
 * gcc.target/arm/memset-inline-6.c: New test.
 * gcc.target/arm/memset-inline-7.c: New test.
 * gcc.target/arm/memset-inline-8.c: New test.
 * gcc.target/arm/memset-inline-9.c: New test.
  
  
   j1328-20140429.txt
  
  
   Index: gcc/config/arm/arm.c
  
 
 ==
  =
   --- gcc/config/arm/arm.c  (revision 209852)
   +++ gcc/config/arm/arm.c  (working copy)
   @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune

RE: [PATCH ARM] Improve ARM memset inlining

2014-05-05 Thread bin.cheng


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of bin.cheng
 Sent: Monday, May 05, 2014 3:21 PM
 To: Richard Earnshaw
 Cc: gcc-patches@gcc.gnu.org
 Subject: RE: [PATCH ARM] Improve ARM memset inlining
 
 Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
 also updated the patch.
 
  -Original Message-
  From: Richard Earnshaw
  Sent: Friday, May 02, 2014 10:00 PM
  To: Bin Cheng
  Cc: gcc-patches@gcc.gnu.org
  Subject: Re: [PATCH ARM] Improve ARM memset inlining
 
  On 30/04/14 03:52, bin.cheng wrote:
   Hi,
   This patch expands small memset calls into direct memory set
   instructions by introducing setmemsi pattern.  For processors
   without NEON support, it expands memset using general store
   instruction.  For example, strd for 4-bytes aligned addresses.  For
   processors with NEON support, it expands memset using neon
   instructions like vstr and miscellaneous vst1.* instructions for
   both
 aligned
  and unaligned cases.
  
   This patch depends on
   http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
   vst1.64 will be generated for 32-bit aligned memory unit.
  
   There is also one leftover work of this patch:  Since vst1.*
   instructions only support post-increment addressing mode, the
   inlined memset for unaligned neon cases should be like:
 vmov.i32   q8, #...
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]!
 vst1.8 {q8}, [r3]
 
  Other than for zero, I'd expect the vmov to be vmov.i8 to move an
 arbitrary
 I just used vmov.i32 as an example.  The element size is actually
calculated by
 function neon_valid_immediate which works as expected I think.
 
  byte value into all lanes in a vector.  After that, if the alignment
  is
 known to
  be more than 8-bit, I'd expect the vst1 instructions (with the
  exception
 of the
  last store if the length is not a multiple of the alignment) to use
 
  vst1.align {reg}, [addr-reg :align]!
 
  Hence, for 16-bit aligned data, we want
 
  vst1.16 {q8}, [r3:16]!
 Did I miss something important?  It seems to me the explicit alignment
notes
 supported are 64/128/256.  So what do you mean by 16 bits alignment here?
 
 
   But for now, gcc can't do this and below code is generated:
 vmov.i32   q8, #...
 vst1.8 {q8}, [r3]
 addr2,   r3,  #16
 addr3,   r2,  #16
 vst1.8 {q8}, [r2]
 vst1.8 {q8}, [r3]
 addr2,   r3,  #16
 vst1.8 {q8}, [r2]
  
   I investigated this issue.  The root cause lies in rtx cost returned
   by ARM backend.  Anyway, I think this is another issue and should be
   fixed in separated patch.
  
   Bootstrap and reg-test on cortex-a15, with or without neon support.
   Is it OK?
  
 
  Some more comments inline.
 
   Thanks,
   bin
  
  
   2014-04-29  Bin Cheng  bin.ch...@arm.com
  
 PR target/55701
 * config/arm/arm.md (setmem): New pattern.
 * config/arm/arm-protos.h (struct tune_params): New field.
 (arm_gen_setmem): New prototype.
 * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
 (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
 (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
 (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
 (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
 (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
 (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
 (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
 (arm_const_inline_cost): New function.
 (arm_block_set_max_insns): New function.
 (arm_block_set_straight_profit_p): New function.
 (arm_block_set_vect_profit_p): New function.
 (arm_block_set_unaligned_vect): New function.
 (arm_block_set_aligned_vect): New function.
 (arm_block_set_unaligned_straight): New function.
 (arm_block_set_aligned_straight): New function.
 (arm_block_set_vect, arm_gen_setmem): New functions.
  
   gcc/testsuite/ChangeLog
   2014-04-29  Bin Cheng  bin.ch...@arm.com
  
 PR target/55701
 * gcc.target/arm/memset-inline-1.c: New test.
 * gcc.target/arm/memset-inline-2.c: New test.
 * gcc.target/arm/memset-inline-3.c: New test.
 * gcc.target/arm/memset-inline-4.c: New test.
 * gcc.target/arm/memset-inline-5.c: New test.
 * gcc.target/arm/memset-inline-6.c: New test.
 * gcc.target/arm/memset-inline-7.c: New test.
 * gcc.target/arm/memset-inline-8.c: New test.
 * gcc.target/arm/memset-inline-9.c: New test.
  
  
   j1328-20140429.txt
  
  
   Index: gcc/config/arm/arm.c
  
 
 ==
  =
   --- gcc/config/arm/arm.c  (revision 209852)
   +++ gcc/config/arm/arm.c  (working copy)
   @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune
 =
  true,  /* Prefer constant
  pool

Re: [PATCH ARM] Improve ARM memset inlining

2014-05-02 Thread Richard Earnshaw
On 30/04/14 03:52, bin.cheng wrote:
 Hi,
 This patch expands small memset calls into direct memory set instructions by
 introducing setmemsi pattern.  For processors without NEON support, it
 expands memset using general store instruction.  For example, strd for
 4-bytes aligned addresses.  For processors with NEON support, it expands
 memset using neon instructions like vstr and miscellaneous vst1.*
 instructions for both aligned and unaligned cases.
 
 This patch depends on
 http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise vst1.64
 will be generated for 32-bit aligned memory unit.
 
 There is also one leftover work of this patch:  Since vst1.* instructions
 only support post-increment addressing mode, the inlined memset for
 unaligned neon cases should be like:
   vmov.i32   q8, #...
   vst1.8 {q8}, [r3]!
   vst1.8 {q8}, [r3]!
   vst1.8 {q8}, [r3]!
   vst1.8 {q8}, [r3]

Other than for zero, I'd expect the vmov to be vmov.i8 to move an
arbitrary byte value into all lanes in a vector.  After that, if the
alignment is known to be more than 8-bit, I'd expect the vst1
instructions (with the exception of the last store if the length is not
a multiple of the alignment) to use

vst1.align {reg}, [addr-reg :align]!

Hence, for 16-bit aligned data, we want

vst1.16 {q8}, [r3:16]!

 But for now, gcc can't do this and below code is generated:
   vmov.i32   q8, #...
   vst1.8 {q8}, [r3]
   addr2,   r3,  #16
   addr3,   r2,  #16
   vst1.8 {q8}, [r2]
   vst1.8 {q8}, [r3]
   addr2,   r3,  #16
   vst1.8 {q8}, [r2]
 
 I investigated this issue.  The root cause lies in rtx cost returned by ARM
 backend.  Anyway, I think this is another issue and should be fixed in
 separated patch.
 
 Bootstrap and reg-test on cortex-a15, with or without neon support.  Is it
 OK?
 

Some more comments inline.

 Thanks,
 bin
 
 
 2014-04-29  Bin Cheng  bin.ch...@arm.com
 
   PR target/55701
   * config/arm/arm.md (setmem): New pattern.
   * config/arm/arm-protos.h (struct tune_params): New field.
   (arm_gen_setmem): New prototype.
   * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
   (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
   (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
   (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
   (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
   (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
   (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
   (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
   (arm_const_inline_cost): New function.
   (arm_block_set_max_insns): New function.
   (arm_block_set_straight_profit_p): New function.
   (arm_block_set_vect_profit_p): New function.
   (arm_block_set_unaligned_vect): New function.
   (arm_block_set_aligned_vect): New function.
   (arm_block_set_unaligned_straight): New function.
   (arm_block_set_aligned_straight): New function.
   (arm_block_set_vect, arm_gen_setmem): New functions.
 
 gcc/testsuite/ChangeLog
 2014-04-29  Bin Cheng  bin.ch...@arm.com
 
   PR target/55701
   * gcc.target/arm/memset-inline-1.c: New test.
   * gcc.target/arm/memset-inline-2.c: New test.
   * gcc.target/arm/memset-inline-3.c: New test.
   * gcc.target/arm/memset-inline-4.c: New test.
   * gcc.target/arm/memset-inline-5.c: New test.
   * gcc.target/arm/memset-inline-6.c: New test.
   * gcc.target/arm/memset-inline-7.c: New test.
   * gcc.target/arm/memset-inline-8.c: New test.
   * gcc.target/arm/memset-inline-9.c: New test.
 
 
 j1328-20140429.txt
 
 
 Index: gcc/config/arm/arm.c
 ===
 --- gcc/config/arm/arm.c  (revision 209852)
 +++ gcc/config/arm/arm.c  (working copy)
 @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune =
true,  /* Prefer constant 
 pool.  */
arm_default_branch_cost,
false, /* Prefer LDRD/STRD.  */
 -  {true, true},  /* Prefer non short 
 circuit.  */
 -  arm_default_vec_cost,/* Vectorizer costs.  */
 -  false,/* Prefer Neon for 64-bits 
 bitops.  */
 -  false, false  /* Prefer 32-bit encodings.  
 */
 +  {true, true},  /* Prefer non short circuit.  */
 +  arm_default_vec_cost,/* Vectorizer costs.  */
 +  false,/* Prefer Neon for 64-bits bitops.  
 */
 +  false, false, /* Prefer 32-bit encodings.  */
 +  false /* Prefer Neon for stringops.  */
  };
  

Please make sure that all the white space before the comments is using
TAB, not spaces.  Similarly for the other 

[PATCH ARM] Improve ARM memset inlining

2014-04-29 Thread bin.cheng
Hi,
This patch expands small memset calls into direct memory set instructions by
introducing setmemsi pattern.  For processors without NEON support, it
expands memset using general store instruction.  For example, strd for
4-bytes aligned addresses.  For processors with NEON support, it expands
memset using neon instructions like vstr and miscellaneous vst1.*
instructions for both aligned and unaligned cases.

This patch depends on
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise vst1.64
will be generated for 32-bit aligned memory unit.

There is also one leftover work of this patch:  Since vst1.* instructions
only support post-increment addressing mode, the inlined memset for
unaligned neon cases should be like:
  vmov.i32   q8, #...
  vst1.8 {q8}, [r3]!
  vst1.8 {q8}, [r3]!
  vst1.8 {q8}, [r3]!
  vst1.8 {q8}, [r3]
But for now, gcc can't do this and below code is generated:
  vmov.i32   q8, #...
  vst1.8 {q8}, [r3]
  addr2,   r3,  #16
  addr3,   r2,  #16
  vst1.8 {q8}, [r2]
  vst1.8 {q8}, [r3]
  addr2,   r3,  #16
  vst1.8 {q8}, [r2]

I investigated this issue.  The root cause lies in rtx cost returned by ARM
backend.  Anyway, I think this is another issue and should be fixed in
separated patch.

Bootstrap and reg-test on cortex-a15, with or without neon support.  Is it
OK?

Thanks,
bin


2014-04-29  Bin Cheng  bin.ch...@arm.com

PR target/55701
* config/arm/arm.md (setmem): New pattern.
* config/arm/arm-protos.h (struct tune_params): New field.
(arm_gen_setmem): New prototype.
* config/arm/arm.c (arm_slowmul_tune): Initialize new field.
(arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
(arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
(arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
(arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
(arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
(arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
(arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
(arm_const_inline_cost): New function.
(arm_block_set_max_insns): New function.
(arm_block_set_straight_profit_p): New function.
(arm_block_set_vect_profit_p): New function.
(arm_block_set_unaligned_vect): New function.
(arm_block_set_aligned_vect): New function.
(arm_block_set_unaligned_straight): New function.
(arm_block_set_aligned_straight): New function.
(arm_block_set_vect, arm_gen_setmem): New functions.

gcc/testsuite/ChangeLog
2014-04-29  Bin Cheng  bin.ch...@arm.com

PR target/55701
* gcc.target/arm/memset-inline-1.c: New test.
* gcc.target/arm/memset-inline-2.c: New test.
* gcc.target/arm/memset-inline-3.c: New test.
* gcc.target/arm/memset-inline-4.c: New test.
* gcc.target/arm/memset-inline-5.c: New test.
* gcc.target/arm/memset-inline-6.c: New test.
* gcc.target/arm/memset-inline-7.c: New test.
* gcc.target/arm/memset-inline-8.c: New test.
* gcc.target/arm/memset-inline-9.c: New test.
Index: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 209852)
+++ gcc/config/arm/arm.c(working copy)
@@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune =
   true,/* Prefer constant 
pool.  */
   arm_default_branch_cost,
   false,   /* Prefer LDRD/STRD.  */
-  {true, true},/* Prefer non short 
circuit.  */
-  arm_default_vec_cost,/* Vectorizer costs.  */
-  false,/* Prefer Neon for 64-bits 
bitops.  */
-  false, false  /* Prefer 32-bit encodings.  */
+  {true, true},/* Prefer non short circuit.  */
+  arm_default_vec_cost,/* Vectorizer costs.  */
+  false,/* Prefer Neon for 64-bits bitops.  */
+  false, false, /* Prefer 32-bit encodings.  */
+  false /* Prefer Neon for stringops.  */
 };
 
 const struct tune_params arm_fastmul_tune =
@@ -1602,10 +1603,11 @@ const struct tune_params arm_fastmul_tune =
   true,/* Prefer constant 
pool.  */
   arm_default_branch_cost,
   false,   /* Prefer LDRD/STRD.  */
-  {true, true},/* Prefer non short 
circuit.  */
-  arm_default_vec_cost,/* Vectorizer costs.  */
-  false,/* Prefer Neon for 64-bits 
bitops.  */
-  false, false  /* Prefer 32-bit encodings.  */
+  {true, true},