Re: [RFC, ARM] later split of symbol_refs

2012-07-13 Thread Dmitry Plotnikov
2012/6/30 Georg-Johann Lay :
> Is there a special reason to restrict it to SYMBOL_REF?
> Doesn't the same issue occur with, e.g.
> (const (plus (symbol_ref const_int))) or label_ref?

Hi!
We have added splits for symbol_ref plus const and label_ref.  With
this patch, assembly code and oprofile data look better,
but on SPEC2K INT it's about 3% slower than with split for only symbol_refs.
We will try to find later why this happens.
For now, we commited the original patch.


symbol_plus.patch
Description: Binary data


Re: [RFC, ARM] later split of symbol_refs

2012-07-04 Thread Dmitry Melnik

On 06/29/2012 06:31 PM, Ramana Radhakrishnan wrote:

Ok with this comment?

+;; Split symbol_refs at the later stage (after cprop), instead of 
generating

+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop. However,
+;; if the default is to prefer movt/movw rather than a load from the 
constant

+;; pool, the performance is usually better.



+;; Split symbol_refs at the later stage (after cprop), instead of generating
+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop.  However,

I would rewrite part of your comment as


+;; movt/movw is preferable, because it usually executes faster than a load

"However if the default is to prefer to use movw/movt rather than the
constant pool use that. instead of a load from the constant pool."


--
Best regards,
   Dmitry

2009-05-29  Julian Brown  

gcc/
	* config/arm/arm.md (movsi): Don't split symbol refs here.
	(define_split): New.

--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -5472,14 +5472,6 @@
   optimize && can_create_pseudo_p ());
   DONE;
 }
-
-  if (TARGET_USE_MOVT && !target_word_relocations
- && GET_CODE (operands[1]) == SYMBOL_REF
- && !flag_pic && !arm_tls_referenced_p (operands[1]))
-   {
- arm_emit_movpair (operands[0], operands[1]);
- DONE;
-   }
 }
   else /* TARGET_THUMB1...  */
 {
@@ -5588,6 +5580,24 @@
   "
 )
 
+;; Split symbol_refs at the later stage (after cprop), instead of generating
+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop.  However,
+;; if the default is to prefer movt/movw rather than a load from the constant
+;; pool, the performance is usually better.
+(define_split
+  [(set (match_operand:SI 0 "arm_general_register_operand" "")
+   (match_operand:SI 1 "general_operand" ""))]
+  "TARGET_32BIT
+   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF
+   && !flag_pic && !target_word_relocations
+   && !arm_tls_referenced_p (operands[1])"
+  [(clobber (const_int 0))]
+{
+  arm_emit_movpair (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*thumb1_movsi_insn"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*l*h*k")
(match_operand:SI 1 "general_operand"  "l, I,J,K,>,l,mi,l,*l*h*k"))]


Re: [RFC, ARM] later split of symbol_refs

2012-06-30 Thread Georg-Johann Lay

Dmitry Melnik schrieb:

On 06/27/2012 07:53 PM, Richard Earnshaw wrote:

Please update the ChangeLog entry (it's not appropriate to mention
Sourcery G++) and add a comment as Steven has suggested.

Otherwise OK.



Updated.
Ok to commit now?

--
Best regards,
  Dmitry

2009-05-29  Julian Brown  

gcc/
* config/arm/arm.md (movsi): Don't split symbol refs here.
(define_split): New.

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 0654564..98ff382 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -5472,14 +5472,6 @@
   optimize && can_create_pseudo_p ());
   DONE;
 }
-
-  if (TARGET_USE_MOVT && !target_word_relocations
- && GET_CODE (operands[1]) == SYMBOL_REF
- && !flag_pic && !arm_tls_referenced_p (operands[1]))
-   {
- arm_emit_movpair (operands[0], operands[1]);
- DONE;
-   }
 }
   else /* TARGET_THUMB1...  */
 {
@@ -5588,6 +5580,23 @@
   "
 )
 
+;; Split symbol_refs at the later stage (after cprop), instead of generating

+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop.  However,
+;; movt/movw is preferable, because it usually executes faster than a load.
+(define_split
+  [(set (match_operand:SI 0 "arm_general_register_operand" "")
+   (match_operand:SI 1 "general_operand" ""))]
+  "TARGET_32BIT
+   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF


Is there a special reason to restrict it to SYMBOL_REF?
Doesn't the same issue occur with, e.g.
(const (plus (symbol_ref const_int))) or label_ref?

Johann


+   && !flag_pic && !target_word_relocations
+   && !arm_tls_referenced_p (operands[1])"
+  [(clobber (const_int 0))]
+{
+  arm_emit_movpair (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*thumb1_movsi_insn"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*l*h*k")
(match_operand:SI 1 "general_operand"  "l, I,J,K,>,l,mi,l,*l*h*k"))]


Re: [RFC, ARM] later split of symbol_refs

2012-06-29 Thread Ramana Radhakrishnan
On 29 June 2012 14:48, Dmitry Melnik  wrote:
>
> On 06/27/2012 07:55 PM, Ramana Radhakrishnan wrote:
>
>> I must admit that I had been suggesting to Zhenqiang about turning
>> this off by tightening the movsi_insn predicates rather than adding a
>> split, but given that it appears to produce enough benefit in this
>> case I don't have any reasons to object ...
>>
>> However it's interesting that this doesn't seem to help vpr 
>
> We retested vpr, but it just seems to be unstable:

Unfortunate but ok.

Ramana


Re: [RFC, ARM] later split of symbol_refs

2012-06-29 Thread Ramana Radhakrishnan
> +;; Split symbol_refs at the later stage (after cprop), instead of generating
> +;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
> +;; and lo_sum would be merged back into memory load at cprop.  However,

I would rewrite part of your comment as

> +;; movt/movw is preferable, because it usually executes faster than a load

"However if the default is to prefer to use movw/movt rather than the
constant pool use that. instead of a load from the constant pool."



regards,
Ramana


>
>
> --
> Best regards,
>  Dmitry
>


Re: [RFC, ARM] later split of symbol_refs

2012-06-29 Thread Dmitry Melnik

On 06/27/2012 07:53 PM, Richard Earnshaw wrote:

Please update the ChangeLog entry (it's not appropriate to mention
Sourcery G++) and add a comment as Steven has suggested.

Otherwise OK.




Updated.
Ok to commit now?


--
Best regards,
  Dmitry

2009-05-29  Julian Brown  

gcc/
	* config/arm/arm.md (movsi): Don't split symbol refs here.
	(define_split): New.

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 0654564..98ff382 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -5472,14 +5472,6 @@
 			   optimize && can_create_pseudo_p ());
   DONE;
 }
-
-  if (TARGET_USE_MOVT && !target_word_relocations
-	  && GET_CODE (operands[1]) == SYMBOL_REF
-	  && !flag_pic && !arm_tls_referenced_p (operands[1]))
-	{
-	  arm_emit_movpair (operands[0], operands[1]);
-	  DONE;
-	}
 }
   else /* TARGET_THUMB1...  */
 {
@@ -5588,6 +5580,23 @@
   "
 )
 
+;; Split symbol_refs at the later stage (after cprop), instead of generating
+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop.  However,
+;; movt/movw is preferable, because it usually executes faster than a load.
+(define_split
+  [(set (match_operand:SI 0 "arm_general_register_operand" "")
+   (match_operand:SI 1 "general_operand" ""))]
+  "TARGET_32BIT
+   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF
+   && !flag_pic && !target_word_relocations
+   && !arm_tls_referenced_p (operands[1])"
+  [(clobber (const_int 0))]
+{
+  arm_emit_movpair (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*thumb1_movsi_insn"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*l*h*k")
 	(match_operand:SI 1 "general_operand"  "l, I,J,K,>,l,mi,l,*l*h*k"))]


Re: [RFC, ARM] later split of symbol_refs

2012-06-29 Thread Dmitry Melnik


On 06/27/2012 07:55 PM, Ramana Radhakrishnan wrote:

> I must admit that I had been suggesting to Zhenqiang about turning
> this off by tightening the movsi_insn predicates rather than adding a
> split, but given that it appears to produce enough benefit in this
> case I don't have any reasons to object ...
>
> However it's interesting that this doesn't seem to help vpr 

We retested vpr, but it just seems to be unstable:

base peak
time  time

  175.vpr   1400   502  X 1400 526  X
  175.vpr   1400   500  X 1400 524  X
  175.vpr   1400   516  X 1400 526  X
  175.vpr   1400   492  X 1400 481  X
  175.vpr   1400   496  X 1400 485  X
  median 500524

However, the minimum time is still better with the patch.

And here are all 3 runs for previously reported data:

testbase ratiopeak ratiomedian basemedian peak improvement
164.gzip284281284282-0.70%
164.gzip284282
164.gzip285283
175.vpr329306323306-5.26%
175.vpr323305
175.vpr306307
176.gcc5425545425572.77%
176.gcc541558
176.gcc544557
181.mcf3433403403410.29%
181.mcf339342
181.mcf340341
186.crafty3833993833912.09%
186.crafty390391
186.crafty380386
197.parser2542572542571.18%
197.parser254257
197.parser254257
252.eon5916445946448.42%
252.eon598644
252.eon594643
253.perlbmk4624904634905.83%
253.perlbmk463490
253.perlbmk463490
254.gap4154734254679.88%
254.gap430467
254.gap425464
255.vortex38443038243012.57%
255.vortex382430
255.vortex381430
256.bzip23313543323546.63%
256.bzip2332354
256.bzip2335349
300.twolf3233563283527.32%
300.twolf347337
300.twolf328352


--
Best regards,
  Dmitry



Re: [RFC, ARM] later split of symbol_refs

2012-06-27 Thread Ramana Radhakrishnan
On 27 June 2012 15:58, Dmitry Melnik  wrote:
> Hi,
>
> We'd like to note about CodeSourcery's patch for ARM backend, from which GCC
> mainline can gain 4% on SPEC2K INT:
> http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch
> (also the patch is attached).
>
> Originally, we noticed that GNU Go works 6% faster on cortex-a8 with
> -fno-gcse.  After profiling we found that this is most likely caused by
> cache misses when accessing global variables.  GCC generates ldr
> instructions for them, while this can be avoided by emitting movt/movw pair
> for such cases. RTL expressions for these instructions is high_ and lo_sum.
>  Currently, symbol_ref expands as high_ and lo_sum but then cprop1 decides
> that this is redundant and merges them into one load insn.
>
> The problem was also found by Linaro community:
> https://bugs.launchpad.net/gcc-linaro/+bug/886124 .

The reason IIRC this isn't in our later releases is that it wasn't
thought beneficial enough to upstream. Now you've got some evidence to
the contrary.

> Also there is a patch from codesourcery (attached), which was ported to
> linaro gcc 4.5, but is missing in later linaro releases.
> This patch makes split of symbol_refs at the later stage (after cprop),
> instead of generating movt/movw at expand.

I must admit that I had been suggesting to Zhenqiang about turning
this off by tightening the movsi_insn predicates rather than adding a
split, but given that it appears to produce enough benefit in this
case I don't have any reasons to object ...

However it's interesting that this doesn't seem to help vpr 


Ramana


Re: [RFC, ARM] later split of symbol_refs

2012-06-27 Thread Richard Earnshaw
On 27/06/12 15:58, Dmitry Melnik wrote:
> Hi,
> 
> We'd like to note about CodeSourcery's patch for ARM backend, from which 
> GCC mainline can gain 4% on SPEC2K INT: 
> http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch
>  
> (also the patch is attached).
> 
> Originally, we noticed that GNU Go works 6% faster on cortex-a8 with 
> -fno-gcse.  After profiling we found that this is most likely caused by 
> cache misses when accessing global variables.  GCC generates ldr 
> instructions for them, while this can be avoided by emitting movt/movw 
> pair for such cases. RTL expressions for these instructions is high_ and 
> lo_sum.  Currently, symbol_ref expands as high_ and lo_sum but then 
> cprop1 decides that this is redundant and merges them into one load insn.
> 
> The problem was also found by Linaro community: 
> https://bugs.launchpad.net/gcc-linaro/+bug/886124 .
> Also there is a patch from codesourcery (attached), which was ported to 
> linaro gcc 4.5, but is missing in later linaro releases.
> This patch makes split of symbol_refs at the later stage (after cprop), 
> instead of generating movt/movw at expand.
> 
> It fixed our test case on GNU Go.  Also we tested it on SPEC2K INT (ref) 
> with GCC 4.8 snapshot from May 12, 2012 on cortex-a9 with -O2 and -mthumb:
> 
>  Base  Base  Base  Peak  Peak  Peak
> Benchmarks  Ref Time  Run Time   RatioRef Time  Run Time  Ratio
> --           ---
> 164.gzip1400  492   284 1400   497   282  -0.70%
> 175.vpr 1400  433   323 1400   458   306  -5.26%
> 176.gcc 1100  203   542 1100   198   557   2.77%
> 181.mcf 1800  529   340 1800   528   341   0.29%
> 186.crafty  1000  261   383 1000   256   391   2.09%
> 197.parser  1800  709   254 1800   701   257   1.18%
> 252.eon 1300  219   594 1300   202   644   8.42%
> 253.perlbmk 1800  389   463 1800   367   490   5.83%
> 254.gap 1100  259   425 1100   236   467   9.88%
> 255.vortex  1900  498   382 1900   442   430  12.57%
> 256.bzip2   1500  452   332 1500   424   354   6.63%
> 300.twolf   3000  916   328 3000   853   352   7.32%
> SPECint_base2000376
> SPECint2000  391   3.99%
> 
> 
> SPEC2K INT grows by 4% (up to 12.5% on vortex; vpr slowdown is likely 
> because of big variance on this test).
> 
> Similarly, there are gains of 3-4% without -mthumb on cortex-a9 and on 
> cortex-a8 (thumb2 and ARM modes).
> 
> This patch can be applied to current trunk and passes regtest 
> successfully on qemu-arm.
> Maybe it will be good to have it in trunk?
> If everybody agrees, we can take care of committing it.
> 
> --
> Best regards,
>Dmitry
> 
> 
> gcc-4.5-linaro-r99369.patch
> 

Please update the ChangeLog entry (it's not appropriate to mention
Sourcery G++) and add a comment as Steven has suggested.

Otherwise OK.

R.



Re: [RFC, ARM] later split of symbol_refs

2012-06-27 Thread Julian Brown
On Wed, 27 Jun 2012 18:58:36 +0400
Dmitry Melnik  wrote:

> This patch can be applied to current trunk and passes regtest 
> successfully on qemu-arm.
> Maybe it will be good to have it in trunk?
> If everybody agrees, we can take care of committing it.

No objection from me (as the original author), FWIW.

Thanks!

Julian


Re: [RFC, ARM] later split of symbol_refs

2012-06-27 Thread Steven Bosscher
On Wed, Jun 27, 2012 at 4:58 PM, Dmitry Melnik  wrote:
> This patch can be applied to current trunk and passes regtest successfully
> on qemu-arm.
> Maybe it will be good to have it in trunk?
> If everybody agrees, we can take care of committing it.

If the patch is approved, can you please add a brief comment before
the define_split to explain why it's there and what it does?

Ciao!
Steven


[RFC, ARM] later split of symbol_refs

2012-06-27 Thread Dmitry Melnik

Hi,

We'd like to note about CodeSourcery's patch for ARM backend, from which 
GCC mainline can gain 4% on SPEC2K INT: 
http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch 
(also the patch is attached).


Originally, we noticed that GNU Go works 6% faster on cortex-a8 with 
-fno-gcse.  After profiling we found that this is most likely caused by 
cache misses when accessing global variables.  GCC generates ldr 
instructions for them, while this can be avoided by emitting movt/movw 
pair for such cases. RTL expressions for these instructions is high_ and 
lo_sum.  Currently, symbol_ref expands as high_ and lo_sum but then 
cprop1 decides that this is redundant and merges them into one load insn.


The problem was also found by Linaro community: 
https://bugs.launchpad.net/gcc-linaro/+bug/886124 .
Also there is a patch from codesourcery (attached), which was ported to 
linaro gcc 4.5, but is missing in later linaro releases.
This patch makes split of symbol_refs at the later stage (after cprop), 
instead of generating movt/movw at expand.


It fixed our test case on GNU Go.  Also we tested it on SPEC2K INT (ref) 
with GCC 4.8 snapshot from May 12, 2012 on cortex-a9 with -O2 and -mthumb:


Base  Base  Base  Peak  Peak  Peak
Benchmarks  Ref Time  Run Time   RatioRef Time  Run Time  Ratio
--           ---
164.gzip1400  492   284 1400   497   282  -0.70%
175.vpr 1400  433   323 1400   458   306  -5.26%
176.gcc 1100  203   542 1100   198   557   2.77%
181.mcf 1800  529   340 1800   528   341   0.29%
186.crafty  1000  261   383 1000   256   391   2.09%
197.parser  1800  709   254 1800   701   257   1.18%
252.eon 1300  219   594 1300   202   644   8.42%
253.perlbmk 1800  389   463 1800   367   490   5.83%
254.gap 1100  259   425 1100   236   467   9.88%
255.vortex  1900  498   382 1900   442   430  12.57%
256.bzip2   1500  452   332 1500   424   354   6.63%
300.twolf   3000  916   328 3000   853   352   7.32%
SPECint_base2000376
SPECint2000  391   3.99%


SPEC2K INT grows by 4% (up to 12.5% on vortex; vpr slowdown is likely 
because of big variance on this test).


Similarly, there are gains of 3-4% without -mthumb on cortex-a9 and on 
cortex-a8 (thumb2 and ARM modes).


This patch can be applied to current trunk and passes regtest 
successfully on qemu-arm.

Maybe it will be good to have it in trunk?
If everybody agrees, we can take care of committing it.

--
Best regards,
  Dmitry
2010-08-20  Jie Zhang  

	Merged from Sourcery G++ 4.4:

	gcc/
	2009-05-29  Julian Brown  
	Merged from Sourcery G++ 4.3:
	* config/arm/arm.md (movsi): Don't split symbol refs here.
	(define_split): New.

 2010-08-18  Julian Brown  
 
 	Issue #9222

=== modified file 'gcc/config/arm/arm.md'
--- old/gcc/config/arm/arm.md	2010-08-20 16:41:37 +
+++ new/gcc/config/arm/arm.md	2010-08-23 14:39:12 +
@@ -5150,14 +5150,6 @@
 			   optimize && can_create_pseudo_p ());
   DONE;
 }
-
-  if (TARGET_USE_MOVT && !target_word_relocations
-	  && GET_CODE (operands[1]) == SYMBOL_REF
-	  && !flag_pic && !arm_tls_referenced_p (operands[1]))
-	{
-	  arm_emit_movpair (operands[0], operands[1]);
-	  DONE;
-	}
 }
   else /* TARGET_THUMB1...  */
 {
@@ -5265,6 +5257,19 @@
   "
 )
 
+(define_split
+  [(set (match_operand:SI 0 "arm_general_register_operand" "")
+	(match_operand:SI 1 "general_operand" ""))]
+  "TARGET_32BIT
+   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF
+   && !flag_pic && !target_word_relocations
+   && !arm_tls_referenced_p (operands[1])"
+  [(clobber (const_int 0))]
+{
+  arm_emit_movpair (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*thumb1_movsi_insn"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*lhk")
 	(match_operand:SI 1 "general_operand"  "l, I,J,K,>,l,mi,l,*lhk"))]