Re: [RFC, ARM] later split of symbol_refs
2012/6/30 Georg-Johann Lay g...@gcc.gnu.org: Is there a special reason to restrict it to SYMBOL_REF? Doesn't the same issue occur with, e.g. (const (plus (symbol_ref const_int))) or label_ref? Hi! We have added splits for symbol_ref plus const and label_ref. With this patch, assembly code and oprofile data look better, but on SPEC2K INT it's about 3% slower than with split for only symbol_refs. We will try to find later why this happens. For now, we commited the original patch. symbol_plus.patch Description: Binary data
Re: [RFC, ARM] later split of symbol_refs
On 06/29/2012 06:31 PM, Ramana Radhakrishnan wrote: Ok with this comment? +;; Split symbol_refs at the later stage (after cprop), instead of generating +;; movt/movw pair directly at expand. Otherwise corresponding high_sum +;; and lo_sum would be merged back into memory load at cprop. However, +;; if the default is to prefer movt/movw rather than a load from the constant +;; pool, the performance is usually better. +;; Split symbol_refs at the later stage (after cprop), instead of generating +;; movt/movw pair directly at expand. Otherwise corresponding high_sum +;; and lo_sum would be merged back into memory load at cprop. However, I would rewrite part of your comment as +;; movt/movw is preferable, because it usually executes faster than a load However if the default is to prefer to use movw/movt rather than the constant pool use that. instead of a load from the constant pool. -- Best regards, Dmitry 2009-05-29 Julian Brown jul...@codesourcery.com gcc/ * config/arm/arm.md (movsi): Don't split symbol refs here. (define_split): New. --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -5472,14 +5472,6 @@ optimize can_create_pseudo_p ()); DONE; } - - if (TARGET_USE_MOVT !target_word_relocations - GET_CODE (operands[1]) == SYMBOL_REF - !flag_pic !arm_tls_referenced_p (operands[1])) - { - arm_emit_movpair (operands[0], operands[1]); - DONE; - } } else /* TARGET_THUMB1... */ { @@ -5588,6 +5580,24 @@ ) +;; Split symbol_refs at the later stage (after cprop), instead of generating +;; movt/movw pair directly at expand. Otherwise corresponding high_sum +;; and lo_sum would be merged back into memory load at cprop. However, +;; if the default is to prefer movt/movw rather than a load from the constant +;; pool, the performance is usually better. +(define_split + [(set (match_operand:SI 0 arm_general_register_operand ) + (match_operand:SI 1 general_operand ))] + TARGET_32BIT +TARGET_USE_MOVT GET_CODE (operands[1]) == SYMBOL_REF +!flag_pic !target_word_relocations +!arm_tls_referenced_p (operands[1]) + [(clobber (const_int 0))] +{ + arm_emit_movpair (operands[0], operands[1]); + DONE; +}) + (define_insn *thumb1_movsi_insn [(set (match_operand:SI 0 nonimmediate_operand =l,l,l,l,l,,l, m,*l*h*k) (match_operand:SI 1 general_operand l, I,J,K,,l,mi,l,*l*h*k))]
Re: [RFC, ARM] later split of symbol_refs
Dmitry Melnik schrieb: On 06/27/2012 07:53 PM, Richard Earnshaw wrote: Please update the ChangeLog entry (it's not appropriate to mention Sourcery G++) and add a comment as Steven has suggested. Otherwise OK. Updated. Ok to commit now? -- Best regards, Dmitry 2009-05-29 Julian Brown jul...@codesourcery.com gcc/ * config/arm/arm.md (movsi): Don't split symbol refs here. (define_split): New. diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index 0654564..98ff382 100644 --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -5472,14 +5472,6 @@ optimize can_create_pseudo_p ()); DONE; } - - if (TARGET_USE_MOVT !target_word_relocations - GET_CODE (operands[1]) == SYMBOL_REF - !flag_pic !arm_tls_referenced_p (operands[1])) - { - arm_emit_movpair (operands[0], operands[1]); - DONE; - } } else /* TARGET_THUMB1... */ { @@ -5588,6 +5580,23 @@ ) +;; Split symbol_refs at the later stage (after cprop), instead of generating +;; movt/movw pair directly at expand. Otherwise corresponding high_sum +;; and lo_sum would be merged back into memory load at cprop. However, +;; movt/movw is preferable, because it usually executes faster than a load. +(define_split + [(set (match_operand:SI 0 arm_general_register_operand ) + (match_operand:SI 1 general_operand ))] + TARGET_32BIT +TARGET_USE_MOVT GET_CODE (operands[1]) == SYMBOL_REF Is there a special reason to restrict it to SYMBOL_REF? Doesn't the same issue occur with, e.g. (const (plus (symbol_ref const_int))) or label_ref? Johann +!flag_pic !target_word_relocations +!arm_tls_referenced_p (operands[1]) + [(clobber (const_int 0))] +{ + arm_emit_movpair (operands[0], operands[1]); + DONE; +}) + (define_insn *thumb1_movsi_insn [(set (match_operand:SI 0 nonimmediate_operand =l,l,l,l,l,,l, m,*l*h*k) (match_operand:SI 1 general_operand l, I,J,K,,l,mi,l,*l*h*k))]
Re: [RFC, ARM] later split of symbol_refs
+;; Split symbol_refs at the later stage (after cprop), instead of generating +;; movt/movw pair directly at expand. Otherwise corresponding high_sum +;; and lo_sum would be merged back into memory load at cprop. However, I would rewrite part of your comment as +;; movt/movw is preferable, because it usually executes faster than a load However if the default is to prefer to use movw/movt rather than the constant pool use that. instead of a load from the constant pool. regards, Ramana -- Best regards, Dmitry
Re: [RFC, ARM] later split of symbol_refs
On 29 June 2012 14:48, Dmitry Melnik d...@ispras.ru wrote: On 06/27/2012 07:55 PM, Ramana Radhakrishnan wrote: I must admit that I had been suggesting to Zhenqiang about turning this off by tightening the movsi_insn predicates rather than adding a split, but given that it appears to produce enough benefit in this case I don't have any reasons to object ... However it's interesting that this doesn't seem to help vpr We retested vpr, but it just seems to be unstable: Unfortunate but ok. Ramana
Re: [RFC, ARM] later split of symbol_refs
On Wed, Jun 27, 2012 at 4:58 PM, Dmitry Melnik d...@ispras.ru wrote: This patch can be applied to current trunk and passes regtest successfully on qemu-arm. Maybe it will be good to have it in trunk? If everybody agrees, we can take care of committing it. If the patch is approved, can you please add a brief comment before the define_split to explain why it's there and what it does? Ciao! Steven
Re: [RFC, ARM] later split of symbol_refs
On Wed, 27 Jun 2012 18:58:36 +0400 Dmitry Melnik d...@ispras.ru wrote: This patch can be applied to current trunk and passes regtest successfully on qemu-arm. Maybe it will be good to have it in trunk? If everybody agrees, we can take care of committing it. No objection from me (as the original author), FWIW. Thanks! Julian
Re: [RFC, ARM] later split of symbol_refs
On 27/06/12 15:58, Dmitry Melnik wrote: Hi, We'd like to note about CodeSourcery's patch for ARM backend, from which GCC mainline can gain 4% on SPEC2K INT: http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch (also the patch is attached). Originally, we noticed that GNU Go works 6% faster on cortex-a8 with -fno-gcse. After profiling we found that this is most likely caused by cache misses when accessing global variables. GCC generates ldr instructions for them, while this can be avoided by emitting movt/movw pair for such cases. RTL expressions for these instructions is high_ and lo_sum. Currently, symbol_ref expands as high_ and lo_sum but then cprop1 decides that this is redundant and merges them into one load insn. The problem was also found by Linaro community: https://bugs.launchpad.net/gcc-linaro/+bug/886124 . Also there is a patch from codesourcery (attached), which was ported to linaro gcc 4.5, but is missing in later linaro releases. This patch makes split of symbol_refs at the later stage (after cprop), instead of generating movt/movw at expand. It fixed our test case on GNU Go. Also we tested it on SPEC2K INT (ref) with GCC 4.8 snapshot from May 12, 2012 on cortex-a9 with -O2 and -mthumb: Base Base Base Peak Peak Peak Benchmarks Ref Time Run Time RatioRef Time Run Time Ratio -- --- 164.gzip1400 492 284 1400 497 282 -0.70% 175.vpr 1400 433 323 1400 458 306 -5.26% 176.gcc 1100 203 542 1100 198 557 2.77% 181.mcf 1800 529 340 1800 528 341 0.29% 186.crafty 1000 261 383 1000 256 391 2.09% 197.parser 1800 709 254 1800 701 257 1.18% 252.eon 1300 219 594 1300 202 644 8.42% 253.perlbmk 1800 389 463 1800 367 490 5.83% 254.gap 1100 259 425 1100 236 467 9.88% 255.vortex 1900 498 382 1900 442 430 12.57% 256.bzip2 1500 452 332 1500 424 354 6.63% 300.twolf 3000 916 328 3000 853 352 7.32% SPECint_base2000376 SPECint2000 391 3.99% SPEC2K INT grows by 4% (up to 12.5% on vortex; vpr slowdown is likely because of big variance on this test). Similarly, there are gains of 3-4% without -mthumb on cortex-a9 and on cortex-a8 (thumb2 and ARM modes). This patch can be applied to current trunk and passes regtest successfully on qemu-arm. Maybe it will be good to have it in trunk? If everybody agrees, we can take care of committing it. -- Best regards, Dmitry gcc-4.5-linaro-r99369.patch Please update the ChangeLog entry (it's not appropriate to mention Sourcery G++) and add a comment as Steven has suggested. Otherwise OK. R.
Re: [RFC, ARM] later split of symbol_refs
On 27 June 2012 15:58, Dmitry Melnik d...@ispras.ru wrote: Hi, We'd like to note about CodeSourcery's patch for ARM backend, from which GCC mainline can gain 4% on SPEC2K INT: http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch (also the patch is attached). Originally, we noticed that GNU Go works 6% faster on cortex-a8 with -fno-gcse. After profiling we found that this is most likely caused by cache misses when accessing global variables. GCC generates ldr instructions for them, while this can be avoided by emitting movt/movw pair for such cases. RTL expressions for these instructions is high_ and lo_sum. Currently, symbol_ref expands as high_ and lo_sum but then cprop1 decides that this is redundant and merges them into one load insn. The problem was also found by Linaro community: https://bugs.launchpad.net/gcc-linaro/+bug/886124 . The reason IIRC this isn't in our later releases is that it wasn't thought beneficial enough to upstream. Now you've got some evidence to the contrary. Also there is a patch from codesourcery (attached), which was ported to linaro gcc 4.5, but is missing in later linaro releases. This patch makes split of symbol_refs at the later stage (after cprop), instead of generating movt/movw at expand. I must admit that I had been suggesting to Zhenqiang about turning this off by tightening the movsi_insn predicates rather than adding a split, but given that it appears to produce enough benefit in this case I don't have any reasons to object ... However it's interesting that this doesn't seem to help vpr Ramana