Re: [patch, arm] Minor optimization on thumb2 tail call

2015-01-13 Thread Ramana Radhakrishnan



On 19/11/14 02:43, Joey Ye wrote:

Current thumb2 -Os generates suboptimal code for following tail call case:

int f4(int b, int a, int c, int d);
int g(int a, int b, int c, int d)
{ return f4(b, a, c, d); }

arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m3 test.c

push
{r4, lr}
mov r4, r1
mov r1, r0
mov r0, r4
pop {r4, lr}

b f4

There are two issues: The first one is that saving/restoring lr is not
necessary, as there is no return via pop pc. The second one is that even if
we managed to avoid lr push/pop, ldmia.w sp!, {r4} is still emitted as there
is a missing pattern for pop single and code size is not optimal.

This patch fixes these two issues and introduces a shared test case. CSiBE
thumb2 -Os shows cross board code size reduction, except for one case with 4
bytes regression. The case is like:

void f ()
{
if ()
  ...
else if ()
  ...
else g();
}

There are N=2 non-sibcall returns and S=1 sibcall return. Originally the
non-sibcall returns are just pop {r4, r5, pc}, now they become
   b.n  .Lreturn

.Lreturn:
   pop {r4, r5}
   bx lr

The one byte save from sibcall return does not win the non-sibcall return
regressions back. In general scenario, number of N non-sibcall returns use
b.n branching to merged tail, number of S sibcalls save 2 bytes by avoid
poping lr. It results in 4-2*S bytes regression. In the worst scenario, each
non-sibcall return has to use b.w branching to merged tail, resulting in
(N-S)*2 bytes regression. The worst scenario is rare, according to CSiBE.
The general regression scenario can only regress 2 bytes at most. So I would
not introduce additional complexity to handle the regression case.

Make check cortex-m3: pass
thumb2 bootstrap (O2/Os): pass

 * config/arm/arm.c (arm_compute_save_reg_mask):
 Do not save lr in case of tail call.
 * config/arm/thumb2.md (*thumb2_pop_single): New pattern.

 * gcc.target/arm/thumb2-pop-single.c: New test.

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 4f04707..20d0b9e 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -19190,6 +19190,7 @@ arm_compute_save_reg_mask (void)
|| (save_reg_mask
   optimize_size
   ARM_FUNC_TYPE (func_type) == ARM_FT_NORMAL
+  !crtl-tail_call_emit
   !crtl-calls_eh_return))
  save_reg_mask |= 1  LR_REGNUM;

diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index 64acfea..29cfb17 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -267,6 +267,17 @@
 (set_attr type multiple)]
  )

+;; Pop a single register as its size is preferred over a post-incremental
load
+(define_insn *thumb2_pop_single
+  [(set (match_operand:SI 0 low_register_operand =r)
+(mem:SI (post_inc:SI (reg:SI SP_REGNUM]
+  TARGET_THUMB2  (reload_in_progress || reload_completed)
+  pop\t{%0}
+  [(set_attr type load1)
+   (set_attr length 2)
+   (set_attr predicable yes)]
+)
+
  ;; We have two alternatives here for memory loads (and similarly for
stores)
  ;; to reflect the fact that the permissible constant pool ranges differ
  ;; between ldr instructions taking low regs and ldr instructions taking
high



This is OK thanks. Please CC me on ARM specific patches, this one 
somehow seems to have missed my filters.



Ramana







Re: [patch, arm] Minor optimization on thumb2 tail call

2015-01-12 Thread Joey Ye
Ping

On Wed, Nov 19, 2014 at 10:43 AM, Joey Ye joey...@arm.com wrote:
 Current thumb2 -Os generates suboptimal code for following tail call case:

 int f4(int b, int a, int c, int d);
 int g(int a, int b, int c, int d)
 { return f4(b, a, c, d); }

 arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m3 test.c

 push
 {r4, lr}
 mov r4, r1
 mov r1, r0
 mov r0, r4
 pop {r4, lr}

 b f4

 There are two issues: The first one is that saving/restoring lr is not
 necessary, as there is no return via pop pc. The second one is that even if
 we managed to avoid lr push/pop, ldmia.w sp!, {r4} is still emitted as there
 is a missing pattern for pop single and code size is not optimal.

 This patch fixes these two issues and introduces a shared test case. CSiBE
 thumb2 -Os shows cross board code size reduction, except for one case with 4
 bytes regression. The case is like:

 void f ()
 {
if ()
  ...
else if ()
  ...
else g();
 }

 There are N=2 non-sibcall returns and S=1 sibcall return. Originally the
 non-sibcall returns are just pop {r4, r5, pc}, now they become
   b.n  .Lreturn

 .Lreturn:
   pop {r4, r5}
   bx lr

 The one byte save from sibcall return does not win the non-sibcall return
 regressions back. In general scenario, number of N non-sibcall returns use
 b.n branching to merged tail, number of S sibcalls save 2 bytes by avoid
 poping lr. It results in 4-2*S bytes regression. In the worst scenario, each
 non-sibcall return has to use b.w branching to merged tail, resulting in
 (N-S)*2 bytes regression. The worst scenario is rare, according to CSiBE.
 The general regression scenario can only regress 2 bytes at most. So I would
 not introduce additional complexity to handle the regression case.

 Make check cortex-m3: pass
 thumb2 bootstrap (O2/Os): pass

 * config/arm/arm.c (arm_compute_save_reg_mask):
 Do not save lr in case of tail call.
 * config/arm/thumb2.md (*thumb2_pop_single): New pattern.

 * gcc.target/arm/thumb2-pop-single.c: New test.

 diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
 index 4f04707..20d0b9e 100644
 --- a/gcc/config/arm/arm.c
 +++ b/gcc/config/arm/arm.c
 @@ -19190,6 +19190,7 @@ arm_compute_save_reg_mask (void)
|| (save_reg_mask
optimize_size
ARM_FUNC_TYPE (func_type) == ARM_FT_NORMAL
 +  !crtl-tail_call_emit
!crtl-calls_eh_return))
  save_reg_mask |= 1  LR_REGNUM;

 diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
 index 64acfea..29cfb17 100644
 --- a/gcc/config/arm/thumb2.md
 +++ b/gcc/config/arm/thumb2.md
 @@ -267,6 +267,17 @@
 (set_attr type multiple)]
  )

 +;; Pop a single register as its size is preferred over a post-incremental
 load
 +(define_insn *thumb2_pop_single
 +  [(set (match_operand:SI 0 low_register_operand =r)
 +(mem:SI (post_inc:SI (reg:SI SP_REGNUM]
 +  TARGET_THUMB2  (reload_in_progress || reload_completed)
 +  pop\t{%0}
 +  [(set_attr type load1)
 +   (set_attr length 2)
 +   (set_attr predicable yes)]
 +)
 +
  ;; We have two alternatives here for memory loads (and similarly for
 stores)
  ;; to reflect the fact that the permissible constant pool ranges differ
  ;; between ldr instructions taking low regs and ldr instructions taking
 high