[Bug target/82227] New: ARM thumb inefficient tailcall return sequence (multiple pops)

peter at cordes dot ca Sat, 16 Sep 2017 03:55:53 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82227


            Bug ID: 82227
           Summary: ARM thumb inefficient tailcall return sequence
                    (multiple pops)
           Product: gcc
           Version: 7.1.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: arm*-*-*

int ext();
int tailcall_external() { return ext(); }
 // https://godbolt.org/g/W43fxw

gcc6.3 -Os -mthumb

        push    {r4, lr}
        bl      ext
        pop     {r4}
        pop     {r1}        # two separate pop instructions isn't optimal
        bx      r1

gcc6.3 -Os -mthumb -mno-thumb-interwork

        push    {r4, lr}
        bl      ext
        pop     {r4, pc}

A 16-bit thumb pop instruction can only pop "lo" registers and PC, not back
into LR.  That's why it can't  pop {r4, lr}  / bx lr  like it does in -marm
mode.

But there is a more efficient way:

        pop     {r1, r2}
        bx      r2

We never needed a call-preserved register; r4 was pushed only to keep the stack
aligned.  So as long as we have 2 call-clobbered regs available, we can pop the
padding that came from r4, and pop the saved lr, both into call-clobbered regs.

If we did need a call-preserved register for anything, two separate pop
instructions are presumably better than any combination of pop-multiple and
reg-reg moves.

----

This also happens with two identical functions with different names, with -Os. 
One compiles into a call to the other, done exactly the same way as to an
external function.  (See the godbolt link above).

In that case, I don't understand why we can't just tail-call with a `b`
instruction (like we get with -marm).  Both functions are compiled to Thumb2
code, so we can jump to the other and let it do an interworking return, right? 
Especially with -mno-thumb-interwork, I don't understand why tail-calls aren't
optimized to a jump.

(I'm not an expert on ARM / Thumb stuff, so there might be a reason I'm
missing.)

[Bug target/82227] New: ARM thumb inefficient tailcall return sequence (multiple pops)

Reply via email to