https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82227
Bug ID: 82227 Summary: ARM thumb inefficient tailcall return sequence (multiple pops) Product: gcc Version: 7.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm*-*-* int ext(); int tailcall_external() { return ext(); } // https://godbolt.org/g/W43fxw gcc6.3 -Os -mthumb push {r4, lr} bl ext pop {r4} pop {r1} # two separate pop instructions isn't optimal bx r1 gcc6.3 -Os -mthumb -mno-thumb-interwork push {r4, lr} bl ext pop {r4, pc} A 16-bit thumb pop instruction can only pop "lo" registers and PC, not back into LR. That's why it can't pop {r4, lr} / bx lr like it does in -marm mode. But there is a more efficient way: pop {r1, r2} bx r2 We never needed a call-preserved register; r4 was pushed only to keep the stack aligned. So as long as we have 2 call-clobbered regs available, we can pop the padding that came from r4, and pop the saved lr, both into call-clobbered regs. If we did need a call-preserved register for anything, two separate pop instructions are presumably better than any combination of pop-multiple and reg-reg moves. ---- This also happens with two identical functions with different names, with -Os. One compiles into a call to the other, done exactly the same way as to an external function. (See the godbolt link above). In that case, I don't understand why we can't just tail-call with a `b` instruction (like we get with -marm). Both functions are compiled to Thumb2 code, so we can jump to the other and let it do an interworking return, right? Especially with -mno-thumb-interwork, I don't understand why tail-calls aren't optimized to a jump. (I'm not an expert on ARM / Thumb stuff, so there might be a reason I'm missing.)