http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49421
Summary: [arm] suboptimal choice of working regs Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: ph...@gnu.org If a leaf function requires one more working register than can be accomodated in the call-clobbered set, gcc currently tends to push r4 and use that next. However, in the specific case of a leaf function, it would be better to push lr and use that as the working register, since then the return can be done with a single pop. Consider the made-up example: int f(int *a, int *b, int *c, int *d) { int i; for (i = 0; i < 4; i++) if (a[i] || b[i] || c[i] || d[i]) return 1; return 0; } which compiles (-march=armv6 -mtune=arm1136jf-s -O2) to: f: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mov ip, #0 str r4, [sp, #-4]! .L3: ldr r4, [r0, ip] cmp r4, #0 bne .L7 ldr r4, [r1, ip] cmp r4, #0 bne .L7 ldr r4, [r2, ip] cmp r4, #0 bne .L7 ldr r4, [r3, ip] add ip, ip, #4 cmp r4, #0 bne .L7 cmp ip, #16 bne .L3 mov r0, r4 .L2: ldmfd sp!, {r4} bx lr .L7: mov r0, #1 b .L2 If lr had been pushed instead of r4 then the return could have simply been "pop {lr}". Also, since this is arm11, it is no more expensive to push two words than one. If the compiler had stacked both r4 and lr, it would have freed up an extra register for the loop which would probably have allowed the loads to be scheduled better.