https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77491
Bug ID: 77491 Summary: Suboptimal code produced with unnecessary moving of values on/off stack Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: dhowells at redhat dot com Target Milestone: --- Created attachment 39567 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39567&action=edit Test source The attached program produces unnecessary instructions moving registers on and off of the stack. Compiled with Fedora 24 gcc-6.1.1-3 20160621, using gcc -Os, for the first function I see: 0000000000000000 <jump>: 0: 9c pushfq 1: 59 pop %rcx 2: fa cli 3: 8b 07 mov (%rdi),%eax 5: 89 44 24 fc mov %eax,-0x4(%rsp) 9: 8b 54 24 fc mov -0x4(%rsp),%edx d: 83 fa 17 cmp $0x17,%edx 10: 0f 94 c0 sete %al 13: 75 06 jne 1b <jump+0x1b> 15: c7 07 2b 00 00 00 movl $0x2b,(%rdi) 1b: 51 push %rcx 1c: 9d popfq 1d: 8b 54 24 fc mov -0x4(%rsp),%edx 21: 89 16 mov %edx,(%rsi) 23: c3 retq The instruction at 9 is unnecessary - either the value in EDX could be moved directly to EAX, or the comparison at d could be made against EAX. The instructions at 5, 1d and 21 could be combined to place the result directly in (ESI) rather than shuffling it on and off the stack. Looking at the second function: 0000000000000024 <jump2>: 24: 9c pushfq 25: 58 pop %rax 26: fa cli 27: 8b 17 mov (%rdi),%edx 29: 89 54 24 fc mov %edx,-0x4(%rsp) 2d: 8b 54 24 fc mov -0x4(%rsp),%edx 31: 83 fa 17 cmp $0x17,%edx 34: 75 06 jne 3c <jump2+0x18> 36: c7 07 2b 00 00 00 movl $0x2b,(%rdi) 3c: 50 push %rax 3d: 9d popfq 3e: 8b 44 24 fc mov -0x4(%rsp),%eax 42: 89 44 24 f8 mov %eax,-0x8(%rsp) 46: 8b 44 24 f8 mov -0x8(%rsp),%eax 4a: c3 retq It would be best if the flags were stashed in ECX, not EAX, as happens with the first function. This would allow the return value to be set in instruction 27. The comparison in 31 could then be against EAX directly. Instructions 29, 2d, 3e, 42 and 46 are all redundant. Changing the #if in the code to disable the inline asm doesn't show all that much improvement in either function. Doing this also allows it to be built for aarch64 - which also shows unnecessary stack shuffling: 0000000000000000 <jump>: 0: d10043ff sub sp, sp, #0x10 4: b9400002 ldr w2, [x0] 8: b9000fe2 str w2, [sp,#12] c: b9400fe2 ldr w2, [sp,#12] 10: 71005c5f cmp w2, #0x17 14: 1a9f17e3 cset w3, eq 18: 54000061 b.ne 24 <jump+0x24> 1c: 52800562 mov w2, #0x2b // #43 20: b9000002 str w2, [x0] 24: b9400fe0 ldr w0, [sp,#12] 28: b9000020 str w0, [x1] 2c: 2a0303e0 mov w0, w3 30: 910043ff add sp, sp, #0x10 34: d65f03c0 ret 0000000000000038 <jump2>: 38: d10043ff sub sp, sp, #0x10 3c: b9400001 ldr w1, [x0] 40: b9000fe1 str w1, [sp,#12] 44: b9400fe1 ldr w1, [sp,#12] 48: 71005c3f cmp w1, #0x17 4c: 54000061 b.ne 58 <jump2+0x20> 50: 52800561 mov w1, #0x2b // #43 54: b9000001 str w1, [x0] 58: b9400fe0 ldr w0, [sp,#12] 5c: b9000be0 str w0, [sp,#8] 60: b9400be0 ldr w0, [sp,#8] 64: 910043ff add sp, sp, #0x10 68: d65f03c0 ret