http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48064
Summary: Optimizer produces suboptimal code for e.g. x = x ^ (x >> 1) Product: gcc Version: 4.5.2 Status: UNCONFIRMED Severity: minor Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: jasper.neum...@web.de Target: windows 32 When I compile the following OPT.CPP with gcc 4.5.2 (mingw) under Windows-32... === int test(int x) { x = x ^ (x >> 1); int x1=x; x = x >> 2; x = x ^ x1; return x; } === ...a call to gpp -O3 -S OPT.CPP produces this OPT.s: === .file "OPT.CPP" .text .p2align 2,,3 .globl __Z4testi .def __Z4testi; .scl 2; .type 32; .endef __Z4testi: LFB0: pushl %ebp LCFI0: movl %esp, %ebp LCFI1: movl 8(%ebp), %eax movl %eax, %edx sarl %edx xorl %eax, %edx movl %edx, %eax sarl $2, %eax xorl %edx, %eax leave LCFI2: ret LFE0: === The problem I see is that in movl %eax, %edx sarl %edx xorl %eax, %edx movl %edx, %eax sarl $2, %eax xorl %edx, %eax gcc produces code which presumably costs 6 cycles (edx and then eax is modified 3 times in a row) whereas the equivalent statements movl %eax, %edx sarl %eax xorl %eax, %edx movl %edx, %eax sarl $2, %edx xorl %edx, %eax cost only 4 cycles since the mov and the shift can go in parallel. I would have expected this at least for explicit form in int x1=x; x = x >> 2; x = x ^ x1; I found no way to get gcc to output my version. A speed test reveals that the proposed form only costs about 2/3 of the time on Intel Atom N450 and 3/4 of the time on Intel i7. Have I missed something? By the way: If I produce an output in Intel syntax the statement "sar eax" should be "sar eax,1". Otherwise some assemblers will complain.