http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50856
Bug #: 50856 Summary: ARM: suboptimal code for absolute difference calculation Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: siarhei.siamas...@gmail.com gcc generates suboptimal code on ARM for "abs(a - b)" type of operation, which is used for example in paeth png filter: http://www.w3.org/TR/PNG-Filters.html Given the following test code: int absolute_difference1(unsigned char a, unsigned char b) { return a > b ? a - b : b - a; } int absolute_difference2(unsigned char a, unsigned char b) { int tmp = a; if ((tmp -= b) < 0) tmp = -tmp; return tmp; } The current gcc svn trunk (r180383) generates the following code for -O2 and -Os optimizations: .cpu arm10tdmi .eabi_attribute 27, 3 .eabi_attribute 28, 1 .fpu vfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 4 .eabi_attribute 34, 0 .eabi_attribute 18, 4 .file "test.c" .text .align 2 .global absolute_difference1 .type absolute_difference1, %function absolute_difference1: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. cmp r0, r1 rsbhi r0, r1, r0 rsbls r0, r0, r1 bx lr .size absolute_difference1, .-absolute_difference1 .align 2 .global absolute_difference2 .type absolute_difference2, %function absolute_difference2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. rsb r0, r1, r0 cmp r0, #0 rsblt r0, r0, #0 bx lr .size absolute_difference2, .-absolute_difference2 .ident "GCC: (GNU) 4.7.0 20111024 (experimental)" .section .note.GNU-stack,"",%progbits Even in the quite explicit second code variant ('absolute_difference2' function), gcc does not generate the expected SUBS + NEGLT pair of instructions. Also for ARMv6 capable processors even a single USAD8 instruction could be used here if both operands are known to have values in [0-255] range and if high latency of this instruction can be hidden.