http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090
Bug #: 53090 Summary: suboptimal ivopt Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: xinlian...@gmail.com Compiling the attached benchmark code with trunk gcc, the code generated for the hot memory swap loop (line 60) is very inefficient : both icc and llvm use two ivs and generate a tight loop with 9 instructions, but gcc decides to use 3 ivs, and the loop exit testing code is wierd and inefficient -- it ends up produce a loop with 11 instructions. #define XCH(x,y) { Aint t_mp; t_mp=(x); (x)=(y); (y)=t_mp; } for( i=1, j=k-1 ; i<j ; ++i, --j ) { XCH(perm[i], perm[j]) } The tight version: .LBB0_11: # %for.body57.i # Parent Loop BB0_1 Depth=1 # Parent Loop BB0_9 Depth=2 # => This Inner Loop Header: Depth=3 movl (%rbx,%rdi,4), %ebp movl (%rbx,%rsi,4), %eax movl %eax, (%rbx,%rdi,4) movl %ebp, (%rbx,%rsi,4) decq %rsi incq %rdi cmpl %edx, %edi leal -1(%rdx), %edx jl .LBB0_11 The gcc version: .L18: movl (%rdx), %edi addl $1, %ecx movl (%rsi), %eax movl %eax, (%rdx) addq $4, %rdx movl %edi, (%rsi) movl %r8d, %edi subq $4, %rsi subl %ecx, %edi cmpl %edi, %ecx jl .L18 However gcc is doing the right thing when applied on the extracted test case: #define XCH(x,y) { int t_mp; t_mp=(x); (x)=(y); (y)=t_mp; } void foo (int *perm, int k) { int i,j; for( i=1, j=k-1 ; i<j ; ++i, --j ) { XCH(perm[i], perm[j]) } }