http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47010
Summary: Missed optimization: x86-64 prologue not deleted
Product: gcc
Version: 4.5.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
AssignedTo: [email protected]
ReportedBy: [email protected]
Created attachment 22818
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22818
pre-processed bzipped source code
The following code is generated by g++ 4.5.1 on an x86-64 architecture (Mac OS
10.6). This is a static function where g++ may even have modified the argument
list. I believe the three instructions "pushq", "movq", and "leave" are not
necessary. This routine is called in a compute-intensive inner loop that has
problems fitting into the level 1 instruction cache.
The disassembled routine is:
__ZL20PDstandardNth11_implPKdll.clone.1:
0000000000000140 pushq %rbp
0000000000000141 movupd 0x10(%rdi),%xmm3
0000000000000146 movupd 0xf0(%rdi),%xmm0
000000000000014b movupd 0x08(%rdi),%xmm2
0000000000000150 addpd %xmm3,%xmm0
0000000000000154 movupd 0xf8(%rdi),%xmm1
0000000000000159 movq %rsp,%rbp
000000000000015c addpd %xmm2,%xmm1
0000000000000160 mulpd 0x000a0578(%rip),%xmm1
0000000000000168 addpd %xmm0,%xmm1
000000000000016c movupd (%rdi),%xmm0
0000000000000170 mulpd 0x000a0578(%rip),%xmm0
0000000000000178 leave
0000000000000179 addpd %xmm1,%xmm0
000000000000017d ret
The original function is defined as:
static CCTK_REAL_VEC PDstandardNth11_impl(CCTK_REAL const* restrict const u,
ptrdiff_t const dj, ptrdiff_t const dk) __attribute__((pure))
__attribute__((noinline)) __attribute__((unused));
static CCTK_REAL_VEC PDstandardNth11_impl(CCTK_REAL const* restrict const u,
ptrdiff_t const dj, ptrdiff_t const dk)
{ return
kmadd(ToReal(30),vec_loadu_maybe3(0,0,0,(u)[(0)+dj*(0)+dk*(0)]),kmadd(ToReal(-16),kadd(vec_loadu_maybe3(-1,0,0,(u)[(-1)+dj*(0)+dk*(0)]),vec_loadu_maybe3(1,0,0,(u)[(1)+dj*(0)+dk*(0)])),kadd(vec_loadu_maybe3(-2,0,0,(u)[(-2)+dj*(0)+dk*(0)]),vec_loadu_maybe3(2,0,0,(u)[(2)+dj*(0)+dk*(0)]))));
}
where CCTK_REAL is double, and CCTK_REAL_VEC is __m128d, the SSE2 vector of
doubles. The function body contains macros that translate directly to Intel
SSE2 vector instructions.
The code was compiled with gcc 4.5.1 with the options
g++-mp-4.5 -g3 -m128bit-long-double -march=native -std=gnu++0x -O3
-funsafe-loop-optimizations -fsee -ftree-loop-linear -ftree-loop-im -fivopts
-fvect-cost-model -funroll-loops -funroll-all-loops
-fvariable-expansion-in-unroller -fprefetch-loop-arrays -ffast-math
-fassociative-math -freciprocal-math -fno-trapping-math -fexcess-precision=fast
-fopenmp -Wall -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align
-Woverloaded-virtual
I attach the complete pre-processed and bzipped source code. The source code
itself is auto-generated.