i experience some speed regressions with gcc-4.4, with sse intrinsics on a core2 (x86_64). the code is:
namespace detail { /** compute x1 * (1 + x2 * amount) */ __m128 inline amp_mod4_loop(__m128 x1, __m128 x2, __m128 amount, __m128 one) { return _mm_mul_ps(x1, _mm_add_ps(one, _mm_mul_ps(x2, amount))); } } /* namespace detail */ template <> inline void amp_mod4(float * out, const float * in1, const float * in2, const float amount, unsigned int n) { n = n >> 2; const __m128 one = detail::gen_one(); const __m128 amnt = _mm_set_ps1(amount); do { const __m128 x1 = _mm_load_ps(in1); in1 += 4; const __m128 x2 = _mm_load_ps(in2); in2 += 4; const __m128 result = detail::amp_mod4_loop(x1, x2, amnt, one); _mm_store_ps(out, result); out += 4; } while (--n); } the results for different compilers (using hardware performance counters) are: gcc-4.4: cycles: 1416276094 branch misses: 425897 gcc-4.4 -march=core2: cycles: 1520034636 branch misses: 3263912 gcc-4.3: cycles: 1548838336 branch misses: 5990424 gcc-4.3 -march=core2: cycles: 1386605444 branch misses: 5609 gcc-4.2: cycles: 1321697674 branch misses: 3682 it seems that gcc-4.3 with -march core2 and gcc-4.2 generate code, which is more friendly to the branch predictor. tuning for core2 on gcc-4.4 actually seems to generate worse code. the best code (gcc-4.2) is: 0000000000400de0 <bench_1_simd(unsigned int)>: 400de0: 66 0f ef c0 pxor %xmm0,%xmm0 400de4: c1 ef 02 shr $0x2,%edi 400de7: 0f 28 15 32 0f 00 00 movaps 0xf32(%rip),%xmm2 # 401d20 <_IO_stdin_used+0xb0> 400dee: 31 c0 xor %eax,%eax 400df0: 66 0f 76 c0 pcmpeqd %xmm0,%xmm0 400df4: 66 0f 72 d0 19 psrld $0x19,%xmm0 400df9: 66 0f 72 f0 17 pslld $0x17,%xmm0 400dfe: 0f 28 c8 movaps %xmm0,%xmm1 400e01: 0f 28 80 e0 26 60 00 movaps 0x6026e0(%rax),%xmm0 400e08: 0f 59 c2 mulps %xmm2,%xmm0 400e0b: 0f 58 c1 addps %xmm1,%xmm0 400e0e: 0f 59 80 e0 25 60 00 mulps 0x6025e0(%rax),%xmm0 400e15: 0f 29 80 e0 24 60 00 movaps %xmm0,0x6024e0(%rax) 400e1c: 48 83 c0 10 add $0x10,%rax 400e20: 83 ef 01 sub $0x1,%edi 400e23: 75 dc jne 400e01 <bench_1_simd(unsigned int)+0x21> 400e25: f3 c3 repz retq 400e27: 90 nop 400e28: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) the worst code (gcc-4.4, -march=core2) is 15% slower: 0000000000400e70 <bench_1_simd(unsigned int)>: 400e70: 66 0f ef d2 pxor %xmm2,%xmm2 400e74: 89 fa mov %edi,%edx 400e76: 66 0f 76 d2 pcmpeqd %xmm2,%xmm2 400e7a: c1 ea 02 shr $0x2,%edx 400e7d: 66 0f 72 d2 19 psrld $0x19,%xmm2 400e82: ff ca dec %edx 400e84: 66 0f 72 f2 17 pslld $0x17,%xmm2 400e89: 48 ff c2 inc %rdx 400e8c: 0f 28 0d 7d 17 00 00 movaps 0x177d(%rip),%xmm1 # 402610 <_IO_stdin_used+0xb0> 400e93: 48 c1 e2 04 shl $0x4,%rdx 400e97: 31 c0 xor %eax,%eax 400e99: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 400ea0: 0f 28 c1 movaps %xmm1,%xmm0 400ea3: 0f 59 80 e0 36 60 00 mulps 0x6036e0(%rax),%xmm0 400eaa: 0f 58 c2 addps %xmm2,%xmm0 400ead: 0f 59 80 e0 35 60 00 mulps 0x6035e0(%rax),%xmm0 400eb4: 0f 29 80 e0 34 60 00 movaps %xmm0,0x6034e0(%rax) 400ebb: 48 83 c0 10 add $0x10,%rax 400ebf: 48 39 d0 cmp %rdx,%rax 400ec2: 75 dc jne 400ea0 <bench_1_simd(unsigned int)+0x30> 400ec4: f3 c3 repz retq 400ec6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 400ecd: 00 00 00 -- Summary: [4.4 Regression] speed regression with sse intrinsics Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: inline-asm AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tim at klingt dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38671