[Bug inline-asm/38671] New: [4.4 Regression] speed regression with sse intrinsics

tim at klingt dot org Tue, 30 Dec 2008 04:58:41 -0800

i experience some speed regressions with gcc-4.4, with sse intrinsics on a
core2 (x86_64). the code is:


namespace detail
{
/** compute x1 * (1 + x2 * amount)  */
__m128 inline amp_mod4_loop(__m128 x1, __m128 x2, __m128 amount, __m128 one)
{
    return _mm_mul_ps(x1,
                      _mm_add_ps(one,
                                 _mm_mul_ps(x2, amount)));
}
} /* namespace detail */

template <>
inline void amp_mod4(float * out, const float * in1, const float * in2,
                     const float amount, unsigned int n)
{
    n = n >> 2;
    const __m128 one = detail::gen_one();
    const __m128 amnt = _mm_set_ps1(amount);

    do
    {
        const __m128 x1 = _mm_load_ps(in1);
        in1 += 4;
        const __m128 x2 = _mm_load_ps(in2);
        in2 += 4;

        const __m128 result = detail::amp_mod4_loop(x1, x2, amnt, one);

        _mm_store_ps(out, result);
        out += 4;
    }
    while (--n);
}

the results for different compilers (using hardware performance counters) are:
gcc-4.4:
cycles: 1416276094
branch misses: 425897

gcc-4.4 -march=core2:
cycles: 1520034636
branch misses: 3263912

gcc-4.3:
cycles: 1548838336
branch misses: 5990424

gcc-4.3 -march=core2:
cycles: 1386605444
branch misses: 5609

gcc-4.2:
cycles: 1321697674
branch misses: 3682

it seems that gcc-4.3 with -march core2 and gcc-4.2 generate code, which is
more friendly to the branch predictor. tuning for core2 on gcc-4.4 actually
seems to generate worse code.

the best code (gcc-4.2) is:
0000000000400de0 <bench_1_simd(unsigned int)>:
  400de0:       66 0f ef c0             pxor   %xmm0,%xmm0
  400de4:       c1 ef 02                shr    $0x2,%edi
  400de7:       0f 28 15 32 0f 00 00    movaps 0xf32(%rip),%xmm2        #
401d20 <_IO_stdin_used+0xb0>
  400dee:       31 c0                   xor    %eax,%eax
  400df0:       66 0f 76 c0             pcmpeqd %xmm0,%xmm0
  400df4:       66 0f 72 d0 19          psrld  $0x19,%xmm0
  400df9:       66 0f 72 f0 17          pslld  $0x17,%xmm0
  400dfe:       0f 28 c8                movaps %xmm0,%xmm1
  400e01:       0f 28 80 e0 26 60 00    movaps 0x6026e0(%rax),%xmm0
  400e08:       0f 59 c2                mulps  %xmm2,%xmm0
  400e0b:       0f 58 c1                addps  %xmm1,%xmm0
  400e0e:       0f 59 80 e0 25 60 00    mulps  0x6025e0(%rax),%xmm0
  400e15:       0f 29 80 e0 24 60 00    movaps %xmm0,0x6024e0(%rax)
  400e1c:       48 83 c0 10             add    $0x10,%rax
  400e20:       83 ef 01                sub    $0x1,%edi
  400e23:       75 dc                   jne    400e01 <bench_1_simd(unsigned
int)+0x21>
  400e25:       f3 c3                   repz retq 
  400e27:       90                      nop    
  400e28:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)

the worst code (gcc-4.4, -march=core2) is 15% slower:
0000000000400e70 <bench_1_simd(unsigned int)>:
  400e70:       66 0f ef d2             pxor   %xmm2,%xmm2
  400e74:       89 fa                   mov    %edi,%edx
  400e76:       66 0f 76 d2             pcmpeqd %xmm2,%xmm2
  400e7a:       c1 ea 02                shr    $0x2,%edx
  400e7d:       66 0f 72 d2 19          psrld  $0x19,%xmm2
  400e82:       ff ca                   dec    %edx
  400e84:       66 0f 72 f2 17          pslld  $0x17,%xmm2
  400e89:       48 ff c2                inc    %rdx
  400e8c:       0f 28 0d 7d 17 00 00    movaps 0x177d(%rip),%xmm1        #
402610 <_IO_stdin_used+0xb0>
  400e93:       48 c1 e2 04             shl    $0x4,%rdx
  400e97:       31 c0                   xor    %eax,%eax
  400e99:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400ea0:       0f 28 c1                movaps %xmm1,%xmm0
  400ea3:       0f 59 80 e0 36 60 00    mulps  0x6036e0(%rax),%xmm0
  400eaa:       0f 58 c2                addps  %xmm2,%xmm0
  400ead:       0f 59 80 e0 35 60 00    mulps  0x6035e0(%rax),%xmm0
  400eb4:       0f 29 80 e0 34 60 00    movaps %xmm0,0x6034e0(%rax)
  400ebb:       48 83 c0 10             add    $0x10,%rax
  400ebf:       48 39 d0                cmp    %rdx,%rax
  400ec2:       75 dc                   jne    400ea0 <bench_1_simd(unsigned
int)+0x30>
  400ec4:       f3 c3                   repz retq 
  400ec6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400ecd:       00 00 00


-- 
           Summary: [4.4 Regression] speed regression with sse intrinsics
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: inline-asm
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tim at klingt dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38671

[Bug inline-asm/38671] New: [4.4 Regression] speed regression with sse intrinsics

Reply via email to