With code like this (see attachement for a complete silly test case),
unsigned int quad = 0;
for (unsigned int dec=node.count/4; dec; --dec) {
        _mm_prefetch(1024+(const char *)&base[quad], _MM_HINT_NTA);
        sampler.countl[0] = _mm_add_epi32(sampler.countl[0],
_mm_cmpgt_epi32(sampler.pos[0], base[quad+0]));
        sampler.countl[1] = _mm_add_epi32(sampler.countl[1],
_mm_cmpgt_epi32(sampler.pos[1], base[quad+0]));
        sampler.countl[0] = _mm_add_epi32(sampler.countl[0],
_mm_cmpgt_epi32(sampler.pos[0], base[quad+1]));
        sampler.countl[1] = _mm_add_epi32(sampler.countl[1],
_mm_cmpgt_epi32(sampler.pos[1], base[quad+1]));
        sampler.countl[0] = _mm_add_epi32(sampler.countl[0],
_mm_cmpgt_epi32(sampler.pos[0], base[quad+2]));
        sampler.countl[1] = _mm_add_epi32(sampler.countl[1],
_mm_cmpgt_epi32(sampler.pos[1], base[quad+2]));
        sampler.countl[0] = _mm_add_epi32(sampler.countl[0],
_mm_cmpgt_epi32(sampler.pos[0], base[quad+3]));
        sampler.countl[1] = _mm_add_epi32(sampler.countl[1],
_mm_cmpgt_epi32(sampler.pos[1], base[quad+3]));
        quad += 4;
}

g++ 4.2 insists to use the same register for addressing 'base[quad]' and
prefetching 1k away from it. Horrible encoding ensues.

With gcc-4.2-20060311/gcc-4.2-20060826 -O3 -march=k8 i get something like
  401080:       66 0f 6f c2             movdqa %xmm2,%xmm0
  401084:       0f 18 00                prefetchnta (%eax)
  401087:       66 0f 66 80 00 fc ff ff         pcmpgtd 0xfffffc00(%eax),%xmm0
  40108f:       66 0f fe 42 20          paddd  0x20(%edx),%xmm0
  401094:       0f 29 42 20             movaps %xmm0,0x20(%edx)
  401098:       66 0f 6f c1             movdqa %xmm1,%xmm0
  40109c:       66 0f 66 80 00 fc ff ff         pcmpgtd 0xfffffc00(%eax),%xmm0
  4010a4:       66 0f fe 42 30          paddd  0x30(%edx),%xmm0
  4010a9:       0f 29 42 30             movaps %xmm0,0x30(%edx)
  4010ad:       66 0f 6f c2             movdqa %xmm2,%xmm0
  4010b1:       66 0f 66 80 10 fc ff ff         pcmpgtd 0xfffffc10(%eax),%xmm0
  4010b9:       66 0f fe 42 20          paddd  0x20(%edx),%xmm0
  4010be:       0f 29 42 20             movaps %xmm0,0x20(%edx)
  4010c2:       66 0f 6f c1             movdqa %xmm1,%xmm0
etc...

There's other issues with the code produced, ie gcc writing back values instead
of just keeping them live, but i can kludge around them.
But i cannot fix that silly encoding.

msvc8, icc9.1 and g++ 3.4.4 do a much better job, here's g++ 3.4.4
 401084:       prefetchnta 0x400(%eax)
 40108b:       movdqa %xmm5,%xmm2
 40108f:       movdqa %xmm4,%xmm1
 401093:       movdqa %xmm3,%xmm0
 401097:       pcmpgtd (%eax),%xmm2
 40109b:       paddd  %xmm2,%xmm6
 40109f:       movaps %xmm6,0x20(%edx)
 4010a3:       movdqa %xmm5,%xmm2
 4010a7:       pcmpgtd (%eax),%xmm1
 4010ab:       paddd  %xmm1,%xmm0
 4010af:       movaps %xmm0,0x30(%edx)
 4010b3:       movdqa %xmm4,%xmm1
 4010b7:       pcmpgtd 0x10(%eax),%xmm2
 4010bc:       paddd  %xmm2,%xmm6
 4010c0:       movaps %xmm6,0x20(%edx)

Note that -fprefetch-loop-arrays's heuristic is way off the mark and
counterproductive, even for that simplified testcase.


-- 
           Summary: overzealous pointer coalescence leading to poor encoding
           Product: gcc
           Version: 4.2.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tbptbp at gmail dot com
  GCC host triplet: x86*


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28919

Reply via email to