I was playing with -fprefetch-loop-arrays on pentium4, trying to get some speed-up with simple operations on arrays. Consider this small testcase:
#define NELEM 10000000 #define NITER 1000 int buf[NELEM]; int main() { int i,j; int sum = 0; double ssum = 0.0; for (i = 0; i < NELEM; i++) buf[i] = i; for (j = 0; j < NITER; j++) { for (i = 0; i < NELEM; i++) sum += buf[i]; ssum += sum; } printf ("%f\n", ssum); return 0; } gcc -O2 -march=pentium4: time ./a.out 3347504896.000000 real 0m18.114s user 0m17.910s sys 0m0.072s Using -fprefetch-loop-arrays, the run time increases drastically: gcc -O2 -march=pentium4 -fprefetch-loop-arrays time ./a.out 3347504896.000000 real 0m27.678s user 0m27.611s sys 0m0.051s That is, more than 50% performance hit using -fprefetch-loop-arrays on pentium4. The inner loop looks like: .L5: prefetcht0 384(%eax) addl (%eax), %edx addl $4, %eax cmpl %eax, %ecx jne .L5 Without -fprefetch-loop-arrays, the code for the inner loop is the same (without prefetch insn, of course). Is there everythin OK with prefetches on P4? -- Summary: -fprefetch-loop-arrays increases run time considerably Product: gcc Version: 4.1.0 Status: UNCONFIRMED Severity: normal Priority: P2 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: uros at kss-loka dot si CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i686-pc-linux-gnu GCC host triplet: i686-pc-linux-gnu GCC target triplet: i686-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20748