------- Additional Comments From dank at kegel dot com 2005-06-18 22:45 ------- I asked the fellow who posted the original problem report to give me the results of 'cat /proc/cpuinfo' on the affected machine. Here it is:
vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 10 cpu MHz : 896.153 This is the same as one of the two affected CPU types here. The slow routine appears to be the buffer cleaning routine, though I haven't verified this with oprofile yet. Here's its loop: static char cleanse_ctr; ... while (len--) { *(ptr++) = cleanse_ctr; cleanse_ctr += (17 + (unsigned char) ((int) ptr & 0xF)); } and the output of -O3 -fPIC for both gcc-2.95.3 and gcc-4.0.0: --- gcc-2.95.3 --- .L5: movl [EMAIL PROTECTED](%ebx),%edi movb (%edi),%al movb %al,(%edx) incl %edx movb (%edi),%cl addb $17,%cl movb %dl,%al andb $15,%al addb %al,%cl movb %cl,(%edi) subl $1,%esi jnc .L5 .L4: --- gcc-4 --- .L4: movb (%esi), %al movb %al, (%edx) leal (%ecx,%edi), %eax andl $15, %eax incl %ecx addb (%esi), %al incl %edx addl $17, %eax cmpl %ecx, 12(%ebp) movb %al, (%esi) jne .L4 It's not obvious to me why the gcc-4.0.0 generated code should be slower when run on some CPUs, if in fact it is. Is it the fact that the loop condition is checked with a cmp against memory instead of a flag being set by subtracting 1 from a register? (And where's the best place to learn about how to predict how long assembly snippets like this will take to run on various modern CPUs, anyway?) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19923