I was wondering if anyone would care to elaborate on the advantages if any of moving to 64 bit?
Coincidentally, I've spent the last 5 days working on 64-bit optimizations...
I have ported prime95 to the Windows 64-bit OS. It isn't QA'ed or anything but can
pass the torture test.
The trial factoring code was completely rewritten. On an Opteron it is significantly
faster than the 32-bit code - nearly twice as fast.
An Intel guy has played around with using the extra registers and is reporting an 8%
speedup. All attempts I've made to use the extra registers on the Opteron have
been ineffective.
The AMD64 (both 32 and 64 bit) code was suffering from 2 bottlenecks.
1) The basic building blocks run faster on a P4 than on the AMD64. For example
the four_complex_fft macro running from the L1 data cache takes 95 clocks on a P4
and takes 129 clocks on the Opteron.
2) The P4 was accessing data from the L2 cache faster than the Opteron. The same
macro running from the L2 cache was taking 100 clocks on a P4 but 180 clocks on the
Opteron.
The bad news is nothing I've tried has made the Opteron as fast as the P4. Using
extra registers, reordering code, changing the code to use the ADD/MUL/STORE
pipelines differently made any significant difference (exception: moving stores forward
as far as possible gets the time down to 117 clocks).
The good news is that the L2 cache penalty can be eliminated by using the
prefetchw instruction (but not prefetch or prefetcht0). The majority of the time
prime95 knows the addresses for the next building block and can use prefetchw
accordingly.
I'm near completing a version of prime95 with the prefetchw optimization for about
a 15% gain. I'll also investigate whether this idea can help 32-bit Athlons.
-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.8.1 - Release Date: 3/23/2005
_______________________________________________ Prime mailing list [email protected] http://hogranch.com/mailman/listinfo/prime
