Quoting John R Pierce <[EMAIL PROTECTED]>:
ok, the 128bit version of PMULUDQ has a 4 clock latency and can execute every other clock, so a 2.4Ghz am64 can, in theory, execute 2.4 BILLION 64x64->128 bit integer multiplies per second. 4 of these does a complete 128x128->256 bit, 16 does 256x256->512 bit, etc etc.
tell me thats not better than the FPU stuff where there's rounding problems?
That's not better than the FPU where there's rounding problems.
An FPU multiply is one instruction. Using integer multiplies in an FFT-like setting requires several multiplies and several auxiliary operations. For platforms with high-performance integer multiplication, the speed difference is only a factor of 3-5. Multiply by 10 for platforms with crappy integer multiply speed (of which there are many).
but the 64 bit FPU multiply has only 52(?) bits of significance (the rest goes to exponent), and it generates a 52(?) bit result, so doing a high precision multiply requires MORE operations.
but, I think I made a mistake there.... PMULUDQ does two 32x32->64's in 2 clocks. so a 64x64->128 would take 4 clocks, which is the same as the regular MUL 64,64->128.
doing general purpose non-numerical operations that are memory and processor intensive on a benchmark test we ran at work, a quad opteron 2.2GHz in 64bit mode was 4X faster than a dual xeon 2.8Ghz. It was also over 2X faster than a 20 CPU Sun Enterprise 10000 doing the exact same workload.
This particular benchmark was a CPU bound Oracle job, using extensive PLSQL database programming, and very little disk IO (ok, heavy disk writes in the background, but virtually no read activity), it was also heavily multithreaded (dozens of java process threads hammering on the oracle stored procedures)
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime
