> Hi all,
>
> The mailing list has been quiet. I hope everyone enjoyed
> a happy Thanksgiving (or at least a good weekend for non-U.S. readers).
The focus has been on double-checking, by a different mechanism
than that which made the original count, alas in a non-Mersenne context.
> I've received 2 queries about the recently released Pentium 4
> and prime95. I have no timings at this point, but I figured some folks
> would like to know how the architecture helps or hurts our cause. I've
> downloaded the manuals and have the following observations:
(stuff deleted)
> 2) The P4 introduces SSE2 instructions. Intel hopes new programs
> stop using the old FPU instructions and start using these new instructions.
> The SSE2 instructions work on 2 floating point values at the same time!
> An ADD takes 4 clocks, but can only issue every other clock cycle. A
> MUL takes 6 clocks and also can be issued every other clock cycle.
>
> The theoretical maximum throughput for SSE2 is one ADD *AND* one
> MUL every clock cycle. The average latency is 2 for a ADD and 3 for
> a MUL.
>
> Summary: If a program can be effectively recoded to use SSE2,
> then it can have greater throughput than even the Athlon. Of course,
> months ago I had hoped that the P4 would be able to get a throughput
> of 2 ADDs and 2 MULs per clock cycle. Maybe in a few years, a
> future P4 or AMD chip will do this.
I understand that the SSE2 instructions operate only on
64-bit (and 32-bit) floating point data, whereas the
FPU registers support 80-bit intermediate results.
How will the loss of precision affect the FFT length?
Vector processors such as the Cray typically support both
vector op vector -> vector
vector op scalar -> vector
opcodes, so one can (for example) form all b[i]^2 - 4.0*a[i]*c[i]
when solving several quadratic equations.
[We need two vector*vector multiplications,
one vector*scalar multiplication,
one vector-vector subtraction.]
I find it strange that the MMX and XMM and SSE2 instruction sets
lack vector*scalar operations and also lack a way to make multiple
copies of the constant 4.0, other than to store multiple copies
in memory. While data replication in memory
(or adding a[i]*c[i] to itself twice) may be acceptable here,
we don't want multiple copies of the table of roots of unity,
for example.
Peter Montgomery
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt