> Hi all,
> 
        > The mailing list has been quiet.  I hope everyone enjoyed
> a happy Thanksgiving (or at least a good weekend for non-U.S. readers).

    The focus has been on double-checking, by a different mechanism
than that which made the original count, alas in a non-Mersenne context.

>       I've received 2 queries about the recently released Pentium 4
> and prime95.  I have no timings at this point, but I figured some folks
> would like to know how the architecture helps or hurts our cause.  I've
> downloaded the manuals and have the following observations:

     (stuff deleted)

> 2)  The P4 introduces SSE2 instructions.  Intel hopes new programs
> stop using the old FPU instructions and start using these new instructions.
> The SSE2 instructions work on 2 floating point values at the same time!
> An ADD takes 4 clocks, but can only issue every other clock cycle.  A
> MUL takes 6 clocks and also can be issued every other clock cycle.
> 
> The theoretical maximum throughput for SSE2 is one ADD *AND* one
> MUL every clock cycle.  The average latency is 2 for a ADD and 3 for
> a MUL.
> 
> Summary:  If a program can be effectively recoded to use SSE2,
> then it can have greater throughput than even the Athlon.  Of course,
> months ago I had hoped that the P4 would be able to get a throughput
> of 2 ADDs and 2 MULs per clock cycle.  Maybe in a few years, a
> future P4 or AMD chip will do this.

     I understand that the SSE2 instructions operate only on
64-bit (and 32-bit) floating point data, whereas the 
FPU registers support 80-bit intermediate results.
How will the loss of precision affect the FFT length?

    Vector processors such as the Cray typically support both

              vector op vector -> vector
              vector op scalar -> vector

opcodes, so one can (for example) form all b[i]^2 - 4.0*a[i]*c[i]
when solving several quadratic equations.
[We need two vector*vector multiplications, 
one vector*scalar multiplication, 
one vector-vector subtraction.]
I find it strange that the MMX and XMM and SSE2 instruction sets
lack vector*scalar operations and also lack a way to make multiple
copies of the constant 4.0, other than to store multiple copies
in memory.  While data replication in memory 
(or adding a[i]*c[i] to itself twice) may be acceptable here, 
we don't want multiple copies of the table of roots of unity,
for example.

        Peter Montgomery


_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt

Reply via email to