RE: Mersenne: K7 vs. x86

Willmore, David Mon, 23 Aug 1999 13:52:23 -0700
> From: Brian J. Beesley [SMTP:[EMAIL PROTECTED]]
> This is _still_ remarkable, since the "consumer" Athlons starting to 
> trickle onto the market have 64-bit 100 MHz FSB and 512KB L2 cache, 
> like PII / PIII / Xeon, but run their L2 cache at only 1/3 clock 
> speed (c.f. full clock speed for Xeon & 1/2 clock speed for PII/PIII)
> 
Well, they do run the point-to-point bus clock at 100MHz, but they
send/receive data on each clock transition, so it's 200M-transfers/sec.
And, I thought the production Athlons were using 1/2 speed cache--it was
only the pre-production (evaluation) processors which were using the 1/3
speed.  Of course, I could easily be wrong.

> The high performance Athlons with 128 bit FSB @ 200 MHz & larger, 
> relatively faster L2 cache should be really impressive. Maybe 
> starting to approach what Alpha has been doing for a while ;-) (The 
> critical difference here is that Athlon does run native IA32 code!)
> 
Yes, 400 M-transfers/second at 16 bytes each is a nice 6.4GB/s. :)  Run the
8M L2 at 1:1 with the processor at 800MHz and get twice that. :)  I can't
find a line in the document stating the width of the L2 bus, but I would be
suprised if it's < 128 bits.  256 for a server version would be nice. 12.8
GB/s to 25.6 GB/s, geezz....

> > If you were pipelining fmuls, the Athlon could spit them out in 1 clock
> > cycle (after the 4 cycle latency), compared to 2 cycles (after the 5
> cycle
> > latency) on the PIII, so it would be REAL important to get lots of yummy
> > pipelined fmuls to the Athlon to really let it strut it's stuff.
> 
> Am I missing something here? I thought that the _throughput_ was 1 
> FMUL per clock, but there was a 4 clock period between the 
> instruction entering the execution unit (meaning that it has already 
> had to be prefetched, decoded and the operands made available) and 
> the result of the operation becoming available. So, provided there is 
> no delay in fetching instructions, there is capacity in the decoders 
> and there is no stall due to operands being unavailable, you _should_ 
> get a throughput of 1 FMUL per clock (assuming that you aren't also 
> scheduling other instructions which block the multiplier execution 
> unit, or use its pipeline).
> 
I believe what he's saying is that there is a bit of 'granularity' to the
pipe in the PII.  It sounds like a FMUL and only enter the pipe every other
cycle--to emerge five cycles later.  If so, that's what used to be called
'superpipelined'.  Hmmm, no, that would be backwards.  Maybe it's
'subpipelined'.  Think of it as a 2.5 cycle long pipe running at half core
speed.  This can result from a not fully pipelined stage in the middle of
the pipe.  Say single precision goes:

stageA, stageB, stageC,

but double precision goes:

stageA, stageB, stageB, stageC, stageC

That way, stage B is used for two successive cycles by the same operand.
This will force stage A to stall waiting for the following stage to clear.
This isn't all that unlikely when normally single precision stages are
available.

Cheers,
David


_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
RE: Mersenne: K7 vs. x86

Reply via email to