Re: Mersenne: What I've learned about the P4 thusfar (long)

George Woltman Mon, 01 Jan 2001 19:44:36 -0800
Hi Jason,

At 10:02 PM 1/1/2001 -0500, Jason Stratos Papadopoulos wrote:
> > The cause is the new 64-byte cache line.  The L1 cache is write-back.
> > Any changes to the L1 cache are written back to the L2 cache.  On the P3
>
>Two things here; first, doesn't the P4 have 128 byte cache lines? Also,
>do you mean write *back* or write *through*?

The L2 cache line is 128 bytes, but the L1 cache line is 64 bytes.

I get back/through confused all the time.  When you write to the L1 cache it is
also written to the L2 cache, but not main memory (until the line is
thrown out of the L2 cache).

> > P4 has a 64 byte L1 cache.  The update of the L2 cache is done with two
> > writes of 32 bytes each taking 7 clocks.  Thus, the P4 macro takes 8*14=112
> > clocks.
>
>Are you sure of this timing?

Yes.

>  The memory bus to L2 is supposed to be 256
>bits wide,

256 bits is 32 bytes which is why it takes 2 writes to copy an
L1 cache line to the L2 cache.

>  and is supposed to deliver data once per clock (or maybe every
>other clock, the Coppermine P3 does that). This would mean a cache line
>write back would take at least four processor clocks.

The 7 clock figure came from Tom's Hardware guide which I'm sure is
in the Intel manual somewhere.  It so beautifully described the 112 clock
result I was seeing that I've not looked for the Intel reference.

>  Or is that
>super-speed delivery only supposed to be for reads, with writes being
>buffered?

I'm not sure what the latency is for a read from the L2 cache.  Your
theory could be correct.  I'm sure that Intel engineers want to optimize
the read path more than the write path.

>Since the new L1 cache is tiny and the L2 is very large and almost as
>fast, would your life become any easier if you coded for L2 cache latency
>instead of L1? Most of the work gets done in a single pass through main
>memory, then the "rows" of the transform get processed in a second pass
>that can barely tolerate L2 latency.

My plan is to work through main memory in 128KB chunks.  Within
these 128KB chunks I'll make several passes working on 2KB sub-chunks
to maximize L1 cache hits.

>  Or is the latency of SSE2
>instructions so high that you run out of registers and rely on renaming
>already?

The latency is high, but so far I've not had too much trouble with register
pressure.  The four_complex_unfft macro used 3 temporary memory
locations due to register pressure.  Fortunately, store-forwarding seems
to be keeping the cost at a minimum.

Regards,
George

_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: What I've learned about the P4 thusfar (long)

Reply via email to