> There's previously been several posts discussing the performance penalty
> one suffers when running multiple LL tests on a multiprocessor system
> with a single shared system bus. It would be interesting to see whether
> this
> penalty could be alleviated in a reasonably cost-effective fashion through
> use of larger L2 caches.
> 
This is a fundimental problem in processor/system design.  The quick answer
is no.

> Here's the idea: if the L2 caches are much smaller than the dataset size,
> the system bus will be heavily used by each process, leading to memory
> delays. If each processor has an L2 cache large enough to hold the full
> dataset (between 4 and 5MB for an LL test of an exponent ~10M), there
> will be essentially no memory traffic and no shared-bus penalty. Such
> large caches are expensive. But it seems that between these two extremes
> (very small and very large caches) there should be some kind of optimal
> compromise, where the cache size is just large enough that the resulting
> reduction in memory traffic allows each process to access the main memory
> at basically the same speed it would if there were no other jobs competing
> for
> bus bandwidth, i.e. where the bus traffic just begins to saturate.
> 
The loading on the memory bus by the other processor may only use half of
the available bandwidth (for a 2 CPU system) but the latency due to bus
contention will decrease performance, still.  The curve of performance loss
vs foreign bus utilization is not very sharp.  With modern processors
supporting many outstanding loads from main memory, it's not getting any
sharper.

> I suspect for LL tests in the ~10M range, this happy medium may be as
> 'small' as 1-2MB. Are PC systems with L2 caches in this size range
> available? If so, how much of a premium does one pay for the extra cache?
> 
The cheap answer these days is faster shared bus.  Next to that is switched
memory fabric (throw out the shared bus idea).  Only after that is
increasing the caches on package or on chip.  External caches are getting
less and less popular.  For reasons similar to what I've mentioned above.
The fundimental problem of caches (and layers of them) is that smaller and
'closer' (in terms of latency) competes with larger and 'farther'.  Off chip
and off package is just *so* far away these days that it doesn't buy you
much unless you put a ton of fast memory there.  At that point, you have to
worry that it's just using bandwidth that could be data from main memory.
So, you put it on a dedicated bus, but now you have to ask 'would it have
just been better to make the pipe to main memory wider with these pins?'

*stops for breath*

Okay, the quick answer if that, for PC hardware, the Intel Xenon systems are
the only ones with bigger than 512K of chip caches.  Those guys only run in
expensive motherboards.  So, the simple solution is to not run on SMP
systems where you run into this problem.  The cheapest way is to use
uni-processor systems with a reasonable memory architecture.  

But, for those of us with SMP systems--I have one because it was cheap and
I've always wanted one--we just have to keep in mind what kind of demands
our applications put on the shared memory bus.  My dual PII/333 system with
a 66MHz 64bit shared bus does *not* like to run two coppies of LL testing at
any one time.  Once LL and one factoring is just fine, though--as factoring
stayes on CPU.

That's just something that I need to live with.  As numbers get bigger, that
dependency on main memory BW will get worse and worse as the cache becomes
less and less effective.  At some point, I'll probably retire the machine to
some less memory BW demanding task--like building kernels or (may I die
before this happens) RC5 cracking.

Cheers,
David
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to