Re: Mersenne: SMT

2001-11-04 Thread Michael Vang

George Woltman wrote:

> This is a prime95 problem, not a SMT problem.  Prime95 is designed to
> run efficiently in 128KB of L2 cache.

George-

Are there any gains to be had if you code it to fit a 256KB L2? If so,
maybe we should have 2 versions?  :)

Also, how do you think the new .13 micron Tualatin CPUs will do for
Prime95? I read someting about a new pre-fetch mechanism (?) but I
didn't quite grasp what it meant...

The newer PIIIs (1.13 & 1.26GHZ, IIRC...) have an option for 512KB...

Thanks!


Xyzzy [81/117.943/94/9.801/801.750] http://www.teamprimerib.com/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: SMT

2001-11-04 Thread George Woltman

At 09:35 PM 11/4/2001 +, [EMAIL PROTECTED] wrote:
>I'm not sure I fully understand the way in which a SMT processor
>would utilise cache.

This is a prime95 problem, not a SMT problem.  Prime95 is designed to
run efficiently in 128KB of L2 cache.  If I split the current FFT into 2 
threads,
then either each thread is going to want 128KB of L2 cache space (more
cache contention) or I must recode the "passes" of prime95 to run efficiently
in just 64KB of cache (this might be a bad idea as using less L2 cache
may require adding another "pass" over the FFT data - at least for some
FFT sizes).  That is why I fear that SMT may not be helpful to prime95.
Another way of saying this is prime95 may be more constrained by L2 cache
sizes than it constrained by micro-op scheduling.

However, I had not considered the idea of factoring and LL testing.  Even 
better,
if we make the factoring code use mostly integer instructions then the LL test
thread would keep the FPU units busy and the factoring thread would keep
the integer units busy!  Then the only question becomes "Does the user want
to slow down his LL tests in order to increase his total (LL + factoring)
throughput".

-- George

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



RE: Mersenne: SMT

2001-11-04 Thread Aaron Blosser

I kind of like the practice that many dual processor folks have seem to
adopted (and one which I'll be switching my group of computers too)...

Namely, on dual CPU systems, have one Prime process doing LL tests, and
have the other one doing trial factoring.  Even on Compaq servers that
have GREAT cache/memory management, running 2 LL tests on each CPU will
slow down both processes.  Running one LL and one factor reduces the hit
on the memory subsystem since the factoring can generally remain in the
CPU cache of it's respective processor, leaving the LL process to better
use the memory for itself.

So perhaps this same approach could be adopted for SMT?

And just a reminder... trial factoring is still a great use of slower
machines... I have an AMD K6-III 400 that can trial factor the current
17M exponents in just about 2 days.  Yeah, P4's can do it a lot faster,
or 1.2GHz Athlons, but I'd rather have those machines concentrate on the
LL tests.

Aaron

> -Original Message-
> From: [EMAIL PROTECTED]
[mailto:mersenne-invalid-
> [EMAIL PROTECTED]] On Behalf Of [EMAIL PROTECTED]
> Sent: Sunday, November 04, 2001 1:36 PM
> To: Kel Utendorf
> Cc: [EMAIL PROTECTED]
> Subject: Re: Mersenne: SMT
> 
> On 3 Nov 2001, at 21:40, Kel Utendorf wrote:
> 
> > At 21:01 11/03/2001 -0500, George Woltman wrote:
> >  >Can prime95 take advantage of SMT?  I'm skeptical.  If the FFT is
> >  broken >up to run in two threads, I'm afraid L2 cache pollution
will
> >  negate any >advantage of SMT.  Of course, I'm just guessing - to
test
> >  this theory out we >should compare our throughput running 1 vs. 2
> >  copies of prime95 on an >SMT machine.
> 
> I'm not sure I fully understand the way in which a SMT processor
> would utilise cache. But I can't see how the problem could be
> worse than running two copies of a program on a SMP system.
> This seems to work fairly well in both Windows and linux regimes
> (attatching a thread to a processor and therefore its associated
> cache, rigidly in the case of Windows, loosely but intelligently in
> the case of linux).
> 
> If an SMT processor has a unified cache, cache pollution should
> surely be not too much of a problem? Running one copy & thereby
> getting benefit of the full cache size would run that one copy faster,
> (just as happens with SMP systems where memory bandwidth can
> be crucial) but the total throughput with two copies running would
> surely be greater. Especially on a busy system, where two threads
> get twice as many timeslices as one!
> 
> If there is some way in which the FFT could be broken down into
> roughly equal sized chunks, it _might_ be worth synchronizing two
> streams so that e.g. transform in on one thread was always in
> parallel with transform out on the other, and vice versa. Obviously
> you'd need to be running on two different exponents but using the
> same FFT length to gain from this technique. Whether this would
> be any better than running unsynchronized would probably require
> experimentation.
> >
> > Could things be setup so that factoring and LL-testing went on
> > "simultaneously?"  This would speed up the overall amount of work
> > being done.
> 
> Because trial factoring, or P-1/ECM on _small_ exponents, have a
> very low memory bus loading, running a LL test and factoring in
> parallel on a dual-processor SMP system makes a lot of sense. I
> suspect the same situation would apply in an SMT environment.
> 
> The "problem" of mass deployment (almost everyone in this
> position, instead of only a few of us) is that there is a great deal
of
> LL testing effort required in comparison to trial factoring, so
running
> two LL tests in parallel but inefficiently would bring us to
> "milestones" faster than the efficient LL/trial factoring split.
> 
> 
> Regards
> Brian Beesley
>

_
> Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
> Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: SMT

2001-11-04 Thread bjb

On 3 Nov 2001, at 21:40, Kel Utendorf wrote:

> At 21:01 11/03/2001 -0500, George Woltman wrote:
>  >Can prime95 take advantage of SMT?  I'm skeptical.  If the FFT is
>  broken >up to run in two threads, I'm afraid L2 cache pollution will
>  negate any >advantage of SMT.  Of course, I'm just guessing - to test
>  this theory out we >should compare our throughput running 1 vs. 2
>  copies of prime95 on an >SMT machine.

I'm not sure I fully understand the way in which a SMT processor 
would utilise cache. But I can't see how the problem could be 
worse than running two copies of a program on a SMP system. 
This seems to work fairly well in both Windows and linux regimes 
(attatching a thread to a processor and therefore its associated 
cache, rigidly in the case of Windows, loosely but intelligently in 
the case of linux).

If an SMT processor has a unified cache, cache pollution should 
surely be not too much of a problem? Running one copy & thereby 
getting benefit of the full cache size would run that one copy faster, 
(just as happens with SMP systems where memory bandwidth can 
be crucial) but the total throughput with two copies running would 
surely be greater. Especially on a busy system, where two threads 
get twice as many timeslices as one!

If there is some way in which the FFT could be broken down into 
roughly equal sized chunks, it _might_ be worth synchronizing two 
streams so that e.g. transform in on one thread was always in 
parallel with transform out on the other, and vice versa. Obviously 
you'd need to be running on two different exponents but using the 
same FFT length to gain from this technique. Whether this would 
be any better than running unsynchronized would probably require 
experimentation.
> 
> Could things be setup so that factoring and LL-testing went on 
> "simultaneously?"  This would speed up the overall amount of work
> being done.

Because trial factoring, or P-1/ECM on _small_ exponents, have a 
very low memory bus loading, running a LL test and factoring in 
parallel on a dual-processor SMP system makes a lot of sense. I 
suspect the same situation would apply in an SMT environment.

The "problem" of mass deployment (almost everyone in this 
position, instead of only a few of us) is that there is a great deal of 
LL testing effort required in comparison to trial factoring, so running 
two LL tests in parallel but inefficiently would bring us to 
"milestones" faster than the efficient LL/trial factoring split.


Regards
Brian Beesley
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: SMT

2001-11-04 Thread Robert Deininger

On Sun, Nov 4, 2001 7:59 AM, Dieter Schmitt <[EMAIL PROTECTED]>
wrote:
>Gareth Randell wrote:
>
>> Some SMT news that I know of:
>> Alpha EV8 will have SMT with 4 simultaneous execution paths.
>> Alpha recently got canned by compaq, so the above may never happen.
>
>. and Intel bought Alpha from Compaq recently.

More precisely, Intel bought non-exclusive rights to all the alpha
technology, as well as the right to hire a lot of the Compaq Alpha people. 
The EV8 designers are going/have gone to Intel.  Compaq is still developing
the Alpha EV7 family.  In a few years, the EV7 people may migrate to Intel
also.

At this point, Compaq still owns all the Alpha intellectual property, and
could continue EV8 development if they wanted to (though they would have to
start over with new people).  Intel could complete the Alpha EV8 and ship
it as an Intel-branded chip if they wanted to.  Or (far more likely) they
will incorporate what features they can in their current CPU families,
mainly Itanium.

Many of Compaq's compiler people will be migrating to Intel also.

---
Robert Deininger
[EMAIL PROTECTED]



_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: SMT

2001-11-04 Thread Dieter Schmitt

Gareth Randell wrote:

> Some SMT news that I know of:
> Alpha EV8 will have SMT with 4 simultaneous execution paths.
> Alpha recently got canned by compaq, so the above may never happen.

. and Intel bought Alpha from Compaq recently.

Yours,

Dieter Schmitt

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: SMT

2001-11-03 Thread Gareth Randall

George Woltman wrote:
> SMT for those that don't know makes one P4 CPU look like 2 CPUs
> to the operating system.  Each "virtual CPU" has its own set of registers
> and each runs a different program (actually a different "thread").  The real
> CPU can now execute instructions from either virtual CPU.


SMT on Intel? I didn't know about that.

If SMT is implemented like the planned Alpha EV8 implementation, then it will be up to 
the OS to schedule multiple tasks for the processor. Consequently unless the OS had 
special interfaces to allow one program to consume several SMT slots, the program 
would either be restricted to running as normal, or have to try running as several 
processes, or would have to replicate the necessary OS kernel functionality itself 
(difficult, and not portable).

I think the odds are that prime95 / mprime would not be able to gain much unless 
either the OS makes special arrangements for single compute-intensive programs, which 
seems unlikely since SMT is intended for CPUs running multiple processes, or the OS is 
open source and can be patched at kernel level, which excludes windows.

Some SMT news that I know of:
Alpha EV8 will have SMT with 4 simultaneous execution paths.
Alpha recently got canned by compaq, so the above may never happen.

Yours,

=== Gareth Randall ===

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: SMT

2001-11-03 Thread Kel Utendorf

At 21:01 11/03/2001 -0500, George Woltman wrote:
 >Can prime95 take advantage of SMT?  I'm skeptical.  If the FFT is broken
 >up to run in two threads, I'm afraid L2 cache pollution will negate any
 >advantage of SMT.  Of course, I'm just guessing - to test this theory out we
 >should compare our throughput running 1 vs. 2 copies of prime95 on an
 >SMT machine.

Could things be setup so that factoring and LL-testing went on 
"simultaneously?"  This would speed up the overall amount of work being done.

Kel


_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: SMT

2001-11-03 Thread George Woltman

Hi,

At 12:19 PM 11/2/2001 -0800, Stephan T. Lavavej wrote:
>Will Prime95 be optimized to take advantage of
>simultanous multithreading processors? Perhaps
>some part of the FFT computation can be done
>with multiple threads, so a SMT processor could
>devote more power to one while the other is
>waiting on memory or something.

A good theoretical question!

The details on Intel's SMT implementation are not out yet, but the
information we have now suggests that SMT could be a big winner for
modern CPUs.  SMT will be implemented in some versions of the P4 soon.

SMT for those that don't know makes one P4 CPU look like 2 CPUs
to the operating system.  Each "virtual CPU" has its own set of registers
and each runs a different program (actually a different "thread").  The real
CPU can now execute instructions from either virtual CPU.

Why is this good?  Well, the P4 CPU is often stalled waiting for a
instruction dependencies or memory accesses or whatever.  With SMT the
CPU now has more instructions to choose from in scheduling to keep
the functional units busy.  Better yet, it is guaranteed that there are no
dependencies on instructions from different virtual CPUs. Intel states they
are seeing up to 30% improvements in CPU throughput.

Can prime95 take advantage of SMT?  I'm skeptical.  If the FFT is broken
up to run in two threads, I'm afraid L2 cache pollution will negate any
advantage of SMT.  Of course, I'm just guessing - to test this theory out we
should compare our throughput running 1 vs. 2 copies of prime95 on an
SMT machine.

-- George

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers