Re: Mersenne: SMT
George Woltman wrote: > This is a prime95 problem, not a SMT problem. Prime95 is designed to > run efficiently in 128KB of L2 cache. George- Are there any gains to be had if you code it to fit a 256KB L2? If so, maybe we should have 2 versions? :) Also, how do you think the new .13 micron Tualatin CPUs will do for Prime95? I read someting about a new pre-fetch mechanism (?) but I didn't quite grasp what it meant... The newer PIIIs (1.13 & 1.26GHZ, IIRC...) have an option for 512KB... Thanks! Xyzzy [81/117.943/94/9.801/801.750] http://www.teamprimerib.com/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SMT
At 09:35 PM 11/4/2001 +, [EMAIL PROTECTED] wrote: >I'm not sure I fully understand the way in which a SMT processor >would utilise cache. This is a prime95 problem, not a SMT problem. Prime95 is designed to run efficiently in 128KB of L2 cache. If I split the current FFT into 2 threads, then either each thread is going to want 128KB of L2 cache space (more cache contention) or I must recode the "passes" of prime95 to run efficiently in just 64KB of cache (this might be a bad idea as using less L2 cache may require adding another "pass" over the FFT data - at least for some FFT sizes). That is why I fear that SMT may not be helpful to prime95. Another way of saying this is prime95 may be more constrained by L2 cache sizes than it constrained by micro-op scheduling. However, I had not considered the idea of factoring and LL testing. Even better, if we make the factoring code use mostly integer instructions then the LL test thread would keep the FPU units busy and the factoring thread would keep the integer units busy! Then the only question becomes "Does the user want to slow down his LL tests in order to increase his total (LL + factoring) throughput". -- George _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
RE: Mersenne: SMT
I kind of like the practice that many dual processor folks have seem to adopted (and one which I'll be switching my group of computers too)... Namely, on dual CPU systems, have one Prime process doing LL tests, and have the other one doing trial factoring. Even on Compaq servers that have GREAT cache/memory management, running 2 LL tests on each CPU will slow down both processes. Running one LL and one factor reduces the hit on the memory subsystem since the factoring can generally remain in the CPU cache of it's respective processor, leaving the LL process to better use the memory for itself. So perhaps this same approach could be adopted for SMT? And just a reminder... trial factoring is still a great use of slower machines... I have an AMD K6-III 400 that can trial factor the current 17M exponents in just about 2 days. Yeah, P4's can do it a lot faster, or 1.2GHz Athlons, but I'd rather have those machines concentrate on the LL tests. Aaron > -Original Message- > From: [EMAIL PROTECTED] [mailto:mersenne-invalid- > [EMAIL PROTECTED]] On Behalf Of [EMAIL PROTECTED] > Sent: Sunday, November 04, 2001 1:36 PM > To: Kel Utendorf > Cc: [EMAIL PROTECTED] > Subject: Re: Mersenne: SMT > > On 3 Nov 2001, at 21:40, Kel Utendorf wrote: > > > At 21:01 11/03/2001 -0500, George Woltman wrote: > > >Can prime95 take advantage of SMT? I'm skeptical. If the FFT is > > broken >up to run in two threads, I'm afraid L2 cache pollution will > > negate any >advantage of SMT. Of course, I'm just guessing - to test > > this theory out we >should compare our throughput running 1 vs. 2 > > copies of prime95 on an >SMT machine. > > I'm not sure I fully understand the way in which a SMT processor > would utilise cache. But I can't see how the problem could be > worse than running two copies of a program on a SMP system. > This seems to work fairly well in both Windows and linux regimes > (attatching a thread to a processor and therefore its associated > cache, rigidly in the case of Windows, loosely but intelligently in > the case of linux). > > If an SMT processor has a unified cache, cache pollution should > surely be not too much of a problem? Running one copy & thereby > getting benefit of the full cache size would run that one copy faster, > (just as happens with SMP systems where memory bandwidth can > be crucial) but the total throughput with two copies running would > surely be greater. Especially on a busy system, where two threads > get twice as many timeslices as one! > > If there is some way in which the FFT could be broken down into > roughly equal sized chunks, it _might_ be worth synchronizing two > streams so that e.g. transform in on one thread was always in > parallel with transform out on the other, and vice versa. Obviously > you'd need to be running on two different exponents but using the > same FFT length to gain from this technique. Whether this would > be any better than running unsynchronized would probably require > experimentation. > > > > Could things be setup so that factoring and LL-testing went on > > "simultaneously?" This would speed up the overall amount of work > > being done. > > Because trial factoring, or P-1/ECM on _small_ exponents, have a > very low memory bus loading, running a LL test and factoring in > parallel on a dual-processor SMP system makes a lot of sense. I > suspect the same situation would apply in an SMT environment. > > The "problem" of mass deployment (almost everyone in this > position, instead of only a few of us) is that there is a great deal of > LL testing effort required in comparison to trial factoring, so running > two LL tests in parallel but inefficiently would bring us to > "milestones" faster than the efficient LL/trial factoring split. > > > Regards > Brian Beesley > _ > Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm > Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SMT
On 3 Nov 2001, at 21:40, Kel Utendorf wrote: > At 21:01 11/03/2001 -0500, George Woltman wrote: > >Can prime95 take advantage of SMT? I'm skeptical. If the FFT is > broken >up to run in two threads, I'm afraid L2 cache pollution will > negate any >advantage of SMT. Of course, I'm just guessing - to test > this theory out we >should compare our throughput running 1 vs. 2 > copies of prime95 on an >SMT machine. I'm not sure I fully understand the way in which a SMT processor would utilise cache. But I can't see how the problem could be worse than running two copies of a program on a SMP system. This seems to work fairly well in both Windows and linux regimes (attatching a thread to a processor and therefore its associated cache, rigidly in the case of Windows, loosely but intelligently in the case of linux). If an SMT processor has a unified cache, cache pollution should surely be not too much of a problem? Running one copy & thereby getting benefit of the full cache size would run that one copy faster, (just as happens with SMP systems where memory bandwidth can be crucial) but the total throughput with two copies running would surely be greater. Especially on a busy system, where two threads get twice as many timeslices as one! If there is some way in which the FFT could be broken down into roughly equal sized chunks, it _might_ be worth synchronizing two streams so that e.g. transform in on one thread was always in parallel with transform out on the other, and vice versa. Obviously you'd need to be running on two different exponents but using the same FFT length to gain from this technique. Whether this would be any better than running unsynchronized would probably require experimentation. > > Could things be setup so that factoring and LL-testing went on > "simultaneously?" This would speed up the overall amount of work > being done. Because trial factoring, or P-1/ECM on _small_ exponents, have a very low memory bus loading, running a LL test and factoring in parallel on a dual-processor SMP system makes a lot of sense. I suspect the same situation would apply in an SMT environment. The "problem" of mass deployment (almost everyone in this position, instead of only a few of us) is that there is a great deal of LL testing effort required in comparison to trial factoring, so running two LL tests in parallel but inefficiently would bring us to "milestones" faster than the efficient LL/trial factoring split. Regards Brian Beesley _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SMT
On Sun, Nov 4, 2001 7:59 AM, Dieter Schmitt <[EMAIL PROTECTED]> wrote: >Gareth Randell wrote: > >> Some SMT news that I know of: >> Alpha EV8 will have SMT with 4 simultaneous execution paths. >> Alpha recently got canned by compaq, so the above may never happen. > >. and Intel bought Alpha from Compaq recently. More precisely, Intel bought non-exclusive rights to all the alpha technology, as well as the right to hire a lot of the Compaq Alpha people. The EV8 designers are going/have gone to Intel. Compaq is still developing the Alpha EV7 family. In a few years, the EV7 people may migrate to Intel also. At this point, Compaq still owns all the Alpha intellectual property, and could continue EV8 development if they wanted to (though they would have to start over with new people). Intel could complete the Alpha EV8 and ship it as an Intel-branded chip if they wanted to. Or (far more likely) they will incorporate what features they can in their current CPU families, mainly Itanium. Many of Compaq's compiler people will be migrating to Intel also. --- Robert Deininger [EMAIL PROTECTED] _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SMT
Gareth Randell wrote: > Some SMT news that I know of: > Alpha EV8 will have SMT with 4 simultaneous execution paths. > Alpha recently got canned by compaq, so the above may never happen. . and Intel bought Alpha from Compaq recently. Yours, Dieter Schmitt _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SMT
George Woltman wrote: > SMT for those that don't know makes one P4 CPU look like 2 CPUs > to the operating system. Each "virtual CPU" has its own set of registers > and each runs a different program (actually a different "thread"). The real > CPU can now execute instructions from either virtual CPU. SMT on Intel? I didn't know about that. If SMT is implemented like the planned Alpha EV8 implementation, then it will be up to the OS to schedule multiple tasks for the processor. Consequently unless the OS had special interfaces to allow one program to consume several SMT slots, the program would either be restricted to running as normal, or have to try running as several processes, or would have to replicate the necessary OS kernel functionality itself (difficult, and not portable). I think the odds are that prime95 / mprime would not be able to gain much unless either the OS makes special arrangements for single compute-intensive programs, which seems unlikely since SMT is intended for CPUs running multiple processes, or the OS is open source and can be patched at kernel level, which excludes windows. Some SMT news that I know of: Alpha EV8 will have SMT with 4 simultaneous execution paths. Alpha recently got canned by compaq, so the above may never happen. Yours, === Gareth Randall === _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SMT
At 21:01 11/03/2001 -0500, George Woltman wrote: >Can prime95 take advantage of SMT? I'm skeptical. If the FFT is broken >up to run in two threads, I'm afraid L2 cache pollution will negate any >advantage of SMT. Of course, I'm just guessing - to test this theory out we >should compare our throughput running 1 vs. 2 copies of prime95 on an >SMT machine. Could things be setup so that factoring and LL-testing went on "simultaneously?" This would speed up the overall amount of work being done. Kel _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SMT
Hi, At 12:19 PM 11/2/2001 -0800, Stephan T. Lavavej wrote: >Will Prime95 be optimized to take advantage of >simultanous multithreading processors? Perhaps >some part of the FFT computation can be done >with multiple threads, so a SMT processor could >devote more power to one while the other is >waiting on memory or something. A good theoretical question! The details on Intel's SMT implementation are not out yet, but the information we have now suggests that SMT could be a big winner for modern CPUs. SMT will be implemented in some versions of the P4 soon. SMT for those that don't know makes one P4 CPU look like 2 CPUs to the operating system. Each "virtual CPU" has its own set of registers and each runs a different program (actually a different "thread"). The real CPU can now execute instructions from either virtual CPU. Why is this good? Well, the P4 CPU is often stalled waiting for a instruction dependencies or memory accesses or whatever. With SMT the CPU now has more instructions to choose from in scheduling to keep the functional units busy. Better yet, it is guaranteed that there are no dependencies on instructions from different virtual CPUs. Intel states they are seeing up to 30% improvements in CPU throughput. Can prime95 take advantage of SMT? I'm skeptical. If the FFT is broken up to run in two threads, I'm afraid L2 cache pollution will negate any advantage of SMT. Of course, I'm just guessing - to test this theory out we should compare our throughput running 1 vs. 2 copies of prime95 on an SMT machine. -- George _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: SMT
I was wondering, Will Prime95 be optimized to take advantage of simultanous multithreading processors? Perhaps some part of the FFT computation can be done with multiple threads, so a SMT processor could devote more power to one while the other is waiting on memory or something. I don't know the details of FFTs; if they're a completely serial process I may sound really silly. SMT will provide a performance boost to Prime95 anyways, as other system activity will impact idle threads less, of course. -- Stephan T. Lavavej _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers