Mersenne Digest Thursday, October 7 1999 Volume 01 : Number 638 ---------------------------------------------------------------------- Date: Tue, 5 Oct 1999 20:34:55 +0100 From: "Brian J. Beesley" <[EMAIL PROTECTED]> Subject: Re: Mersenne: New beta mers release: new Lucas-Lehmer program On 5 Oct 99, at 18:19, Guillermo Ballester Valor wrote: > Well, I'm wonder the reason of such diferent performance. On intel > machines MacLucasFFTW runs more than twice faster than MacLucas and on > RISC processor MacLucasUNIX is better than MacLucasFFTW. Looking at the > code, without deep understanding, one can see: > > i) MacLucasUNIX uses intensively the 'register' key in local > definitions, so a processor with many internal registers can allocate > most of them. It is a good thing because they can be accessed very fast. > The bad thing that is that in processors with very few registers (like > intel's) it can slowdown the speed. And (according to a local computer scientist, who I think knows what he's talking about) with modern processors making extensive use of register renaming, it's not usually sensible to use the "register" keyword _at all_. The theory is that the instruction scheduler can do at least as good a job as the programmer - it gets more choice, anyway, e.g. there are 40 32-bit general-purpose registers in the Intel PPro, but only a few have "names" at any given time. > > ii) On the other hand, FFTW does not use 'register' at all. All local > variables are stored on stack. I don't know much about compilers, but > perhaps some good compilers can use the register storage as speed > optimization. Looking at the code generated by gnu-gcc on intel > processors, some local double variables are stored on intel fpu and the > performance is so good. Storing data on a stack is not very efficient in most RISC architectures - you tend to cause problems with cache alignment, overloading cache lines causing high miss rates, etc. The small caches on the Alpha 21164 design possibly contribute to this - the L1 data cache is only 8K bytes & the L2 cache is only 96K bytes (but there can be a L3 cache which is at least 2M bytes, if fitted). The Intel FPU is a special case! > > My question is: What can happen in FFTW code if we directly include > 'register' keys management on its local temporal variable definitions?. > This sort of things can be made with a single compiler option?. > > I did it. I've included register managements on all FFTW radix routines > up to radix-16 (which need no more than 32 stack variables). For intel > machines the code is untouched (because I previously defined REG as a > void comment) . But I'm not the owner of a RISC machine so I have no > idea about its performance. any volunteer?. > Sure, I'll give it a go. Just mail me the source ... I've access to a Sun Ultra 10 as well (running Solaris, but with the gcc compiler, not Sun's own). > > Any improvement on MacLucasXXXX is desired. Any improvement on _anything_ is desireable !!! Actually MacLucasUNIX on my Alpha system isn't bad, compiled from pure C source with gcc it gives Prime95/mprime running on a PII-333 a good run for its money (a bit faster, or a bit slower, depending on the exponent). Given the brilliant optimization George has done for the Intel CPU, I think this is quite good. I'm pretty sure I could at least double the speed of MacLucasUNIX on the Alpha, by replacing critical chunks of code with hand-tuned assembler, but the investment in terms of time & effort is too much for me 8-( > > I think we can sqeeze FFTW a lot more. I like its code very much. The > good performance on intel (45% with respect mprime) is good enought to > work a litle more on it. I agree - in particular, there's an obvious gain in being able to do FFT with run lengths other than powers of 2, once you have the speed in the same ballpark. Nevertheless, I think FFTW will be hard pushed to match mprime on 32-bit Intel architecture systems. There is an obvious need for something reasonably efficient and portable, if only to be able to take advantage of new processor designs (like Merced, and to a lesser extent Athlon) without having to expend very large amounts of effort in hand-optimization. Regards Brian Beesley _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 5 Oct 1999 21:57:21 +0200 From: "Steinar H. Gunderson" <[EMAIL PROTECTED]> Subject: Mersenne: Re: New beta mers release: new Lucas-Lehmer program On Tue, Oct 05, 1999 at 06:19:20PM +0200, Guillermo Ballester Valor wrote: >ii) On the other hand, FFTW does not use 'register' at all. All local >variables are stored on stack. I don't know much about compilers, but >perhaps some good compilers can use the register storage as speed >optimization. gcc should be able to, at least newer versions (post-egcs phase). >Looking at the code generated by gnu-gcc on intel >processors, some local double variables are stored on intel fpu and the >performance is so good. Try Pentium GCC once (http://www.goof.com/pcg/). It has some MMX support built-in. If you run out of integer registers (and don't do float), it's interestingly enough storing data in MMX registers. On the other hand, I think MacLucasUNIX will use a _lot_ of floats. It will even compile on non-Intel machines (I think...), but I've got _no_ clue at all (I think nobody has...) about the performance. Volunteers? >My question is: What can happen in FFTW code if we directly include >'register' keys management on its local temporal variable definitions?. >This sort of things can be made with a single compiler option?. - -Dregister= will do the trick and remove _all_ register keywords. They're a bad thing in general. The compiler should be able to decide for itself which variables that are to be put in registers. const, on the other hand, should be used as much as possible (think `const struct foo * const * const bar' here...). But I guess they wouldn't have overlooked such a textbook rule... >> The other line of approach I have on improving MacLucasUNIX is to try >> Digital's native C compiler - the linux beta is currently available >> FOC, but unfortunately I will have to upgrade linux to run it as it >> requires 5.2 or later. Linux 5.2? Surely, you must be referring to RedHat 5.2? Upgrading your libc from scratch isn't all that hard, if you can do with a bit of tweaking. /* Steinar */ - -- Homepage: http://members.xoom.com/sneeze/ _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Wed, 6 Oct 1999 17:28:56 -0700 From: Will Edgington <[EMAIL PROTECTED]> Subject: Re: Mersenne: New beta mers release: new Lucas-Lehmer program Simon Burge writes: Unless you're doing a timed run, maybe "kill -STOP pid" and "kill -START pid" on the ecm3 run might give more accurate results - I hate to think of what's happening to the cache... I use this on machines that have mersenne1 running when users notice X load showing a constant load average of 1.0. I tried this just before uploading the new beta with tunefftw.c in it and MacLucasFFTW's speed improved by less than 3% over the run with the tuning done while ecm3 was running. So either the cache is not the bottleneck or that Linux's context switching with 128 MB RAM is quite good, as a guess. My early tests on a 200MHz UltraSparc are not that encouraging. [...times deleted...] Sigh.:( Though it does look like MacLucasFFTW is faster when the FFT length is enough lower ... by, hm, about 10% ? But MacLucasFFTW2 using two CPUs isn't as fast as two MacLucasUNIX's running at the same time, is it? I would guess not, from those numbers. The -C means don't checkpoint ever and -S N means print a speed update every N iterations. MacLucasFFTW2 is hard coded to use 2 threads. The case for 4609273 is iteresting, with nearly identical FFT lengths... Yeah.:( I'm assuming that you're seeing such a speed-up on Intel because of the lack of registers that MacLucasUNIX likes, and FFTW is doing a better job under these conditions. Looks that way, or something similar. Will - I'll send you a diff that I used for the timing stuff. Yes, please do when you have time; it'll save me from having to reimplement it. Will _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Wed, 06 Oct 1999 20:07:40 -0500 From: Ken Kriesel <[EMAIL PROTECTED]> Subject: Mersenne: If quitting big exponents (was Re: Decamega Tests) At 11:42 AM 1999/09/23 +0200, Philippe Trottier <[EMAIL PROTECTED]> wrote: >HI! > >Anyone thought of sending these P and Q once a month to the server.. in the >case where someone would abandon a quest, it could be continued by someone >else ... That capability doesn't exist in the Internet Primenet Server at this time. Anyone contemplating quitting an exponent after more than a PII-400-month, please save the files and contact me. We'll try to make use of them in QA. Ken Ken Kriesel, PE <[EMAIL PROTECTED]> _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Wed, 6 Oct 1999 23:08:53 EDT From: [EMAIL PROTECTED] Subject: Mersenne: Mlucas and MacLucasUNIX on Alpha Dear all: Here is the first installment of my head-to-head performance comparison for Mlucas and MacucasUNIX, on all three major generations of the Alpha architecture. I'll send similar for MIPS and SPARC in the next few days. If the alignments look weird, try using a true-type font. - -Ernst - -------------------- GIMPS SOURCE CODE PERFORMANCE CHART The first (leftmost) column gives the number of 64-bit floats in the array being transformed. The second column gives per-iteration timings of Prime95 v19 running on a 400MHz Pentium II, as reported by George Woltman (www.mersenne.org/status.htm). The remaining columns list timings of other fast LL testing codes running on various platforms. In parentheses to the right of each time in seconds, I also list the relative performance index (RPI, in %) with respect to Prime95 v19 running on a 400MHz Pentium II. The RPI is the ratio of the speed of code X running on hardware Y, to the speed of Prime95 running on a 400MHz PII, for an exponent p, and adjusted for any difference in the clock rate of the two CPUs being compared. (For codes that allow a similar variety of FFT lengths and have similar accuracy, we can use FFT length in place of exponent.) Since speed is inversely related to per-iteration time, the RPI is defined as (time for Prime95 to test M (p) on 400MHz PII) * 400 MHz RPI (p) = ------------------------------------------------------------- x 100% (time for code X to test M (p) on CPU Y)* (clock rate of CPU Y) Example 1: On 400MHz Alpha 21164, Mlucas 2.7y at 384K takes .316 sec. Prime95 needs .211 sec on a 400MHz PII for the same FFT length. Since the two CPU clock rates are the same, Mlucas has an RPI of (.211/.316)x100% or about 67%, meaning that on the 21164 at that runlength, it performs about two-thirds as well as Prime95 on PII. Example 2: For Mlucas at the same FFT length on the 250 MHz R10K, we have a per-iteration time of .292 sec, similar to the 400MHz 21164, but the clock rate is lower, hence .211 * 400 RPI = ---------- x 100% = 116%, .292 * 250 meaning that on the MIPS, Mlucas performs somewhat better than Prime95 running on a PII with the same clock speed. Also note that for the above examples, the accuracy of the code is similar enough to Prime95 (pmax about 1-2% less at a given FFT length) to allow us to just consider FFT length - if code X is significantly less accurate or allows just power-of-2 FFT lengths, we may have to compare different FFT lengths in the above formula (e.g. for codes like MacLucasUNIX, which allows only powers of 2 and jumps to the next power-of-2 runlength much earlier than Prime95 and Mlucas.) Abbreviations: MLU625 = MaclucasUNIX v6.25 MLF = MaclucasFFTW P95 = Prime 95 v19 n/a = Length not available; must use next-higher power of 2. Program, platform, cache sizes / per-iteration time in seconds P95 Mlucas2.7y MLU625 Mlucas2.7y MLU625 Mlucas 2.7y MLU625 PII Alpha Alpha Alpha Alpha Alpha Alpha 400 21064/200 21064/200 21164/400 21164/400 21264/500 21264/500 L1: 8KB ? ? 8KB L1 8KB L1 64KB L1 64kB L1 L2: 512K ? ? 512KB L2 512KB L2 4MB L2 4MB L2 length ---- ---------- --------- ---------- --------- ----------- --------- 96K .045 .127 (70%) n/a .057 (79%) n/a .025 (138%) n/a 112K .055 .155 (71%) n/a .070 (80%) n/a .031 (142%) n/a 128K .060 .172 (70%) .312(38%) .077 (78%) .098(61%) .034 (141%) .036(133%) 160K .083 .223 (73%) n/a .099 (84%) n/a .044 (148%) n/a 192K .098 .272 (72%) n/a .120 (82%) n/a .052 (148%) n/a 224K .119 .345 (68%) n/a .146 (77%) n/a .064 (149%) n/a 256K .132 .370 (64%) .679(39%) .161 (82%) .220(60%) .069 (153%) .078(126%) 320K .173 .544 (63%) n/a .251 (69%) n/a .090 (150%) n/a 384K .211 .695 (60%) n/a .316 (67%) n/a .107 (153%) n/a 448K .252 .880 (57%) n/a .417 (60%) n/a .132 (153%) n/a 512K .281 1.03 (55%) 1.45(39%) .472 (60%) .459(61%) .146 (152%) .178(126%) 640K .372 1.32 (56%) n/a .648 (57%) n/a .207 (138%) n/a 768K .453 1.60 (57%) n/a .782 (58%) n/a .257 (133%) n/a 896K .536 1.93 (56%) n/a .920 (58%) n/a .326 (128%) n/a 1024K .600 2.14 (56%) 3.01(40%) .990 (61%) 1.07(56%) .363 (127%) .461(104%) 1280K .776 3.00 (52%) n/a 1.35 (57%) n/a .480 (122%) n/a 1536K .934 3.66 (51%) n/a 1.82 (51%) n/a .656 (108%) n/a 1792K 1.11 4.46 (50%) n/a 2.15 (52%) n/a .789 (110%) n/a 2048K 1.23 4.85 (51%) 6.50(38%) 2.36 (52%) 2.84(43%) .880 (111%) 1.36(72%) 2560K 1.64 6.42 (51%) n/a 3.20 (51%) n/a 1.23 (110%) n/a 3072K 1.99 7.73 (51%) n/a 3.89 (51%) n/a 1.48 (103%) n/a 3584K 2.38 9.23 (52%) n/a 4.57 (52%) n/a 1.79 (105%) n/a 4096K 2.60 10.2 (51%) 14.0(37%) 5.02 (52%) 7.42(35%) 2.01 (103%) 3.70(56%) TIMINGS SUMMARY: The only place MacLucasUNIX outperforms Mlucas is on the 21164 at FFT length 512K, where MLU seems to benefit from a fortuitous cache alignment - at lengths greater than this, things deteriorate rapidly. ACCURACY SUMMARY: here are the FFT length/exponent breakpoints for the three fastest codes. Prime95 is best, since it is able to take advantage of the x86 80-bit floating-point register format. Mlucas is close behind, with a pmax just 1-2% lower than Prime95 at each runlength. MacLucasUNIX is the worst of the lot, even when compiled using (on Alpha Unix) -assume accuracy_sensitive to prevent the compiler from overaggressive reordering of floating-point operations (note that if you're compiling using -fast you MUST also use the above -assume flag, otherwise Mlucas won't run and MacLucasUNIX won't be able to do round-off checking, i.e. has no way of telling whether the FFT length it is using is appropriate for the number under test.) Maximum exponent (millions) Prime95 Mlucas2.7 MLU625 length ------- --------- ------ 96K 1.990 1.983 n/a 112K 2.323 2.310 n/a 128K 2.656 2.610 ~2.38 160K 3.290 3.260 n/a 192K 3.935 3.910 n/a 224K 4.598 4.550 n/a 256K 5.250 5.160 ~4.98 320K 6.515 6.420 n/a 384K 7.730 7.700 n/a 448K 9.020 8.950 n/a 512K 10.32 10.20 ~9.3 640K 12.83 12.70 n/a 768K 15.27 15.20 n/a 896K 17.85 17.60 n/a 1024K 20.40 20.10 ~18.8 1280K 25.33 25.00 n/a 1536K 30.10 29.80 n/a 1792K 35.10 34.70 n/a 2048K 40.25 39.40 ~37 2560K 49.90 49.10 n/a 3072K 59.40 59.10 n/a 3548K 69.00 68.50 n/a 4096K 79.30 78.00 < 75 MEMORY USAGE: Prime95 and Mlucas need little storage beside the LL residue itself, i.e. (runlength x 8 bytes) + perhaps 10% extra for FFT sincos and DWT weights tables and bit-reversal index arrays. MacLucasUNIX, on the other hand, is a memory hog - at 4096K it needs a whopping 244MB, compared to just 33MB for Prime95 and Mlucas. _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 7 Oct 1999 14:39:22 -0500 From: "Griffith, Shaun" <[EMAIL PROTECTED]> Subject: Mersenne: SETI & DNRC This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. - ------_=_NextPart_001_01BF10FB.ABBE0E2F Content-Type: text/plain; charset="iso-8859-1" Glommed this from the Dilbert Newsletter...Maybe someday GIMPS will have a similar dubious mention? <<<<<<<<<<<<< Pranks On Induhviduals - ---------------------- Here's the best DNRC prank ever. Report: A co-worker of mine has SETI@home <mailto:SETI@home> running on his computer. This is software, distributed by SETI (Search for Extra-Terrestrial Intelligence), that will run on PCs as a screen saver and analyze chunks of data from a radio telescope looking for non-naturally occurring signals from outer space. The other day I copied the SETI analysis screen to Microsoft Paint and then edited it to contain a large alert message stating that ET signals had been discovered. I also drew in a button that he could use to "Notify SETI Immediately." I left this image on his screen with a "red alert" sound running in the background. When he returned to his desk he was ecstatic to see that he had found ET life. He called another co-worker over to witness the historic moment. Then he clicked the button and discovered what I'd done. He's now looking for an opportunity to slay me so this may be my last message to you. >>>>>>>>>>>>>>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Shaun Griffith, Texas Instruments MSP Multimedia, [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> work (972)480-2186, fax (972)480-3555, pager (972)598-6823 alpha pager: page spg1 <mailto:[EMAIL PROTECTED]?subject=pagespg1> Quantum Mechanics: The dreams stuff is made of - ------_=_NextPart_001_01BF10FB.ABBE0E2F Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; = charset=3Diso-8859-1"> <META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version = 5.5.2448.0"> <TITLE>SETI & DNRC</TITLE> </HEAD> <BODY> <P><FONT SIZE=3D2 FACE=3D"Courier New">Glommed this from the Dilbert = Newsletter...Maybe someday GIMPS will have a similar dubious = mention?</FONT> </P> <P><FONT SIZE=3D2 FACE=3D"Courier = New"><<<<<<<<<<<<<</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">Pranks On Induhviduals</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">----------------------</FONT> </P> <P><FONT SIZE=3D2 FACE=3D"Courier New">Here's the best DNRC prank = ever.</FONT> </P> <P><FONT SIZE=3D2 FACE=3D"Courier New">Report:</FONT> </P> <P><FONT SIZE=3D2 FACE=3D"Courier New">A co-worker of mine has</FONT> = <A HREF=3D"mailto:SETI@home"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 = FACE=3D"Courier New">SETI@home</FONT></U></A><FONT SIZE=3D2 = FACE=3D"Courier New"> running on his computer. This is</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">software, distributed by SETI = (Search for Extra-Terrestrial</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">Intelligence), that will run on = PCs as a screen saver and analyze</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">chunks of data from a radio = telescope looking for non-naturally</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">occurring signals from outer = space. The other day I copied the SETI</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">analysis screen to Microsoft = Paint and then edited it to contain a</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">large alert message stating = that ET signals had been discovered. I</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">also drew in a button that he = could use to "Notify SETI</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">Immediately." I left this = image on his screen with a "red alert"</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">sound running in the = background.</FONT> </P> <P><FONT SIZE=3D2 FACE=3D"Courier New">When he returned to his desk he = was ecstatic to see that he had</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">found ET life. He called = another co-worker over to witness the</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">historic moment. Then he = clicked the button and discovered what I'd</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">done.</FONT> </P> <P><FONT SIZE=3D2 FACE=3D"Courier New">He's now looking for an = opportunity to slay me so this may be my</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">last message to you.</FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier = New">>>>>>>>>>>>>>>>>>= ;></FONT> </P> <P><FONT SIZE=3D2 FACE=3D"Courier = New">~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~</FONT= > <BR><FONT SIZE=3D2 FACE=3D"Courier New">Shaun Griffith, Texas = Instruments MSP Multimedia,</FONT> <A = HREF=3D"mailto:[EMAIL PROTECTED]"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 = FACE=3D"Courier New">[EMAIL PROTECTED]</FONT></U></A><FONT SIZE=3D2 = FACE=3D"Courier New"> </FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">work (972)480-2186, fax = (972)480-3555, pager (972)598-6823 </FONT> <BR><FONT SIZE=3D2 FACE=3D"Courier New">alpha pager:</FONT> <A = HREF=3D"mailto:[EMAIL PROTECTED]?subject=3Dpagespg1"><U><FONT = COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Courier New">page = spg1</FONT></U></A><FONT SIZE=3D2 FACE=3D"Courier New"> </FONT> <BR><FONT COLOR=3D"#000080" SIZE=3D2 FACE=3D"Courier New">Quantum = Mechanics:</FONT><FONT SIZE=3D2 FACE=3D"Courier New"></FONT> <FONT = COLOR=3D"#008000" SIZE=3D2 FACE=3D"Courier New">The dreams stuff is = made of</FONT> </P> </BODY> </HTML> - ------_=_NextPart_001_01BF10FB.ABBE0E2F-- _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ End of Mersenne Digest V1 #638 ******************************