Re: Mersenne: M#40 - what went wrong?
Hi, On Saturday 14 June 2003 12:13, Steinar H. Gunderson wrote: The biggest problem with SSE2 is of course that it's only supported on the Pentium 4 yet -- they are becoming increasingly common, but for instance, no current AMD chip supports it. Actually, new AMD64 chips (current Opteron and future Athlon64) supports SSE2, and doubling the 128 mmx registers ( 8 for Pentium4 and 16 for AMD64). George Woltman, with P4 and SSE2, has made impressive improvements in Prime95. I'm just this weeks adding SSE2 code to Glucas to see how it can run on opterons, but I'm still in beta code. Guillermo. -- Guillermo Ballester Valor registered linux user #117181 [EMAIL PROTECTED] Ogijares, Granada SPAIN _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: 22.8.1 has increased iteration time
Hi, On Thursday 29 August 2002 15:30, you wrote: I have noticed a small but definite increase in the iteration time of version 22.8.1 as opposed to 21.4. During the night, when my 2.2GHz Pentium IV system was free of all other processing activities, the iteration times were as follows: 21.4 47 msec 22.8.150 msec It also could be a better detection of system clock frequency. I think new 22.8 detects automatically the frecuency, while in older versions one can set it manually in 'local.ini'. So I think that v. 22.8 displays better accurace time/iteration. Regards. Guillermo. -- Guillermo Ballester Valor [EMAIL PROTECTED] Ogijares, Granada SPAIN _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Mersenne Prime number search monster
Hi, here you are a link which gives GIMPS some publicity: http://www.theinquirer.net/?article=4728 Regards Guillermo - Guillermo Ballester Valor [EMAIL PROTECTED] Ogijares, Granada SPAIN _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Prime95 for AIX
On Fri 10 May 2002 18:06, you wrote: Hello I have several PCs, computing prime95. I have several big RS/6000 too and I would like to use them too for calculating prime. Does anyone know a program for AIX? I didn't found a program. I hope you know probably one. You could try to compile Glucas. It is a Mersenne prime tester ready for most platforms but unfortunately no so friendly as Prime95. See http://glucas.sourceforge.net You also can try Mlucas, ftp://hogranch.com/pub/mayer/README.html Good luck, Guillermo -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Two L-L tests at once?
Hi, On Friday 01 Mar 2002 21:22, Brian J Beesley wrote: [ snip ] The memory bottleneck was the first thing I thought, and I was near to discard the idea when I realized that the trig bata would be the same, and the required memory access would be less than double the single stream scheme If a double stream version cost less than double the single one the we can speed up the project a bit On Friday 01 March 2002 00:37, George Woltman wrote: Well, that would be true if SSE2 had a multiply vector by scalar instruction That is, to multiply two values by the same trig value, you must either load two copies the trig value or add instructions to copy the value into both halves of the SSE2 register I can't see that being a major problem Surely there's only one main memory fetch to load the two halves of the SSE2 register with the same value, and surely the loads can be done in parallel since there's no interaction ( M - X; then X - R1 X - R2 in parallel, where X is one of the temporary registers available to the pipeline) We would have to evaluate the cost of memory traffic to load data with two halves the same, or load two differnt data and then double them in two XMM registers I have not any skill in SSE2, no machine to try This morning I've been reading (on the fly) the intel PDF manual, and I saw that the SSE2 was made by Intel engineers thinking more in multimedia than in Mathematics (or GIMPS) There are some elemental ops they could be implemented to do complex number multiplication easy, or a vector by escalar mul, or an exchange within halves :-( Perhaps in SSE3 :) On Thursday 28 February 2002 21:20, Steinar H Gunderson wrote: Testing a number in parallel with itself is obviously a bad idea if there occurs an undetected error :-) Sure But the only way there would be a problem here (given that the data values are independent because of the different random offsets) is if there was a major error like miscounting the number of iterations This is relatively easy to test out I'm sort of marginally uneasy, rather than terrified, about running a double-check in parallel with the first test on the same system at the same time Also, I think most people would rather complete one assignment in time T rather than two assignments in time 2T with both results unknown till they both complete Against this is that Guillermo's suggestion does something to counter the relatively low rate at which DCs are completed I also was worried about that idea, but every time I think, it seems less absurd to me OTOH, I don't know how difficult would be the carry and normalization code of DWT for two _different_ exponents At first approximation, I recall some code I wrote without branches for Glucas, actually a code which makes two streams at once I mean perhaps the cost is small Regards Guillermo _ Unsubscribe list info -- http://wwwndatechcom/mersenne/signuphtm Mersenne Prime FAQ -- http://wwwtasamcom/~lrwiman/FAQ-mers
Re: Mersenne: Two L-L tests at once?
Hi again: I received the mail from Mersenne list two times Is it because of subejct? :) The first time is the mail I sent to list, the second is the same mail mirrored by an unknown for me 'ntsys24yucombe' system !? Back to the subject, I'm wondering about how fast can we do two L-L test in parallel using this SSE2 extensions Basically, I'm thinking in use two nearest exponents with the same FFT-length The memory access in FFT phase would be the same, the trig data also the same, the most difficult part would be the carry-and-normalization pass This dificult could also dissapear making the second test over the same exponent We then get the L-L and double check at once Remember the scheme Prime95 uses to make double check is to shift initally a random number of bits DWT scrambles the data enough to be reasonabily sure both test are independent A matching result would imply a very confident result A non matching result would say us something was wrong It also would allow us to check interim results to be sure all is well so far Regards Guillermo _ Unsubscribe list info -- http://wwwndatechcom/mersenne/signuphtm Mersenne Prime FAQ -- http://wwwtasamcom/~lrwiman/FAQ-mers
Re: Mersenne: Two L-L tests at once?
At 11:03 PM 2/28/2002 +0100, Guillermo Ballester Valor wrote: The memory bottleneck was the first thing I thought, and I was near to discard the idea when I realized that the trig bata would be the same, and the required memory access would be less than double the single stream scheme. Well, that would be true if SSE2 had a multiply vector by scalar instruction. That is, to multiply two values by the same trig value, you must either load two copies the trig value or add instructions to copy the value into both halves of the SSE2 register. Yes, I was thinking in copy the trig value from a half to other, although I don't know how would be the cost. _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: further code optimisation using a recompile?
Hi, On Sun 27 Jan 2002 08:23, [EMAIL PROTECTED] wrote: On 26 Jan 2002, at 14:45, John R Pierce wrote: http://www.slashdot.org has a link to http://open-mag.com on a new Intel compiler for Linux an M$ Windows. The new compiler makes use of the new instructions in the Pentium III and IV. Of course, the most important part of the Prime95 code does not get compiled at all, since it has already been handcoded. But it would be nice to know if a recompile with this compiler would improve throughput significantly, anyone? I believe the Intel C 5.0 compiler is based on Kai C++, which is hardly new. Its also $500 per user per system on Linux. The MS Windows version requires you to already have the MS Visual C++ 6.0 package, as this piggybacks on the MS C tools. This isn't going to take the open source world by storm... [snip] Where such a compiler would make a difference is in porting code to new or substantially different architectures; Glucas is already pretty good on IA32/linux, but a compiler upgrade might help get it a bit closer to Prime95. The difficulty with porting Prime95 to non- IA32 architectures is that so much of it is in assembler, which is not easy to port between architectures whilst retaining something approaching optimum efficiency. In fact it would be pretty much a total rewrite job. I've already used Intel 5.0 compiler for Glucas in both IA32 and IA64 architectures. For IA32, GNU/gcc does better job because of Glucas also has assembler macros in its code and Intel compiler has problems with them. So, I had to compile Glucas for PentiumIII (linux) deactivating the assembler code and the result was a code about 20% slower. As a proof, I also deactivate the assembler code in GNU/gcc compiler and the result was a bit slower (5%) than Intel in very big FFT runLengths. On IA64/linux Intel compiler clearly is the winner (about 30% faster than gcc job). Actually the last binaries for Glucas 2.8c/IA64 are built with it. Some other advantage of Intel compiler is the OpenMP compatibily. I used it to make Glucas OpenMP compatible, but then I also made it multithreaded using Posix Threads and GNU/gcc compiler. As John points implies, you've got to be pretty committed to shell out ~$500 per system for the privelege of compiling code on your own hardware. It would take a _really_ significant speed boost to make that sort of expenditure worthwhile. I downloaded a free complete version for non-profit proposes from intel site. Have a nice Sunday Guillermo. -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: GIMPS on new iMac ??
Hi, On Sun 13 Jan 2002 03:14, Russel Brooks wrote: Does Glucas run similar to prime95? Does it get exponents from the server, etc? Or would I have to do manual check in/out? Usually Glucas run fast but no as fast as prime95. If I recall well, a G4 processor runs at about the same speed than prime95 running on a PIII at the same clock speed. Tom Cage did some benchmark work: http://www.belchfirecomputing.com/GIMPS/Glucas/BenchMark.html Glucas only does Lucas Lehmer test (at the moment), no factorizing work. Unfortunately, you have to do manual check in/out (as in all non-prime95 family clients). Have a nice day. Guillermo. -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: GIMPS on new iMac ??
Hi all, On Fri 11 Jan 2002 23:58, Russel Brooks wrote: How do you think GIMPS will run on the new iMac? (Or will it run at all?) Glucas runs well on iMacs, and I think it will do faster on new iMac. Soon it will be released a Multithreaded version of Glucas scaling pretty well on most multiprocessor systems (I think even it will be a two processors version of iMac). If you already are a happy owner of this new iMac, you can see whether some of the binaries prebuild for version 2.8c runs properly: http://sourceforge.net/projects/glucas Regards. Guillermo. -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Glucasmp. Help needed.
Hi all, As a collateral effect of the discovery of the new Mersenne prime, it has pointed out we need of a very fast double-checker. I begun to study the upcoming standard OpenMP and tried to make Glucas multithreaded. I did it, and it runs well in the few systems available to me. I called this Glucas's version as Glucasmp. The problem here is to find OpenMp complain compilers, and multiprocessors systems to test. So far, I wrote an OpenMP v.1.0 compatible. I also wrote code for Sun WorkShop v.6.0 (here using specific Sun directives, similar to OpenMP). These are some results: SystemCompiler OS %CPU %Speed 4-Itanium @800 Mhz Intel-C v.5.1 RedHat 7.1 320% 250% 4-Pentium3 @500 Mhz Intel-V v.5.1 Debian 2.2 400% 335% 4-ev6@500 MhzCompaq-C Tru64 Unix 400% 340% 2-Ultraii @450 Mhz SunWspro v 6.0 Solaris 200% 170% Column 5 is the performance with respect the single processor version using the same compiler/options. The FFT runlength is 768K. As we can expect, small FFT runlengths are less suitable to multithreading than very big FFT. Around 2048K FFTs the speed performance increases about 5-10%. There is no OpenMP version for GNU/gcc :(. (A direct Posix threads code is the alternative for gcc, but it is harder to implement) It is still an experimental version. Here is when I need a help. If anybody has access to more powerful systems with C-compilers that understand OpenMP directives (or Sun MP), he could try to make and test Glucasmp. The code is at sourceforge: http://glucas.sourceforge.net/QAdir/Glucas-2.8d.pre1.tar.gz To build the binaries for Alphas or Sun/Solaris it is better to use the special Makefiles 'Makefile.alpha_mp' or 'Makefile.sunc_mp' I used in my tests, but we also need to edit those files to adapt them to our systems. Please, contact me in private e-mail for any question or suggestion. Regards. Guillermo. P.D. Don't be excited those GIMPSers with SMP systems. It is better to run N single processor clients than a Multithreaded N-procesors one. We will get more credit with the single solution. Glucasmp (or Prime95mp) should be useful when we want to make a special L-L test quickly (in a QA work or after a new prime discovery) -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: M13466917 Glucas's save files
Hi again, As I said, the server is too slow and has problems with ftp PORT command from some web browsers. I've uploaded all the directory to a more efficient http server. Sorry those users with problems last hours. :( You can download the files from the directory: http://glucas.sourceforge.net/M13466917 and save to disk. The readme file: http://glucas.sourceforge.net/M13466917/README the residue file: http://glucas.sourceforge.net/M13466917/M13466917.res64 and the save files, here is the iter 1346 save file http://glucas.sourceforge.net/M13466917/s13466917 Regards Guillermo. -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: M13466917 Glucas's save files
Hi all, As you know, I am one of the lucky man with the honour to have made one of double checks of the new known Mersenne prime. I uploaded some save files and residues in my snail ftp server directory. ftp://ftp.oxixares.com/pub/M13466917 There is a small README file in it. If you are a Glucas user and want to see how Glucas found the new prime, you only have to download the save file for iteration 1346 and wait a while (or some hours). ftp://ftp.oxixares.com/pub/M13466917/s13466917 Sorry if the server is too slow :(. Regards. Guillermo. P.D: It was nice to see some pictures of the Mersenne Party. -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SF Bay GIMPS party
Hi all, Most of GMIPSers are far away from S. Francisco Bay Area. Far enough to have not any chance to be in the party, like me :( It would be nice that some of you make some nice pictures of the party and put together in a page with some funny history of the event. So, the envious guys like me could see and imagine how the party was :). In addition, we could watch some picture of people we have mailing for some years. Regards from Spain, Europe. Greetings to US people. Guillermo. [EMAIL PROTECTED] wrote: Spike Jones (hey, Spike!) wrote Lets have a Bay Area GIMPS party! Same place as before? Ill have the prime rib. {8-] spike -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Glucas v 2.8c released.
Hi, Interesting to compare the performance numbers given for the Itanium running Glucas v2.8c against my Thunderbird running mprime v21.4 : AFAIK, mprime v21.4 now uses prefetch hints to avoid idle cycles waiting for new data. Glucas/Itanium (C-plain code) uses a kind of preload, no prefetch, it loads in a loop-cycle what it will need in the next. This way of preload I think is less efficient than directly prefetch if prefetch hints are well tuned. To say the true, it was the first thing I tried, but I got difficulties: gcc compiler is still no good for inline assembler on Itanium, and Intel compiler simply does not support inline assembler (for IA64). So I had to write the preload scheme I told. But what if FFT data are big enough? prefetch request are some often out of L2 memory and the system have to wait many cycles to get the requested data near the processor. So, if we've not prefetched the data many cycles before we still have to wait some clocks to use. - At the smallest FFT length, the Itanium is WAY faster. at short FFT length, both mprime/prefetch and Glucas/preload are efficient. L2 cache are big enough too. Here, Itanium has the advantage of its superb FPU units. It can do any combination of FADD and FMUL up to two ops/clock. On the AMD we can only do an FADD and a FMUL in a clock cycle. It is not permitted two FMUL. this performance difference decreases until - At FFTs 640K-2048K, the Itanium is a little bit faster Here, perhaps, the mprime/prefetch scheme is better. then the performance difference increases until - At the largest FFT length, the Itanium is noticeably faster Is it memory-bandwidth that lets the Itanium pull ahead at the large FFT lengths ? Now we should need large L2. Some memory request are to main memory. Here the L2 size and bandwidth are important. And, just a speculation, prefetch hints are less effective. Guillermo -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Glucas v 2.8c released.
Hi again, I forgot to comment an observation made when writing Glucas for Itanium. IA64 architecture has a very nice feature: predication. In the DWT used in most GIMPS clients, the normalization and carry phase has a relevant cost in terms of performance. There some branches hard to predict and here the predication substitutes this branches with great success. On small FFT length, the relative cost of carry_and_norm are greater than bigger runlengths, and this is an additional point to know why Itanium is so good at short Mersenne exponents, and why this advantage is decreases when FFT runlength increases. Have a nice Sunday. Guillermo. -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Mersenne list problems?
Hi, I've been trying to contact to Mersenne list to change my subscription since a week ago. There is no way to contact to www.scruznet.com host. Is there any problem?. Must we update the links?. Thanks. Guillermo. -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Glucas version 2.8b released!
Hi: I am glad to announce the release of a new version of Glucas, the version 2.8b. This version has some very big improvements in performance for some targets. Most part of he improvement is due to some prefetch hints inserted in the code, other part is from a good tuning of Glucas parameters and there is too a small improvement to changes in the code. The biggest improvement is for Alpha ev6 and ev67 (almost 30%). You can see the ChangeLog to this version at the end of this mail. As a sample of the new Glucas performance, here some timings: Secs per iter. Roundoff check ON/OFF. FFT runlengh Machine 256K512K 1792K 4096K (1)0.038/0.036 0.078/0.0720.336/0.318 0.806/0.763 (2)0.041/0.039 0.084/0.0790.355/0.328 0.867/0.835 (3)0.045/0.042 0.106/0.1010.459/0.439 1.126/1.167 (4)0.069/0.067 0.174/0.1710.809/0.783 2.168/2.161 (5)0.255/0.246 0.547/0.5242.261/2.183 5.275/5.094 (6)0.170/0.162 0.370/0.3541.554/1.485 3.588/3.496 (7)0.199/0.192 0.497/0.4832.214/2.152 5.637/5.519 (1) Alpha ev67, 667 MHz, Linux 2.4. 64 KB L1, 4MB L2. (2) AMD Athlon, 1200 MHz 266 FSB, Linux 2.4. (3) IA-64 Itanium, 800 MHz. (4) Sparc UltraIIi, 450 MHz, 16 KB L1, 4 MB L2. Solaris 5.8 (5) PowerMac G3. 300 MHz, 512kB L2, Mac OS X. (6) PowerMac G4. 400 MHz, 1MB L2, Mac OS X. (7) RS/6000, ppc604e 375 MHz, 1MB L2, Linux 2.2. Thanks to all people helping me to improve Glucas. Special thanks to Klaus Kastens, B.J Beesley and Tom Cage for their work. You can download the files at https://sourceforge.net/project/showfiles.php?group_id=24518release_id=47225 The home page for glucas is http://glucas.sourceforge.net. And this is the ChangeLog: v.2.8b 07/Aug/2001 -Great Prefetch working progress has made for Alphas. B.J.Beesley found the way to insert assembler prefetch hints in Compaq-c code. The improvement is about 30% or more for ev6 and ev67.! -Other prefetch hints has been coded for other platforms. At the moment there is no significative gains for other than x86 and powerpc. -Good news for Mac OS users, both with classic MacOS and Mac OS X: Big performance improvement for powerpc family (10%, about 3% adding prefetch hints and 7% tuning the parameters). Klaus kastens and Tom Cage did the job. -Binaries for Itanium IA-64 has improved a lot, but now the credits are for GNU/gcc team. With gcc 3.0 now Glucas is almost twice faster. -Long macros has been coded for radices 4 and 8. It could take more advantages of prefetch and help to less clever compilers. It can be activated with -DY_LONG_MACROS compiler flag. The gain is from 5% to -1% . -Some routines has been recoded to hide some dependency stalls and to make easy to vectorize with instructions like altivec G4+ or SSE2. We can activate it -DY_VECTORIZE. We have to set -DY_KILL_BRANCHES to do any effect. About 1% gain in most cases. -Radix 4 can be recoded to use other vectorized macros, doing two radix-4 transform in a single loop pass. It can be useful sometimes. it is equivalent to unroll those loops. To activate -DY_VECTORIZE2 -When the prefetch hints have a lot of cost, we can try the flag -DY_VECTORIZE_EXPENSIVE. Radices 4 and 8 will be unrolled to save a lot of prefetch calls. It is still an experimental feature. -Selftest now outputs all the Glucas flags actived. It is useful in tuning and developing tasks -If option Alternative_output_flag == 2, now the output is driven both to stdout (console) and file set with option Output_file. -Compiler time has been reduced a lot in most cases. The routines are trivial when we no need them. It will make the developer work easy. - Regards. Guillermo. -- Guillermo Ballester Valor [EMAIL PROTECTED] Granada (Spain) _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: More P4 timings
Hi: George Woltman wrote: I just completed my first 512K FFT using the new SSE2 instructions! The 512K FFT handles exponents up to 10.3 million. Timings are as follows: 1.4GHz P4, old code:0.126 sec. 1.4GHz P4, new code:0.048 sec. 1.2GHz Athlon, 133MHz DDR: 0.084 sec. I have a few more optimizations up my sleeve. I think my goal of 0.040 seconds is achievable. I really think Intel should give you a good prize!. I can't imagine better publicity for P4. The cut of prices announced recently will help too. Good job!. Regards. Guillermo _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Glucas for the Macintosh, version 2.7b
Hi: Tom Cage escribi: On Monday, 26 March 2001, Guillermo Ballester Valor [EMAIL PROTECTED] released version 2.7b of Glucas which is now available for the Macintosh. There is no almost differences between this and v2.7. Klaus Kastens detected a bug for MAc Clients which made the save files incomptible with other platforms. This is fixed in the new version but is important to note that the new save files will no be compatible with v2.7 and older clients. On the other hand, the coincidence of Glucas v.27b and the new Mlucas v2.7b is casual. Ernst, sure I will have to increase the version soon. :) Happy hunting. Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: More than 10000 athlons!.
Hi all: Last weekend the number of AMD-athlons registered in Primenet has reached 1. In few weeks they will be the most popular among GIMPSers if the actual trend continues. George, it is possible an optimization for Athlons?. I mean if it is worthwhile, as far as I know (I'm possibily wrong) the documentation for AMD-processors is no so clear as for Intel's. Regards. Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: P4 Optimization
[EMAIL PROTECTED] wrote: One of the drawbacks of doing it by hand in assembler...too bad high-quality HLL compilers (i.e. ones capable of giving 80-90% of the performance of laboriously coded and hand-tuned ASM, for complex, data-nonlocal algorithms requiring lots of data prefetch) appear to be nigh-impossible to write for CISCs like the x86 family. I don't want to start a RISC-versus-CISC flame war here, but the fact is, no high-level FFT-based large-integer-multiply code has gotten within a factor of 2 of the performance of Prime95 on the Pentium. Glucas, a C-coded program to L-L test reaches about 55% performance with respect Prime95 when is compiled with GNU/gcc compiler on Pentium and no assembler macros used. After including about three hundred lines of assembler lines, the performance rise to about 65%. Regards. Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Glucas v.2.0 released
Hi: After some delays, here is Glucas 2.0. Now I think it is stable enough to try complete Lucas-Lehmer test from PRIMENET (using manual forms, of course). You can download the package from E.W.MAYER server (thanks Ernst): ftp://209.133.33.182/pub/valor/Glucas-2.0.tar.gz You can read more about Glucas in ftp://209.133.33.182/pub/valor/README.Glucas.htm The performance is near Mlucas when it is well tuned. It can be a good chance to extent GIMPS and Lucas-Lehmer test to platforms with good C-compilers but no expensive f90 ones. The actual release is, at the moment, for the UNIX/Linux world and all its variants. For x86 users, of course, you should use mprime (Glucas is about 65% of performance with respect mprime), but Glucas can be used for Double-Check proposes. Some remarkable features of this release are: -It uses the Interchangeable Mersenne Residue File Format to save files. We can use the save files in most of the systems (and they are very compacted). Nevertheless, it has backward compatibility for Will Edgington rw() routines used in MacLucasUNIX. -There is no problem with accuracy. Glucas adjust the FFT runlength size at run time whether the roundoff error are too high. -It is coded using intensively C-macro facilities. It is relatively easy to write small assembler macros to speed it up (as made for x86's GNU/gcc compiler ). There are still no precompiled binaries. I think soon will be binaries for Alpha-Osf (ev56, ev6). Thus, you have to make the binary :(. We need some Unix's GURUs to make the binaries as fast as possible. You can read in the documentation how to test and timing Glucas. Any help, feedback or suggestion is welcome. Regards Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Glucas v.2.0 released
Guillermo Ballester Valor wrote: There are still no precompiled binaries. I think soon will be binaries for Alpha-Osf (ev56, ev6). Thus, you have to make the binary :(. Now there are two binaries in the directory ftp://209.133.33.182/pub/valor/bin one is for Alpha ev5-OSF and the other is for pentium GNU/Linux glib2.0 You can play with it. Regards Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: P4
John R Pierce wrote: I know this is a little off-topic, but how good is the P4 at integer operations? not that off topic at all. Integer multiplies can be more efficient than FP for this sort of thing, IF they are as fast. If a processor could pipeline 64 bit integer multiplies at one per clock with parallel adds at the same time it would be as fast or faster than using the 80 bit FP format... It will be interesting seeing what IA64 brings to the table when it finally gets up to speed... -jrp Today, I've read in the manuals that a simple integer add with carry (addc) has 8 clocks of latency and 3 clocks of throughput for a P4. Humm, too much slowdown for single Ia32 instructions, Intel engineers will know the reasons. Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Mac OS X
Tony Gott wrote: As the development of MAC Gimps seems to have come to a halt I wondered if any of the Unix gurus out there had considered porting the software to the unix base of the MAC OS X operating system? Is this a big job, or is it possible to use one of the existing unix ports to work on the OS? Well I'm not a UNIX guru. I've been writing a C program to make Lucas-Lehmer test (Glucas). Some months ago I'll release a beta (or alpha) version to ask help to extend the program to other platforms. Only few volunteers make some binaries in few platforms, but there is still the 'tune' work to do. Because the time go fast, I decided no wait, in few days I will release a new version of Glucas. It run fast enough to be compared with Mlucas. It seems there is no problem in UNIX world but in other OS I will need some more help. So, you can try to make a binary, test, tune it and tell how it works. It has a lot of advantages with respect MacLucasUNIX. Best regards. Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt
Mersenne: Interchangeable format for Mersenne number residues
Hi: As I told in a recent mail to the list, I've been working to release Glucas. One of the features it support is now write/read the 'save' files in a platform independent style. I can use the residue files in both my old Pentium-MMX and in a Alpha 21164. The format is described in the page: ftp://209.133.33.182/pub/valor/mformat.htm Any comment, suggestion?. Regards Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt
Mersenne: Glucas beta version released.
Hi to all: After a long and hot summer, I've finally written a first beta version of Glucas. This is a C-program to make L-L test. Still, there is no good scripts to make the binaries in a easy way. I need the help of testers to make fast binaries and fix possible bugs. Obviously, for intel platforms, his 'big brother' mprime is faster. In alphas is about as fast as Mlucas. I don't know how well (or bad) it runs other machines, I'm waiting your news ;-). To see how you can help, and read first: ftp://209.133.33.182/pub/valor/README.htm to see more details from Glucas: ftp://209.133.33.182/pub/valor/README.Glucas.htm To download the source: ftp://209.133.33.182/pub/valor/Glucas-1.98.tar.gz Thanks to Ernst Mayer for all his help, specially for the use of his server. Regards Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt
Re: Mersenne: Re: Glucas beta version released.
Hi: Yann Forget wrote: Hi, I'll test it on RS6000: IBM B50, PPC 604e, 375 Mhz, 500 Mo RAM The first bug has already found by Yann Forget!. It will affect to those users of GNU gcc compiler other than x86 platform, gcc will give an error. You can try now the version 1.99 with the first stupid bug fixed. The new source code: ftp://209.133.33.182/pub/valor/Glucas-1.99.tar.gz Thanks ! Regards. Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.exu.ilstu.edu/mersenne/faq-mers.txt
Re: Mersenne: HLLLL and HLL....
Hi: [EMAIL PROTECTED] wrote: On IA32 systems, how the code is aligned is also a factor. To compare accurately, you'd really need the separate code fragments to be in their own dedicated segments. This is not the way that un*x or Win32 applications are usually coded. Not being up on the latest HLL compilers, I would still suspect that some compiler directives/options would be available to handle alignment properly without much headache, especially if it is as important as you suggest. I have the teensiest fraction of knowledge about C (I'm trying to learn it now), but I know a little about compiling C programs. With DJGPP, you can enable cool options like -malign-double which can really speed up some programs. Is this the "alignment" that you're speaking of? On IA32 system, the alignment is one of the most important factors to achieve a decent performance. Unfortunately, some compilers (GNU/gcc) do not make the alignment correctly in a easy way. When writing Glucas, I discovered than -malign-double only aligns double when they are global-static variables. The local variables on the stack are not aligned, there is only a 50% chance of that (because they are 4-byte aligned). In a bad alignment scenery the performance can drop to the half (or even more). To achieve good performance I had to try the the same trick based in a built-in malloc than FFTW uses, i.e., allocate conditionally 4 bytes in the calling routine and so the doubles on the called are properly aligned, an ugly solution, you see. Regards Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: New Lucas-Lehmer test program.
Hi to all: In the last weeks I've been writing a program to perform l-l test. I'm the guy who wrote an adaptation of MacLucasUNIX to FFTW package half year ago. It seemed to run fast on intelx86 but in other platforms the Mlucas and MacLucasUNIX was better than MacLucasFFTW. I had learnt from that the first thing I must to made was to write my own FFT code. I did it, and the code seems to be fast. I called it YEAFFT (Yet Another FFT) and actually is a fast convolver. It is a complex based FFT and uses intensively the C-macro expansion facilities. The FFT routines are based in a short file of macros which includes all the common tasks. This macros now are written in a generic form but I suppose they can be tuned for a particular target, even more it can be replaced by assembler code if our C-compiler supports this feature. My aim when writing the code was that it could run I a the big variety of systems. In fact, my test version (without priority management) runs in a old 486 pc with Borland turbo c++ under MSdos 6-20, and also runs on Alpha 21264 . With the -Dmacro compiler flags one can choose a lot of features of the package to adjust it to the target system and made it as fast as possible. Ernst W. Mayer kindly has made some timings and has created a telnet count for me in his machines. Without his suggestions and help in testing I've not been able to write this code. The results are similar to Mlucas on MIPS and Alpha's. It can be a good new for those platforms without a f90 compiler but good C ones (for example Apple Mac. users). It supports the same FFT lengths than prime95 plus a radix-9 (like Mlucas). The Lucas-Lehmer test code, which calls to YEAFFT routines is called Glucas. The code runs fine on my pentium 166-MMX under SuSE Linux (about 50% of performance than mprime, but better than MacLucasFFTW). The I/O code is from Will Edgington MERS package. Now Will Edgington is working to include it in his MERS release, and hope soon we will have results. For test proposes there is other program, ylucas2, with the same interface than original Dr. Crandall lucdwt.c (it has not restart features nor priority management and then not recommended for a complete big l-l test) but is good to see the possibilities of Glucas. I can send a zip copy of the source to interested people. Send me a private e-mail. Soon there will be other L-L tester in the GIMPS arena. I wonder whether there will be a 'universal' interface to primenet, (perhaps in Java), able to read the outputs from Mlucas, MacLcasUNIX, Glucas ... dialog with primenet server and manage the tasks to do by with the clients, all in automatic form (no manual). Regards. Guillermo. _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: register usage
[EMAIL PROTECTED] wrote: And (according to a local computer scientist, who I think knows what he's talking about) with modern processors making extensive use of register renaming, it's not usually sensible to use the "register" keyword _at all_. The theory is that the instruction scheduler can do at least as good a job as the programmer Before the beginning of FFTW thread, this was the thinking I had about 'register' keyword. I had never used it in my programs. When I was learning C the use of register keyword made the programs slower and so I left to the compiler the good use of them (I remember the compiler was Borland Turbo-C under DOS). On the other hand, I was under the impression that the C "register" keyword was intended as a suggestion to the compiler, not an absolute, i.e. that compilers should be free to heed or ignore the programmer's advice on this point. Looks like whether that's true depends on the compiler. Yes, I've compiled MacLucasUNIX in my pentium 166-MMX (linux/gcc) and it doesn't give any error or warning about 'register'. My question is still why works so different FFTW on modern RISC processors?. Likely, the use of register is not the answer. The code of FFTW tries to minimize the memory accesses, the mul and add operations and stack temporary variables. Perhaps the weight assigned to every of this task is the correct to intel machines but not for alphas or mips, I don't know. By the way, the memory accesses in FFTW seems very good for intel but I've not seen any similar in Mlucas or MaclucasUNIX. FFTW uses the memory access in the form X[i*iostride] where iostride is not necessary a power of two. I don't know a word about assembler on alpha's or other machine than x86 but perhaps is better to store the offset on a register and then use X[offset] when is needed. Again, a good compiler would have to do it. Regards | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: New beta mers release: new Lucas-Lehmer program
Hi: Simon Burge wrote: Will Edgington wrote: (The 49% CPU usage is because my computer was also doing a long term ecm3 run, including during the FFTW tuning.) I recommend to use tunefftw when the system is bored, i. e. when no other idle programs like mprime or ecm3 are runnig, it will take no more than half an hour. I've observed more than 25% of gain in performance when the tune has been made under these conditions. To use the runtime 'wisdom' could not be the optimal solution. Feel free to send me any questions, bug reports, and so on. As he notes in comments and his README file, I believe Guillermo welcomes feedback as well. Of course, any feedback is welcome ;-). | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
Hi: Paul Leyland wrote: Actually, we at Microsoft Research in Cambridge have seen similar effects when compiling and running FFTW code. Our discovery is that the alignment of FP data values is critical. Get it wrong, and performance can plummet. Unless you set the alignment explicitly, it will be wrong approximately half the time. Your right, I gained a 35% performance only with doing a simple trick to be sure there were a 8-bytes alignement. On the other hand, I made the FFTW library using long double float type (with a 'awful' 10-bytes long) and the performance was near 65% in comparison with double float type performance. | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Re: Mlucas 2.7 for x86?
[EMAIL PROTECTED] wrote: I'm not familiar enough with the details of FFTW to say for sure (Steven Johnson or Jason Papadopoulos could answer this), but I'm pretty sure Frigo/Johnson have done some machine-specific ASM of critical portions of the code, at least for some popular architectures - if so, the x86 would be one of the first candidates, for the above reason and the fact that it's so numerous. That may explain why FFTW performs so well on the x86. I've read on the fly the code used and I've not seen any line of assembler code. Perhaps the C-code has tuned thinking in x86. Is there any linux or window Mlucas 2.7 executable for intel machines? No, but anyone with an f90 compiler for x86 is free to download and compile the source. By far the best compiler I know of for x86 is the Digital/Compaq/Microsoft Visual f90 for Windows - I don't have that one to play with, unfortunately, but I'd be interested to hear from someone who does what timings they get. I've found a microsoft f-90 compiler. I've tried to compile it this afternoon but it gives me some errors, I think easy to solve. The compiler requires interfaces for most fft routines because first parameter 'a' is a target !? :-( I have no many time. But I'll be on work. yours sincerely. | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
Jason Stratos Papadopoulos wrote: For really big FFTs you can get major gains by using FFTW as a building block in a bigger program, rather than have it do your entire FFT with a single function call. As Ernst mentioned, the building block approach lets you fold some of the forward and inverse FFT computations together, and this saves loads of time in cache misses avoided. On the UltraSPARC, using FFTW for the pieces of your computation rather than the whole thing is somewhere between 2 and 10 times faster than FFTW alone. It could be terrific!. I'll see that. On the Pentium, assembly-coded small FFTs run more than twice as fast as FFTW. Even from C, you can do better on the Pentium (do a web search for djbfft, a free Pentium-optimized FFT package). For a recursive split-radix, you need about 200 lines of assembly; surely this is worth twice the speed! I would like to write some C-code for general proposes. For tuned assembler we have the Woltman fantastic prime95/mprime code. Thank you very much for your comments. It will help me a lot. Have a nice weekend. | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Linux mprime and glibc 2.1
Hi: "Ethan M. O'Connor" wrote: (In response to all the mprime segfaulting under glibc 2.1 messages) I had some problems a while back with this issue as well, and then it suddenly went away. I'm using the potato release of Debian, and right now they are tracking the prereleases of glibc 2.1.2 pretty closely. I think that the problems started happening with the change from glibc 2.1.1 to an early pre-2.1.2, and went away with another upgrade. Have you tried to run sprime? (is mprime but linked with static libraries). I've got some problems in an old machine with an old linux and sprime ran well. In my machine, sprime runs a bit fast than mprime!. | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Cleared exponents
Hi: There are two databases in GIMPS/Primenet project. The master is the GIMPS one and includes all information about the search, and therefore includes the basic information from Primenet data. There is a lot of data in Primenet and the synchronization is performed every few months. Till today, I though the synchronization scheme was to take the results received before a date, and then clean these exponents in reports like cleared exponents (cleared.txt) and individual account reports (anyway, the credits are taken into account). But I'm wrong :-(. It seems the last synchronization was on Aug-9, and my exponents finished before that day disappeared from my personal account report as I supposed correct. But in the cleared exponents report there are a lot of results sent to Primenet before Aug-9. My results are not in this list, so... Why other results from other accounts remain in it?. Thanks for your time. | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers