Bill Rea <[EMAIL PROTECTED]> writes: << I tried the Mlucas_2.7x compiled with Sun's f90 version 2 and compared this against MacLucasUNIX v6.20. On the two ranges of exponents I tried Mlucas was about 15% slower than MacLucasUNIX, using 256k and 512K FFTs. Currently I'm double checking some exponents around 4,100,000 and doing LL testing on some around 8,300,000. One advantage Mlucas might have over MacLucasUNIX is that it can do FFT sizes other than 2^n, but even using the 224K and 448K FFTs for these exponent sizes, it was slower than MacLucasUNIX by about 8%. >> Hi, Bill, and thanks for the timings. Note that there's an updated beta, v2.7y available now. It has better cache behavior and some minor bugfixes. I get about a 10% speedup on my MIPS R10Ks (more at power-of-2 runlengths, and especially at really big N) over v2.7x. << The compiler options I used were:- -fast -libmil -xlibmopt -xarch=v9 >> Alex Krupa says he got the best timings using -fast -libmil -xlibmopt -fns=yes -xarch=v9, but I don't know if the -fns=yes flag makes a difference. I'm also told that Solaris 6 users should use -xarch=v8plus, as S6 doesn't support -xarch=v9. You might also try using -xinline=all and -xdepend. (I have no SPARC to play with, so don't know if those will help.) One final optimization you can try is to do runtime profiling at a desired FFT length to improve performance. Rob Vassar <[EMAIL PROTECTED]> writes (about some timings he did using v2.7x): << Some preliminary results for M110503: flags: "-fast -xarch=v9 -xchip=ultra2i -xcache=16/32/1:512/64/1 -libmil -xlibmopt -fns=yes" ~0.00443 per iteration flags: "-fast -xO5 -xarch=v9 -xchip=ultra2i -xcache=16/32/1:512/64/1 -libmil -xlibmopt -fns=yes" ~0.00412 per iteration [Next Rob added runtime profiling - compile as below, but using -xprofile=collect, then do a 100-iteration timing test at a desired runlength, then recompile using] flags: "-fast -xO5 -xarch=v9 -xchip=ultra2i -xcache=16/32/1:512/64/1 -libmil -xlibmopt -fns=yes -xprofile=use" ~0.00372 per iteration! So, it would appear that profiling does offer some gain on M110503. I'm going to try a few larger FFT sizes and see if the profile gain remains, or is limited to one FFT size. Before I forget... The binary I generated is tuned specifically to the Ultra-IIi CPU in my Ultra 10. Compiling for the Ultra 2 w/300Mhz CPU('s) would change the chip and cache flags as follows "-xchip=ultra2 -xcache=16/32/1:2048/64/1". Since the binary is "-xarch=v9" it will only run on an Ultra class machine, not SS20's, SS5's, or SS2's. >> Note however that profiling as above appears to only improve performance AT THE PARTICULAR FFT LENGTH USED to collect profile data. It may actually hurt performance at other lengths. If it gives a nice speedup at the particular N, you might consider compiling a couple versions - you wouldn't need more than (say) two at any one time, e.g. at the moment you'd need ones for 384K and 448K. Best regards, Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers