Bill Rea <[EMAIL PROTECTED]> writes:

<< I tried the Mlucas_2.7x compiled with Sun's f90 version 2 and compared
this against MacLucasUNIX v6.20. On the two ranges of exponents I tried
Mlucas was about 15% slower than MacLucasUNIX, using 256k and 512K
FFTs. Currently I'm double checking some exponents around 4,100,000 and
doing LL testing on some around 8,300,000. One advantage Mlucas might
have over MacLucasUNIX is that it can do FFT sizes other than 2^n,
but even using the 224K and 448K FFTs for these exponent sizes, it 
was slower than MacLucasUNIX by about 8%. >>

Hi, Bill, and thanks for the timings. Note that there's an updated beta,
v2.7y available now. It has better cache behavior and some minor bugfixes.
I get about a 10% speedup on my MIPS R10Ks (more at power-of-2 runlengths,
and especially at really big N) over v2.7x.

<< The compiler options I used were:-

 -fast -libmil -xlibmopt -xarch=v9 >>

Alex Krupa says he got the best timings using

 -fast -libmil -xlibmopt -fns=yes -xarch=v9,

but I don't know if the -fns=yes flag makes a difference. I'm also
told that Solaris 6 users should use -xarch=v8plus, as S6 doesn't
support -xarch=v9. You might also try using -xinline=all and -xdepend.
(I have no SPARC to play with, so don't know if those will help.)

One final optimization you can try is to do runtime profiling at a
desired FFT length to improve performance. Rob Vassar <[EMAIL PROTECTED]>
writes (about some timings he did using v2.7x):

<< Some preliminary results for M110503:

flags: "-fast -xarch=v9 -xchip=ultra2i -xcache=16/32/1:512/64/1
-libmil -xlibmopt -fns=yes" ~0.00443 per iteration

flags: "-fast -xO5 -xarch=v9 -xchip=ultra2i -xcache=16/32/1:512/64/1
-libmil -xlibmopt -fns=yes" ~0.00412 per iteration

[Next Rob added runtime profiling - compile as below, but using
-xprofile=collect, then do a 100-iteration timing test at a desired
runlength, then recompile using]

flags: "-fast -xO5 -xarch=v9 -xchip=ultra2i -xcache=16/32/1:512/64/1
-libmil -xlibmopt -fns=yes -xprofile=use" ~0.00372 per iteration!

So, it would appear that profiling does offer some gain on M110503.
I'm going to try a few larger FFT sizes and see if the profile gain
remains, or is limited to one FFT size.  Before I forget... The binary
I generated is tuned specifically to the Ultra-IIi CPU in my Ultra 10.
Compiling for the Ultra 2 w/300Mhz CPU('s) would change the chip and
cache flags as follows "-xchip=ultra2 -xcache=16/32/1:2048/64/1".
Since the binary is "-xarch=v9" it will only run on an Ultra class
machine, not SS20's, SS5's, or SS2's. >>

Note however that profiling as above appears to only improve
performance AT THE PARTICULAR FFT LENGTH USED to collect profile
data. It may actually hurt performance at other lengths. If it gives
a nice speedup at the particular N, you might consider compiling a
couple versions - you wouldn't need more than (say) two at any one time,
e.g. at the moment you'd need ones for 384K and 448K.

Best regards,
Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to