Mersenne: lucas_mayer V2.6 (Was: Request for Software)

Ernst W. Mayer Mon, 12 Apr 1999 17:43:14 -0700

John Gilmore writes:

>Anyone have an executable that runs under SGI Irix 6.5 I can use for
>double-checking and acquire via FTP or e-mail attachment?

The most recent version of my LL code, V2.6, has (Fortran-90) source code
and binaries for Alpha Unix and SGI Irix (the latter optimized for MIPS R10K).

Readme:       ftp://nigel.mae.cwru.edu/pub/mayer/README
Source:       ftp://nigel.mae.cwru.edu/pub/mayer/
Alpha binary: ftp://nigel.mae.cwru.edu/pub/mayer/bin/ALPHA_OSF/Mlucas_2.6X.exe.gz
         (and ftp://nigel.mae.cwru.edu/pub/mayer/bin/ALPHA_OSF/libshpf.so.gz
           if you lack an F90 compiler - this is the run-time library you need.)
SGI   binary: ftp://nigel.mae.cwru.edu/pub/mayer/bin/SGI/Mlucas_2.6X.exe.gz

There are two major changes from V2.5:

1) Non-power-of-2 runlengths are here. The code supports FFT runlengths
of form {1,3,5,7)*2^n, i.e. the same lengths as George Woltman's Prime95.

2) More efficient FFT: the code now does a decimation-in-frequency forward
FFT and decimation-in-time inverse FFT, thus avoiding any bit-reversal data
reorderings.

The code allows exponents up to 20M, so can be used for double-checking or
current assignments.

NOTE: people upgrading from V2.5 will have to finish their current exponent
before switching to V2.6.

Here are some per-iteration timings for two slightly different MIPS R10K setups,
for exponents spanning the current double-checking and new testing ranges:

                FFT length / max. exponent (in millions)

Platform        96K    112K   128K   160K   192K   224K   256K   320K   384K
                1.99M  2.30M  2.62M  3.27M  3.91M  4.56M  5.20M  6.46M  7.71M

195 MHz R10K,   .087s  .104s  .120s  .159s  .200s  .244s  .287s  .399s  .511s
32 KB D-cache
4MB L2 cache  (One processor of a dual-processor Origin, run using runon 0)

250 MHz R10K,   .108s  .129s  .145s  .192s  .233s  .277s  .311s  .398s  .481s
32 KB D-cache
1MB L2 cache  (A single-processor Octane)

Note the salutary effect of having a nice large L2 cache - the 195MHz CPU
timings are better than the 250MHz up to FFT length 320K.

NOTE TO SPARC USERS: I finally know why my code sucks on SPARC - a crappy F90
compiler. Jason Papadopoulos was kind enough to look at the executable
produced by the SPARC F90 compiler. Here is his review:

"Your program is slow on the ultra because Sun's
F90 compiler does a miserable job. Even when you tell it to use the 
Sparc V9 instruction set, to use 64-bit loads and stores, and to target
the ultra explicitly it still insists on using 32-bit loads and stores
almost exclusively. It also alternates loads and stores a lot, which
on the Ultra causes nasty bus-switching stalls. Finally, it has no
idea about loading values in advance; all your real*8 values are loaded
(one real*4 at a time) and then arithmetic is immediately performed on
them. At least it mixes integer and floating point nicely."

Sorry, SPARCers, you'll have to wait for the C version.

Happy hunting,
Ernst
________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne: lucas_mayer V2.6 (Was: Request for Software)

Reply via email to