Re: Mersenne: Hyper-threading
There are some performance results with Intel Fortran OpenMP compiler in latest Intel Technology Journal: http://developer.intel.com/technology/itj/ http://developer.intel.com/technology/itj/2002/volume06issue01/art04_fortrancompiler/p08_perf_eval.htm Regards, Laurent _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: TMS XP-15 DSP card
> This XP-15 ( http://www.superdsp.com/ ) bad boy DSP card looks pretty > impressive, how useful would it be in searching for Mersenne primes? At > $14,000 a pop, how would it compare to a farm of P4's? According to the > Presentation pdf, it is 42 times as fast as a 1.4GHz P4 at a 1024K FFT! The killer thing is that the vector processing unit only handles 32 bit IEEE FP numbers... Laurent _ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: Mlucas 2.7x on SPARC
Laurent Desnogues wrote: > [wrong list...] > > Please also note the SYSVABI_1.3: I guess it means the executable > can only be run on a SPARC station running Solaris 7! I hate cut & paste :( % ldd Mlucas2.7x.exe libfui.so.1 => (file not found) libfai.so.1 => (file not found) libfai2.so.1 => (file not found) libfsumai.so.1 =>(file not found) libfprodai.so.1 => (file not found) libfminlai.so.1 => (file not found) libfmaxlai.so.1 => (file not found) libfminvai.so.1 => (file not found) libfmaxvai.so.1 => (file not found) libfsu.so.1 => (file not found) libsunmath.so.1 => /opt/SUNWspro/lib/libsunmath.so.1 libm.so.1 => /usr/lib/libm.so.1 libc.so.1 => /usr/lib/libc.so.1 libc.so.1 (SYSVABI_1.3) => (version not found) libdl.so.1 =>/usr/lib/libdl.so.1 Laurent _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: Mlucas 2.7x on SPARC
[EMAIL PROTECTED] wrote: > > The SPARC binary Alex Kruppa sent me of Mlucas 2.7x is at my ftp site: > > ftp://209.133.33.182/pub/mayer/README > ftp://209.133.33.182/pub/mayer/bin/SPARC/Mlucas_2.7x.exe.gz [...] > I don't even know if the above runs on a machine that doesn't have an f90 > compiler installed (i.e. whether the code needs any f90-specific RTL files)- > I don't think it does, but anyone with an f90-less SPARC can easily find out. No, it indeed does not: % ldd Mlucas2.7x.exe libfui.so.1 => (file not found) libfai.so.1 => (file not found) libfai2.so.1 => (file not found) libfsumai.so.1 =>(file not found) libfprodai.so.1 => (file not found) libfminlai.so.1 => (file not found) libfmaxlai.so.1 => (file not found) libfminvai.so.1 => (file not found) libfmaxvai.so.1 => (file not found) libfsu.so.1 => (file not found) libsunmath.so.1 => /opt/SUNWspro/lib/libsunmath.so.1 libm.so.1 => /usr/lib/libm.so.1 libc.so.1 => /usr/lib/libc.so.1 libc.so.1 => (version not found) libdl.so.1 =>/usr/lib/libdl.so.1 Please also note the SYSVABI_1.3: I guess it means the executable can only be run on a SPARC station running Solaris 7! Laurent _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Mlucas 2.7x on SPARC
[EMAIL PROTECTED] wrote: > > << I tried all sorts of compiler flags - unfortunately, the optimization > flags are not linear, especially -O5 tends to produce much slower code > than -O4 when combined with other flags. >> > > I see similar weird slowdowns using the -O5 compile option on some (not > all) Alpha CPUs (generally the older ones.) I wonder if both compilers > are doing similar "optimizations" at -O5. The Sun C compiler -O5 flag should only be used when using a profile to direct subsequent compilations... The way to use it is to compile with -xprofile=collect then run then recompile with -xprofile=use... This might be something similar for Alpha. However some optimizations done at higher levels of optimization might produce slower code. An example is too much loop unrolling producing code that does not fit well in L1 I-Cache. > << I'm using -fast -libmil -xlibmopt -fnsyes now, which seems to give > close to optimal performance. >> [...] > << I dont know whether this is also optimal on other types of UltraSparc, I > only have Ultra60s for testing. >> This won't be optimal if you run under Solaris < 7! Under such OSes the -fast flag must be followed by a -xarch=v8plus in order to use all 32 double FP regs of an UltraSPARC chip. This is not the case for a Solaris 7 system where -fast will use -xarch=v9. Flags to also test are: -xdepend -xinline=all -xsafe=mem (to be used with -xO5) Good luck ;) Laurent _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: SPARC times
Bill Rea wrote: > > This is using MacLucasUNIX compiled with the Sun workshop compilers. Which version and which flags did you use? I guess you ran your tests under Solaris 7, right? > (1) If I used the -xarch=v9 > option on those systems that support this option, the resulting > binary runs slower on the tests than using -xarch=v8plusa. That's very strange! I have benchmarked some code using both flags (with -fast preprended) under Solaris 7 (which is required to run code compiled with -xarch=v9) and v9 helped; however the code was purely 64-bit integer. There's also a very interesting flag to test that's not documented in Sun cc doc: -xinline=all. I used it by error but it did a great job with the code I was working on. > (2) Differences in speeds on the tests supplied with the software > don't translate into differences in speed when working on Mersenne > numbers of the size above. Even with speed differences around 10% > on the tests showed no discernable differences in practice. I don't know the tests supplied but the difference might result from the way time is counted. I think the best way to check the speed of a code is to use getrusage for the process only (RUSAGE_SELF) and to only take into account the user time. This way I get very consistent timings for the before mentioned code (BTW, the code is ecdl by Robert Harley, used to crack ECC). Laurent _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Regarding: multiply/add
Jason Stratos Papadopoulos wrote: > > There was a paper in (I believe) Appl. Math and Comp. from about two > years ago that reformulated radix-2,3,4 and 5 FFT butterfiles to use > multiply-adds wherever possible. The savings can be impressive: a radix-2 > FFT butterfly has 4 multiplies and six adds, but this boils down to > six multiply-adds. Can't the RS6000 do two multiply-adds per clock? I have found a paper titled "Implementation of Efficient FFT Algorithms on Fused Multiply-Add Architectures" by E. Linzer and E. Feig, IEEE Trans. on Signal Processing, vol. 41, no 1, Jan 93. They call their method scaled FFT. For a radix-8, with size 8, Cooley-Tukey FFT has 56 m/a ops and their method has 52 m/a ops. The number of m/a ops for radix 8 is: - Cooley-Tukey: 7/2 n log_2(n) - 57/14 n + 32/7 - scaled FFT:11/4 n log_2(n) - 57/28 n + 16/7. Regarding the errors (root-mean-square error per point): DFTinverse DFT - Cooley-Tukey: 5.3500 10^-15 3.3999 10^-15 - scaled FFT:5.3441 10^-15 3.2386 10^-15 They say the impovement in error is probably due to the fast that a m/a op is more precise than a mult followed by an add. Their experiments were done on an RS/6000 model 530. I can't comment on that article, since I'm not mathematically-gifted enough to just read it without a lot of time with a pencil! Laurent _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: AMD K7 will
Yves & Lucile Gallot wrote: > All modern processors use a dynamic execution architecture that blends > out-of-order and speculative execution with hardware register renaming and > branch prediction. These processors feature an in-order issue pipeline, > which breaks processor macroinstructions into simple, micro-operations, and > an out-of-order, superscalar processor core, which executes the micro-ops. > The out-of-order core of the processor contains several pipelines to which > integer, branch, floating-point and memory execution units are attached. > Many instructions contain few micro-operations then the processor is able to > execute more than 1 instruction per cycle. Some instructions are complex and > needs several cycles to be executed (div, sqrt, cos, ...) but they are not > often used. > Is that processor a RISC or a CISC ? Neither a RISC nor a CISC! And it is > really faster than a RISC or a CISC. Here are a few rules that I use to clearly sort between RISCs and CISCs: - orthogonality of the ISA - no simultaneous mem access and computation - regular instruction encoding - single use of MMU. I don't think that, for instance, the way instructions are executed or their complexity do matter a lot, though using these rules you can get some architectural information. With these rules (we can add a few others), one can say in which class a processor falls... But this thread does not belong to the mailing-list! Laurent