Bill Rea wrote: > >On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter > >for MLU against 0.78 secs/iter for Mlucas at 512K FFT. I wrote: > These timings suggest that it's more than a cache size issue - after > all, Mlucas has a smaller memory footprint irrespective of the CPU, > and one would expect some benefit from this even in cases where > both codes significantly exceed the L2 cache size. I wonder if > the fact that the code itself (due to all the routines needed for > non-power-of-2 FFT lengths, which MLU doesn't support) might be > causing a slowdown here, by competing for space in the L2 cache > with FFT data? Bill replied: >The executable sizes are quite different:- > >113216 bytes for MLU >340048 bytes for Mlucas That is *exactly* the kind of size difference one might expect to cause an appreciable performance difference on a 256KB cache machine. I'll bet the runtime profiling is designed to strip out unused code sections from the executable image, and thus reduce its size. If you still can't get -xprofile to compile decently fast on Mlucas (a problem you noted earlier),there's a manual way to test this hypothesis, namely: 1) Pick an FFT length for testing. Look at the combination of FFT radices used for that N by Mlucas in mers_mod_square. (Example: for N = 224K = 224*1024 = 229376, mers_mod_square lists a set of complex radices (7,8,8,16,16), whose product is 229376/2. 2) Comment out all subroutine calls in mers_mod_square which are not to the routines for the radices in (1). E.g. for the 224K example, in the select case(radix(1)) blocks, comment out all calls except the ones to radix7_dif_pass1, radix7_ditN_cy_dif1 and radix7_dit_pass1. 3) Recompile and compare the the size of the executable to that of the full executable. If it's not substantially smaller, you may have to physically remove the commented-out subroutines from the program file, then recompile. 4) Once your .exe is reasonably small, run some timings. Note that the code compiled for 224K above would also work for any N which uses a combination of radices of the form (7,{any combination of 4,8 or 16},16) (the final radix must always be a 16), i.e. also for 112K and 448K (assuming you're using radices (7,8,16,16,16), not (14,4,16,16,16) for the latter.) If this kind of thing does prove helpful (especially on small-cache and/or bandwidth) systems, once we have a Unix PrimeNet interface to automate execution, it will be relatively easy to replace the current single Mlucas executable with a set of smaller ones, each handling a set of FFT lengths that share the same initial radix, e.g. radix(1) = 3,5,6,7,8,10,14,16, the ones currently supported by Mlucas. The other thing these considerations imply is that compiling in 64-bit mode (due to the large .exe that results) may actually be counterproductive in many instances, unless the code makes heavy use of some 64-bit opcodes which are not supported in 32-bit mode. Let me know what you find, -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers