Bill Rea wrote:

> >On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter
> >for MLU against 0.78 secs/iter for Mlucas at 512K FFT.

I wrote:

> These timings suggest that it's more than a cache size issue - after
> all, Mlucas has a smaller memory footprint irrespective of the CPU,
> and one would expect some benefit from this even in cases where
> both codes significantly exceed the L2 cache size. I wonder if
> the fact that the code itself (due to all the routines needed for
> non-power-of-2 FFT lengths, which MLU doesn't support) might be
> causing a slowdown here, by competing for space in the L2 cache
> with FFT data?

Bill replied:

>The executable sizes are quite different:-
>
>113216 bytes for MLU
>340048 bytes for Mlucas

That is *exactly* the kind of size difference one might expect to cause
an appreciable performance difference on a 256KB cache machine. I'll
bet the runtime profiling is designed to strip out unused code sections
from the executable image, and thus reduce its size. If you still can't
get -xprofile to compile decently fast on Mlucas (a problem you noted
earlier),there's a manual way to test this hypothesis, namely:

1) Pick an FFT length for testing. Look at the combination of FFT
radices used for that N by Mlucas in mers_mod_square. (Example: for
N = 224K = 224*1024 = 229376, mers_mod_square lists a set of complex
radices (7,8,8,16,16), whose product is 229376/2.

2) Comment out all subroutine calls in mers_mod_square which are not
to the routines for the radices in (1). E.g. for the 224K example,
in the select case(radix(1)) blocks, comment out all calls except
the ones to radix7_dif_pass1, radix7_ditN_cy_dif1 and radix7_dit_pass1.

3) Recompile and compare the the size of the executable to that of the
full executable. If it's not substantially smaller, you may have to
physically remove the commented-out subroutines from the program file,
then recompile.

4) Once your .exe is reasonably small, run some timings. Note that the
code compiled for 224K above would also work for any N which uses a
combination of radices of the form (7,{any combination of 4,8 or 16},16)
(the final radix must always be a 16), i.e. also for 112K and 448K
(assuming you're using radices (7,8,16,16,16), not (14,4,16,16,16)
for the latter.)

If this kind of thing does prove helpful (especially on small-cache
and/or bandwidth) systems, once we have a Unix PrimeNet interface
to automate execution, it will be relatively easy to replace the
current single Mlucas executable with a set of smaller ones,
each handling a set of FFT lengths that share the same initial
radix, e.g. radix(1) = 3,5,6,7,8,10,14,16, the ones currently
supported by Mlucas.

The other thing these considerations imply is that compiling in
64-bit mode (due to the large .exe that results) may actually be
counterproductive in many instances, unless the code makes heavy
use of some 64-bit opcodes which are not supported in 32-bit mode.

Let me know what you find,
-Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to