RE: Mersenne: Re: FFTW for GIMPS?

1999-09-28 Thread Olivier Langlois

Hi Paul,

--- Paul Leyland <[EMAIL PROTECTED]> a écrit:
> > From: Olivier Langlois
> [mailto:[EMAIL PROTECTED]]
...
> 
> Actually, we at Microsoft Research in Cambridge have
> seen similar effects
> when compiling and running FFTW code.  Our discovery
> is that the alignment
> of FP data values is critical.  Get it wrong, and
> performance can plummet.
> Unless you set the alignment explicitly, it will be
> wrong approximately half
> the time.
> 
> Jonathan Hardwick investigated this effect as part
> of his research into
> high-performance computing.  He gave an internal
> seminar (which is where I
> learned about it) and wrote it up in detail.  The
> full details are at
>
http://www.research.microsoft.com/users/jch/fftw-performance.html

I agree with you that data aligment is an important
issue for performance. Non-aligned data need 2 memory
read to be fetched. I was aware that MSVC++ 6 is
superior than some other compilers on this aspect
because it does an AND on the stack pointer with
0xFFF8 to be sure that local variables will be
aligned within an 8 bytes boundary.

However, MSVC FPU registers allocation and assignement
algorithm could be greatly improved(that was my point
in my previous e-mail and Assembler listings generated
by MSVC from FFTW are great to see what I mean). MSVC
almost never use all 8 FPU registers.
It seems like your optimizer is good on a statement
basis but as soon as you have a sequence of short FP
instructions, it does a very bad job.

With these unused registers:

- it could start a new FP operation instead of storing
back to the memory the result of an unfinished FP
operation which waste a few CPU cycles.

- it could use them to store temporary variables and
reduce memory access.

With a better FPU registers allocation and assignement
algorithm than the actual one, I wouldn't be surprise
to see FFTW 20-30% faster than now with MSVC. 

While we're at it, there is a simple optimization that
MSVC doesn't perform.

When there is an expression like that:

x % n

where n is a numeric constant of the form 2^y
and y <= 32. The compiler could replace this by

x & (n-1) (An optimization using a Mersenne Number :-)
instead of doing a costly division like it
is done right now.

Greetings
Olivier Langlois
http://www3.sympatico.ca/olanglois -
[EMAIL PROTECTED]
Nortel Networks 514-818-1010 x46659
Montreal, Canada



__
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: FFTW for GIMPS?

1999-09-28 Thread Pierre Abbat

>When did using 10-byte reals become common? As far as I remember, you
>don't even have a store instruction for that. The 80-bit load is slower
>than the 64-bit load too, I think.

Win32Forth can be compiled either way. Here's some of the code from FLOAT.F:


B/FLOAT 10 =
[IF]synonym FSIZE extended
[ELSE]  synonym FSIZE double
[THEN]


code fpush  ( f: r -- )  ( fs: -- r )   \ push on simulated stack
mov ecx, FSP [edi]
fstpFSIZE FSTACK [ecx] [edi]
fwait
add ecx, # B/FLOAT
mov FSP [edi], ecx
next,
end-code


code fpop   ( f: -- r )   ( fs: r -- )  \ push on real stack
mov ecx, FSP [edi]
sub ecx, # B/FLOAT
js  L$1
fld FSIZE FSTACK [ecx] [edi]
mov FSP [edi], ecx
\   fwait
jmp L$2

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Mersenne: Re: FFTW for GIMPS?

1999-09-28 Thread Steinar H. Gunderson

On Tue, Sep 28, 1999 at 07:43:17PM +0100, Brian J. Beesley wrote:
>Alignment on 4-byte boundaries is quite sufficient for C floats. Ten-
>byte reals (direct copies from FPU registers) are a problem

When did using 10-byte reals become common? As far as I remember, you
don't even have a store instruction for that. The 80-bit load is slower
than the 64-bit load too, I think.

/* Steinar */
-- 
Homepage: http://members.xoom.com/sneeze/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



RE: Mersenne: Re: FFTW for GIMPS?

1999-09-28 Thread Brian J. Beesley

On 27 Sep 99, at 4:22, Paul Leyland wrote:

> Actually, we at Microsoft Research in Cambridge have seen similar effects
> when compiling and running FFTW code.  Our discovery is that the alignment
> of FP data values is critical.  Get it wrong, and performance can plummet.
> Unless you set the alignment explicitly, it will be wrong approximately half
> the time.

So, is a future release of MSVC++ going to include an option to 
optimize alignment of FP data values, at the expense of minimizing 
storage by packing data values as tight as possible?

I think this mainly applies to quadword operands (doubles in C) which 
should be aligned on a 8-byte boundary, so that one memory bus cycle 
is sufficient. This strategy also avoids operands spanning cache line 
boundaries, which would likely have a serious effect on performance 
by effectively halving the associativeness of the L1 data cache.

Alignment on 4-byte boundaries is quite sufficient for C floats. Ten-
byte reals (direct copies from FPU registers) are a problem, you are 
always going to need two memory bus cycles since you can't fit an 80-
bit operand on a 64-bit bus. However, whether you pack a ten-byte 
real array contiguously (with no wasted space), or align elements on 
16-byte boundaries (with lots of wasted space, but no cache line 
conflicts) could have a significant effect on performance - which 
might work in different directions in different applications.




Regards
Brian Beesley
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: FFTW for GIMPS?

1999-09-27 Thread Guillermo Ballester Valor

Hi:

Paul Leyland wrote:
> Actually, we at Microsoft Research in Cambridge have seen similar effects
> when compiling and running FFTW code.  Our discovery is that the alignment
> of FP data values is critical.  Get it wrong, and performance can plummet.
> Unless you set the alignment explicitly, it will be wrong approximately half
> the time.

Your right, I gained a 35% performance only with doing a simple trick to
be sure there were a 8-bytes alignement. On the other hand, I made the
FFTW library using long double float type (with a 'awful' 10-bytes long)
and the performance was near 65% in comparison with double float type
performance.


| Guillermo Ballester Valor   |  
| [EMAIL PROTECTED]  |  
| c/ cordoba, 19  |
| 18151-Ogijares (Spain)  |
| (Linux registered user 1171811) |
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Mersenne: Re: FFTW for GIMPS?

1999-09-27 Thread Steinar H. Gunderson

On Mon, Sep 27, 1999 at 04:22:53AM -0700, Paul Leyland wrote:
>Actually, we at Microsoft Research in Cambridge have seen similar effects
>when compiling and running FFTW code.  Our discovery is that the alignment
>of FP data values is critical.

It is generally for _all_ FP code. Unfortunately, the ABI for x86 `gets it
wrong', and uses too little alignment.

For gcc/egcs, try -malign-double if you don't need ABI compatibility (ie.
you compile all libraries containing structures with floats with -malign-
double). pgcc (a gcc/egcs derivative) has a way of aligning the stack,
so that the floats will be aligned by 8 most of the time, without breaking
the ABI. Check http://www.goof.com/pcg/

Note that Prime95 already has correct alignment; would you expect anything
else from George? ;-)

/* Steinar */
-- 
Homepage: http://members.xoom.com/sneeze/
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



RE: Mersenne: Re: FFTW for GIMPS?

1999-09-27 Thread Paul Leyland

> From: Olivier Langlois [mailto:[EMAIL PROTECTED]]
> I've played a little bit FFTW and I've remarked that its performance can
> vary a lot depending on how good your compiler is at optimization.
> 
> For instance, compiled FFTW code is far from optimal with MSVC++ 6. This
> compiler doesn't fully use the FPU stack like an ASM programmer could do
and
> I don't know why since I'm sure that writing a compiler that would make a
> good usage of the FPU registers is far from impossible.
> 
> So, the compiler you use to compile FFTW is a major factor for the
> performance you'll get from it.
> 
> I don't know if someone have done similar experiences and if there is a
> better compiler than MSVC for intensive FP code.

Actually, we at Microsoft Research in Cambridge have seen similar effects
when compiling and running FFTW code.  Our discovery is that the alignment
of FP data values is critical.  Get it wrong, and performance can plummet.
Unless you set the alignment explicitly, it will be wrong approximately half
the time.

Jonathan Hardwick investigated this effect as part of his research into
high-performance computing.  He gave an internal seminar (which is where I
learned about it) and wrote it up in detail.  The full details are at
http://www.research.microsoft.com/users/jch/fftw-performance.html


Paul

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: FFTW for GIMPS?

1999-09-25 Thread Jason Stratos Papadopoulos

On Sat, 25 Sep 1999, Olivier Langlois wrote:

> I've played a little bit FFTW and I've remarked that its performance can
> vary a lot depending on how good your compiler is at optimization.

Absolutely. Compile FFTW with gcc on a Sun and you'll get half the
speed of FFTW using Sun cc.

> For instance, compiled FFTW code is far from optimal with MSVC++ 6. This
> compiler doesn't fully use the FPU stack like an ASM programmer could do and
> I don't know why since I'm sure that writing a compiler that would make a
> good usage of the FPU registers is far from impossible.
> 
> So, the compiler you use to compile FFTW is a major factor for the
> performance you'll get from it.
> 
> I don't know if someone have done similar experiences and if there is a
> better compiler than MSVC for intensive FP code.

In my (admittedly limited) experience, MSVC produces pretty suboptimal
FPU code. gcc for DOS (the djgpp compiler) runs some of my FPU intensive
programs twice as fast as VC++ (for the *same* program). djgpp is nice
because if you write your floating point C a certain way, the compiler
will produce almost identical assembly code. This can be put to good
use if you can write asm yourself.

Another gripe I have about VC++ is that it's pretty good at using
integer registers so that stalls are avoided (for e.g. the Pentium),
but if you try to do anything even remotely complicated it hops to
a function call and destroys performance. For example, when working
with 64-bit integers gcc will recognize a shift by 32 and inline
simple code for it, where VC++ will jump to a function that will use
variable shifts (the call and shifts together take about 20x the time).
Actually, gcc's long long primitives are *way* faster than __int64
stuff on Microsoft's compiler.

jasonp

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: FFTW for GIMPS?

1999-09-25 Thread Olivier Langlois

I've played a little bit FFTW and I've remarked that its performance can
vary a lot depending on how good your compiler is at optimization.

For instance, compiled FFTW code is far from optimal with MSVC++ 6. This
compiler doesn't fully use the FPU stack like an ASM programmer could do and
I don't know why since I'm sure that writing a compiler that would make a
good usage of the FPU registers is far from impossible.

So, the compiler you use to compile FFTW is a major factor for the
performance you'll get from it.

I don't know if someone have done similar experiences and if there is a
better compiler than MSVC for intensive FP code.

Olivier Langlois - [EMAIL PROTECTED] -
http://www3.sympatico.ca/olanglois
Nortel 515-818-1010 x46659
Montreal, Canada


_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: FFTW for GIMPS?

1999-09-25 Thread Guillermo Ballester Valor

Jason Stratos Papadopoulos wrote:
> For really big FFTs you can get major gains by using FFTW as a building
> block in a bigger program, rather than have it do your entire FFT with
> a single function call. As Ernst mentioned, the building block approach
> lets you fold some of the forward and inverse FFT computations together,
> and this saves loads of time in cache misses avoided. On the UltraSPARC,
> using FFTW for the pieces of your computation rather than the whole thing
> is somewhere between 2 and 10 times faster than FFTW alone.
> 
It could be terrific!. I'll see that.

> On the Pentium, assembly-coded small FFTs run more than twice as fast
> as FFTW. Even from C, you can do better on the Pentium (do a web search
> for djbfft, a free Pentium-optimized FFT package). For a recursive
> split-radix, you need about 200 lines of assembly; surely this is worth
> twice the speed!
> 
I would like to write some C-code for general proposes. For tuned
assembler we have the Woltman fantastic prime95/mprime code.

Thank you very much for your comments. It will help me a lot.

Have a nice weekend.


| Guillermo Ballester Valor   |  
| [EMAIL PROTECTED]  |  
| c/ cordoba, 19  |
| 18151-Ogijares (Spain)  |
| (Linux registered user 1171811) |
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: FFTW for GIMPS?

1999-09-25 Thread Jason Stratos Papadopoulos

On Sat, 25 Sep 1999, Guillermo Ballester Valor wrote:

> Yes, certainly I've be able to adapt lucdwt and McLucasUNIX in four
> days. On the other hand, my goal only was to know if working with FFTW
> is a good idea, and timings obtained make me think it could be.


For really big FFTs you can get major gains by using FFTW as a building
block in a bigger program, rather than have it do your entire FFT with
a single function call. As Ernst mentioned, the building block approach 
lets you fold some of the forward and inverse FFT computations together,
and this saves loads of time in cache misses avoided. On the UltraSPARC,
using FFTW for the pieces of your computation rather than the whole thing
is somewhere between 2 and 10 times faster than FFTW alone.

> If your comparison were ported to intel machines, which is wrong, your
> code will run nearly as fast as mprime!!. You say your code is twice
> faster than FFTW, sure it is, *BUT* in my pentium-166 the short code I
> wrote do an iteration of my actual exponent 3975659 in 0.901 seconds
> while mprime take only 0.359. This makes a RPI=40%. Then, your code will
> reach nearly 90% !and without lots of assembler code!.

On the Pentium, assembly-coded small FFTs run more than twice as fast
as FFTW. Even from C, you can do better on the Pentium (do a web search
for djbfft, a free Pentium-optimized FFT package). For a recursive
split-radix, you need about 200 lines of assembly; surely this is worth
twice the speed!

jasonp


_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Re: FFTW for GIMPS?

1999-09-25 Thread Guillermo Ballester Valor

Hi:

[EMAIL PROTECTED] wrote:
> 
> Guillermo Ballester Valor writes:
> 
> << Do you Know the GNU-FFTW package ?(The Fastest Fourier Transform in the
> West). Last week I thought it would be interesting to see if it is as
> fast as they say. It is really a fast C-written FFT code. >>
> 

> 
> My comment is this: If you want to create a fast FFT-based program in a
> short time, FFTW is certainly a good way to do it. On the other hand, if
> you're really striving for extreme performance (i.e. trying to write a
> code that possibly many people will use intensively), it is best to have
> a code whose details you understand well. In my case, the latter criterion
> (and the fact that I wanted to learn as much as I could about FFTs) led me
> to write my own code. I started with the Numerical Recipes FFT (slow and
> not very accurate, but easy to play with) about 3 years ago and have been
> working on it ever since - my current code looks nothing like the NR FFT,
> but you have to start somewhere.
> 
Yes, certainly I've be able to adapt lucdwt and McLucasUNIX in four
days. On the other hand, my goal only was to know if working with FFTW
is a good idea, and timings obtained make me think it could be.

> Looking at the FFTW timings page, for a length-262144 real-vector transform
> they list (http://www.fftw.org/benchfft/results/alpha467.html)
> a performance of around  105 "MFlops" on a 467 MHz Alpha 21164. Using
> their definition of MFlops for real FFTs, this translates to a per-FFT time of
> 
> 0.5*[5*262144*log2(262144) Flop]/[115 MFlop/sec] = 0.112 sec.
> 
> My LL code does 2 FFT's plus other operations per Mersenne-mod squaring,
> so we estimate about 80% of the per-iteration time equals one FFT. At 256K
> vector length it needs .177 sec per iteration on a 400 MHz 21164, which
> leads to an estimate of .40*.177*400/467 = 0.061 sec on a 467MHz 21164
> which is significantly faster than FFTW.
> 
If your comparison were ported to intel machines, which is wrong, your
code will run nearly as fast as mprime!!. You say your code is twice
faster than FFTW, sure it is, *BUT* in my pentium-166 the short code I
wrote do an iteration of my actual exponent 3975659 in 0.901 seconds
while mprime take only 0.359. This makes a RPI=40%. Then, your code will
reach nearly 90% !and without lots of assembler code!.

Is there any linux or window Mlucas 2.7 executable for intel machines?
It would be nice to look at timings. 


P.S. The source I sent to E. Mayer was buggy, it was an early version I
sent him :-(, if somebody want to the source, mail me in private. 


| Guillermo Ballester Valor   |  
| [EMAIL PROTECTED]  |  
| c/ cordoba, 19  |
| 18151-Ogijares (Spain)  |
| (Linux registered user 1171811) |
_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Mersenne: Re: FFTW for GIMPS?

1999-09-23 Thread EWMAYER

Guillermo Ballester Valor writes:

<< Do you Know the GNU-FFTW package ?(The Fastest Fourier Transform in the
West). Last week I thought it would be interesting to see if it is as
fast as they say. It is really a fast C-written FFT code. >>

Hi, Guillermo (please, let's use first names - I'm an informal guy) -

While not having used it myself, certainly I've heard of FFTW. Note that
a better person to comment on your questions might be Jason Papadopoulos
([EMAIL PROTECTED]) - he adapted FFTW for his very fast SPARC Fermat
number code.

My comment is this: If you want to create a fast FFT-based program in a
short time, FFTW is certainly a good way to do it. On the other hand, if
you're really striving for extreme performance (i.e. trying to write a
code that possibly many people will use intensively), it is best to have
a code whose details you understand well. In my case, the latter criterion
(and the fact that I wanted to learn as much as I could about FFTs) led me
to write my own code. I started with the Numerical Recipes FFT (slow and
not very accurate, but easy to play with) about 3 years ago and have been
working on it ever since - my current code looks nothing like the NR FFT,
but you have to start somewhere.

Looking at the FFTW timings page, for a length-262144 real-vector transform
they list (http://www.fftw.org/benchfft/results/alpha467.html)
a performance of around  105 "MFlops" on a 467 MHz Alpha 21164. Using
their definition of MFlops for real FFTs, this translates to a per-FFT time of

0.5*[5*262144*log2(262144) Flop]/[115 MFlop/sec] = 0.112 sec.

My LL code does 2 FFT's plus other operations per Mersenne-mod squaring,
so we estimate about 80% of the per-iteration time equals one FFT. At 256K
vector length it needs .177 sec per iteration on a 400 MHz 21164, which
leads to an estimate of .40*.177*400/467 = 0.061 sec on a 467MHz 21164
which is significantly faster than FFTW.

At real-vector length 362880 (the closest I could find to 384K), FFTW needs

0.5*[5*362880*log2(362880) Flop]/[125 MFlop/sec] = .134 sec per FFT,

whereas Mlucas 2.7 (multiplying the 384K = 393216 timing by 362880/393216
to get a comparison with the FFTW timing for the latter length) needs

.40*.316*(400/467)*(362880/393216)) = .100 sec per FFT on a 467 MHz 21164.


On a 195 MHz MIPS R1 CPU (SGI Origin 2000 workstation) FFTW needs

0.5*[5*262144*log2(262144) Flop]/[62 MFlop/sec] = .190 sec at n=262144
0.5*[5*362880*log2(362880) Flop]/[67 MFlop/sec] = .250 sec at n=362880

whereas Mlucas 2.7 needs (again extrapolating the 384K timing backward to
362880) .088 and .127 seconds, respectively, per FFT on the same hardware.

The fact that it took me so much work to achieve these speeds is a
testament to the speed of FFTW - the sample timings I sent to Steven Johnson
(one of the authors of FFTW) last year were still generally slower than
FFTW. However, "rolling one's own" (as it were) allows one to do lots of
algorithm-specific optimizations not available to the FFTW folks. Since
FFT-based large-integer arithmetic can do a pointwise (dyadic) squaring
on the outputs of the forward FFT without them being ordered in any special
way, one can do a forward decimation-in-frequency FFT, followed by a
pointwise squaring of the (bit-reversal-reordered) data, followed by
decimation-in-time inverse FFT, thus avoiding the need for explicit
bit-reversal-reorderings of data (or matrix transposes, in the context
of the 4-step FFT used by FFTW), for example. One can also combine the
last pass of the forward FFT, the dyadic squaring and the first pass
of the inverse FFT into a single pass through the data. Similarly, one
can fuse the final pass of the inverse FFT, the rounding-and-carry-
propagation step, and the first pass of the forward FFT, which both
minimizes data movement (memory accesses) and allows one to propagate
carries for several sub-blocks of the full array in parallel fashion.

I'm sending a copy of this to both the Mersenne list and to Steven
Johnson, the latter so he can correct any gross errors I might have
made in my timings calculations for FFTW.) 

Cheers,
Ernst

_
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers