RE: Mersenne: Re: FFTW for GIMPS?
Hi Paul, --- Paul Leyland <[EMAIL PROTECTED]> a écrit: > > From: Olivier Langlois > [mailto:[EMAIL PROTECTED]] ... > > Actually, we at Microsoft Research in Cambridge have > seen similar effects > when compiling and running FFTW code. Our discovery > is that the alignment > of FP data values is critical. Get it wrong, and > performance can plummet. > Unless you set the alignment explicitly, it will be > wrong approximately half > the time. > > Jonathan Hardwick investigated this effect as part > of his research into > high-performance computing. He gave an internal > seminar (which is where I > learned about it) and wrote it up in detail. The > full details are at > http://www.research.microsoft.com/users/jch/fftw-performance.html I agree with you that data aligment is an important issue for performance. Non-aligned data need 2 memory read to be fetched. I was aware that MSVC++ 6 is superior than some other compilers on this aspect because it does an AND on the stack pointer with 0xFFF8 to be sure that local variables will be aligned within an 8 bytes boundary. However, MSVC FPU registers allocation and assignement algorithm could be greatly improved(that was my point in my previous e-mail and Assembler listings generated by MSVC from FFTW are great to see what I mean). MSVC almost never use all 8 FPU registers. It seems like your optimizer is good on a statement basis but as soon as you have a sequence of short FP instructions, it does a very bad job. With these unused registers: - it could start a new FP operation instead of storing back to the memory the result of an unfinished FP operation which waste a few CPU cycles. - it could use them to store temporary variables and reduce memory access. With a better FPU registers allocation and assignement algorithm than the actual one, I wouldn't be surprise to see FFTW 20-30% faster than now with MSVC. While we're at it, there is a simple optimization that MSVC doesn't perform. When there is an expression like that: x % n where n is a numeric constant of the form 2^y and y <= 32. The compiler could replace this by x & (n-1) (An optimization using a Mersenne Number :-) instead of doing a costly division like it is done right now. Greetings Olivier Langlois http://www3.sympatico.ca/olanglois - [EMAIL PROTECTED] Nortel Networks 514-818-1010 x46659 Montreal, Canada __ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
>When did using 10-byte reals become common? As far as I remember, you >don't even have a store instruction for that. The 80-bit load is slower >than the 64-bit load too, I think. Win32Forth can be compiled either way. Here's some of the code from FLOAT.F: B/FLOAT 10 = [IF]synonym FSIZE extended [ELSE] synonym FSIZE double [THEN] code fpush ( f: r -- ) ( fs: -- r ) \ push on simulated stack mov ecx, FSP [edi] fstpFSIZE FSTACK [ecx] [edi] fwait add ecx, # B/FLOAT mov FSP [edi], ecx next, end-code code fpop ( f: -- r ) ( fs: r -- ) \ push on real stack mov ecx, FSP [edi] sub ecx, # B/FLOAT js L$1 fld FSIZE FSTACK [ecx] [edi] mov FSP [edi], ecx \ fwait jmp L$2 _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Re: FFTW for GIMPS?
On Tue, Sep 28, 1999 at 07:43:17PM +0100, Brian J. Beesley wrote: >Alignment on 4-byte boundaries is quite sufficient for C floats. Ten- >byte reals (direct copies from FPU registers) are a problem When did using 10-byte reals become common? As far as I remember, you don't even have a store instruction for that. The 80-bit load is slower than the 64-bit load too, I think. /* Steinar */ -- Homepage: http://members.xoom.com/sneeze/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
RE: Mersenne: Re: FFTW for GIMPS?
On 27 Sep 99, at 4:22, Paul Leyland wrote: > Actually, we at Microsoft Research in Cambridge have seen similar effects > when compiling and running FFTW code. Our discovery is that the alignment > of FP data values is critical. Get it wrong, and performance can plummet. > Unless you set the alignment explicitly, it will be wrong approximately half > the time. So, is a future release of MSVC++ going to include an option to optimize alignment of FP data values, at the expense of minimizing storage by packing data values as tight as possible? I think this mainly applies to quadword operands (doubles in C) which should be aligned on a 8-byte boundary, so that one memory bus cycle is sufficient. This strategy also avoids operands spanning cache line boundaries, which would likely have a serious effect on performance by effectively halving the associativeness of the L1 data cache. Alignment on 4-byte boundaries is quite sufficient for C floats. Ten- byte reals (direct copies from FPU registers) are a problem, you are always going to need two memory bus cycles since you can't fit an 80- bit operand on a 64-bit bus. However, whether you pack a ten-byte real array contiguously (with no wasted space), or align elements on 16-byte boundaries (with lots of wasted space, but no cache line conflicts) could have a significant effect on performance - which might work in different directions in different applications. Regards Brian Beesley _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
Hi: Paul Leyland wrote: > Actually, we at Microsoft Research in Cambridge have seen similar effects > when compiling and running FFTW code. Our discovery is that the alignment > of FP data values is critical. Get it wrong, and performance can plummet. > Unless you set the alignment explicitly, it will be wrong approximately half > the time. Your right, I gained a 35% performance only with doing a simple trick to be sure there were a 8-bytes alignement. On the other hand, I made the FFTW library using long double float type (with a 'awful' 10-bytes long) and the performance was near 65% in comparison with double float type performance. | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Re: FFTW for GIMPS?
On Mon, Sep 27, 1999 at 04:22:53AM -0700, Paul Leyland wrote: >Actually, we at Microsoft Research in Cambridge have seen similar effects >when compiling and running FFTW code. Our discovery is that the alignment >of FP data values is critical. It is generally for _all_ FP code. Unfortunately, the ABI for x86 `gets it wrong', and uses too little alignment. For gcc/egcs, try -malign-double if you don't need ABI compatibility (ie. you compile all libraries containing structures with floats with -malign- double). pgcc (a gcc/egcs derivative) has a way of aligning the stack, so that the floats will be aligned by 8 most of the time, without breaking the ABI. Check http://www.goof.com/pcg/ Note that Prime95 already has correct alignment; would you expect anything else from George? ;-) /* Steinar */ -- Homepage: http://members.xoom.com/sneeze/ _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
RE: Mersenne: Re: FFTW for GIMPS?
> From: Olivier Langlois [mailto:[EMAIL PROTECTED]] > I've played a little bit FFTW and I've remarked that its performance can > vary a lot depending on how good your compiler is at optimization. > > For instance, compiled FFTW code is far from optimal with MSVC++ 6. This > compiler doesn't fully use the FPU stack like an ASM programmer could do and > I don't know why since I'm sure that writing a compiler that would make a > good usage of the FPU registers is far from impossible. > > So, the compiler you use to compile FFTW is a major factor for the > performance you'll get from it. > > I don't know if someone have done similar experiences and if there is a > better compiler than MSVC for intensive FP code. Actually, we at Microsoft Research in Cambridge have seen similar effects when compiling and running FFTW code. Our discovery is that the alignment of FP data values is critical. Get it wrong, and performance can plummet. Unless you set the alignment explicitly, it will be wrong approximately half the time. Jonathan Hardwick investigated this effect as part of his research into high-performance computing. He gave an internal seminar (which is where I learned about it) and wrote it up in detail. The full details are at http://www.research.microsoft.com/users/jch/fftw-performance.html Paul _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
On Sat, 25 Sep 1999, Olivier Langlois wrote: > I've played a little bit FFTW and I've remarked that its performance can > vary a lot depending on how good your compiler is at optimization. Absolutely. Compile FFTW with gcc on a Sun and you'll get half the speed of FFTW using Sun cc. > For instance, compiled FFTW code is far from optimal with MSVC++ 6. This > compiler doesn't fully use the FPU stack like an ASM programmer could do and > I don't know why since I'm sure that writing a compiler that would make a > good usage of the FPU registers is far from impossible. > > So, the compiler you use to compile FFTW is a major factor for the > performance you'll get from it. > > I don't know if someone have done similar experiences and if there is a > better compiler than MSVC for intensive FP code. In my (admittedly limited) experience, MSVC produces pretty suboptimal FPU code. gcc for DOS (the djgpp compiler) runs some of my FPU intensive programs twice as fast as VC++ (for the *same* program). djgpp is nice because if you write your floating point C a certain way, the compiler will produce almost identical assembly code. This can be put to good use if you can write asm yourself. Another gripe I have about VC++ is that it's pretty good at using integer registers so that stalls are avoided (for e.g. the Pentium), but if you try to do anything even remotely complicated it hops to a function call and destroys performance. For example, when working with 64-bit integers gcc will recognize a shift by 32 and inline simple code for it, where VC++ will jump to a function that will use variable shifts (the call and shifts together take about 20x the time). Actually, gcc's long long primitives are *way* faster than __int64 stuff on Microsoft's compiler. jasonp _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
I've played a little bit FFTW and I've remarked that its performance can vary a lot depending on how good your compiler is at optimization. For instance, compiled FFTW code is far from optimal with MSVC++ 6. This compiler doesn't fully use the FPU stack like an ASM programmer could do and I don't know why since I'm sure that writing a compiler that would make a good usage of the FPU registers is far from impossible. So, the compiler you use to compile FFTW is a major factor for the performance you'll get from it. I don't know if someone have done similar experiences and if there is a better compiler than MSVC for intensive FP code. Olivier Langlois - [EMAIL PROTECTED] - http://www3.sympatico.ca/olanglois Nortel 515-818-1010 x46659 Montreal, Canada _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
Jason Stratos Papadopoulos wrote: > For really big FFTs you can get major gains by using FFTW as a building > block in a bigger program, rather than have it do your entire FFT with > a single function call. As Ernst mentioned, the building block approach > lets you fold some of the forward and inverse FFT computations together, > and this saves loads of time in cache misses avoided. On the UltraSPARC, > using FFTW for the pieces of your computation rather than the whole thing > is somewhere between 2 and 10 times faster than FFTW alone. > It could be terrific!. I'll see that. > On the Pentium, assembly-coded small FFTs run more than twice as fast > as FFTW. Even from C, you can do better on the Pentium (do a web search > for djbfft, a free Pentium-optimized FFT package). For a recursive > split-radix, you need about 200 lines of assembly; surely this is worth > twice the speed! > I would like to write some C-code for general proposes. For tuned assembler we have the Woltman fantastic prime95/mprime code. Thank you very much for your comments. It will help me a lot. Have a nice weekend. | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
On Sat, 25 Sep 1999, Guillermo Ballester Valor wrote: > Yes, certainly I've be able to adapt lucdwt and McLucasUNIX in four > days. On the other hand, my goal only was to know if working with FFTW > is a good idea, and timings obtained make me think it could be. For really big FFTs you can get major gains by using FFTW as a building block in a bigger program, rather than have it do your entire FFT with a single function call. As Ernst mentioned, the building block approach lets you fold some of the forward and inverse FFT computations together, and this saves loads of time in cache misses avoided. On the UltraSPARC, using FFTW for the pieces of your computation rather than the whole thing is somewhere between 2 and 10 times faster than FFTW alone. > If your comparison were ported to intel machines, which is wrong, your > code will run nearly as fast as mprime!!. You say your code is twice > faster than FFTW, sure it is, *BUT* in my pentium-166 the short code I > wrote do an iteration of my actual exponent 3975659 in 0.901 seconds > while mprime take only 0.359. This makes a RPI=40%. Then, your code will > reach nearly 90% !and without lots of assembler code!. On the Pentium, assembly-coded small FFTs run more than twice as fast as FFTW. Even from C, you can do better on the Pentium (do a web search for djbfft, a free Pentium-optimized FFT package). For a recursive split-radix, you need about 200 lines of assembly; surely this is worth twice the speed! jasonp _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: Re: FFTW for GIMPS?
Hi: [EMAIL PROTECTED] wrote: > > Guillermo Ballester Valor writes: > > << Do you Know the GNU-FFTW package ?(The Fastest Fourier Transform in the > West). Last week I thought it would be interesting to see if it is as > fast as they say. It is really a fast C-written FFT code. >> > > > My comment is this: If you want to create a fast FFT-based program in a > short time, FFTW is certainly a good way to do it. On the other hand, if > you're really striving for extreme performance (i.e. trying to write a > code that possibly many people will use intensively), it is best to have > a code whose details you understand well. In my case, the latter criterion > (and the fact that I wanted to learn as much as I could about FFTs) led me > to write my own code. I started with the Numerical Recipes FFT (slow and > not very accurate, but easy to play with) about 3 years ago and have been > working on it ever since - my current code looks nothing like the NR FFT, > but you have to start somewhere. > Yes, certainly I've be able to adapt lucdwt and McLucasUNIX in four days. On the other hand, my goal only was to know if working with FFTW is a good idea, and timings obtained make me think it could be. > Looking at the FFTW timings page, for a length-262144 real-vector transform > they list (http://www.fftw.org/benchfft/results/alpha467.html) > a performance of around 105 "MFlops" on a 467 MHz Alpha 21164. Using > their definition of MFlops for real FFTs, this translates to a per-FFT time of > > 0.5*[5*262144*log2(262144) Flop]/[115 MFlop/sec] = 0.112 sec. > > My LL code does 2 FFT's plus other operations per Mersenne-mod squaring, > so we estimate about 80% of the per-iteration time equals one FFT. At 256K > vector length it needs .177 sec per iteration on a 400 MHz 21164, which > leads to an estimate of .40*.177*400/467 = 0.061 sec on a 467MHz 21164 > which is significantly faster than FFTW. > If your comparison were ported to intel machines, which is wrong, your code will run nearly as fast as mprime!!. You say your code is twice faster than FFTW, sure it is, *BUT* in my pentium-166 the short code I wrote do an iteration of my actual exponent 3975659 in 0.901 seconds while mprime take only 0.359. This makes a RPI=40%. Then, your code will reach nearly 90% !and without lots of assembler code!. Is there any linux or window Mlucas 2.7 executable for intel machines? It would be nice to look at timings. P.S. The source I sent to E. Mayer was buggy, it was an early version I sent him :-(, if somebody want to the source, mail me in private. | Guillermo Ballester Valor | | [EMAIL PROTECTED] | | c/ cordoba, 19 | | 18151-Ogijares (Spain) | | (Linux registered user 1171811) | _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: Re: FFTW for GIMPS?
Guillermo Ballester Valor writes: << Do you Know the GNU-FFTW package ?(The Fastest Fourier Transform in the West). Last week I thought it would be interesting to see if it is as fast as they say. It is really a fast C-written FFT code. >> Hi, Guillermo (please, let's use first names - I'm an informal guy) - While not having used it myself, certainly I've heard of FFTW. Note that a better person to comment on your questions might be Jason Papadopoulos ([EMAIL PROTECTED]) - he adapted FFTW for his very fast SPARC Fermat number code. My comment is this: If you want to create a fast FFT-based program in a short time, FFTW is certainly a good way to do it. On the other hand, if you're really striving for extreme performance (i.e. trying to write a code that possibly many people will use intensively), it is best to have a code whose details you understand well. In my case, the latter criterion (and the fact that I wanted to learn as much as I could about FFTs) led me to write my own code. I started with the Numerical Recipes FFT (slow and not very accurate, but easy to play with) about 3 years ago and have been working on it ever since - my current code looks nothing like the NR FFT, but you have to start somewhere. Looking at the FFTW timings page, for a length-262144 real-vector transform they list (http://www.fftw.org/benchfft/results/alpha467.html) a performance of around 105 "MFlops" on a 467 MHz Alpha 21164. Using their definition of MFlops for real FFTs, this translates to a per-FFT time of 0.5*[5*262144*log2(262144) Flop]/[115 MFlop/sec] = 0.112 sec. My LL code does 2 FFT's plus other operations per Mersenne-mod squaring, so we estimate about 80% of the per-iteration time equals one FFT. At 256K vector length it needs .177 sec per iteration on a 400 MHz 21164, which leads to an estimate of .40*.177*400/467 = 0.061 sec on a 467MHz 21164 which is significantly faster than FFTW. At real-vector length 362880 (the closest I could find to 384K), FFTW needs 0.5*[5*362880*log2(362880) Flop]/[125 MFlop/sec] = .134 sec per FFT, whereas Mlucas 2.7 (multiplying the 384K = 393216 timing by 362880/393216 to get a comparison with the FFTW timing for the latter length) needs .40*.316*(400/467)*(362880/393216)) = .100 sec per FFT on a 467 MHz 21164. On a 195 MHz MIPS R1 CPU (SGI Origin 2000 workstation) FFTW needs 0.5*[5*262144*log2(262144) Flop]/[62 MFlop/sec] = .190 sec at n=262144 0.5*[5*362880*log2(362880) Flop]/[67 MFlop/sec] = .250 sec at n=362880 whereas Mlucas 2.7 needs (again extrapolating the 384K timing backward to 362880) .088 and .127 seconds, respectively, per FFT on the same hardware. The fact that it took me so much work to achieve these speeds is a testament to the speed of FFTW - the sample timings I sent to Steven Johnson (one of the authors of FFTW) last year were still generally slower than FFTW. However, "rolling one's own" (as it were) allows one to do lots of algorithm-specific optimizations not available to the FFTW folks. Since FFT-based large-integer arithmetic can do a pointwise (dyadic) squaring on the outputs of the forward FFT without them being ordered in any special way, one can do a forward decimation-in-frequency FFT, followed by a pointwise squaring of the (bit-reversal-reordered) data, followed by decimation-in-time inverse FFT, thus avoiding the need for explicit bit-reversal-reorderings of data (or matrix transposes, in the context of the 4-step FFT used by FFTW), for example. One can also combine the last pass of the forward FFT, the dyadic squaring and the first pass of the inverse FFT into a single pass through the data. Similarly, one can fuse the final pass of the inverse FFT, the rounding-and-carry- propagation step, and the first pass of the forward FFT, which both minimizes data movement (memory accesses) and allows one to propagate carries for several sub-blocks of the full array in parallel fashion. I'm sending a copy of this to both the Mersenne list and to Steven Johnson, the latter so he can correct any gross errors I might have made in my timings calculations for FFTW.) Cheers, Ernst _ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers