--- "J.Pietschmann" <[EMAIL PROTECTED]> wrote: > Al Chou wrote: > > Cool reference, thanks. Did you mean "study the source code"? > > Yes, last time I looked it contained some entertaining remarks > in the comments, see > http://www.mersenne.org/source.htm > From memory, the story was like > 1. Implement FFT straightforwardly. It's slow. > 2. Leverage the CPU L1 cache: reshuffle the data so that the > data access is localized despite the FFT access pattern, and > break up loops so that half of the L1 cache can hold the > source and the other half the result of the calculations > in the loop. It's still slow. > 3. Oops, there are the FFT constants which should fit in the L1 > cache as well. Cut down and rearrange loops. It's still slow. > 4. Take L1 cache organisation in account. If certain bits of > the starting address of some data snippets are the same, the > cache will allocate only a few lines for storing them, then > start over with the first allocated line. Naturally, if a large > continous array is loaded this will cause thrashing, even > though there are still free lines. Introduce gaps to offset > starts of cache lines so that the whole L1 cache can be > leveraged. It's *still* slow. > 5. Perhaps doing some tricks with early flushing dirty cache > lines and other stuff, I don't quite remember. > 6. Take L2 cache organization into account. Further reshuffling > and introducing gaps. Write macros to generate the assembler > code, because nobody can keep track of all the permutations > without going insane. > Beyond a certain size, it is *still* slow. > 7. Remember the processor has a TLB, which maps the logical address > space of the process into the physical RAM address space. As > usual, thrashing happened, causing frequent reloads of mapping > data from the main RAM which is both expensive in itself and > swapped valuable data out of the L2 cache. Introduce large gaps > in the data to avoid this. > This was the *only* time I ever read about a non-kernel programmer > having to take this into account. > This was from 1998 or so, and somewhat specific for the Pentium > architecture. More recent processors have larger caches and slightly > different cache organizations, in particular a L2 cache on chip, > there are changes in write back handling, there may be a L3 cache, > RAM access has become more important, and processors aquired SIMD > instructions which are now used. > Note that professional numeric libraries have to deal with similar > problems when it comes to handling large matrices. > > Have fun!
Whew, it sure does sound like a numerical program that went through a lot of optimization for speed. As a scientific user programmer (not a library programmer), I never had to go to such lengths. The most I did was rearrange the nesting of loops and use machine-specific vector-processing routines in a few places instead of standard Fortran, and that was enough. Al ===== Albert Davidson Chou Get answers to Mac questions at http://www.Mac-Mgrs.org/ . __________________________________ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]