--- "J.Pietschmann" <[EMAIL PROTECTED]> wrote:
> Al Chou wrote:
> > Cool reference, thanks.  Did you mean "study the source code"?
> 
> Yes, last time I looked it contained some entertaining remarks
> in the comments, see
>   http://www.mersenne.org/source.htm
>  From memory, the story was like
> 1. Implement FFT straightforwardly. It's slow.
> 2. Leverage the CPU L1 cache: reshuffle the data so that the
>    data access is localized despite the FFT access pattern, and
>    break up loops so that half of the L1 cache can hold the
>    source and the other half the result of the calculations
>    in the loop. It's still slow.
> 3. Oops, there are the FFT constants which should fit in the L1
>    cache as well. Cut down and rearrange loops. It's still slow.
> 4. Take L1 cache organisation in account. If certain bits of
>    the starting address of some data snippets are the same, the
>    cache will allocate only a few lines for storing them, then
>    start over with the first allocated line. Naturally, if a large
>    continous array is loaded this will cause thrashing, even
>    though there are still free lines. Introduce gaps to offset
>    starts of cache lines so that the whole L1 cache can be
>    leveraged. It's *still* slow.
> 5. Perhaps doing some tricks with early flushing dirty cache
>    lines and other stuff, I don't quite remember.
> 6. Take L2 cache organization into account. Further reshuffling
>    and introducing gaps. Write macros to generate the assembler
>    code, because nobody can keep track of all the permutations
>    without going insane.
>    Beyond a certain size, it is *still* slow.
> 7. Remember the processor has a TLB, which maps the logical address
>    space of the process into the physical RAM address space. As
>    usual, thrashing happened, causing frequent reloads of mapping
>    data from the main RAM which is both expensive in itself and
>    swapped valuable data out of the L2 cache. Introduce large gaps
>    in the data to avoid this.
>    This was the *only* time I ever read about a non-kernel programmer
>    having to take this into account.
> This was from 1998 or so, and somewhat specific for the Pentium
> architecture. More recent processors have larger caches and slightly
> different cache organizations, in particular a L2 cache on chip,
> there are changes in write back handling, there may be a L3 cache,
> RAM access has become more important, and processors aquired SIMD
> instructions which are now used.
> Note that professional numeric libraries have to deal with similar
> problems when it comes to handling large matrices.
> 
> Have fun!

Whew, it sure does sound like a numerical program that went through a lot of
optimization for speed.  As a scientific user programmer (not a library
programmer), I never had to go to such lengths.  The most I did was rearrange
the nesting of loops and use machine-specific vector-processing routines in a
few places instead of standard Fortran, and that was enough.


Al

=====
Albert Davidson Chou

    Get answers to Mac questions at http://www.Mac-Mgrs.org/ .

__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to