Thanks for the info.
I would like to follow up, however, that the respective cache sizes change
drastically on IA32 as well.  If I remember correctly (and it has been a few
years so I probably don't)
The 386 has no L1 cache
The 486 and Pentium had small L1 caches (Pentium was 8kData, 8kInstruction?)
The P6 line has anything from no L2 to a 1(2?)Meg L2.  The L1 was 32k
The Athlon/Duron has an L2 from 64k-512k and an L1 of 128k
The Willamette has 256k L1 1M L2 and an L3 (only for the server market?)

Also, any of the processors from the P6 class or newer used
microcode-translation engines (i.e a RISCish core), so the size of an IA32
instruction is very hard to determine once it enters the machine (i.e. how
much room it takes up in the L1)

So how do you optimize assembly-code to fit into the caches in this case?
It makes a big difference performance wise if you hit the L1 vs the L2 for
performance critical code, and if you get it wrong, you might end up missing
most of the time (one line displaces the one you need  next, which displaces
the one you just read, etc).  I would be very interested in knowing what
methods you use to design algorithms that fit into a specific cache, when
you don't know the size of the cache, or how much space a give instruction
will occupy).  Am I missing something here?  Or was the point 'small code=
higher probability of fitting in the cache' (in which case you also want to
avoid any instruction that might end up being translated into micro-code)?

The concept of optimizing for caches has been thrown around in the 'Optimal
hashing' thread, and I've been wondering how to go about doing it.

Thanks,
.Geoff


-----Original Message-----
From: Robert W. Cunningham [mailto:[EMAIL PROTECTED]]
Sent: Saturday, January 06, 2001 4:56 PM
To: [EMAIL PROTECTED]
Subject: Re: [plex86] More on C++ (using assembler can be bad)


"Hausheer, Geoffrey" wrote:

> Since everyone is throwing in their two cents, here's mine:
>   Writing assembly code is almost never the right solution anymore.
>   This is because processors do their own optimization.  Thus, code that
has
> been tuned really well for a Pentium (say using instruction ordering to
> maximize use of the  superscalar architecture) may well be incredibly
> inneficient on a Pentium II (which uses out-of-order execution) or on a
> Pentium IV or Athlon (which both have lots of execution engines and
> sophisticated schedulers).  The compiler can be optimized to take
advantage
> of these things (pgcc), but it is usually bad form to try to take
advantage
> of these optimizations in user-written code.  Writing in assembly to
reduce
> the number of instruction calls, fine, trying to use 'better' instructions
> or place them in a certain order, bad for portability.  This is not MIPS,
> DLX, or embedded programming.  It's IA32 where there are no gaurantees.

I no longer believe the primary goal of hand-tuned assembler is to take
advantage of general instruction execution characteristics.  Today, on the
very
few occasions I do hand-tune code, it is for the following reasons:

1. Get it to fit in and run from a faster cache.

This has proven critical for many uses of MMX, SSE, and 3DNOW (and similar
SIMD
instruction subsets on other processors), where the x86 CISC architecture
morphs to become SIMD, more like a DSP.  When doing math, code and operands
(data) have to be optimized concurrently.  While many compilers sometimes do
well with code optimizations, they often lack the ability to perform any
cache-use optimizations for code, and never (that I'm aware of) for data.
In
general, performance of SIMD code is extraordinarily sensitive to the
locality
of the data, which is why most SIMD instruction sets operate primarily from
registers.  The same considerations can apply to just about any code that
needs
to run often and run fast.

If you want to see a good example of data-locality optimization, take a look
at
some of the Cray compilers:  Many are Open Source or Public Domain, and they
prove how re-arranging your data in RAM and cache can greatly improve the
performance of any large calculation on a vector (SIMD) machine.

2. Employ instructions or instruction sequences the compiler fails to
generate.

I often take a pass through the performance-critical sections of the
assembly
code generated by compilers to identify sequences that are obviously
non-optimal, either in size or the instructions selected for the operation,
especially where data handling is concerned.  In essence, I act as a human
peep-hole optimizer, since I only look at the code from a local perspective.
When doing this, I typically focus on optimizing the code for the SLOWEST
processor the code needs to perform well on (a very typical consideration
for
embedded systems).  More often than I'd like to admit, it is still a 33 MHz
'386!

Over the years I have developed a few standard "tricks" I use frequently.
About 10 years ago I made a serious effort to incorporate some of these
tricks
into GCC.  While I succeeded in the technical sense, I failed miserably
overall:  My changes were just too special purpose, and slowed down the
optimizers (PH and RTL) beyond belief.  (For some reason, nobody wanted to
commit my changes, not even me.  Not even after I created a "-Ocomplex" flag
to
isolate my changes.)

Now and then (but less every year), I still hand-scan the assembly code.  I
haven't written any significant software directly in assembler (from
scratch)
in well over a decade.  I find it is always better to start with the output
of
a C/C++ compiler, and use a hardware emulator and an execution profiler
(typical embedded tools, along with a logic analyzer) to find obvious
bottlenecks.  First I try to tweak the C/C++ code, and only if that fails
will
I hack the assembly code.


That's another aspect of optimization that is seldom addressed:  Learning
the
specific strengths and weaknesses of a compiler, then coding to make best
use
of the strengths, and avoid the weaknesses.  Only by becoming familiar with
assembly is it possible to make useful determinations in this area:  Relying
on
compiler benchmarks is never useful in the real-world, since benchmarks
consist
only of specific code sequences you are likely never to use in an identical
way.  It is far better to use multiple compilers on your actual code and run
it
on your actual target platform, which will then allow you to select the best
compiler overall.  I have even selected different compilers for different
modules within a single project, since that allows me to continue to work at
a
higher level, and worry less about hand optimization of the assembly output.

When I started using C++, this was an especially critical optimization
phase.
On a Windows project (the only one I have *ever* done), the GUI team "had"
to
use MS C++ (for MFC access), while my team used Borland C++ for it's better
code generation and far superior adherence to the C++ spec (at that time, at
least).  We also used an early DOS port of gcc/g++ to check parts of our
code
(better warnings in some situations, and it sometimes generated "strange"
assembly code that was often a learning experience to understand).

In general, when performance is critical, I recommend taking a careful look
at
the high-level design and code, and then using multiple compilers, before
I recommend hand-tweaking of the assembler code.  That way, even if you do
have
to resort to hand tweaking, you have more than one set of generated assembly
code to refer to, which greatly simplifies the overall process.

Of course, that doesn't really much apply to C under Linux, since it is
pretty
much wedded to gcc.  While other Linux C compilers are available, and more
of
them offer "free" binaries or are going Open Source, gcc is still the 600 lb
gorilla in Linux land.

To compare, I recently counted over 150 (!) compilers available for the 8051
family of processors (and their clones and derivatives).  It seems that
compilers for "simple" CPUs (where hand assembly is far easier) are like
graphics toolkits under Linux:  Everyone wants to write their own.


-BobC





Reply via email to