Optimising ArcEm for ARM hosts

Jeffrey Lee Sat, 31 Jul 2010 09:03:08 -0700

Hi guys,

This week I decided to take a quick break from my OMAP work and instead 
had a play around with optimising ArcEm to run faster on ARM hardware
(although I suspect most of the optimisations will benefit other platforms 
too).


After grabbing the latest code from CVS and fixing a few GCC 4/32bit 
compatability issues, initial performance was about what I expected - an
average of 0.8 MIPS when measuring the first 50 million instructions
executed during RISC OS 3's startup sequence (long enough to complete 
the startup and sit idle at the desktop for a few seconds).

After spending the last week refactoring and optimising the code, the 
emulator now runs almost 3 times faster, reaching a speed of 2.26 MIPS 
for the same test. Adding a simple MHz readout revealed that this
corresponds to around 4.2MHz when sitting idle at the desktop, droping to
3.4MHz when messing around in !Paint. The same build running on a 720MHz
Beagleboard reached 6.3MHz when idle, or 5MHz when under load - which
means it probably isn't far off being usable for running games, or at 
least for running ones which are light on action.

I've uploaded my source + binary here: 
http://www.phlamethrower.co.uk/misc2/arcem-fast.zip

Note that it's an elf binary, so you'll need !SharedLibs and the 
SharedUnixLibrary. It's also been compiled for ARMv5, so will likely fail 
on a RiscPC. It's also still using the DIRECT_DISPLAY code, so you'll run 
into problems if your MDF/monitor can't cope with RISC OS 3-era screen 
modes (or if you try using a <256 colour mode on an Iyonix without 
Aemulor running)

Here's a brief summary of my changes:
* Fixed it to compile with GCC 4. IIRC the only thing that needed doing 
was gas'ing the assembler files, although I later disabled the assembler 
code anyway (it didn't seem to result in any major performance gains, and 
I wanted to mess around with the layout of the ARMul_State struct)
* Made the ArcEmKey module 32bit compatible. The new code uses MRS & MSR 
unconditionally, so will fail on pre-ARMv3 (not that there's much point in 
running an ARMv2 emulator on an ARMv2 machine)
* Stripped out all the code relating to 32bit CPU modes and got rid of all 
the PC/PSR state shadowing. I'm not sure how much of a performance gain 
this actually gave, but it certainly made the code a lot easier to work 
with.
* Changed the instruction decode/execute to not have 16 instances of each
execution function (one per condition code). Apart from significantly 
reducing compile times & resulting ELF size, this also gave around a 25% 
performance boost. Condition codes are now evaluated with a simple lookup 
table instead.
* Stripped out a fair amount of other redundant code/state/variables 
(although I suspect there's still a lot more to go)
* Rearranged the ARMul_State struct to move the most commonly used members 
near the start. This resulted in tighter code, better D-cache utilisation, 
etc.
* Rewrote the memory access code. The old code wasn't particularly great 
(linear pagetable searches!), so I worked quite hard on finding a fairly 
optimal solution for ARM. Unfortunately this means it isn't particularly 
portable, but it shouldn't be too hard to introduce a more portable 
variant for other platforms. The new code is called 'FastMap', I'll give 
some more detail on it below.
* Changed the instruction fetch code to decode the instruction at fetch 
time. This will result in a performance loss in some situations (a branch 
instruction just before some frequently-changing data), but it makes it a 
lot easier to deal with resetting the cached execute function when memory 
writes occur (something that I'm not entirely sure the old code dealt with 
properly, judging by the state of the code and the comments that were at 
the top of armemu.c)
* Changed the main execute loop to store the instruction pipeline in an 
array instead of shuffling it through different variables all the time.
* Had a quick play with the instruction execution functions. In particular 
I noticed that the C version of GetDPRegRHS was being inlined, but its 
complexity meant the inlining wasn't really doing much except hurt 
performance. So instead I swapped it for a much simpler version that uses
a function pointer table to call one of 8 simple sub-variants (4 for 
constant shifts, 4 for register shifts).
* Changed all the code to access the ARMul_State via the 'state' pointer 
rather than the 'statestr' global. Before a lot of the code was simply 
ignoring the state pointer that was being passed to it, but making it 
start using it resulted in a nice speed gain. I'm not sure what would 
happen if the state pointer was removed entirely (i.e. if everything was 
forced to use the global statestr); some experiments I performed on 
changing MEMC to use a local pointer only made performance worse, not 
better.
* Reduced the mouse/keyboard poll frequency from once every 125 cycles to 
once every 12500(ish) cycles (see the #ifdef __riscos__ bits in
DisplayKbd_Poll() in arch/DispKbdShared.c). I guess whoever wrote that 
code didn't realise that it would get called so often! I haven't checked
what the real mouse/keyboard update rate should be, but I suspect the rate
could be slowed further without causing any problems (The current rate
would update it at 640 times per second, assuming a 8MHz ARM2).

I think the next step with the optimisations should be to make the 
instruction decoding more aggressive, which will introduce more execution 
functions, but make each function simpler. The way that the each 
memory location caches the last execute function means that this should 
result in a nice performance gain, since over 99% of all executed memory 
locations will have been executed in the past, except in extreme cases 
like very tight self-modfying loops.

But introducing more execute functions has a danger of hurting I-cache 
performance (just like the condition code variants did), so it might be 
worth looking into splitting the execute stage into two or three stages. 
E.g. instead of having a function pointer you have 2 or 3 function 
pointers (or indices into function pointer arrays, to save memory). Since 
most instructions are likely to be data processing instructions or 
load/store instructions this will work quite nicely - the first execution 
stage can be to decode the RHS, while the second stage performs the 
computation/memory access. With any luck a rewrite like that will get us 
to ARM2 speeds on an Iyonix (or at least on a Beagleboard), all without 
the hassle of adding a dynamic recompiler or having to maintain assembler 
sources.

Now, about the FastMap code:

This operates by building a lookup table (MEMC.FastMap) to map each 4K 
page in the 64MB memory map to a FastMapEntry struct. Each FastMapEntry 
contains the page access flags, memory pointer (if direct access is 
possible), and function pointer (if direct access is not possible), all 
within 8 bytes. The fact that the access flags and memory pointer 
have been squeezed into 4 bytes is why it isn't entirely portable. This
arrangement means that all ROM & RAM reads can be accessed directly via
the memory pointer, while I/O access, or writes to the first 512K of RAM, 
go via the function pointer. All access is done via a series of inlined
functions, resulting in minimal overhead for situations where the memory 
pointer is used.

To cope with instruction fetches (and RAM writes) requiring access to the 
instruction decode cache, all executable memory has been coalesced into
one big allocation. This means that once you have a pointer to a memory
location you just need to apply one offset to get a pointer to the
corresponding cached ARMEmuFunc.

A few extra points:
* FastMap_SetEntries offsets the data pointer by the address, so 
FastMap_Log2Phy is just a simle shift + addition (although when I looked 
at some code GCC had produced it was still using 3 instructions to do what 
could have been done in one)
* state->NtransSig is mostly redundant, since all memory access checks are 
performed using MEMC.FastMapMode (as computed by FastMap_RebuildMapMode).
* Inlining of the ARMul_Load/Store functions can be disabled by commenting 
out FASTMAP_INLINE in armarc.h
* ARMul_LoadInstrTriplet, and the LDM/STM code, directly access the data 
pointer for speed whenever they can be certain that the access won't cross 
a page boundary. This resulted in a nice gain for LoadInstrTriplet, but 
didn't seem to result in too much of a gain for LDM/STM. Or maybe my test 
case wasn't that good, since during startup RISC OS seems to write to the 
first 512K of RAM quite a bit, which will prevent the STM optimisation 
from being used.
* The MHz readout is provided by the RefreshDisplay() function in 
riscos-single/DispKbd.c
* The MIPS timing is provided by some slightly hacky code in 
ARMul_Emulate26 and ARMul_DoProg (although it's currently disabled, so the 
emulator doesn't exit after the first 50 million instructions)
* There are also likely to be a few remnants of other bits of my debugging 
code around the place (e.g. the code in ARMul_Emulate26 to log the address 
& value of each instruction being executed)

I doubt I'll be looking at ArcEm much more in the near future, so I'll 
leave it to you to decide what (if anything) to do with these changes. I 
haven't tried them on anything other than RISC OS, so it wouldn't surprise 
me if some of the other platforms fail to compile or are fundamentally 
broken.

If ArcEm is to be made usable on RISC OS then at some point we'll also 
want to get rid of the DIRECT_DISPLAY code, since one of the major 
problems with running old games on new machines is that the 
machine/MDF/monitor is incapable of providing the required screen modes. 
An optimised blitter would therefore be required, and could also allow us 
to support palette splitting (especially if blitting to a 16bpp or 32bpp 
buffer, which would also allow use of hardware scaling on OMAP). To cope
with this eventuality I've been careful not to break any of the cycle 
counting that ArcEm does (although I haven't verified that it's still 100% 
correct).

Cheers,

- Jeffrey

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
arcem-devel mailing list
arcem-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/arcem-devel

Optimising ArcEm for ARM hosts

Reply via email to