Hi guys, This week I decided to take a quick break from my OMAP work and instead had a play around with optimising ArcEm to run faster on ARM hardware (although I suspect most of the optimisations will benefit other platforms too).
After grabbing the latest code from CVS and fixing a few GCC 4/32bit compatability issues, initial performance was about what I expected - an average of 0.8 MIPS when measuring the first 50 million instructions executed during RISC OS 3's startup sequence (long enough to complete the startup and sit idle at the desktop for a few seconds). After spending the last week refactoring and optimising the code, the emulator now runs almost 3 times faster, reaching a speed of 2.26 MIPS for the same test. Adding a simple MHz readout revealed that this corresponds to around 4.2MHz when sitting idle at the desktop, droping to 3.4MHz when messing around in !Paint. The same build running on a 720MHz Beagleboard reached 6.3MHz when idle, or 5MHz when under load - which means it probably isn't far off being usable for running games, or at least for running ones which are light on action. I've uploaded my source + binary here: http://www.phlamethrower.co.uk/misc2/arcem-fast.zip Note that it's an elf binary, so you'll need !SharedLibs and the SharedUnixLibrary. It's also been compiled for ARMv5, so will likely fail on a RiscPC. It's also still using the DIRECT_DISPLAY code, so you'll run into problems if your MDF/monitor can't cope with RISC OS 3-era screen modes (or if you try using a <256 colour mode on an Iyonix without Aemulor running) Here's a brief summary of my changes: * Fixed it to compile with GCC 4. IIRC the only thing that needed doing was gas'ing the assembler files, although I later disabled the assembler code anyway (it didn't seem to result in any major performance gains, and I wanted to mess around with the layout of the ARMul_State struct) * Made the ArcEmKey module 32bit compatible. The new code uses MRS & MSR unconditionally, so will fail on pre-ARMv3 (not that there's much point in running an ARMv2 emulator on an ARMv2 machine) * Stripped out all the code relating to 32bit CPU modes and got rid of all the PC/PSR state shadowing. I'm not sure how much of a performance gain this actually gave, but it certainly made the code a lot easier to work with. * Changed the instruction decode/execute to not have 16 instances of each execution function (one per condition code). Apart from significantly reducing compile times & resulting ELF size, this also gave around a 25% performance boost. Condition codes are now evaluated with a simple lookup table instead. * Stripped out a fair amount of other redundant code/state/variables (although I suspect there's still a lot more to go) * Rearranged the ARMul_State struct to move the most commonly used members near the start. This resulted in tighter code, better D-cache utilisation, etc. * Rewrote the memory access code. The old code wasn't particularly great (linear pagetable searches!), so I worked quite hard on finding a fairly optimal solution for ARM. Unfortunately this means it isn't particularly portable, but it shouldn't be too hard to introduce a more portable variant for other platforms. The new code is called 'FastMap', I'll give some more detail on it below. * Changed the instruction fetch code to decode the instruction at fetch time. This will result in a performance loss in some situations (a branch instruction just before some frequently-changing data), but it makes it a lot easier to deal with resetting the cached execute function when memory writes occur (something that I'm not entirely sure the old code dealt with properly, judging by the state of the code and the comments that were at the top of armemu.c) * Changed the main execute loop to store the instruction pipeline in an array instead of shuffling it through different variables all the time. * Had a quick play with the instruction execution functions. In particular I noticed that the C version of GetDPRegRHS was being inlined, but its complexity meant the inlining wasn't really doing much except hurt performance. So instead I swapped it for a much simpler version that uses a function pointer table to call one of 8 simple sub-variants (4 for constant shifts, 4 for register shifts). * Changed all the code to access the ARMul_State via the 'state' pointer rather than the 'statestr' global. Before a lot of the code was simply ignoring the state pointer that was being passed to it, but making it start using it resulted in a nice speed gain. I'm not sure what would happen if the state pointer was removed entirely (i.e. if everything was forced to use the global statestr); some experiments I performed on changing MEMC to use a local pointer only made performance worse, not better. * Reduced the mouse/keyboard poll frequency from once every 125 cycles to once every 12500(ish) cycles (see the #ifdef __riscos__ bits in DisplayKbd_Poll() in arch/DispKbdShared.c). I guess whoever wrote that code didn't realise that it would get called so often! I haven't checked what the real mouse/keyboard update rate should be, but I suspect the rate could be slowed further without causing any problems (The current rate would update it at 640 times per second, assuming a 8MHz ARM2). I think the next step with the optimisations should be to make the instruction decoding more aggressive, which will introduce more execution functions, but make each function simpler. The way that the each memory location caches the last execute function means that this should result in a nice performance gain, since over 99% of all executed memory locations will have been executed in the past, except in extreme cases like very tight self-modfying loops. But introducing more execute functions has a danger of hurting I-cache performance (just like the condition code variants did), so it might be worth looking into splitting the execute stage into two or three stages. E.g. instead of having a function pointer you have 2 or 3 function pointers (or indices into function pointer arrays, to save memory). Since most instructions are likely to be data processing instructions or load/store instructions this will work quite nicely - the first execution stage can be to decode the RHS, while the second stage performs the computation/memory access. With any luck a rewrite like that will get us to ARM2 speeds on an Iyonix (or at least on a Beagleboard), all without the hassle of adding a dynamic recompiler or having to maintain assembler sources. Now, about the FastMap code: This operates by building a lookup table (MEMC.FastMap) to map each 4K page in the 64MB memory map to a FastMapEntry struct. Each FastMapEntry contains the page access flags, memory pointer (if direct access is possible), and function pointer (if direct access is not possible), all within 8 bytes. The fact that the access flags and memory pointer have been squeezed into 4 bytes is why it isn't entirely portable. This arrangement means that all ROM & RAM reads can be accessed directly via the memory pointer, while I/O access, or writes to the first 512K of RAM, go via the function pointer. All access is done via a series of inlined functions, resulting in minimal overhead for situations where the memory pointer is used. To cope with instruction fetches (and RAM writes) requiring access to the instruction decode cache, all executable memory has been coalesced into one big allocation. This means that once you have a pointer to a memory location you just need to apply one offset to get a pointer to the corresponding cached ARMEmuFunc. A few extra points: * FastMap_SetEntries offsets the data pointer by the address, so FastMap_Log2Phy is just a simle shift + addition (although when I looked at some code GCC had produced it was still using 3 instructions to do what could have been done in one) * state->NtransSig is mostly redundant, since all memory access checks are performed using MEMC.FastMapMode (as computed by FastMap_RebuildMapMode). * Inlining of the ARMul_Load/Store functions can be disabled by commenting out FASTMAP_INLINE in armarc.h * ARMul_LoadInstrTriplet, and the LDM/STM code, directly access the data pointer for speed whenever they can be certain that the access won't cross a page boundary. This resulted in a nice gain for LoadInstrTriplet, but didn't seem to result in too much of a gain for LDM/STM. Or maybe my test case wasn't that good, since during startup RISC OS seems to write to the first 512K of RAM quite a bit, which will prevent the STM optimisation from being used. * The MHz readout is provided by the RefreshDisplay() function in riscos-single/DispKbd.c * The MIPS timing is provided by some slightly hacky code in ARMul_Emulate26 and ARMul_DoProg (although it's currently disabled, so the emulator doesn't exit after the first 50 million instructions) * There are also likely to be a few remnants of other bits of my debugging code around the place (e.g. the code in ARMul_Emulate26 to log the address & value of each instruction being executed) I doubt I'll be looking at ArcEm much more in the near future, so I'll leave it to you to decide what (if anything) to do with these changes. I haven't tried them on anything other than RISC OS, so it wouldn't surprise me if some of the other platforms fail to compile or are fundamentally broken. If ArcEm is to be made usable on RISC OS then at some point we'll also want to get rid of the DIRECT_DISPLAY code, since one of the major problems with running old games on new machines is that the machine/MDF/monitor is incapable of providing the required screen modes. An optimised blitter would therefore be required, and could also allow us to support palette splitting (especially if blitting to a 16bpp or 32bpp buffer, which would also allow use of hardware scaling on OMAP). To cope with this eventuality I've been careful not to break any of the cycle counting that ArcEm does (although I haven't verified that it's still 100% correct). Cheers, - Jeffrey ------------------------------------------------------------------------------ The Palm PDK Hot Apps Program offers developers who use the Plug-In Development Kit to bring their C/C++ apps to Palm for a share of $1 Million in cash or HP Products. Visit us here for more details: http://p.sf.net/sfu/dev2dev-palm _______________________________________________ arcem-devel mailing list arcem-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/arcem-devel