I'm looking at why x86 goes so much slower than Alpha on O3 (4x the ticks), and I think one culprit are dependencies set up by the condition code bits of the flags register. Many instructions in x86 modify or depend on those bits, and even though the condition codes are separated out from the flags register (which does a lot of other stuff too), they're being updated with a read-modify-write sort of mechanism. I expect that's setting up long chains of serializing dependencies which is killing parallelism and performance.
Basically, There are 6 condition codes in x86, Z, C, A, S, P, O or zero, carry, auxiliary carry, sign, parity and overflow. In M5's implementation (and in the patent I patterned it after) there are also artificial "emulation" zero and carry flags that work like the regular ones but are maintained separately. They can be updated independently and checked separately, and are useful behind the scenes when implementing some macroops. Instructions may update all of these flags or only some of them. The PTLSim manual claims that there's a "ZAPS" rule where the zero, auxiliary carry, parity and sign bits are always updated together. That's usually true, but certain instructions change only the zero flag. CMPXCHG8B is an example. What I'd been thinking of doing to handle this is to further split up the condition code bits into separate registers to be managed independently for any register renaming. There are a couple of issues with that, though. First, it looks like there'd have to be 6 different registers, APS, Z, O, C, EZ, and EC. A non-trivial number of instructions would need to update 4 or more of those, putting a perhaps unrealistic burden on any rename mechanism. That would also make the simple CPUs slower because they'd have to read/write all those extra registers. Bread and butter x86 tends to be condition code happy, so that could be a significant slow down. Also, that complicates decoding significantly. Conceptually it's easy to imagine reading/writing the registers with the bits you need, but with the ISA parser, the code needs to either be there or not be there. If you have code that's never used but accesses a register, it'll still get pulled in as a source or dest. That means there would need to be a hard coded version of every microop that would correspond to each possible combination of condition code bits. Since there are 6 bits, that's 2^6, plus 2 variants for partial or complete register writes, so 2^7 or 128 versions of every microop. There are also register/immediate versions of many microops. We would likely end up with thousands of microop classes. We'd also need to generate selection functions that would pick which variant to use. This is all possible, but fairly ugly and clunky. So does anybody have any suggestions on how to unserialize these microops? I found a paper here: http://www.wseas.us/e-library/conferences/2006elounda1/papers/537-325.pdf that claims IPC for x86 CPUs is significantly worse than other ISAs specifically because of this sort of thing. Is this just a fact of life with x86? Would fixing it be not only very annoying but also unrealistic? Is that paper's claim actually true? Gabe _______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
