[m5-dev] Condition code bits in X86 O3

Gabe Black Sun, 06 Feb 2011 02:37:30 -0800

I'm looking at why x86 goes so much slower than Alpha on O3 (4x the
ticks), and I think one culprit are dependencies set up by the condition
code bits of the flags register. Many instructions in x86 modify or
depend on those bits, and even though the condition codes are separated
out from the flags register (which does a lot of other stuff too),
they're being updated with a read-modify-write sort of mechanism. I
expect that's setting up long chains of serializing dependencies which
is killing parallelism and performance.


Basically, There are 6 condition codes in x86, Z, C, A, S, P, O or zero,
carry, auxiliary carry, sign, parity and overflow. In M5's
implementation (and in the patent I patterned it after) there are also
artificial "emulation" zero and carry flags that work like the regular
ones but are maintained separately. They can be updated independently
and checked separately, and are useful behind the scenes when
implementing some macroops. Instructions may update all of these flags
or only some of them. The PTLSim manual claims that there's a "ZAPS"
rule where the zero, auxiliary carry, parity and sign bits are always
updated together. That's usually true, but certain instructions change
only the zero flag. CMPXCHG8B is an example.

What I'd been thinking of doing to handle this is to further split up
the condition code bits into separate registers to be managed
independently for any register renaming. There are a couple of issues
with that, though. First, it looks like there'd have to be 6 different
registers, APS, Z, O, C, EZ, and EC. A non-trivial number of
instructions would need to update 4 or more of those, putting a perhaps
unrealistic burden on any rename mechanism. That would also make the
simple CPUs slower because they'd have to read/write all those extra
registers. Bread and butter x86 tends to be condition code happy, so
that could be a significant slow down.

Also, that complicates decoding significantly. Conceptually it's easy to
imagine reading/writing the registers with the bits you need, but with
the ISA parser, the code needs to either be there or not be there. If
you have code that's never used but accesses a register, it'll still get
pulled in as a source or dest. That means there would need to be a hard
coded version of every microop that would correspond to each possible
combination of condition code bits. Since there are 6 bits, that's 2^6,
plus 2 variants for partial or complete register writes, so 2^7 or 128
versions of every microop. There are also register/immediate versions of
many microops. We would likely end up with thousands of microop classes.
We'd also need to generate selection functions that would pick which
variant to use. This is all possible, but fairly ugly and clunky.

So does anybody have any suggestions on how to unserialize these
microops? I found a paper here:
http://www.wseas.us/e-library/conferences/2006elounda1/papers/537-325.pdf
that claims IPC for x86 CPUs is significantly worse than other ISAs
specifically because of this sort of thing. Is this just a fact of life
with x86? Would fixing it be not only very annoying but also
unrealistic? Is that paper's claim actually true?

Gabe
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

[m5-dev] Condition code bits in X86 O3

Reply via email to