Hi Gabe, I just got around to reading this... please fill me in with more design details as you work on this, as I'd like to keep on top of what you're doing and (perhaps) be in a position to offer some suggestions.
Thanks, Steve On Fri, Feb 11, 2011 at 4:16 PM, Gabriel Michael Black < [email protected]> wrote: > Hello again. I've had a chance to talk with an expert, and I have an idea > of how to approach this. It's going to require more flexibility than the ISA > parser has currently, though, specifically in how the list of source and > destination registers are managed. It would also be nice to have a more > integrated idea of composite operands, ie. ones where some bits come from > here, some from there, and in the end it builds a single uint64_t, double > precision float, vector of uint32_ts, etc. > > Rather than try to shoe horn this into a system that's already suffered > enough of my abuse, aka the ISA description language, I'm going to attempt > to build a parallel facility for defining instructions usable from inside > the python in "let" blocks. Basically it would be python classes, functions, > etc., (hopefully not that many) exported into the let block context that > would allow more direct interaction with the parser's guts, and more control > over how things are put together. > > In the future I'd like to see this bud into isa_parser2.py, but that's > going to be a lot of work and is a somewhat orthogonal issue. Ideally this > sort of thing will also make it easier to split output into smaller files. > > Gabe > > > Quoting Gabe Black <[email protected]>: > > I'm looking at why x86 goes so much slower than Alpha on O3 (4x the >> ticks), and I think one culprit are dependencies set up by the condition >> code bits of the flags register. Many instructions in x86 modify or >> depend on those bits, and even though the condition codes are separated >> out from the flags register (which does a lot of other stuff too), >> they're being updated with a read-modify-write sort of mechanism. I >> expect that's setting up long chains of serializing dependencies which >> is killing parallelism and performance. >> >> Basically, There are 6 condition codes in x86, Z, C, A, S, P, O or zero, >> carry, auxiliary carry, sign, parity and overflow. In M5's >> implementation (and in the patent I patterned it after) there are also >> artificial "emulation" zero and carry flags that work like the regular >> ones but are maintained separately. They can be updated independently >> and checked separately, and are useful behind the scenes when >> implementing some macroops. Instructions may update all of these flags >> or only some of them. The PTLSim manual claims that there's a "ZAPS" >> rule where the zero, auxiliary carry, parity and sign bits are always >> updated together. That's usually true, but certain instructions change >> only the zero flag. CMPXCHG8B is an example. >> >> What I'd been thinking of doing to handle this is to further split up >> the condition code bits into separate registers to be managed >> independently for any register renaming. There are a couple of issues >> with that, though. First, it looks like there'd have to be 6 different >> registers, APS, Z, O, C, EZ, and EC. A non-trivial number of >> instructions would need to update 4 or more of those, putting a perhaps >> unrealistic burden on any rename mechanism. That would also make the >> simple CPUs slower because they'd have to read/write all those extra >> registers. Bread and butter x86 tends to be condition code happy, so >> that could be a significant slow down. >> >> Also, that complicates decoding significantly. Conceptually it's easy to >> imagine reading/writing the registers with the bits you need, but with >> the ISA parser, the code needs to either be there or not be there. If >> you have code that's never used but accesses a register, it'll still get >> pulled in as a source or dest. That means there would need to be a hard >> coded version of every microop that would correspond to each possible >> combination of condition code bits. Since there are 6 bits, that's 2^6, >> plus 2 variants for partial or complete register writes, so 2^7 or 128 >> versions of every microop. There are also register/immediate versions of >> many microops. We would likely end up with thousands of microop classes. >> We'd also need to generate selection functions that would pick which >> variant to use. This is all possible, but fairly ugly and clunky. >> >> So does anybody have any suggestions on how to unserialize these >> microops? I found a paper here: >> http://www.wseas.us/e-library/conferences/2006elounda1/papers/537-325.pdf >> that claims IPC for x86 CPUs is significantly worse than other ISAs >> specifically because of this sort of thing. Is this just a fact of life >> with x86? Would fixing it be not only very annoying but also >> unrealistic? Is that paper's claim actually true? >> >> Gabe >> _______________________________________________ >> m5-dev mailing list >> [email protected] >> http://m5sim.org/mailman/listinfo/m5-dev >> >> > > _______________________________________________ > m5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/m5-dev >
_______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
