Of course. I haven't really thought about it very much yet beyond what was in my earlier email, but when I do I'll be sure to keep you (and this list) in the loop.
Gabe On 02/13/11 20:04, Steve Reinhardt wrote: > Hi Gabe, > > I just got around to reading this... please fill me in with more > design details as you work on this, as I'd like to keep on top of what > you're doing and (perhaps) be in a position to offer some suggestions. > > Thanks, > > Steve > > On Fri, Feb 11, 2011 at 4:16 PM, Gabriel Michael Black > <gbl...@eecs.umich.edu <mailto:gbl...@eecs.umich.edu>> wrote: > > Hello again. I've had a chance to talk with an expert, and I have > an idea of how to approach this. It's going to require more > flexibility than the ISA parser has currently, though, > specifically in how the list of source and destination registers > are managed. It would also be nice to have a more integrated idea > of composite operands, ie. ones where some bits come from here, > some from there, and in the end it builds a single uint64_t, > double precision float, vector of uint32_ts, etc. > > Rather than try to shoe horn this into a system that's already > suffered enough of my abuse, aka the ISA description language, I'm > going to attempt to build a parallel facility for defining > instructions usable from inside the python in "let" blocks. > Basically it would be python classes, functions, etc., (hopefully > not that many) exported into the let block context that would > allow more direct interaction with the parser's guts, and more > control over how things are put together. > > In the future I'd like to see this bud into isa_parser2.py, but > that's going to be a lot of work and is a somewhat orthogonal > issue. Ideally this sort of thing will also make it easier to > split output into smaller files. > > Gabe > > > Quoting Gabe Black <gbl...@eecs.umich.edu > <mailto:gbl...@eecs.umich.edu>>: > > I'm looking at why x86 goes so much slower than Alpha on O3 > (4x the > ticks), and I think one culprit are dependencies set up by the > condition > code bits of the flags register. Many instructions in x86 > modify or > depend on those bits, and even though the condition codes are > separated > out from the flags register (which does a lot of other stuff too), > they're being updated with a read-modify-write sort of > mechanism. I > expect that's setting up long chains of serializing > dependencies which > is killing parallelism and performance. > > Basically, There are 6 condition codes in x86, Z, C, A, S, P, > O or zero, > carry, auxiliary carry, sign, parity and overflow. In M5's > implementation (and in the patent I patterned it after) there > are also > artificial "emulation" zero and carry flags that work like the > regular > ones but are maintained separately. They can be updated > independently > and checked separately, and are useful behind the scenes when > implementing some macroops. Instructions may update all of > these flags > or only some of them. The PTLSim manual claims that there's a > "ZAPS" > rule where the zero, auxiliary carry, parity and sign bits are > always > updated together. That's usually true, but certain > instructions change > only the zero flag. CMPXCHG8B is an example. > > What I'd been thinking of doing to handle this is to further > split up > the condition code bits into separate registers to be managed > independently for any register renaming. There are a couple of > issues > with that, though. First, it looks like there'd have to be 6 > different > registers, APS, Z, O, C, EZ, and EC. A non-trivial number of > instructions would need to update 4 or more of those, putting > a perhaps > unrealistic burden on any rename mechanism. That would also > make the > simple CPUs slower because they'd have to read/write all those > extra > registers. Bread and butter x86 tends to be condition code > happy, so > that could be a significant slow down. > > Also, that complicates decoding significantly. Conceptually > it's easy to > imagine reading/writing the registers with the bits you need, > but with > the ISA parser, the code needs to either be there or not be > there. If > you have code that's never used but accesses a register, it'll > still get > pulled in as a source or dest. That means there would need to > be a hard > coded version of every microop that would correspond to each > possible > combination of condition code bits. Since there are 6 bits, > that's 2^6, > plus 2 variants for partial or complete register writes, so > 2^7 or 128 > versions of every microop. There are also register/immediate > versions of > many microops. We would likely end up with thousands of > microop classes. > We'd also need to generate selection functions that would pick > which > variant to use. This is all possible, but fairly ugly and clunky. > > So does anybody have any suggestions on how to unserialize these > microops? I found a paper here: > > http://www.wseas.us/e-library/conferences/2006elounda1/papers/537-325.pdf > that claims IPC for x86 CPUs is significantly worse than other > ISAs > specifically because of this sort of thing. Is this just a > fact of life > with x86? Would fixing it be not only very annoying but also > unrealistic? Is that paper's claim actually true? > > Gabe > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org <mailto:m5-dev@m5sim.org> > http://m5sim.org/mailman/listinfo/m5-dev > > > > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org <mailto:m5-dev@m5sim.org> > http://m5sim.org/mailman/listinfo/m5-dev > > > > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org > http://m5sim.org/mailman/listinfo/m5-dev
_______________________________________________ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev