Steve Reinhardt wrote: > (Getting this back on the list again...) > > On Tue, Aug 25, 2009 at 10:32 PM, Gabe Black<gbl...@eecs.umich.edu> wrote: > >> Steve Reinhardt wrote: >> >>> On Tue, Aug 25, 2009 at 9:46 PM, Gabe Black<gbl...@eecs.umich.edu> wrote: >>> >>> >>>> Steve Reinhardt wrote: >>>> >>>> >>>>> On Tue, Aug 25, 2009 at 6:56 PM, Gabriel Michael >>>>> Black<gbl...@eecs.umich.edu> wrote: >>>>> >>>>> >>>>> >>>>>> That actually gives rise to one of the potential optimizations I >>>>>> mentioned >>>>>> before. If some of the work of getting from bytes to StaticInsts can be >>>>>> delayed until after the ExtMachInst conversion, for instance until the >>>>>> ExtMachInst is used to construct the EmulEnv object or even in the >>>>>> microop >>>>>> constructors, it would only happen if the decode cache missed and >>>>>> potentially contribute less to the overall run time. I looked at it again >>>>>> recently and nothing like that jumped out, but it might be there if >>>>>> someone >>>>>> looked hard enough. A tricky option would be figuring out how much >>>>>> immediate >>>>>> and/or displacement to read in with less work since that's based on a >>>>>> lot of >>>>>> different factors. >>>>>> >>>>>> >>>>>> >>>>> What about the instruction page cache? I thought our summer intern >>>>> from a few years back added a shadow-page-like struct that cached the >>>>> StaticInst objects for a page according to PC. For x86 you'd have to >>>>> make this byte-oriented rather than word-oriented, but the nice thing >>>>> is that, assuming you're also keeping the original byte sequence along >>>>> with the ExtMachInst, all you have to check is that the byte sequence >>>>> matches what's in the actual instruction page. >>>>> >>>>> >>>>> >>>> I think what happens is that it uses the PC and compares the >>>> ExtMachInsts generated this time and the time it was cached. You'd have >>>> to do that since you wouldn't want, for instance, the PAL version of one >>>> instruction to be returned when trying to decode the non PAL version, >>>> even if the actual bytes in memory are the same. I think the general >>>> rule is that the ExtMachInst must be different if the end StaticInst is >>>> different, and since I followed that rule it all just works out even for >>>> x86. >>>> >>>> >>> Right, I recall that now. Your original comment makes more sense: you >>> really want the ExtMachInst to be just the original byte stream plus >>> any necessary mode info (like PAL mode for Alpha), and not the >>> half-decoded thing you have now. I think that's a great goal to keep >>> in mind if we do dive in to a more thorough restructuring. >>> >>> Steve >>> >>> >> Yeah. Unfortunately figuring out how much immediate and/or displacement >> to read in, something that usually partially determines the length of an >> instruction, seems to require the partial decode. The instructions that >> require either and the size they need seems almost random which is why I >> have some look up tables in there. I think real CPUs approximate and >> then make instructions fix up the PC if they know the quick answer is >> wrong. If we find a way to get around that I think we might actually be >> able to get it in there without any other changes. I was thinking before >> that we could do the extra processing after an early cache lookup but >> before decode, but that's not really necessary since there are already >> steps inside the decoder that could handle some of it. >> > > As far as the cache lookup, if you've already got a cached instruction > then that will tell you how many bytes you need to look at to validate > the cached object. The only hiccup I see is that if the cache lookup > fails, you may need to iteratively build the ExtMachInst as you > decode; I don't know if that's different than what happens now or not. > > Steve > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org > http://m5sim.org/mailman/listinfo/m5-dev >
The problem is that you need the displacement/immediate to actually do the cache look up since those are part of the ExtMachInst and are factored into a match. Those could be ignored for a preliminary lookup, read in if there's a match, and then considered for the second look up, but that sounds less efficient than just doing it like it's done now. There could be a more direct simplification of the logic in there, a way to reduce the number of function calls, etc. that would be easier. The code is in arch/x86/predecoder.cc if you want to take a look. Gabe _______________________________________________ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev