Steve Reinhardt wrote:
> (Getting this back on the list again...)
>
> On Tue, Aug 25, 2009 at 10:32 PM, Gabe Black<gbl...@eecs.umich.edu> wrote:
>   
>> Steve Reinhardt wrote:
>>     
>>> On Tue, Aug 25, 2009 at 9:46 PM, Gabe Black<gbl...@eecs.umich.edu> wrote:
>>>
>>>       
>>>> Steve Reinhardt wrote:
>>>>
>>>>         
>>>>> On Tue, Aug 25, 2009 at 6:56 PM, Gabriel Michael
>>>>> Black<gbl...@eecs.umich.edu> wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> That actually gives rise to one of the potential optimizations I 
>>>>>> mentioned
>>>>>> before. If some of the work of getting from bytes to StaticInsts can be
>>>>>> delayed until after the ExtMachInst conversion, for instance until the
>>>>>> ExtMachInst is used to construct the EmulEnv object or even in the 
>>>>>> microop
>>>>>> constructors, it would only happen if the decode cache missed and
>>>>>> potentially contribute less to the overall run time. I looked at it again
>>>>>> recently and nothing like that jumped out, but it might be there if 
>>>>>> someone
>>>>>> looked hard enough. A tricky option would be figuring out how much 
>>>>>> immediate
>>>>>> and/or displacement to read in with less work since that's based on a 
>>>>>> lot of
>>>>>> different factors.
>>>>>>
>>>>>>
>>>>>>             
>>>>> What about the instruction page cache?  I thought our summer intern
>>>>> from a few years back added a shadow-page-like struct that cached the
>>>>> StaticInst objects for a page according to PC.  For x86 you'd have to
>>>>> make this byte-oriented rather than word-oriented, but the nice thing
>>>>> is that, assuming you're also keeping the original byte sequence along
>>>>> with the ExtMachInst, all you have to check is that the byte sequence
>>>>> matches what's in the actual instruction page.
>>>>>
>>>>>
>>>>>           
>>>> I think what happens is that it uses the PC and compares the
>>>> ExtMachInsts generated this time and the time it was cached. You'd have
>>>> to do that since you wouldn't want, for instance, the PAL version of one
>>>> instruction to be returned when trying to decode the non PAL version,
>>>> even if the actual bytes in memory are the same. I think the general
>>>> rule is that the ExtMachInst must be different if the end StaticInst is
>>>> different, and since I followed that rule it all just works out even for
>>>> x86.
>>>>
>>>>         
>>> Right, I recall that now.  Your original comment makes more sense: you
>>> really want the ExtMachInst to be just the original byte stream plus
>>> any necessary mode info (like PAL mode for Alpha), and not the
>>> half-decoded thing you have now.  I think that's a great goal to keep
>>> in mind if we do dive in to a more thorough restructuring.
>>>
>>> Steve
>>>
>>>       
>> Yeah. Unfortunately figuring out how much immediate and/or displacement
>> to read in, something that usually partially determines the length of an
>> instruction, seems to require the partial decode. The instructions that
>> require either and the size they need seems almost random which is why I
>> have some look up tables in there. I think real CPUs approximate and
>> then make instructions fix up the PC if they know the quick answer is
>> wrong. If we find a way to get around that I think we might actually be
>> able to get it in there without any other changes. I was thinking before
>> that we could do the extra processing after an early cache lookup but
>> before decode, but that's not really necessary since there are already
>> steps inside the decoder that could handle some of it.
>>     
>
> As far as the cache lookup, if you've already got a cached instruction
> then that will tell you how many bytes you need to look at to validate
> the cached object.  The only hiccup I see is that if the cache lookup
> fails, you may need to iteratively build the ExtMachInst as you
> decode; I don't know if that's different than what happens now or not.
>
> Steve
> _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org
> http://m5sim.org/mailman/listinfo/m5-dev
>   

The problem is that you need the displacement/immediate to actually do
the cache look up since those are part of the ExtMachInst and are
factored into a match. Those could be ignored for a preliminary lookup,
read in if there's a match, and then considered for the second look up,
but that sounds less efficient than just doing it like it's done now.
There could be a more direct simplification of the logic in there, a way
to reduce the number of function calls, etc. that would be easier. The
code is in arch/x86/predecoder.cc if you want to take a look.

Gabe
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to