I think your best bet is to make mfpr/mtpr a little bit special.  
Instead of IsIprAccess you probably want flags like IsIprRead/ 
IsIprWrite. Then in src/cpu/o3/rename_impl.hh you can keep track of  
which (reads or writes) are is the back end and only allow reads if if  
there are no writes. That should pretty much solve your issue.


Ali

On Oct 18, 2009, at 4:13 PM, ef wrote:

> Hello,
>
> Here is a bit of an update:
>
> I went ahead and tracked the number of hw_m*pr instructions using
> Blackscholes, simlarge 2 cores. 5% of instructions fetched where
> hw_m*pr instructions. I think this is a huge amount considering that
> is 1 in 20 instructions, since I am using 8 wide machine, thats a
> stall of 3/5 of pipe stages one out of 3 cycles (stalling till the ROB
> is empty)...Not to mention the IPC is atrocious, which I suspect is
> caused by these serializing stalls.
>
> I went ahead and obj dumped my binaries, and looked at the
> decoder.isa. It seems that the only instructions from the parsec
> binaries generating hw_m*pr instructions is:
>
> rdunique (my code has lots of them) wruniqe(my code has 2) , halt(not
> important) and callsys(has quite a bit)
> Anyways the instruction count of these instructions is pointless due
> to branching loops and jumps. I was just trying to get some type of
> quantification.
>
> Unfortunately I was unable to find to much literature on these
> instructions (rdunique, wrunique) using google,  I need to work on
> that.
>
> I have 12 out of 13 PARSEC 2.1 benchmarks compiled and I consider them
> useless due to this issue. I am surprised that this issue has never
> been brought except by Rick Strong.
>
> My first thought was try to force the compiler to
> remove/reduce/optimize these instruction (rduniq,wduniq), however I
> dont think that is possible, so compiler solution seems to be out of
> the question. There is a way to replace these rduniq an wduniq
> instructions, the following is from gcc.org
> " The following builtins are available on systems that use the OSF/1
> PALcode.  Normally they invoke the `rduniq' and `wruniq' PAL calls,
> but
> when invoked with `-mtls-kernel', they invoke `rdval' and `wrval'."
>
> I dont think rdval and wrval are implemented into M5 so that seems  
> useless.
>
> My next step I guess, is to attempt to fix M5. Unfortunately my C++
> skill level is pretty low, I would like to implement a solution based
> on what I can do, but I don't think I have that luxury. I am in the
> process of researching  Nathan's solutions and trying to implement
> them.  I am assuming the scoreboard is the most efficient
> implementation and realistic one (this correct?). I will try looking
> into that solution first..
>
> Any thoughts, advice and/or suggestions would be greatly appreciated.
>
> Thanks,
> EF
>
> On Mon, Oct 12, 2009 at 10:11 PM, nathan binkert <[email protected]>  
> wrote:
>> My guess is that this is the result of calling rdunique and wrunique.
>> These pal instructions keep track of the currently running thread.
>> They more or less just access a single internal pal temp register.
>> There are a number of things that could potentially be done to fix  
>> the
>> slowness here.  You could create an actual renamed register in the o3
>> model and make those palcalls access that special register.  If that
>> weren't enough, you could add a more generalized facility for  
>> renaming
>> pal temp registers (there are many that are simply treated as
>> registers) and allow mfpr and mtpr to not be serializing.  Another
>> option is to make some sort of "barrier" between pal instructions  
>> that
>> allows them to not necessarily be serializing, but forces them to be
>> executed in order.  You could take that a step further if necessary
>> and implement a scoreboard that indicates which instructions have to
>> wait for others (which is how the ev6 really does it).
>>
>> None of these options are particularly simple, but they aren't overly
>> massive changes either.
>>
>>  Nate
>>
>>> Based on the results I am getting, for PARSEC benchmarks, the OoO
>>> preformance is really bad, there are to many hw_mfpr and hw_mtpr,
>>> instructions. I am trying to figure out why I am in the PALcode so
>>> often (any ideas on how to figure this out?). I am running
>>> Blackscholes, which is a relatively simple PARSEC benchmark.
>>>
>>> I need to do more research, but I dont think this is caused at all  
>>> for
>>> itb and dtb misses (i made them really large just in case).
>>> Right now for blackscholes, which isnt close to finish executing
>>> (about 30%) I have it so it will execute  1e9 instructions (running
>>> two cores),  From my perl scripts it seems to have it fetched 13
>>> million hw_mfpr and mtpr instructions (fetched, not committed).  
>>> There
>>> is something really wrong with that ratio.
>> _______________________________________________
>> m5-users mailing list
>> [email protected]
>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>
> _______________________________________________
> m5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>

_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Reply via email to