Hello,

Here is a bit of an update:

I went ahead and tracked the number of hw_m*pr instructions using
Blackscholes, simlarge 2 cores. 5% of instructions fetched where
hw_m*pr instructions. I think this is a huge amount considering that
is 1 in 20 instructions, since I am using 8 wide machine, thats a
stall of 3/5 of pipe stages one out of 3 cycles (stalling till the ROB
is empty)...Not to mention the IPC is atrocious, which I suspect is
caused by these serializing stalls.

I went ahead and obj dumped my binaries, and looked at the
decoder.isa. It seems that the only instructions from the parsec
binaries generating hw_m*pr instructions is:

rdunique (my code has lots of them) wruniqe(my code has 2) , halt(not
important) and callsys(has quite a bit)
Anyways the instruction count of these instructions is pointless due
to branching loops and jumps. I was just trying to get some type of
quantification.

Unfortunately I was unable to find to much literature on these
instructions (rdunique, wrunique) using google,  I need to work on
that.

I have 12 out of 13 PARSEC 2.1 benchmarks compiled and I consider them
useless due to this issue. I am surprised that this issue has never
been brought except by Rick Strong.

My first thought was try to force the compiler to
remove/reduce/optimize these instruction (rduniq,wduniq), however I
dont think that is possible, so compiler solution seems to be out of
the question. There is a way to replace these rduniq an wduniq
instructions, the following is from gcc.org
" The following builtins are available on systems that use the OSF/1
PALcode.  Normally they invoke the `rduniq' and `wruniq' PAL calls,
but
when invoked with `-mtls-kernel', they invoke `rdval' and `wrval'."

I dont think rdval and wrval are implemented into M5 so that seems useless.

My next step I guess, is to attempt to fix M5. Unfortunately my C++
skill level is pretty low, I would like to implement a solution based
on what I can do, but I don't think I have that luxury. I am in the
process of researching  Nathan's solutions and trying to implement
them.  I am assuming the scoreboard is the most efficient
implementation and realistic one (this correct?). I will try looking
into that solution first..

Any thoughts, advice and/or suggestions would be greatly appreciated.

Thanks,
EF

On Mon, Oct 12, 2009 at 10:11 PM, nathan binkert <[email protected]> wrote:
> My guess is that this is the result of calling rdunique and wrunique.
> These pal instructions keep track of the currently running thread.
> They more or less just access a single internal pal temp register.
> There are a number of things that could potentially be done to fix the
> slowness here.  You could create an actual renamed register in the o3
> model and make those palcalls access that special register.  If that
> weren't enough, you could add a more generalized facility for renaming
> pal temp registers (there are many that are simply treated as
> registers) and allow mfpr and mtpr to not be serializing.  Another
> option is to make some sort of "barrier" between pal instructions that
> allows them to not necessarily be serializing, but forces them to be
> executed in order.  You could take that a step further if necessary
> and implement a scoreboard that indicates which instructions have to
> wait for others (which is how the ev6 really does it).
>
> None of these options are particularly simple, but they aren't overly
> massive changes either.
>
>  Nate
>
>> Based on the results I am getting, for PARSEC benchmarks, the OoO
>> preformance is really bad, there are to many hw_mfpr and hw_mtpr,
>> instructions. I am trying to figure out why I am in the PALcode so
>> often (any ideas on how to figure this out?). I am running
>> Blackscholes, which is a relatively simple PARSEC benchmark.
>>
>> I need to do more research, but I dont think this is caused at all for
>> itb and dtb misses (i made them really large just in case).
>> Right now for blackscholes, which isnt close to finish executing
>> (about 30%) I have it so it will execute  1e9 instructions (running
>> two cores),  From my perl scripts it seems to have it fetched 13
>> million hw_mfpr and mtpr instructions (fetched, not committed). There
>> is something really wrong with that ratio.
> _______________________________________________
> m5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Reply via email to