I'd also take a look at how many MSHRs you are giving your caches and see if
it matches w/your cpu model. For example, if you only have 2 mshrs, but your
model is issuing up to 8 speculative loads, its a chance your system may be
under provisioned and eventually lose some performance.

On Thu, May 19, 2011 at 12:28 AM, Ali Saidi <[email protected]> wrote:

> Hi Marc,
>
> If you haven't updated your code recently, I committed some changes last
> week at fixed some dependency issues with the ARM condition codes in the o3
> cpu model. Previously any instruction that wrote a condition code would have
> to do a read-modify-write operation on all the condition codes together
> meaning that a string of instructions that set condition codes were all
> dependent on each other. The committed code fixes this issue and sees
> improvement of up to 22% on some spec benchmarks.
>
> If that doesn't fix the issue, you'll need to see where the o3 model is
> stalling on your workload. Some of the statistics might help narrow it down
> a bit. The model should be able to issue dependent instructions in
> back-to-back cycles, and executes instruction speculatively (including
> loads).
>
> Any chance you'd share your cpu model? Are you sure you're accounting for
> memory latency correctly in it? The atomic memory mode completes a
> load/store instantly, so if you're not correctly accounting for the real
> time it would take for that load/store to complete that could be part of the
> issue.
>
> Ali
>
> On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:
>
> > Hi all,
> >
> > I recently extended the atomic CPU model to simulate a deeply-pipelined
> two-issue in-order machine.  The code includes variable length instruction
> latencies, checks for register dependences, has full bypass/forwarding
> capability, and so on.  I have reason to believe it is working as it should.
> >
> > Curiously, when I run binaries using this CPU model, it frequently
> outperforms the O3 CPU model in terms of cycle count.  The O3 model I
> compare against is also two-issue, has a 8-entry load queue, 8-entry store
> queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise
> configured identically.  The in-order core models identical branch
> prediction with a rather generous 13-cycle mispredict penalty for the
> two-issue core (e.g. as in ARM Cortex-A8), and still achieves better
> performance in most cases.
> >
> > I'm finding it hard to parse through all the O3 trace logs, so I was
> wondering if anyone has intuition as to why this might be the case.  Does
> the O3 CPU not do full bypassing?  Is there speculation going on beyond just
> branch prediction?  I plan to look into the source code in more detail, but
> I was wondering if someone could give me a leg up by pointing me in the
> right direction.
> >
> > I've also noticed when I set the MemRead and MemWrite latencies in
> src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance
> slows down quite drastically (~10% per increment).  This doesn't really make
> sense to me either.  I'm not configuring with a massive instruction window,
> but I wouldn't expect performance to suffer quite so much.  If it helps, all
> my simulations so far are just using ARM.
> > _______________________________________________
> > gem5-users mailing list
> > [email protected]
> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>



-- 
- Korey
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to