I'd also take a look at how many MSHRs you are giving your caches and see if it matches w/your cpu model. For example, if you only have 2 mshrs, but your model is issuing up to 8 speculative loads, its a chance your system may be under provisioned and eventually lose some performance.
On Thu, May 19, 2011 at 12:28 AM, Ali Saidi <[email protected]> wrote: > Hi Marc, > > If you haven't updated your code recently, I committed some changes last > week at fixed some dependency issues with the ARM condition codes in the o3 > cpu model. Previously any instruction that wrote a condition code would have > to do a read-modify-write operation on all the condition codes together > meaning that a string of instructions that set condition codes were all > dependent on each other. The committed code fixes this issue and sees > improvement of up to 22% on some spec benchmarks. > > If that doesn't fix the issue, you'll need to see where the o3 model is > stalling on your workload. Some of the statistics might help narrow it down > a bit. The model should be able to issue dependent instructions in > back-to-back cycles, and executes instruction speculatively (including > loads). > > Any chance you'd share your cpu model? Are you sure you're accounting for > memory latency correctly in it? The atomic memory mode completes a > load/store instantly, so if you're not correctly accounting for the real > time it would take for that load/store to complete that could be part of the > issue. > > Ali > > On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote: > > > Hi all, > > > > I recently extended the atomic CPU model to simulate a deeply-pipelined > two-issue in-order machine. The code includes variable length instruction > latencies, checks for register dependences, has full bypass/forwarding > capability, and so on. I have reason to believe it is working as it should. > > > > Curiously, when I run binaries using this CPU model, it frequently > outperforms the O3 CPU model in terms of cycle count. The O3 model I > compare against is also two-issue, has a 8-entry load queue, 8-entry store > queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise > configured identically. The in-order core models identical branch > prediction with a rather generous 13-cycle mispredict penalty for the > two-issue core (e.g. as in ARM Cortex-A8), and still achieves better > performance in most cases. > > > > I'm finding it hard to parse through all the O3 trace logs, so I was > wondering if anyone has intuition as to why this might be the case. Does > the O3 CPU not do full bypassing? Is there speculation going on beyond just > branch prediction? I plan to look into the source code in more detail, but > I was wondering if someone could give me a leg up by pointing me in the > right direction. > > > > I've also noticed when I set the MemRead and MemWrite latencies in > src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance > slows down quite drastically (~10% per increment). This doesn't really make > sense to me either. I'm not configuring with a massive instruction window, > but I wouldn't expect performance to suffer quite so much. If it helps, all > my simulations so far are just using ARM. > > _______________________________________________ > > gem5-users mailing list > > [email protected] > > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > _______________________________________________ > gem5-users mailing list > [email protected] > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > -- - Korey
_______________________________________________ gem5-users mailing list [email protected] http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
