Hi Marc, If you haven't updated your code recently, I committed some changes last week at fixed some dependency issues with the ARM condition codes in the o3 cpu model. Previously any instruction that wrote a condition code would have to do a read-modify-write operation on all the condition codes together meaning that a string of instructions that set condition codes were all dependent on each other. The committed code fixes this issue and sees improvement of up to 22% on some spec benchmarks.
If that doesn't fix the issue, you'll need to see where the o3 model is stalling on your workload. Some of the statistics might help narrow it down a bit. The model should be able to issue dependent instructions in back-to-back cycles, and executes instruction speculatively (including loads). Any chance you'd share your cpu model? Are you sure you're accounting for memory latency correctly in it? The atomic memory mode completes a load/store instantly, so if you're not correctly accounting for the real time it would take for that load/store to complete that could be part of the issue. Ali On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote: > Hi all, > > I recently extended the atomic CPU model to simulate a deeply-pipelined > two-issue in-order machine. The code includes variable length instruction > latencies, checks for register dependences, has full bypass/forwarding > capability, and so on. I have reason to believe it is working as it should. > > Curiously, when I run binaries using this CPU model, it frequently > outperforms the O3 CPU model in terms of cycle count. The O3 model I compare > against is also two-issue, has a 8-entry load queue, 8-entry store queue, > 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured > identically. The in-order core models identical branch prediction with a > rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as > in ARM Cortex-A8), and still achieves better performance in most cases. > > I'm finding it hard to parse through all the O3 trace logs, so I was > wondering if anyone has intuition as to why this might be the case. Does the > O3 CPU not do full bypassing? Is there speculation going on beyond just > branch prediction? I plan to look into the source code in more detail, but I > was wondering if someone could give me a leg up by pointing me in the right > direction. > > I've also noticed when I set the MemRead and MemWrite latencies in > src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows > down quite drastically (~10% per increment). This doesn't really make sense > to me either. I'm not configuring with a massive instruction window, but I > wouldn't expect performance to suffer quite so much. If it helps, all my > simulations so far are just using ARM. > _______________________________________________ > gem5-users mailing list > gem5-users@m5sim.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users _______________________________________________ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users