Re: [gem5-users] In-order faster than O3?

Ali Saidi Wed, 18 May 2011 21:28:40 -0700

Hi Marc,

If you haven't updated your code recently, I committed some changes last week 
at fixed some dependency issues with the ARM condition codes in the o3 cpu 
model. Previously any instruction that wrote a condition code would have to do 
a read-modify-write operation on all the condition codes together meaning that 
a string of instructions that set condition codes were all dependent on each 
other. The committed code fixes this issue and sees improvement of up to 22% on 
some spec benchmarks.


If that doesn't fix the issue, you'll need to see where the o3 model is 
stalling on your workload. Some of the statistics might help narrow it down a 
bit. The model should be able to issue dependent instructions in back-to-back 
cycles, and executes instruction speculatively (including loads). 

Any chance you'd share your cpu model? Are you sure you're accounting for 
memory latency correctly in it? The atomic memory mode completes a load/store 
instantly, so if you're not correctly accounting for the real time it would 
take for that load/store to complete that could be part of the issue.

Ali

On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:

> Hi all,
> 
> I recently extended the atomic CPU model to simulate a deeply-pipelined 
> two-issue in-order machine.  The code includes variable length instruction 
> latencies, checks for register dependences, has full bypass/forwarding 
> capability, and so on.  I have reason to believe it is working as it should.
> 
> Curiously, when I run binaries using this CPU model, it frequently 
> outperforms the O3 CPU model in terms of cycle count.  The O3 model I compare 
> against is also two-issue, has a 8-entry load queue, 8-entry store queue, 
> 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured 
> identically.  The in-order core models identical branch prediction with a 
> rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as 
> in ARM Cortex-A8), and still achieves better performance in most cases.
> 
> I'm finding it hard to parse through all the O3 trace logs, so I was 
> wondering if anyone has intuition as to why this might be the case.  Does the 
> O3 CPU not do full bypassing?  Is there speculation going on beyond just 
> branch prediction?  I plan to look into the source code in more detail, but I 
> was wondering if someone could give me a leg up by pointing me in the right 
> direction.
> 
> I've also noticed when I set the MemRead and MemWrite latencies in 
> src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows 
> down quite drastically (~10% per increment).  This doesn't really make sense 
> to me either.  I'm not configuring with a massive instruction window, but I 
> wouldn't expect performance to suffer quite so much.  If it helps, all my 
> simulations so far are just using ARM.
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] In-order faster than O3?

Reply via email to