On Mar 10, 2012, at 1:26 PM, Brian Grayson wrote: > I am running a testcase with decodeWidth=4, fetchWidth=8, and > fetchToDecodeDelay set to 1. This results in skidBufferMax being computed > as 1*8+4 = 12. > > > > However, in one workload, the verbose debug log shows that: > > - Decode starts processing 8 instructions, and becomes blocked > after 4 (since decodeWidth is only 4). The remaining 4 then get inserted > into the skidBuffer. > > - Fetch had already sent the next 8 instructions, so in the next > cycle these next 8 instructions enter the skidBuffer. At this point, the > skidBuffer is full. > > - In the next cycle, Fetch generates a Translation-fault NoOp, even > though Fetch is blocked. This then flows down to Decode, where it tries to > put it into the skidBuffer, and asserts start to fire. > > > > A quick workaround for this is to increase skidBufferMax by one more. But I > do not think this is the right fix - even with that, setting the decodeWidth > to 3 causes the simulator to assert. > > > > I think the current flow-control between Fetch and Decode is such that with > a 1-cycle delay, the skidBuffer must be able to hold two full fetches, minus > what it is _guaranteed_ decode can remove. For the decode-3 example, out of > the first batch of 8 instructions, 5 need to go into the skidBuffer. Then > next cycle another 8 arrive. So it seems skidBufferMax should be at least > (fetchToDecodeDelay*params->fetchWidth) + (fetchWidth - decodeWidth) + 1 /* > to handle translation faults? */ > > > > However, this is also not sufficient, probably because Decode doesn't > guarantee that it can always decode 3 instructions. I did a sweep over > fetchWidths and decodeWidths from 1 to 8 (64 combinations), and even with > the above, 13 of the 64 combos failed. > > > > Empirically, setting it to ((fetchToDecodeDelay+1) * params->fetchWidth) > appears to suffice for my simple toy workload across all 64 configs, but I'm > sure someone else can figure out the bug-free proper value to use that is > guaranteed to be correct. > > > > Thanks. > > > > Brian Grayson >
Hi Brian, My gut reaction is that ((fetchToDecodeDelay+1) * params->fetchWidth) is probably right, since decode is not guaranteed to remove any instructions and the communication delay mean that the stall cycle won't get there for a extra cycle. Actually, it might be ((fetchToDecodeDelay+1) * params->fetchWidth) - 1. Since if it was full last cycle a stall should have happened, and thus there must be one slot available. Thanks, Ali _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
