The tricky part here is picking what main() is and how you get to/from it. If main() is just one of the steps in the process, then y() is equivalent. If you can get from main() to y() without stalling, then you can do that from y() to whatever its subordinate step is, say z(). You can get into the same bad situation of going around and around forever this way.
One soution is to put a cap on on how deep you go. Say in this example that's z(). When you get to z(), you never attempt to do the next thing, say a(), you simply return back to main() and let it continue. That works really well if z() is the natural end of what you're doing, in this case the end of the life of the current instruction. The problem here is that you don't know whether you actually came from main() in the first place, or if you came from a callback somewhere in the middle. If you did, you'd return to the callback, it would end, and there would be nothing responsible for the next action of the CPU. If you make the callback smarter, then all the callbacks start to be mini main()s and the complexity goes way up. To solve -this- issue, you can simply keep track of whether you've gotten where you are from main or from a callback after it. If you get to z() and you're not from main(), you run it and record that you are now from main. If you are, you return to it. This way, you can end up at most twice the deepest call depth of a particular instructions life cycle because you can never have main() appear in it twice and you always have to start an instruction in main(). This gets us back to basically what I had the first time around, except that it's now a flag for the whole CPU rather than for a single step. The next problem is dealing with calls to what would be y() in your example from code that is not part of the CPU. That would happen when in an instruction that calls read() or write(). In these bits of code we can't return whether we're expecting a callback since the value will be lost and will have to instead record what's going on in the CPU someplace to read back when we get control again. Now that we're recording whether read() or write() are going to callback or not, we could record the fact that they were called in the first place and defer the work until after initiateAcc entirely. That leads to more global state, though, and adds complexity. I had been thinking this was the way to go for a while, but looking back I changed my mind. So in the end, it looks like there need to be two global flags. One to say whether the call stack is rooted most recently in main() or a callback, and one to say whether the CPU should perform "the next step" or if it should wait for a callback to pick things up again. Now we run into another complication, namely that there isn't necessarily a single "the next step" to go to. After a translation we might, for instance, need to actually access memory, or we might need to invoke a fault for whatever reason. Our second flag now has to not only indicate if the CPU should perform "the next step" but also what that next step should be. At this point, we've basically wound up with an enum of what, if you squint, look like states for the CPU. Those states describe what to do next, ie. translate request X, send packet Y to memory, etc. This is what I was talking about before as far as making the CPU work like a state machine, although that may not have been clear. So is this the way to go, or did I mangle/misinterpret something? Gabe Steve Reinhardt wrote: > I actually looked at the code a bit this time; and I have a hypothesis > that the problem arises from two similar but fundamentally different > models of "bypassing" potential event-based delays: > > main() { > x_will_callback = x(); > if (!x_will_callback) y(); > } > > x() { > if (...) { sched_callback(&cb); return true; } > else { return false; } > } > > cb() { y(); } > > as opposed to: > > main() { x(); } > > x() { > if (...) { sched_callback(&cb); } > else { y(); /* or cb(); */ } > } > > cb() { y(); } > > Both of these have the overall effect of calling x() and then y(), > sometimes with a delay and sometimes not. However in the latter case > y() is called from inside the call to x(), which leads to problems > when that's not expected... basically this is the root of the > initiateAcc/completeAcc problem. Also if there's a cycle (like there > is in our pipeline) where you do x,y,z,x,y,z,x,y,z then as Gabe points > out you can run into stack overflow problems too. > > My hypothesis is that the old TimingSimpleCPU code worked because it > always did the former, and Gabe has introduced two points that do the > latter: one in timingTranslate(), and one in fetch(). I think the > right solution is that for each of these we should either change it > into the first model or eliminate the bypass option altogether and > always do a separately scheduled callback. > > I think the distinction of having main() call y() directly rather than > x_cb() is potentially important, as this gives you points where you > can do slightly different things depending on whether you did the > event or bypassed it. It also (to me) provides some logical > separation between "what comes next" (the code in y()) and how you got > there. > > Coming at this from a different angle, while the code is getting > increasingly messy (or maybe just inherently complex), I'd say a > significant fraction of the complexity is dealing with > cache/page-crossing memory operations, which I don't think would be > significantly improved by a global restructuring. (Let me know if > anyone thinks otherwise.) Thus I'm not too keen on doing a > significant restructuring since I think the code will still be messy > afterward. > > On Wed, May 6, 2009 at 11:42 AM, Gabriel Michael Black > <gbl...@eecs.umich.edu <mailto:gbl...@eecs.umich.edu>> wrote: > > The example I mentioned would be if > you have a microcode loop that doesn't touch memory to, for instance, > stall until you get an interrupt or a countdown expires for a small > delay. > > > Although I agree that it's good to avoid this possibility altogether, > I'd argue that any microcode loop like you describe is broken. If for > no other reason than power dissipation I don't think you'd ever want > to busy-wait in a real system, and certainly even if you did we > wouldn't want to write it that way in m5 for performance reasons. > > Steve > > ------------------------------------------------------------------------ > > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org > http://m5sim.org/mailman/listinfo/m5-dev > _______________________________________________ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev