Re: [m5-dev] Implementing checkpointing for inorder

Korey Sewell Sun, 04 Jul 2010 14:36:25 -0700

> Why are you trying to checkpoint the InOrderCPU?  Wouldn't it be
> better to implement the switchover from SimpleCPU to InOrder?  You
> can't checkpoint caches right now, so it doesn't seem worthwhile to
> checkpoint inorder.
>
If I'm not mistaken, Soumyaroop had this "almost" working in the past and
wants to get all of his M5 work contributed while he can. So I would say
that even if it's not immediately useful, getting checkpointing to work
could be the substrate to a bigger M5 contribution (say if/when caches can
become checkpointed or whatever).


Also,
Even without the caches being checkpointed (and subsequently warmed up),
wouldnt checkpointing still be useful to the degree that at least you have
the memory and CPU state to some point way down along in your simulation? So
if you're interested in something say some billion of cycles down the
simulation there would still be some benefit of not having to simulate that
X billion cycles and warming up from that point instead of the program
origin (assuming that SimPoints or whatever workload approximation isnt
applicable).

With that said,
I do agree that the switchover is probably most useful though, but if the
checkpointing is "close" then I would say definitely go for it in terms of
implementation.



>
>  Nate
>
> On Sat, Jul 3, 2010 at 1:56 PM, soumyaroop roy <s...@cse.usf.edu> wrote:
> > Hello there:
> >
> > I am revisiting an earlier suspended effort to implement checkpointing
> > for the inorder cpu and I am currently debugging a problem (for the
> > case of a uniprocessor and no multithreading). Let me describe the
> > problem here.
> >
> > I am using the hello world program. I am taking a checkpoint at
> > instruction 100 (by specifying --take-checkpoint=100 --at-instruction)
> > and then restoring from there and running another 100 instructions. I
> > generated a trace of ONLY the retired instructions from a separate run
> > of the inorder cpu that retires 200 instructions and compared that
> > trace with the traces generated by the checkpointing and checkpoint
> > restoration steps. I see that there is a bug in the simulation of the
> > 76th instruction after restoration of the program (a load instruction
> > loads a 1 instead of a 0) that causes the problem.
> >
> > Now, this is my understanding of how a checkpoint is taken. Please
> > correct me if I am wrong. I noted that when checkpointing is specified
> > with these options: "--take-checkpoint=N --at-instruction", the
> > max_insts_any_thread for the cpu is set to N which sets up a
> > termination event in the committed instructions queue,
> > comInstEventQueue (lets consider a uniprocessor and no
> > multithreading). After each instruction is retired the events from
> > this queue are serviced. So, when N instructions have been committed,
> > the drain() routine is called. The simulation is exited subsequently.
> > Then the writing of the checkpoint is directed by the python script,
> > Simulation.py. The serialize() routine should be called before the
> > simulation is exited, right? Also, the total number of retired
> > instructions can be more than N eventually, right?
> >
> > Here is another observation which is a bit confusing to me. I traced
> > the routines that are called during O3's checkpointing and the
> > resume() routine is called when the checkpoint is taken (after drain()
> > and serialize() routines). Why is this happening? Shouldn't resume()
> > be called while restoring from a checkpoint after the unserialize()
> > routine is called?
> >
> > regards,
> > Soumyaroop
> >
> >
> > On Fri, Feb 12, 2010 at 12:05 PM, Korey Sewell <ksew...@umich.edu>
> wrote:
> >>> But fixing the two items above did not solve the problem. I figured
> >>> (from the takeoverfrom() routines) that commit stage needs to reset
> >>> its flags to that it does not go and squash the first instruction
> >>> where the restoration is supposed to start from. Since I am not very
> >>> familiar with the O3 code, I did not spend much time looking into it.
> >>
> >> I'm assuming O3 doesnt get to commit 1 instruction, because it's
> immediately
> >> squashed
> >> as soon as you restore from checkpoint?
> >>
> >>
> >>>
> >>> So, now I am seeing inorder proceed to about a 100 instructions after
> >>> which the PC is set to 0x0 (following a squash). I have to look into
> >>> it later. Which trace flags should I use to see the actual
> >>> instructions?
> >>
> >> "Exec" if you want just the committed instructions
> >>
> >> --
> >> - Korey
> >>
> >> _______________________________________________
> >> m5-dev mailing list
> >> m5-dev@m5sim.org
> >> http://m5sim.org/mailman/listinfo/m5-dev
> >>
> >>
> >
> >
> >
> > --
> > Soumyaroop Roy
> > Ph.D. Candidate
> > Department of Computer Science and Engineering
> > University of South Florida, Tampa
> > http://www.csee.usf.edu/~sroy <http://www.csee.usf.edu/%7Esroy>
> > _______________________________________________
> > m5-dev mailing list
> > m5-dev@m5sim.org
> > http://m5sim.org/mailman/listinfo/m5-dev
> >
> >
> _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org
> http://m5sim.org/mailman/listinfo/m5-dev
>



-- 
- Korey

_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Implementing checkpointing for inorder

Reply via email to