Re: [m5-dev] Implementing checkpointing for inorder

soumyaroop roy Sat, 06 Feb 2010 13:46:33 -0800

i just stumbled upon a problem while taking a checkpoint in O3 and
then restoring from it. The problem also exists from simple-timing!


Here is what taking a checkpoint at instruction #100 on gcc_integrate
looks like:

command line: build/ALPHA_SE/m5.fast
--outdir=./m5out/o3-timing/100drain configs/example/se.py
--bench=gcc_integrate --detailed --caches --l2cache
--take-checkpoint=100 --at-instruction
Global frequency set at 1000000000000 ticks per second
0: system.remote_gdb.listener: listening for remote gdb on port 7000
Creating checkpoint at inst:100
info: Entering event queue @ 0.  Starting simulation...
info: Increasing stack size by one page.
hack: be nice to actually delete the event here
exit cause = a thread reached the max instruction count
info: Entering event queue @ 1193000.  Starting simulation...
Writing checkpoint
Checkpoint written.
Exiting @ cycle 1250000 because a thread reached the max instruction count

Here is the error that results while resuming from the same checkpoint:

command line: build/ALPHA_SE/m5.fast
--outdir=./m5out/o3-timing/100resume configs/example/se.py
--bench=gcc_integrate --detailed --caches --l2cache
--checkpoint-dir=./m5out/o3-timing/100drain --checkpoint-restore=100
--at-instruction --max-inst=100
Global frequency set at 1000000000000 ticks per second
0: system.remote_gdb.listener: listening for remote gdb on port 7000
Restoring checkpoint ...
Restoring from checkpoint
fatal: Can't unserialize 'system.cpu:locked'
 @ cycle 1250000
[paramIn:build/ALPHA_SE/sim/serialize.cc, line 203]
Memory Usage: 585384 KBytes
For more information see: http://www.m5sim.org/fatal/60de9f5a

Inspecting the m5.cpt files makes reveals the problem:
for simple-atomic:
...
[system.cpu]
so_state=2
locked=false
_status=1
...

for simple-timing:
...
 [system.cpu]
 so_state=2
 _status=1

for o3-timing:
....
 [system.cpu]
 so_state=2
...

How do I fix it?

regards,
Soumyaroop

On Sat, Feb 6, 2010 at 2:39 PM, soumyaroop roy <s...@cse.usf.edu> wrote:
> On Sat, Feb 6, 2010 at 12:22 PM, Steve Reinhardt <ste...@gmail.com> wrote:
>> On Sat, Feb 6, 2010 at 9:08 AM, soumyaroop roy <s...@cse.usf.edu> wrote:
>>> Let me rephrase my last question:
>>> Is there any way that a comparison could be performed between the
>>> checkpoint output by inorder with that output by a simple CPU? Say, if
>>> the number of committed instructions in both are the same when the
>>> checkpoints are dumped, should I expect that the register and memory
>>> state for both CPU's should be identical?
>>
>> I would expect so, as long as the inorder pipeline is drained and only
>> truly architectural state is checkpointed.
> Actually, earlier I was trying compare the same between O3 and
> simple-atomic and instead of paying attention to
> system.cpu.commit.commitCommittedInsts I was looking at
> system.cpu.commitInsts and that's why I wasn't seeing identical
> architectural register file contents between the two. Does that seem
> reasonable? The physical memory dumps are not identical though.
> Similar observations with inorder.
>
>>
>>>
>>> In any case, what kind of testing strategy should be put in place to
>>> test the correctness of checkpointing apart from just taking
>>> checkpoints for programs/benchmarks, resuming from those checkpoints,
>>> running the programs till completion, and finally verifying the
>>> correctness of the outputs?
>>
>> We've had discussions on this before, though I don't recall all the
>> details of what we've come up with.  Certainly the basic idea of
>> taking a checkpoint, restoring from that checkpoint, and then
>> comparing with an execution that didn't stop is a common one.  You
>> wouldn't have to run all the way to the end; you could dump statistics
>> and perhaps another checkpoint at a common point further on but not
>> all the way at completion.  There is an issue in that the drain
>> operation will perturb the state slightly, so you can't expect the
>> results to be identical unless you either (1) artificially induce a
>> drain without a checkpoint at the same spot in the non-checkpointed
>> execution or (2) do a stats reset at a common point in both
>> simulations after the checkpoint restore.  The latter seems a lot
>> simpler to me.
> Thanks, I will try this. Is there an option in any of the scripts to
> reset stats at an instruction?
>
>>
>> Steve
>> _______________________________________________
>> m5-dev mailing list
>> m5-dev@m5sim.org
>> http://m5sim.org/mailman/listinfo/m5-dev
>>
>
>
>
> --
> Soumyaroop Roy
> Ph.D. Candidate
> Department of Computer Science and Engineering
> University of South Florida, Tampa
> http://www.csee.usf.edu/~sroy
>



-- 
Soumyaroop Roy
Ph.D. Candidate
Department of Computer Science and Engineering
University of South Florida, Tampa
http://www.csee.usf.edu/~sroy
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Implementing checkpointing for inorder

Reply via email to