Here is some more information about the error:

It looks like, while restoring from a checkpoint, it always starts off
with the "simple-atomic" cpu (check the setCPUClass() routine in and then switches to the other CPU (timing or O3 or
Inorder). Therefore, the AtomicSimpleCPU::unserialize() is called on
the checkpoint that is created by the other cpus which is causing that

But, when I altered the script to change that, simulation does not
proceed after restoration and, thus, exits:
command line: /home/sroy-local/research/m5-arm/build/ALPHA_SE/m5.debug
--outdir=./m5out/simple-timing/50rest configs/example/
--bench=gcc_integrate --timing
--checkpoint-dir=./m5out/simple-timing/50drain --checkpoint-restore=50
--at-instruction --max-inst=1000
Global frequency set at 1000000000000 ticks per second
0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
Restoring checkpoint ...
Restoring from checkpoint
warn: optional parameter system.cpu.workload:M5_pid not present
For more information see:
info: Entering event queue @ 2162000.  Starting simulation...
Exiting @ cycle 9223372036854775807 because simulate() limit reached

So, I am not sure if the problem is in python or c++.


On Sat, Feb 6, 2010 at 4:46 PM, soumyaroop roy <> wrote:
> i just stumbled upon a problem while taking a checkpoint in O3 and
> then restoring from it. The problem also exists from simple-timing!
> Here is what taking a checkpoint at instruction #100 on gcc_integrate
> looks like:
> command line: build/ALPHA_SE/
> --outdir=./m5out/o3-timing/100drain configs/example/
> --bench=gcc_integrate --detailed --caches --l2cache
> --take-checkpoint=100 --at-instruction
> Global frequency set at 1000000000000 ticks per second
> 0: system.remote_gdb.listener: listening for remote gdb on port 7000
> Creating checkpoint at inst:100
> info: Entering event queue @ 0.  Starting simulation...
> info: Increasing stack size by one page.
> hack: be nice to actually delete the event here
> exit cause = a thread reached the max instruction count
> info: Entering event queue @ 1193000.  Starting simulation...
> Writing checkpoint
> Checkpoint written.
> Exiting @ cycle 1250000 because a thread reached the max instruction count
> Here is the error that results while resuming from the same checkpoint:
> command line: build/ALPHA_SE/
> --outdir=./m5out/o3-timing/100resume configs/example/
> --bench=gcc_integrate --detailed --caches --l2cache
> --checkpoint-dir=./m5out/o3-timing/100drain --checkpoint-restore=100
> --at-instruction --max-inst=100
> Global frequency set at 1000000000000 ticks per second
> 0: system.remote_gdb.listener: listening for remote gdb on port 7000
> Restoring checkpoint ...
> Restoring from checkpoint
> fatal: Can't unserialize 'system.cpu:locked'
>  @ cycle 1250000
> [paramIn:build/ALPHA_SE/sim/, line 203]
> Memory Usage: 585384 KBytes
> For more information see:
> Inspecting the m5.cpt files makes reveals the problem:
> for simple-atomic:
> ...
> [system.cpu]
> so_state=2
> locked=false
> _status=1
> ...
> for simple-timing:
> ...
>  [system.cpu]
>  so_state=2
>  _status=1
> for o3-timing:
> ....
>  [system.cpu]
>  so_state=2
> ...
> How do I fix it?
> regards,
> Soumyaroop
> On Sat, Feb 6, 2010 at 2:39 PM, soumyaroop roy <> wrote:
>> On Sat, Feb 6, 2010 at 12:22 PM, Steve Reinhardt <> wrote:
>>> On Sat, Feb 6, 2010 at 9:08 AM, soumyaroop roy <> wrote:
>>>> Let me rephrase my last question:
>>>> Is there any way that a comparison could be performed between the
>>>> checkpoint output by inorder with that output by a simple CPU? Say, if
>>>> the number of committed instructions in both are the same when the
>>>> checkpoints are dumped, should I expect that the register and memory
>>>> state for both CPU's should be identical?
>>> I would expect so, as long as the inorder pipeline is drained and only
>>> truly architectural state is checkpointed.
>> Actually, earlier I was trying compare the same between O3 and
>> simple-atomic and instead of paying attention to
>> system.cpu.commit.commitCommittedInsts I was looking at
>> system.cpu.commitInsts and that's why I wasn't seeing identical
>> architectural register file contents between the two. Does that seem
>> reasonable? The physical memory dumps are not identical though.
>> Similar observations with inorder.
>>>> In any case, what kind of testing strategy should be put in place to
>>>> test the correctness of checkpointing apart from just taking
>>>> checkpoints for programs/benchmarks, resuming from those checkpoints,
>>>> running the programs till completion, and finally verifying the
>>>> correctness of the outputs?
>>> We've had discussions on this before, though I don't recall all the
>>> details of what we've come up with.  Certainly the basic idea of
>>> taking a checkpoint, restoring from that checkpoint, and then
>>> comparing with an execution that didn't stop is a common one.  You
>>> wouldn't have to run all the way to the end; you could dump statistics
>>> and perhaps another checkpoint at a common point further on but not
>>> all the way at completion.  There is an issue in that the drain
>>> operation will perturb the state slightly, so you can't expect the
>>> results to be identical unless you either (1) artificially induce a
>>> drain without a checkpoint at the same spot in the non-checkpointed
>>> execution or (2) do a stats reset at a common point in both
>>> simulations after the checkpoint restore.  The latter seems a lot
>>> simpler to me.
>> Thanks, I will try this. Is there an option in any of the scripts to
>> reset stats at an instruction?
>>> Steve
>>> _______________________________________________
>>> m5-dev mailing list
>> --
>> Soumyaroop Roy
>> Ph.D. Candidate
>> Department of Computer Science and Engineering
>> University of South Florida, Tampa
> --
> Soumyaroop Roy
> Ph.D. Candidate
> Department of Computer Science and Engineering
> University of South Florida, Tampa

Soumyaroop Roy
Ph.D. Candidate
Department of Computer Science and Engineering
University of South Florida, Tampa
m5-dev mailing list

Reply via email to