Re: [m5-dev] Checkpoint restoration problems in simple-timing and o3

Ali Saidi Mon, 08 Feb 2010 10:47:23 -0800


Hi Soumyaroop,


As you realized you need to restore a checkpoint to the CPU that  
created it and if a different CPU model is desired then switch CPUs  
after this restoration. Perhaps all the CPUs should have the same  
checkpoint state, however this isn't the way things are implemented  
today and limits the type of checkpointing that can be done. Thoughts  
anyone?

As for the second problem you have once you changed the restoration,  
there isn't a clear answer. You'll need to enable a bunch of trace  
flags after the restoration and see what is or isn't happening. The  
restored checkpoint should look exactly the same as the system that  
continued to execute after the checkpoint was created.

Thanks,
Ali


On Feb 8, 2010, at 11:41 AM, soumyaroop roy wrote:

> I thought it would be wiser to start a new thread for this issue:
>
> I just stumbled upon a problem while taking a checkpoint in O3 and
> then restoring from it. The problem also exists from simple-timing!
>
> Here is what taking a checkpoint at instruction #100 on gcc_integrate
> looks like:
>
> command line: build/ALPHA_SE/m5.fast
> --outdir=./m5out/o3-timing/100drain configs/example/se.py
> --bench=gcc_integrate --detailed --caches --l2cache
> --take-checkpoint=100 --at-instruction
> Global frequency set at 1000000000000 ticks per second
> 0: system.remote_gdb.listener: listening for remote gdb on port 7000
> Creating checkpoint at inst:100
> info: Entering event queue @ 0.  Starting simulation...
> info: Increasing stack size by one page.
> hack: be nice to actually delete the event here
> exit cause = a thread reached the max instruction count
> info: Entering event queue @ 1193000.  Starting simulation...
> Writing checkpoint
> Checkpoint written.
> Exiting @ cycle 1250000 because a thread reached the max instruction  
> count
>
> Here is the error that results while resuming from the same  
> checkpoint:
>
> command line: build/ALPHA_SE/m5.fast
> --outdir=./m5out/o3-timing/100resume configs/example/se.py
> --bench=gcc_integrate --detailed --caches --l2cache
> --checkpoint-dir=./m5out/o3-timing/100drain --checkpoint-restore=100
> --at-instruction --max-inst=100
> Global frequency set at 1000000000000 ticks per second
> 0: system.remote_gdb.listener: listening for remote gdb on port 7000
> Restoring checkpoint ...
> Restoring from checkpoint
> fatal: Can't unserialize 'system.cpu:locked'
> @ cycle 1250000
> [paramIn:build/ALPHA_SE/sim/serialize.cc, line 203]
> Memory Usage: 585384 KBytes
> For more information see: http://www.m5sim.org/fatal/60de9f5a
>
> Inspecting the m5.cpt files gave me an idea about the problem:
> for simple-atomic:
> ...
> [system.cpu]
> so_state=2
> locked=false
> _status=1
> ...
>
> for simple-timing:
> ...
> [system.cpu]
> so_state=2
> _status=1
>
> for o3-timing:
> ....
> [system.cpu]
> so_state=2
> ...
>
> So, I was able to gather some more information about the error:
>
> It looks like, while restoring from a checkpoint, it always starts off
> with the "simple-atomic" cpu (check the setCPUClass() routine in
> Simulation.py) and then switches to the other CPU (timing or O3 or
> Inorder). Therefore, the AtomicSimpleCPU::unserialize() is called on
> the checkpoint that is created by the other cpus which is causing that
> error!
>
> But, when I altered the script to change that, simulation does not
> proceed after restoration and, thus, exits:
> command line: /home/sroy-local/research/m5-arm/build/ALPHA_SE/m5.debug
> --outdir=./m5out/simple-timing/50rest configs/example/se.py
> --bench=gcc_integrate --timing
> --checkpoint-dir=./m5out/simple-timing/50drain --checkpoint-restore=50
> --at-instruction --max-inst=1000
> Global frequency set at 1000000000000 ticks per second
> 0: system.remote_gdb.listener: listening for remote gdb #0 on port  
> 7000
> Restoring checkpoint ...
> Restoring from checkpoint
> warn: optional parameter system.cpu.workload:M5_pid not present
> For more information see: http://www.m5sim.org/warn/aa78cda1
> Done.
> **** REAL SIMULATION ****
> info: Entering event queue @ 2162000.  Starting simulation...
> Exiting @ cycle 9223372036854775807 because simulate() limit reached
>
> So, I am not sure if the problem is in the python script or in c++  
> or both?
>
> Any ideas?
>
> regards,
> Soumyaroop
>
> -- 
> Soumyaroop Roy
> Ph.D. Candidate
> Department of Computer Science and Engineering
> University of South Florida, Tampa
> http://www.csee.usf.edu/~sroy
> _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org
> http://m5sim.org/mailman/listinfo/m5-dev
>

_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Checkpoint restoration problems in simple-timing and o3

Reply via email to