Re: [gem5-users] Checkpointing possible with Ruby, X86, TimingSimpleCPU and O3CPU?

Marco Elver Fri, 31 Aug 2012 13:25:47 -0700

Hi Joel,

Thank you for trying to fix this; if you say you have already fixed this
issue partially, I'll wait for the final patches.


-- Marco

On 31/08/12 20:52, Joel Hestness wrote:
> Hi Marco,
>   Thanks for sending this.  Based on what I see here, I'm pretty
> confident that one of my new patches will fix this issue.
>  Unfortunately, sending you that patch would only get you to the next
> bug that currently exists in Ruby's draining functionality.
>
>   It appears as though these deeper bugs in Ruby were introduced when
> the gem5 ports were split to introduce queued ports (changeset
> 8914:8c3bd7bea667).  I'm hoping to have this all sorted out over the
> next couple days, but if you have an urgent need to run simulations,
> you could try running from sims from before the queued port change (%
> hg update -r 8913, recompile, run).
>
>   I'll keep you posted on debugging.  Thanks again,
>   Joel
>
>
> On Fri, Aug 31, 2012 at 2:34 PM, Marco Elver <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi Joel,
>
>     I ran with 1 CPU and 16 CPUs and get essentially the same result.
>
>     Attachments:
>         - gdb-n1.log: Terminal output of gdb session for the 1 CPU case.
>         - gdb-n16.log: Terminal output of gdb session for 16 CPU case.
>         - gem5-n1.log.bz2: Gem5 output for 1 CPU case.
>         - gem5-n16.log.bz2: Gem5 output for 16 CPU case.
>
>     Both of them crash right after printing information about
>     "[...]Got long mode PDP entry[...]".
>
>     I hope the gdb and gem5 output logs are sufficient for you to
>     replicate this bug; my current hg parent is 9181:42807286d6cb,
>     when the patch mentioned below was applied.
>
>     -- Marco
>
>
>     On 30/08/12 22:30, Joel Hestness wrote:
>>     Hi Marco,
>>       I'm currently trying to track down bugs in checkpoint restore
>>     to get x86+Ruby+O3CPU working, and I'm having trouble replicating
>>     your bug.  Could you please
>>     compile build/X86_MOESI_hammer/gem5.debug and run the same tests
>>     you have here to grab this backtrace?  Also, can you collect and
>>     restore from checkpoint with a single CPU core and see what happens?
>>
>>       Thanks!
>>       Joel
>>
>>
>>     On Wed, Aug 29, 2012 at 5:11 PM, Marco Elver
>>     <[email protected] <mailto:[email protected]>> wrote:
>>
>>         Thank you, with the patch I can confirm that the assertion
>>         problem has
>>         been fixed (after recreating the checkpoint).
>>
>>         My problems with the O3CPU persist, and was wondering if this
>>         is a
>>         problem specific to X86 or is it a general problem?
>>
>>         -- Marco
>>
>>         On 28/08/12 21:28, Nilay Vaish wrote:
>>         > The cause of the assert failure was tracked down recently
>>         by Jason
>>         > Power. The patch is on the review board. Here is the link -
>>         > http://reviews.gem5.org/r/1365
>>         >
>>         > It will be committed to the mainline soon.
>>         >
>>         > --
>>         > Nilay
>>         >
>>         >
>>         > On Tue, 28 Aug 2012, Marco Elver wrote:
>>         >
>>         >> Hi all,
>>         >>
>>         >> I would like to ask if what I am trying to do is even
>>         possible (and if
>>         >> so, how??), as I have been running into a few problems,
>>         despite
>>         >> following the advice I could find in older mailing-list
>>         threads or the
>>         >> wiki. My goal would be to run a full-system with ruby (with
>>         >> MOESI_CMP_directory), multiple processors of type O3CPU
>>         and the X86 ISA;
>>         >> I create a snapshot after the Linux kernel loaded and
>>         before the
>>         >> benchmark enters the ROI.
>>         >>
>>         >> With revision 9174:2171e04a2ee5 (Mon Aug 27 20:53:20 2012
>>         -0400) from
>>         >> the dev repository, I tried the following:
>>         >>    (1) Take a checkpoint with ruby_fs, the *MOESI_hammer*
>>         protocol
>>         >> (only one supporting checkpoints, according to Wiki) and the
>>         >> TimingSimpleCPU (succeeds):
>>         >>           $> build/X86_MOESI_hammer/gem5.opt
>>         >> --outdir=m5out/rawdata/fluidanimate/ckpt
>>         configs/example/ruby_fs.py -n
>>         >> 16 --cpu-type=timing --kernel=system/x86_64-vmlinux-2.6.28.smp
>>         >> --checkpoint-dir=m5out/checkpoints/fluidanimate
>>         --max-checkpoints=1
>>         >> --script=contrib/initscripts/parsec/fluidanimate.sh
>>         >>
>>         >>    (2) Resume from the checkpoint with the O3CPU, restore with
>>         >> TimingSimpleCPU (fails):
>>         >>           $> build/X86_MOESI_hammer/gem5.opt
>>         >> --outdir=m5out/rawdata/fluidanimate/detailed
>>         configs/example/ruby_fs.py
>>         >> -n 16 --cpu-type=detailed
>>         --kernel=system/x86_64-vmlinux-2.6.28.smp
>>         >> --checkpoint-dir=m5out/checkpoints/fluidanimate -r 0
>>         >> --restore-with-cpu=timing
>>         >>           [...]
>>         >>           Switch at curTick count:10000
>>         >>           info: Entering event queue @ 0.  Starting
>>         simulation...
>>         >>           Runtime Error at MOESI_hammer-dir.sm:1270, Ruby
>>         Time:
>>         >> 1111185: assert failure, PID: 2742
>>         >>           press return to continue.
>>         >>
>>         >>           Program aborted at cycle 555592500
>>         >>
>>         >>    (3) Resume from the checkpoint with the TimingSimpleCPU
>>         fails in the
>>         >> same way as (2), as in (2) the CPU isn't even switched to
>>         the O3CPU
>>         >> before it fails.
>>         >>
>>         >>    (4) Though if I try taking a snapshot right after
>>         starting the
>>         >> simulator (after ~ 10000000000 cycles, kernel still
>>         booting) and then
>>         >> try to restore with the TimingSimpleCPU, it works as
>>         expected; only the
>>         >> O3CPU fails with a segfault and the following backtrace:
>>         >>        #0  0x0000000000cdff56 in MasterPort::sendTimingReq
>>         >> (this=<optimized out>, pkt=0x6f8a060)
>>         >>            at build/X86/mem/port.cc:136
>>         >>        #1  0x00000000005fbac5 in sendTiming (pkt=0x6f8a060,
>>         >> sendingState=0x61a7cc0, this=0x49a9e60)
>>         >>            at build/X86/arch/x86/pagetable_walker.cc:173
>>         >>        #2  X86ISA::Walker::WalkerState::sendPackets
>>         (this=0x61a7cc0)
>>         >>            at build/X86/arch/x86/pagetable_walker.cc:631
>>         >>        #3  0x00000000005fc8c2 in
>>         >> X86ISA::Walker::WalkerState::recvPacket
>>         (this=this@entry=0x61a7cc0,
>>         >>            pkt=pkt@entry=0x1e99920) at
>>         >> build/X86/arch/x86/pagetable_walker.cc:590
>>         >>        #4  0x00000000005fcb98 in
>>         X86ISA::Walker::recvTimingResp
>>         >> (this=0x43706c0, pkt=0x1e99920)
>>         >>            at build/X86/arch/x86/pagetable_walker.cc:129
>>         >>        #5  0x0000000000ce1f5b in PacketQueue::trySendTiming
>>         >> (this=0x42ba5e0)
>>         >>            at build/X86/mem/packet_queue.cc:152
>>         >>        #6  0x0000000000ce2929 in
>>         PacketQueue::sendDeferredPacket
>>         >> (this=0x42ba5e0)
>>         >>            at build/X86/mem/packet_queue.cc:190
>>         >>        #7  0x0000000000c391be in EventQueue::serviceOne
>>         >> (this=<optimized out>) at build/X86/sim/eventq.cc:204
>>         >>        #8  0x0000000000c7d342 in simulate
>>         >> (num_cycles=9223372036854785807) at
>>         build/X86/sim/simulate.cc:71
>>         >>        #9  0x0000000000b8e17c in _wrap_simulate__SWIG_0
>>         >> (args=<optimized out>)
>>         >>            at build/X86/python/swig/event_wrap.cc:4755
>>         >>        #10 _wrap_simulate (self=<optimized out>,
>>         args=<optimized out>)
>>         >>            at build/X86/python/swig/event_wrap.cc:4804
>>         >>        #11 0x00007fb32a094fc6 in PyEval_EvalFrameEx () from
>>         >> /lib/libpython2.7.so.1.0
>>         >>
>>         >> Trying to restore with ruby using MOESI_CMP_directory and the
>>         >> TimingSimpleCPU results in the same error as (2), with the
>>         difference
>>         >> that it finishes loading the checkpoint, resumes, but then
>>         fails after
>>         >> about a minute ("Runtime Error at
>>         MOESI_CMP_directory-dir.sm:485, Ruby
>>         >> Time: 12038425921 <tel:12038425921>: assert failure, PID:
>>         19169"). Using the O3CPU still
>>         >> results in the same error as (4).
>>         >>
>>         >> In addition, I have seen workflows of: 1) create
>>         checkpoint without ruby
>>         >> and with the AtomicSimpleCPU 2) load checkpoint with ruby
>>         and the
>>         >> TimingSimpleCPU. I tried this, and it works if I set
>>         >> --restore-with-cpu=timing. But trying this with the O3CPU
>>         doesn't work,
>>         >> resulting in the same backtrace as (4).
>>         >>
>>         >> Is what I'm trying to do possible? If so, any workarounds
>>         I should
>>         >> know of?
>>         >>
>>         >> Thanks,
>>         >> Marco
>>         >>
>>         >>
>>         >> --
>>         >> The University of Edinburgh is a charitable body,
>>         registered in
>>         >> Scotland, with registration number SC005336.
>>         >>
>>         >> _______________________________________________
>>         >> gem5-users mailing list
>>         >> [email protected] <mailto:[email protected]>
>>         >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>         >>
>>         > _______________________________________________
>>         > gem5-users mailing list
>>         > [email protected] <mailto:[email protected]>
>>         > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>         >
>>
>>
>>         --
>>         The University of Edinburgh is a charitable body, registered in
>>         Scotland, with registration number SC005336.
>>
>>         _______________________________________________
>>         gem5-users mailing list
>>         [email protected] <mailto:[email protected]>
>>         http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>>
>>     -- 
>>       Joel Hestness
>>       PhD Student, Computer Architecture
>>       Dept. of Computer Science, University of Wisconsin - Madison
>>       Dept. of Computer Science, University of Texas - Austin
>>       http://www.cs.utexas.edu/~hestness
>>     <http://www.cs.utexas.edu/%7Ehestness>
>>
>>
>>     _______________________________________________
>>     gem5-users mailing list
>>     [email protected] <mailto:[email protected]>
>>     http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>     The University of Edinburgh is a charitable body, registered in
>     Scotland, with registration number SC005336.
>
>     _______________________________________________
>     gem5-users mailing list
>     [email protected] <mailto:[email protected]>
>     http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>     -- 
>       Joel Hestness
>       PhD Student, Computer Architecture
>       Dept. of Computer Science, University of Wisconsin - Madison
>       http://www.cs.utexas.edu/~hestness
>     <http://www.cs.utexas.edu/%7Ehestness>
>
>
>
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Checkpointing possible with Ruby, X86, TimingSimpleCPU and O3CPU?

Reply via email to