Re: [gem5-users] Checkpointing possible with Ruby, X86, TimingSimpleCPU and O3CPU?

Joel Hestness Fri, 31 Aug 2012 12:52:26 -0700

Hi Marco,
  Thanks for sending this.  Based on what I see here, I'm pretty confident
that one of my new patches will fix this issue.  Unfortunately, sending you
that patch would only get you to the next bug that currently exists in
Ruby's draining functionality.


  It appears as though these deeper bugs in Ruby were introduced when the
gem5 ports were split to introduce queued ports (changeset
8914:8c3bd7bea667).  I'm hoping to have this all sorted out over the next
couple days, but if you have an urgent need to run simulations, you could
try running from sims from before the queued port change (% hg update -r
8913, recompile, run).

  I'll keep you posted on debugging.  Thanks again,
  Joel


On Fri, Aug 31, 2012 at 2:34 PM, Marco Elver <[email protected]> wrote:

>  Hi Joel,
>
> I ran with 1 CPU and 16 CPUs and get essentially the same result.
>
> Attachments:
>     - gdb-n1.log: Terminal output of gdb session for the 1 CPU case.
>     - gdb-n16.log: Terminal output of gdb session for 16 CPU case.
>     - gem5-n1.log.bz2: Gem5 output for 1 CPU case.
>     - gem5-n16.log.bz2: Gem5 output for 16 CPU case.
>
> Both of them crash right after printing information about "[...]Got long
> mode PDP entry[...]".
>
> I hope the gdb and gem5 output logs are sufficient for you to replicate
> this bug; my current hg parent is 9181:42807286d6cb, when the patch
> mentioned below was applied.
>
> -- Marco
>
>
> On 30/08/12 22:30, Joel Hestness wrote:
>
> Hi Marco,
>   I'm currently trying to track down bugs in checkpoint restore to get
> x86+Ruby+O3CPU working, and I'm having trouble replicating your bug.  Could
> you please compile build/X86_MOESI_hammer/gem5.debug and run the same
> tests you have here to grab this backtrace?  Also, can you collect and
> restore from checkpoint with a single CPU core and see what happens?
>
>    Thanks!
>   Joel
>
>
> On Wed, Aug 29, 2012 at 5:11 PM, Marco Elver <[email protected]> wrote:
>
>> Thank you, with the patch I can confirm that the assertion problem has
>> been fixed (after recreating the checkpoint).
>>
>> My problems with the O3CPU persist, and was wondering if this is a
>> problem specific to X86 or is it a general problem?
>>
>> -- Marco
>>
>> On 28/08/12 21:28, Nilay Vaish wrote:
>> > The cause of the assert failure was tracked down recently by Jason
>> > Power. The patch is on the review board. Here is the link -
>> > http://reviews.gem5.org/r/1365
>> >
>> > It will be committed to the mainline soon.
>> >
>> > --
>> > Nilay
>> >
>> >
>> > On Tue, 28 Aug 2012, Marco Elver wrote:
>> >
>> >> Hi all,
>> >>
>> >> I would like to ask if what I am trying to do is even possible (and if
>> >> so, how??), as I have been running into a few problems, despite
>> >> following the advice I could find in older mailing-list threads or the
>> >> wiki. My goal would be to run a full-system with ruby (with
>> >> MOESI_CMP_directory), multiple processors of type O3CPU and the X86
>> ISA;
>> >> I create a snapshot after the Linux kernel loaded and before the
>> >> benchmark enters the ROI.
>> >>
>> >> With revision 9174:2171e04a2ee5 (Mon Aug 27 20:53:20 2012 -0400) from
>> >> the dev repository, I tried the following:
>> >>    (1) Take a checkpoint with ruby_fs, the *MOESI_hammer* protocol
>> >> (only one supporting checkpoints, according to Wiki) and the
>> >> TimingSimpleCPU (succeeds):
>> >>           $> build/X86_MOESI_hammer/gem5.opt
>> >> --outdir=m5out/rawdata/fluidanimate/ckpt configs/example/ruby_fs.py -n
>> >> 16 --cpu-type=timing --kernel=system/x86_64-vmlinux-2.6.28.smp
>> >> --checkpoint-dir=m5out/checkpoints/fluidanimate --max-checkpoints=1
>> >> --script=contrib/initscripts/parsec/fluidanimate.sh
>> >>
>> >>    (2) Resume from the checkpoint with the O3CPU, restore with
>> >> TimingSimpleCPU (fails):
>> >>           $> build/X86_MOESI_hammer/gem5.opt
>> >> --outdir=m5out/rawdata/fluidanimate/detailed configs/example/ruby_fs.py
>> >> -n 16 --cpu-type=detailed --kernel=system/x86_64-vmlinux-2.6.28.smp
>> >> --checkpoint-dir=m5out/checkpoints/fluidanimate -r 0
>> >> --restore-with-cpu=timing
>> >>           [...]
>> >>           Switch at curTick count:10000
>> >>           info: Entering event queue @ 0.  Starting simulation...
>> >>           Runtime Error at MOESI_hammer-dir.sm:1270, Ruby Time:
>> >> 1111185: assert failure, PID: 2742
>> >>           press return to continue.
>> >>
>> >>           Program aborted at cycle 555592500
>> >>
>> >>    (3) Resume from the checkpoint with the TimingSimpleCPU fails in the
>> >> same way as (2), as in (2) the CPU isn't even switched to the O3CPU
>> >> before it fails.
>> >>
>> >>    (4) Though if I try taking a snapshot right after starting the
>> >> simulator (after ~ 10000000000 cycles, kernel still booting) and then
>> >> try to restore with the TimingSimpleCPU, it works as expected; only the
>> >> O3CPU fails with a segfault and the following backtrace:
>> >>        #0  0x0000000000cdff56 in MasterPort::sendTimingReq
>> >> (this=<optimized out>, pkt=0x6f8a060)
>> >>            at build/X86/mem/port.cc:136
>> >>        #1  0x00000000005fbac5 in sendTiming (pkt=0x6f8a060,
>> >> sendingState=0x61a7cc0, this=0x49a9e60)
>> >>            at build/X86/arch/x86/pagetable_walker.cc:173
>> >>        #2  X86ISA::Walker::WalkerState::sendPackets (this=0x61a7cc0)
>> >>            at build/X86/arch/x86/pagetable_walker.cc:631
>> >>        #3  0x00000000005fc8c2 in
>> >> X86ISA::Walker::WalkerState::recvPacket (this=this@entry=0x61a7cc0,
>> >>            pkt=pkt@entry=0x1e99920) at
>> >> build/X86/arch/x86/pagetable_walker.cc:590
>> >>        #4  0x00000000005fcb98 in X86ISA::Walker::recvTimingResp
>> >> (this=0x43706c0, pkt=0x1e99920)
>> >>            at build/X86/arch/x86/pagetable_walker.cc:129
>> >>        #5  0x0000000000ce1f5b in PacketQueue::trySendTiming
>> >> (this=0x42ba5e0)
>> >>            at build/X86/mem/packet_queue.cc:152
>> >>        #6  0x0000000000ce2929 in PacketQueue::sendDeferredPacket
>> >> (this=0x42ba5e0)
>> >>            at build/X86/mem/packet_queue.cc:190
>> >>        #7  0x0000000000c391be in EventQueue::serviceOne
>> >> (this=<optimized out>) at build/X86/sim/eventq.cc:204
>> >>        #8  0x0000000000c7d342 in simulate
>> >> (num_cycles=9223372036854785807) at build/X86/sim/simulate.cc:71
>> >>        #9  0x0000000000b8e17c in _wrap_simulate__SWIG_0
>> >> (args=<optimized out>)
>> >>            at build/X86/python/swig/event_wrap.cc:4755
>> >>        #10 _wrap_simulate (self=<optimized out>, args=<optimized out>)
>> >>            at build/X86/python/swig/event_wrap.cc:4804
>> >>        #11 0x00007fb32a094fc6 in PyEval_EvalFrameEx () from
>> >> /lib/libpython2.7.so.1.0
>> >>
>> >> Trying to restore with ruby using MOESI_CMP_directory and the
>> >> TimingSimpleCPU results in the same error as (2), with the difference
>> >> that it finishes loading the checkpoint, resumes, but then fails after
>> >> about a minute ("Runtime Error at MOESI_CMP_directory-dir.sm:485, Ruby
>> >> Time: 12038425921: assert failure, PID: 19169"). Using the O3CPU still
>> >> results in the same error as (4).
>> >>
>> >> In addition, I have seen workflows of: 1) create checkpoint without
>> ruby
>> >> and with the AtomicSimpleCPU 2) load checkpoint with ruby and the
>> >> TimingSimpleCPU. I tried this, and it works if I set
>> >> --restore-with-cpu=timing. But trying this with the O3CPU doesn't work,
>> >> resulting in the same backtrace as (4).
>> >>
>> >> Is what I'm trying to do possible? If so, any workarounds I should
>> >> know of?
>> >>
>> >> Thanks,
>> >> Marco
>> >>
>> >>
>> >> --
>> >> The University of Edinburgh is a charitable body, registered in
>> >> Scotland, with registration number SC005336.
>> >>
>> >> _______________________________________________
>> >> gem5-users mailing list
>> >> [email protected]
>> >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >>
>> > _______________________________________________
>> > gem5-users mailing list
>> > [email protected]
>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>> >
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>> _______________________________________________
>> gem5-users mailing list
>> [email protected]
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
>
>  --
>   Joel Hestness
>   PhD Student, Computer Architecture
>   Dept. of Computer Science, University of Wisconsin - Madison
>   Dept. of Computer Science, University of Texas - Austin
>   http://www.cs.utexas.edu/~hestness
>
>
> _______________________________________________
> gem5-users mailing 
> [email protected]http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
> --
>   Joel Hestness
>   PhD Student, Computer Architecture
>   Dept. of Computer Science, University of Wisconsin - Madison
>    <http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users>
> http://www.cs.utexas.edu/~hestness
>
>

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Checkpointing possible with Ruby, X86, TimingSimpleCPU and O3CPU?

Reply via email to