Hi Tim,

I have not been completely following this thread, but I can answer your 
question about unserializing cache contents.

The benefit for creating at trace, rather than just inserting data into the 
cache, is two-fold.  First, by creating a trace from a very large cache system, 
one can warmup caches of different sizes, associativities and even completely 
different cache hierarchies/configurations from a single trace.  Second, and 
probably more important, Ruby protocols rely on timing requests to set cache 
block state to the unique states used by a particular protocol.  Often Ruby is 
used to compare different protocols and this process allows us to compare 
protocols using the exact same checkpoint.

I hope that helps,

Brad




-----Original Message-----
From: gem5-dev [mailto:gem5-dev-boun...@gem5.org] On Behalf Of Timothy M Jones
Sent: Wednesday, June 17, 2015 3:16 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Ruby serialize removing event queue head

Thanks Nilay and Joel for the information.

I've been playing around with this over the past few days and I can't work out 
what the point of the flush is.  The CacheRecorder already has a copy of all 
the data blocks in the trace before the flush starts. 
Removing the flush event and subsequent simulation produces exactly the same 
system.ruby.cache.gz file as with it in, so I guess it's safe to remove them....

So, with that out of the way, I can create checkpoints and exit the simulator 
correctly.  I'm not 100% sure about restoring the checkpoint though, and it 
seems a little hacky.  Is there a reason it has to unserialise by inserting 
memory requests into the event queue - couldn't it just write the data into the 
correct locations in the caches?

There's also a question about whether ruby should be recording its state 
anyway.  Shouldn't it be doing the same as the classic memory system caches and 
implementing memWriteback() to flush all dirty data out before checkpointing 
happens, then it doesn't need to trace anything? 
(Maybe I'm opening a can of worms, but I thought I'd just ask!)

Cheers
Tim


On 13/06/2015 18:03, Joel Hestness wrote:
> Hey guys,
>    I'm pretty sure Tim is correct that the checkpointing bugs were 
> introduced earlier than the changeset Nilay points to; gem5-gpu is 
> currently using gem5 rev 10645 
> <http://repo.gem5.org/gem5/rev/cd95d4d51659>, and we cannot get 
> reliable checkpoint and restore with it. Note that Tim's bug may not 
> be the only checkpointing bug that exists right now.
>
>    To answer Tim's question: While taking a checkpoint, Ruby 
> commandeers the event queue to inject flushing memory accesses into 
> the caches. This is used to generate a trace of cache contents, which 
> can be used to warm up the caches on checkpoint restore. To take over 
> control of the event queue, Ruby clears the event at the queue head (I 
> think this assumes there is only 1 event in the queue? This should 
> probably be checked), and then adds it's own event for the cache 
> flushing operation. After the caches have been flushed (simulate() 
> call in RubySystem::serialize()), Ruby restores the head event that 
> was in the queue and rolls back the current tick.
>
>    One way to check if this cooldown operation is at fault for 
> unreliable checkpointing is to simply comment out the event queue 
> commandeering, and try to take a checkpoint. You may also be able to 
> test checkpoint restore by commenting the cache warm-up code in 
> RubySystem::unserialize(). If checkpoint and restore work without the 
> event queue commandeering, it is likely that the event queue 
> manipulation is problematic.
>
>    I'd also recommend trying to take a checkpoint and restore with 
> simulation specifying the gem5 flag --debug-flag=RubyCacheTrace, which 
> will show what the cache flushing and warm-up are doing, respectively.
>
>    Joel
>
>
>
> On Sat, Jun 13, 2015 at 9:48 AM, Nilay Vaish <ni...@cs.wisc.edu 
> <mailto:ni...@cs.wisc.edu>> wrote:
>
>     Your bisection is not right.  You might want to take a look at the
>     following changeset:
>
>
>     changeset:   10756:f9c0692f73ec
>     user:        Curtis Dunham <curtis.dun...@arm.com
>     <mailto:curtis.dun...@arm.com>>
>     date:        Mon Mar 23 06:57:36 2015 -0400
>     summary:     sim: Reuse the same limit_event in simulate()
>
>
>     I suggest that you revert this changeset in your repo while I think
>     about what needs to be done.
>
>     --
>     Nilay
>
>
>
>     On Sat, 13 Jun 2015, Timothy M Jones wrote:
>
>         Hi again,
>
>         Further to this message, I've used hg bisect to find the
>         revision that breaks checkpointing with ruby.  It's revision
>         10524 that Nilay committed in November that's the first bad
>         changeset.  It fails with the panic() on the missing event that
>         I wrote about previously.
>
>         I've scanned through the diff and can't immediately see any
>         reason why this would break serialisation, although it does
>         remove some of the code to serialise ruby state.
>
>         Could anyone (Nilay?) give me a hint as to why this might break
>         checkpointing with ruby?
>
>         I've compiled with the MOESI_hammer protocol for x86, then run
>         with this command line:
>
>         ./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
>         configs/example/fs.py -n 1 --kernel <my-kernel> --script
>         configs/boot/hack_back_ckpt.rcS --max-checkpoints 1
>         --checkpoint-dir <cptdir> --disk-image <my-disk-image>
>         --cpu-type timing --restore-with timing --ruby
>
>         Any help would be appreciated.  I don't know ruby at all, so
>         trying to work out what's going on is slow....
>
>         Cheers
>         Tim
>
>         On 11/06/2015 20:48, Timothy M Jones wrote:
>
>               Hello,
>
>               Could someone tell me why we need to take the head event
>             off the event
>               queue in RubySystem::serialize() in
>             src/mem/ruby/system/System.cc?
>
>               Event* eventq_head = eventq->replaceHead(NULL);
>
>               The problem I'm getting is that when simulate() is called
>             a few lines
>               later, it tries to reschedule the simulate_limit_event,
>             but that causes
>               a panic because it's no longer on the event queue.  This
>             is happening
>               when trying to take a checkpoint with ruby.  I can't work
>             out from the
>               comments why the head event needs to be taken off in the
>             first place.
>
>               This is basically the reason behind the problems in this
>             thread:
>
>             
> https://www.mail-archive.com/gem5-users@gem5.org/msg11701.html
>
>               Thanks
>               Tim
>
>
>         --
>         Timothy M. Jones
>         http://www.cl.cam.ac.uk/~tmj32/
>         _______________________________________________
>         gem5-dev mailing list
>         gem5-dev@gem5.org <mailto:gem5-dev@gem5.org>
>         http://m5sim.org/mailman/listinfo/gem5-dev
>
>
>     _______________________________________________
>     gem5-dev mailing list
>     gem5-dev@gem5.org <mailto:gem5-dev@gem5.org>
>     http://m5sim.org/mailman/listinfo/gem5-dev
>
>
>
>
> --
>    Joel Hestness
>    PhD Candidate, Computer Architecture
>    Dept. of Computer Science, University of Wisconsin - Madison 
> http://pages.cs.wisc.edu/~hestness/

--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to