Hi, dist-gem5 does not support delayed checkpoint currently. The delay option is used to decide whether a 'collective' or an 'immediate' checkpoint is to be taken.
If delay == 0 then the checkpoint is triggered only when all gem5 instances (participating in the dist-gem5 run) have completed an 'm5 checkpoint 0' command. An example use case is when one wants to take a checkpoint at a synchronisation point in the simulated distributed application (e.g. take a checkpoint just before an MPI_Barrier() completes in an MPI application). On the other hand, if a checkpoint command with a delay != 0 parameter is hit in any of the gem5 processes then a checkpoint is taken immediately across all participating gem5 instances. Regards, Gabor Dozsa ------------------------- Date: Wed, 7 Mar 2018 10:21:41 -0600 From: David Kim <dkim.t...@gmail.com> To: gem5 users mailing list <gem5-users@gem5.org> Subject: Re: [gem5-users] dist-gem5 checkpointing Message-ID: <caauosrmiikaf9bfe2gudrzrwvtz6he-mtr9bwjtyg5sp393...@mail.gmail.com> Content-Type: text/plain; charset="utf-8" Hello, I have looked at the output message again, and it gave the following message; info: m5 checkpoint called with non-zero delay => triggering immediate checkpoint (at the next sync) So, I look at the source code that print out that message, and the following is the code snippet, *@ src/dev/net/dist_iface.cc* *bool* *DistIface::readyToCkpt(Tick delay, Tick period)* *{* * bool ret = true;* * DPRINTF(DistEthernet, "DistIface::readyToCkpt() called, delay:%lu "* * "period:%lu\n", delay, period);* * if (master) {* * if (delay == 0) {* * inform("m5 checkpoint called with zero delay => triggering collaborative "* * "checkpoint\n");* * sync->requestCkpt(ReqType::collective);* * } else {* * inform("m5 checkpoint called with non-zero delay => triggering immediate "* * "checkpoint (at the next sync)\n");* * sync->requestCkpt(ReqType::immediate);* * }* * if (period != 0)* * inform("Non-zero period for m5_ckpt is ignored in "* * "distributed gem5 runs\n");* * ret = false;* * }* * return ret;* *}* *@ src/sim/pseudo_inst.cc* *void* *m5checkpoint(ThreadContext *tc, Tick delay, Tick period)* *{* * DPRINTF(PseudoInst, "PseudoInst::m5checkpoint(%i, %i)\n", delay, period);* * if (!tc->getCpuPtr()->params()->do_checkpoint_insts)* * return;* * if (DistIface::readyToCkpt(delay, period)) {* * Tick when = curTick() + delay * SimClock::Int::ns;* * Tick repeat = period * SimClock::Int::ns;* * exitSimLoop("checkpoint", 0, when, repeat);* * }* *}* Since the checkpoint delay is non-zero value, it seems to force do checkpointing at the next sync time rather than delay value. In this simulation, I added 'dist-sync-start=1000000000000t', so I think sync will be on every 1s in simulation time, right? FYI, I have added 'echo' command, but it was not printed out, so I think simulation did not reach that point. Can you explain what is exactly happening in the dist-gem5 checkpoint routine? Any suggestion or idea will be appreciated. Thanks. Dong Wan Kim On Mon, Mar 5, 2018 at 6:01 PM, Mohammad Alian <m.alian1...@gmail.com> wrote: > Hi, > > What you have should work. Are you sure that you start the application > after the checkpoint command (you don't block any where?)? E.g. what would > be the output if you add an echo right before starting the MPI app: > > /sbin/m5 checkpoint 50000000000000 > > /sbin/m5 loadsymbol > > /sbin/m5 resetstats > *echo "start the app"* > mpiexec -hosts=node1,node2 -np 2 ./cg.S.2 > > > Do you see immediate progress in your application if you remove "/sbin/m5 > checkpoint 50000000000000"? > > Best, > Mohammad > > > On Mon, Mar 5, 2018 at 11:59 AM, David Kim <dkim.t...@gmail.com> wrote: > >> Hello, >> >> I am trying to checkpoint dist-gem5 in the middle of the execution of the >> application. >> The following is my script file that used to run dist-gem5 (with 2 nodes) >> after boot up Linux. >> >> < for node 1 (node1.rcS)> >> *#!/bin/sh* >> >> *# Set up IP address for node 1* >> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:02* >> */sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up* >> >> *cd /root/NPB3.3.1/NPB3.3-MPI/bin* >> >> *# checkpoint after delay (in ns, so the below delay represents 50000 >> seconds! I have also tested 0.1s,10s, and 100s delay)* >> */sbin/m5 checkpoint 50000000000000* >> >> */sbin/m5 loadsymbol* >> >> */sbin/m5 resetstats* >> *mpiexec -hosts=node1,node2 -np 2 ./cg.S.2* >> */sbin/m5 exit* >> >> < for node 2 (node2.rcS) > >> *#!/bin/sh* >> >> >> * # Set up IP address for node 2 * >> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:03* >> */sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up* >> >> And, here is my commandline to run dist-gem5 (I did not use gem5-dist.sh >> for some reason, and the following commandline works well in general) >> >> *For switch node,* >> >> *. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py --is-switch >> --dist-size=2 --dist-server-name=localhost --dist-server-port=2200* >> >> *For computer nodes (here is one for node1),* >> */build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py >> --machine-type=VExpress_EMM64 >> --disk-image=aarch64-ubuntu-trusty-headless.img >> --kernel=vmlinux.aarch64.20140821 >> --dtb-filename=vexpress.aarch64.20140821.dtb --cpu-type=TimingSimpleCPU >> --num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1 >> --mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 --dist-size=2 >> --dist-server-name=localhost --dist-server-port=2200 >> --dist-sync-start=1000000000000t* >> >> I have increased checkpoint delay to see if there is any change in my >> checkpoint image, but seems to show same behavior; wait that amount of time >> (not running an application) then do checkpoint (no progress is displayed >> on console until checkpoint. Then, restoring gem5 prints out all the >> application output from the beginning). >> >> To checkpoint in the middle of the running of an application, for >> example, after 1 billion cycles after running an application, should I only >> use m5_roi_begin() and m5_roi_end() call in the application's source code >> (I did not test this yet, but guess it will work?), but cannot just add >> some delay to checkpoint as shown above (and thus not change application >> source code)? >> >> Any comment will be appreciated. >> >> Thanks. >> >> Regards, >> Dong Wan Kim >> >> _______________________________________________ >> gem5-users mailing list >> gem5-users@gem5.org >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >> > > > _______________________________________________ > gem5-users mailing list > gem5-users@gem5.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users