I see, Gabor. Sorry that it took some time to reply because I wanted to see if other checkpoint option is working (e.g. m5_checkpoint() ) and it looks working.
So, it seems like using function calls such as m5_checkpoint() is "currently" right way to do checkpoint dist-gem5 during execution of an application, right? Also, I have tested checkpointing NPB benchmark that you used in your dist-gem5 evaluation, but some of them seems not working correctly. For example, DT terminated with a message like "*BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES*", which implies an application was terminated for some reason, not because of MPICH (according to MPICH doc. :-) ). Since you used a subset of NPB suite in your evaluation, do you have some insight about why some test cases could not finish successfully and a list of those applications? Thanks. Dong Wan Kim On Thu, Mar 8, 2018 at 3:43 AM, Gabor Dozsa <gabor.do...@arm.com> wrote: > Hi, > > > > dist-gem5 does not support delayed checkpoint currently. The delay option > is used to decide whether a 'collective' or an 'immediate' checkpoint is to > be taken. > > > > If delay == 0 then the checkpoint is triggered only when all gem5 > instances (participating in the dist-gem5 run) have completed an 'm5 > checkpoint 0' command. An example use case is when one wants to take a > checkpoint at a synchronisation point in the simulated distributed > application (e.g. take a checkpoint just before an MPI_Barrier() completes > in an MPI application). > > > > On the other hand, if a checkpoint command with a delay != 0 parameter is > hit in any of the gem5 processes then a checkpoint is taken immediately > across all participating gem5 instances. > > > > Regards, > > Gabor Dozsa > > > > ------------------------- > > > > Date: Wed, 7 Mar 2018 10:21:41 -0600 > > From: David Kim <dkim.t...@gmail.com> > > To: gem5 users mailing list <gem5-users@gem5.org> > > Subject: Re: [gem5-users] dist-gem5 checkpointing > > Message-ID: > > <CAAuOSRmiiKAf9Bfe2gUDrzrWVtz6hE-MTr9BWjTYG5sP393fTA@mail. > gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > Hello, > > > > I have looked at the output message again, and it gave the following > > message; > > info: m5 checkpoint called with non-zero delay => triggering immediate > > checkpoint (at the next sync) > > > > So, I look at the source code that print out that message, and the > > following is the code snippet, > > > > *@ src/dev/net/dist_iface.cc* > > > > *bool* > > *DistIface::readyToCkpt(Tick delay, Tick period)* > > *{* > > * bool ret = true;* > > * DPRINTF(DistEthernet, "DistIface::readyToCkpt() called, delay:%lu > "* > > * "period:%lu\n", delay, period);* > > * if (master) {* > > * if (delay == 0) {* > > * inform("m5 checkpoint called with zero delay => triggering > > collaborative "* > > * "checkpoint\n");* > > * sync->requestCkpt(ReqType::collective);* > > * } else {* > > * inform("m5 checkpoint called with non-zero delay => > triggering > > immediate "* > > * "checkpoint (at the next sync)\n");* > > * sync->requestCkpt(ReqType::immediate);* > > * }* > > * if (period != 0)* > > * inform("Non-zero period for m5_ckpt is ignored in "* > > * "distributed gem5 runs\n");* > > * ret = false;* > > * }* > > * return ret;* > > *}* > > > > *@ src/sim/pseudo_inst.cc* > > > > *void* > > *m5checkpoint(ThreadContext *tc, Tick delay, Tick period)* > > *{* > > * DPRINTF(PseudoInst, "PseudoInst::m5checkpoint(%i, %i)\n", delay, > > period);* > > * if (!tc->getCpuPtr()->params()->do_checkpoint_insts)* > > * return;* > > > > * if (DistIface::readyToCkpt(delay, period)) {* > > * Tick when = curTick() + delay * SimClock::Int::ns;* > > * Tick repeat = period * SimClock::Int::ns;* > > * exitSimLoop("checkpoint", 0, when, repeat);* > > * }* > > *}* > > > > Since the checkpoint delay is non-zero value, it seems to force do > > checkpointing at the next sync time rather than delay value. > > In this simulation, I added 'dist-sync-start=1000000000000t', so I > think > > sync will be on every 1s in simulation time, right? > > > > FYI, I have added 'echo' command, but it was not printed out, so I > think > > simulation did not reach that point. > > > > Can you explain what is exactly happening in the dist-gem5 checkpoint > > routine? Any suggestion or idea will be appreciated. > > > > Thanks. > > > > Dong Wan Kim > > > > > > On Mon, Mar 5, 2018 at 6:01 PM, Mohammad Alian <m.alian1...@gmail.com> > > wrote: > > > > > Hi, > > > > > > What you have should work. Are you sure that you start the > application > > > after the checkpoint command (you don't block any where?)? E.g. what > would > > > be the output if you add an echo right before starting the MPI app: > > > > > > /sbin/m5 checkpoint 50000000000000 > > > > > > /sbin/m5 loadsymbol > > > > > > /sbin/m5 resetstats > > > *echo "start the app"* > > > mpiexec -hosts=node1,node2 -np 2 ./cg.S.2 > > > > > > > > > Do you see immediate progress in your application if you remove > "/sbin/m5 > > > checkpoint 50000000000000"? > > > > > > Best, > > > Mohammad > > > > > > > > > On Mon, Mar 5, 2018 at 11:59 AM, David Kim <dkim.t...@gmail.com> > wrote: > > > > > >> Hello, > > >> > > >> I am trying to checkpoint dist-gem5 in the middle of the execution > of the > > >> application. > > >> The following is my script file that used to run dist-gem5 (with 2 > nodes) > > >> after boot up Linux. > > >> > > >> < for node 1 (node1.rcS)> > > >> *#!/bin/sh* > > >> > > >> *# Set up IP address for node 1* > > >> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:02* > > >> */sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up* > > >> > > >> *cd /root/NPB3.3.1/NPB3.3-MPI/bin* > > >> > > >> *# checkpoint after delay (in ns, so the below delay represents > 50000 > > >> seconds! I have also tested 0.1s,10s, and 100s delay)* > > >> */sbin/m5 checkpoint 50000000000000* > > >> > > >> */sbin/m5 loadsymbol* > > >> > > >> */sbin/m5 resetstats* > > >> *mpiexec -hosts=node1,node2 -np 2 ./cg.S.2* > > >> */sbin/m5 exit* > > >> > > >> < for node 2 (node2.rcS) > > > >> *#!/bin/sh* > > >> > > >> > > >> * # Set up IP address for node 2 * > > >> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:03* > > >> */sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up* > > >> > > >> And, here is my commandline to run dist-gem5 (I did not use > gem5-dist.sh > > >> for some reason, and the following commandline works well in > general) > > >> > > >> *For switch node,* > > >> > > >> *. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py > --is-switch > > >> --dist-size=2 --dist-server-name=localhost --dist-server-port=2200* > > >> > > >> *For computer nodes (here is one for node1),* > > >> */build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py > > >> --machine-type=VExpress_EMM64 > > >> --disk-image=aarch64-ubuntu-trusty-headless.img > > >> --kernel=vmlinux.aarch64.20140821 > > >> --dtb-filename=vexpress.aarch64.20140821.dtb > --cpu-type=TimingSimpleCPU > > >> --num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1 > > >> --mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0 > --dist-size=2 > > >> --dist-server-name=localhost --dist-server-port=2200 > > >> --dist-sync-start=1000000000000t* > > >> > > >> I have increased checkpoint delay to see if there is any change in > my > > >> checkpoint image, but seems to show same behavior; wait that amount > of time > > >> (not running an application) then do checkpoint (no progress is > displayed > > >> on console until checkpoint. Then, restoring gem5 prints out all the > > >> application output from the beginning). > > >> > > >> To checkpoint in the middle of the running of an application, for > > >> example, after 1 billion cycles after running an application, > should I only > > >> use m5_roi_begin() and m5_roi_end() call in the application's > source code > > >> (I did not test this yet, but guess it will work?), but cannot just > add > > >> some delay to checkpoint as shown above (and thus not change > application > > >> source code)? > > >> > > >> Any comment will be appreciated. > > >> > > >> Thanks. > > >> > > >> Regards, > > >> Dong Wan Kim > > >> > > >> _______________________________________________ > > >> gem5-users mailing list > > >> gem5-users@gem5.org > > >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > >> > > > > > > > > > _______________________________________________ > > > gem5-users mailing list > > > gem5-users@gem5.org > > > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > > IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. > > _______________________________________________ > gem5-users mailing list > gem5-users@gem5.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >
_______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users