I see, Gabor.

Sorry that it took some time to reply because I wanted to see if other
checkpoint option is working (e.g. m5_checkpoint() ) and it looks working.

So, it seems like using function calls such as m5_checkpoint() is
"currently" right way to do checkpoint dist-gem5 during execution of an
application, right?

Also, I have tested checkpointing NPB benchmark that you used in your
dist-gem5 evaluation, but some of them seems not working correctly.
For example, DT terminated with a message like "*BAD TERMINATION OF ONE OF
YOUR APPLICATION PROCESSES*", which implies an application was terminated
for some reason, not because of MPICH (according to MPICH doc. :-) ).
Since you used a subset of NPB suite in your evaluation, do you have some
insight about why some test cases could not finish successfully and a list
of those applications?

Thanks.

Dong Wan Kim

On Thu, Mar 8, 2018 at 3:43 AM, Gabor Dozsa <gabor.do...@arm.com> wrote:

> Hi,
>
>
>
> dist-gem5 does not support delayed checkpoint currently. The delay option
> is used to decide whether a 'collective' or an 'immediate' checkpoint is to
> be taken.
>
>
>
> If delay == 0 then the checkpoint is triggered only when all gem5
> instances (participating in the dist-gem5 run) have completed an 'm5
> checkpoint 0' command. An example use case is when one wants to take a
> checkpoint at a synchronisation point in the simulated distributed
> application (e.g. take a checkpoint just before an MPI_Barrier() completes
> in an MPI application).
>
>
>
> On the other hand, if a checkpoint command with a delay != 0 parameter is
> hit in any of the gem5 processes then a checkpoint is taken immediately
> across all participating gem5 instances.
>
>
>
> Regards,
>
> Gabor Dozsa
>
>
>
> -------------------------
>
>
>
>     Date: Wed, 7 Mar 2018 10:21:41 -0600
>
>     From: David Kim <dkim.t...@gmail.com>
>
>     To: gem5 users mailing list <gem5-users@gem5.org>
>
>     Subject: Re: [gem5-users] dist-gem5 checkpointing
>
>     Message-ID:
>
>                 <CAAuOSRmiiKAf9Bfe2gUDrzrWVtz6hE-MTr9BWjTYG5sP393fTA@mail.
> gmail.com>
>
>     Content-Type: text/plain; charset="utf-8"
>
>
>
>     Hello,
>
>
>
>     I have looked at the output message again, and it gave the following
>
>     message;
>
>     info: m5 checkpoint called with non-zero delay => triggering immediate
>
>     checkpoint (at the next sync)
>
>
>
>     So, I look at the source code that print out that message, and the
>
>     following is the code snippet,
>
>
>
>     *@ src/dev/net/dist_iface.cc*
>
>
>
>     *bool*
>
>     *DistIface::readyToCkpt(Tick delay, Tick period)*
>
>     *{*
>
>     *    bool ret = true;*
>
>     *    DPRINTF(DistEthernet, "DistIface::readyToCkpt() called, delay:%lu
> "*
>
>     *            "period:%lu\n", delay, period);*
>
>     *    if (master) {*
>
>     *        if (delay == 0) {*
>
>     *            inform("m5 checkpoint called with zero delay => triggering
>
>     collaborative "*
>
>     *                   "checkpoint\n");*
>
>     *            sync->requestCkpt(ReqType::collective);*
>
>     *        } else {*
>
>     *            inform("m5 checkpoint called with non-zero delay =>
> triggering
>
>     immediate "*
>
>     *                   "checkpoint (at the next sync)\n");*
>
>     *            sync->requestCkpt(ReqType::immediate);*
>
>     *        }*
>
>     *        if (period != 0)*
>
>     *            inform("Non-zero period for m5_ckpt is ignored in "*
>
>     *                   "distributed gem5 runs\n");*
>
>     *        ret = false;*
>
>     *    }*
>
>     *    return ret;*
>
>     *}*
>
>
>
>     *@ src/sim/pseudo_inst.cc*
>
>
>
>     *void*
>
>     *m5checkpoint(ThreadContext *tc, Tick delay, Tick period)*
>
>     *{*
>
>     *    DPRINTF(PseudoInst, "PseudoInst::m5checkpoint(%i, %i)\n", delay,
>
>     period);*
>
>     *    if (!tc->getCpuPtr()->params()->do_checkpoint_insts)*
>
>     *        return;*
>
>
>
>     *    if (DistIface::readyToCkpt(delay, period)) {*
>
>     *        Tick when = curTick() + delay * SimClock::Int::ns;*
>
>     *        Tick repeat = period * SimClock::Int::ns;*
>
>     *        exitSimLoop("checkpoint", 0, when, repeat);*
>
>     *    }*
>
>     *}*
>
>
>
>     Since the checkpoint delay is non-zero value, it seems to force do
>
>     checkpointing at the next sync time rather than delay value.
>
>     In this simulation, I added 'dist-sync-start=1000000000000t', so I
> think
>
>     sync will be on every 1s in simulation time, right?
>
>
>
>     FYI, I have added 'echo' command, but it was not printed out, so I
> think
>
>     simulation did not reach that point.
>
>
>
>     Can you explain what is exactly happening in the dist-gem5 checkpoint
>
>     routine? Any suggestion or idea will be appreciated.
>
>
>
>     Thanks.
>
>
>
>     Dong Wan Kim
>
>
>
>
>
>     On Mon, Mar 5, 2018 at 6:01 PM, Mohammad Alian <m.alian1...@gmail.com>
>
>     wrote:
>
>
>
>     > Hi,
>
>     >
>
>     > What you have should work. Are you sure that you start the
> application
>
>     > after the checkpoint command (you don't block any where?)? E.g. what
> would
>
>     > be the output if you add an echo right before starting the MPI app:
>
>     >
>
>     > /sbin/m5 checkpoint 50000000000000
>
>     >
>
>     > /sbin/m5 loadsymbol
>
>     >
>
>     > /sbin/m5 resetstats
>
>     > *echo "start the app"*
>
>     > mpiexec -hosts=node1,node2 -np 2 ./cg.S.2
>
>     >
>
>     >
>
>     > Do you see immediate progress in your application if you remove
> "/sbin/m5
>
>     > checkpoint 50000000000000"?
>
>     >
>
>     > Best,
>
>     > Mohammad
>
>     >
>
>     >
>
>     > On Mon, Mar 5, 2018 at 11:59 AM, David Kim <dkim.t...@gmail.com>
> wrote:
>
>     >
>
>     >> Hello,
>
>     >>
>
>     >> I am trying to checkpoint dist-gem5 in the middle of the execution
> of the
>
>     >> application.
>
>     >> The following is my script file that used to run dist-gem5 (with 2
> nodes)
>
>     >> after boot up Linux.
>
>     >>
>
>     >> < for node 1 (node1.rcS)>
>
>     >> *#!/bin/sh*
>
>     >>
>
>     >> *# Set up IP address for node 1*
>
>     >> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:02*
>
>     >> */sbin/ifconfig eth0 192.168.0.2 netmask 255.255.255.0 up*
>
>     >>
>
>     >> *cd /root/NPB3.3.1/NPB3.3-MPI/bin*
>
>     >>
>
>     >> *#  checkpoint after delay (in ns, so the below delay represents
> 50000
>
>     >> seconds! I have also tested 0.1s,10s, and 100s delay)*
>
>     >> */sbin/m5 checkpoint 50000000000000*
>
>     >>
>
>     >> */sbin/m5 loadsymbol*
>
>     >>
>
>     >> */sbin/m5 resetstats*
>
>     >> *mpiexec -hosts=node1,node2 -np 2 ./cg.S.2*
>
>     >> */sbin/m5 exit*
>
>     >>
>
>     >> < for node 2  (node2.rcS) >
>
>     >> *#!/bin/sh*
>
>     >>
>
>     >>
>
>     >> * # Set up IP address for node 2 *
>
>     >> */sbin/ifconfig eth0 hw ether 00:90:00:00:00:03*
>
>     >> */sbin/ifconfig eth0 192.168.0.3 netmask 255.255.255.0 up*
>
>     >>
>
>     >> And, here is my commandline to run dist-gem5 (I did not use
> gem5-dist.sh
>
>     >> for some reason, and the following commandline works well in
> general)
>
>     >>
>
>     >> *For switch node,*
>
>     >>
>
>     >> *. /build/ARM/gem5.opt -d ./m5out.switch ./configs/dist/sw.py
> --is-switch
>
>     >> --dist-size=2 --dist-server-name=localhost --dist-server-port=2200*
>
>     >>
>
>     >> *For computer nodes (here is one for node1),*
>
>     >> */build/ARM/gem5.opt -d ./m5out.0 ./configs/example/fs.py
>
>     >> --machine-type=VExpress_EMM64
>
>     >> --disk-image=aarch64-ubuntu-trusty-headless.img
>
>     >> --kernel=vmlinux.aarch64.20140821
>
>     >> --dtb-filename=vexpress.aarch64.20140821.dtb
> --cpu-type=TimingSimpleCPU
>
>     >> --num-cpus=1 --caches --l2cache --mem-size=512MB --mem-channels=1
>
>     >> --mem-ranks=1 --script=./node1.rcS --dist --dist-rank=0
> --dist-size=2
>
>     >> --dist-server-name=localhost --dist-server-port=2200
>
>     >> --dist-sync-start=1000000000000t*
>
>     >>
>
>     >> I have increased checkpoint delay to see if there is any change in
> my
>
>     >> checkpoint image, but seems to show same behavior; wait that amount
> of time
>
>     >> (not running an application) then do checkpoint (no progress is
> displayed
>
>     >> on console until checkpoint. Then, restoring gem5 prints out all the
>
>     >> application output from the beginning).
>
>     >>
>
>     >> To checkpoint in the middle of the running of an application, for
>
>     >> example, after 1 billion cycles after running an application,
> should I only
>
>     >> use m5_roi_begin() and m5_roi_end() call in the application's
> source code
>
>     >> (I did not test this yet, but guess it will work?), but cannot just
> add
>
>     >> some delay to checkpoint as shown above (and thus not change
> application
>
>     >> source code)?
>
>     >>
>
>     >> Any comment will be appreciated.
>
>     >>
>
>     >> Thanks.
>
>     >>
>
>     >> Regards,
>
>     >> Dong Wan Kim
>
>     >>
>
>     >> _______________________________________________
>
>     >> gem5-users mailing list
>
>     >> gem5-users@gem5.org
>
>     >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>     >>
>
>     >
>
>     >
>
>     > _______________________________________________
>
>     > gem5-users mailing list
>
>     > gem5-users@gem5.org
>
>     > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to