Re: [Dmtcp-forum] DMTCP restart for multi-node jobs

gene Fri, 01 Nov 2013 18:08:29 -0700

Artem,
    Would you mind putting in the contrib/rm directory of the
DMTCP svn the two example scripts when you're ready?  One possibility
is to put them in a directory:  contrib/rm/example
but you should do what you think is best.


Thanks,
- Gene

On Fri, Nov 01, 2013 at 04:37:09PM -0400, Bryan F Putnam wrote:
> Hi Gene and Artem,
> 
> I suppose what would be useful for me to see would be a sample PBS runscript 
> that starts a multinode session (e.g. nodes=4:ppn=2) which starts up a simple 
> MPI job using dmtcp_coordinator/dmtcp_launch, etc., and which checkpoints the 
> job, for example, every 60 seconds. And then, assuming that job is killed or 
> terminated, a second PBS runscript that will restart that job. For example 
> Artem has attempted to do that with his two scripts:
> 
> #PBS -N hellompi
> #PBS -l nodes=4:ppn=2
> #PBS -j oe
> cd $PBS_O_WORKDIR
> export PATH=<path-to-dmtcp>:$PATH
> mpiexec dmtcp_checkpoint -h jet ./hellompi
> 
> #PBS -N hellompi
> #PBS -l nodes=4:ppn=2
> #PBS -j oe
> cd $PBS_O_WORKDIR
> export PATH=<path-to-dmtcp>:$PATH
> export DMTCP_HOST=jet
> ./dmtcp_restart_script.sh
> 
> 
> However, I'm confused about what the first script is doing. Is it immediately 
> checkpointing a job? I see for example something similar to
> 
> dmtcp_launch mpiexec -np 8 ./a.out
> dmtcp_command --checkpoint
> .dmtcp_restart_script.sh
> 
> in the QUICK-START file, and which does actually work for a single node job, 
> but I don't see any description of a command of the form:
> 
> mpiexec dmtcp_command ...
> 
> In the file
> 
> dmtcp-2.0/contrib/rm/README
> 
> I see a third method of doing the same thing, for example,
> 
> dmtcp_launch --rm  (without using mpiexec or mpirun at all)
> 
> and then doing
> 
> dmtcp_coordinator&
> dmtcp_restart_script.sh
> 
> to restart the job.
> 
> 
> Anyway, I'll look some more into this, and the documentation, and let you 
> know if I can give any helpful suggestions.
> 
> Thanks!
> Bryan
> 
> 
> 
> 
> 
> ----- Original Message -----
> > Hi Gene,
> > 
> > Yes, I can take a look at the documentation and try to give some
> > suggestions. I'll get back to you soon on that.
> > 
> > Our mvapich2 builds are configured with
> > --with-device=ch3:mrail \
> > --with-rdma=gen2 \
> > 
> > and they don't run on TCP/IP networks, only IB. However we do have
> > some mpich2 and mpich-3 builds (upon which mvapich2 is based) which
> > use TCP/IP, and I'm able to successfully checkpoint, kill, and restart
> > parallel mpich2 jobs as long as they are using only a single node.
> > 
> > In general, DMTCP appears to be working well for me, as long as the
> > job is running on a single node. I can checkpoint, kill the job, and
> > restart it, and it will restart again, even on a different node. It's
> > just that when more than one node is involved, DMTCP doesn't appears
> > to be retaining information about the remote nodes, and it restarts
> > everything on whatever localhost it is restarted on. Perhaps I'm just
> > missing something simple, I'm having difficulty understanding the use
> > of the "rm" plugin.
> > 
> > I was also able to checkpoint and restart a parallel Gaussian09 job,
> > which doesn't use MPI at all. But again it only worked when the
> > parallel job was a single node job.
> > 
> > Thanks,
> > Bryan
> > 
> > 
> > ----- Original Message -----
> > > Hi Bryan,
> > > Also, we've been thinking about how to improve the documentation
> > > for the resource managers (Torque and SLURM). We always get good
> > > insights on this by looking at people seeing it for the first time.
> > > If you should have the time, could you make some rough notes on
> > > how we can improve our documentation (what to emphasize, extra
> > > pointers
> > > to include, etc.)?
> > >
> > > As for InfiniBand, we're now tracking down still one more bug (a
> > > race
> > > condition). For InfiniBand, please continue updating from our svn:
> > > svn co svn://svn.code.sf.net/p/dmtcp/code/trunk dmtcp-trunk
> > > We're hoping to have the last bugs out of InfiniBand sometime this
> > > next week.
> > >
> > > You also mention mvapich2. Does that work for you with ordinary
> > > Ethernet? If it fails for youeven in that case, would you mind
> > > letting
> > > us know (either informally, or a bug report -- whichever you like).
> > >
> > > Thanks,
> > > - Gene
> > >
> > > On Fri, Nov 01, 2013 at 03:00:08PM -0400, Bryan F Putnam wrote:
> > > > Thanks for the examples Artem. Let me take a look as these, and
> > > > also
> > > > your instructions in
> > > >
> > > >
> > > > .../dmtcp-2.0/contrib/rm/README
> > > >
> > > >
> > > > and see if I can come up with something that works with Torque-4.
> > > > If
> > > > not, I'll contact my supervisor and I'm sure he'd be happy to let
> > > > us
> > > > set up an account for you on one of our clusters. So far I've
> > > > tried
> > > > using both openmpi and mpich2 (and mpich-3) but am seeing the same
> > > > problems with not being able to specify a specific set of nodes on
> > > > restarting.
> > > >
> > > >
> > > > I've also tried mvapich2, but that fails for different reasons,
> > > > and
> > > > I do see that Infiniband is not fully supported.
> > > >
> > > >
> > > > Please feel free to play around with my Fortran code "matmat2.f".
> > > > It's a simple matrix multiply inside a loop. If it doesn't run
> > > > long
> > > > enough for you, just modify the variable "niter". The iteration is
> > > > printed as the job proceeds, so it's easy to see that the job is
> > > > picking up where it left off, after being checkpointed and
> > > > restarted.
> > > >
> > > >
> > > > Thanks,
> > > > Bryan
> > > >
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > >
> > > >
> > > >
> > > > Bryan,
> > > >
> > > >
> > > >
> > > > Resource manager plugin is installed by default. As far as I see
> > > > you
> > > > execute application correctly.
> > > > Just in case I am attaching initial and restart batch scripts to
> > > > this e-mail for reference.
> > > > What is inside: at this moment (for debugging) I usually start
> > > > dmtcp_coordinator at the frontend and use DMTCP options to point
> > > > on
> > > > it. We already have a solution how to run coordinator in batch
> > > > manner too but untill you get correct behavior this is not
> > > > reasonable.
> > > > We test DMTCP with Open MPI mostly. Different MPI implementation
> > > > also can be the reason but we need to check if that is so.
> > > >
> > > >
> > > > 1. I need to additionally check Torque plugin by myself. This will
> > > > take few days. We add
> > > > 2. What application you run and is it possible for me to get it
> > > > for
> > > > testing with instructions about how to do that exactly as you do.
> > > > 3. I have acces to Torque 2.x installations and we didn't test
> > > > Torque 4.x. Is it possible for me to have access on your system
> > > > for
> > > > testing and debuggig?
> > > >
> > > >
> > > >
> > > > 2013/10/29 Bryan F Putnam < [email protected] >
> > > >
> > > >
> > > >
> > > >
> > > > Hi Artem, thanks for writing back.
> > > >
> > > >
> > > > We're using DMTCP-2.0 and Torque-4.1.5.1.
> > > >
> > > >
> > > > I'm a bit confused as to how to install a dmtcp plugin, or if in
> > > > fact the Torque plugin is already installed by default. For
> > > > example
> > > > if I start up a nodes=2:ppn=2 PBS session, my $PBS_NODEFILE may
> > > > look
> > > > something like
> > > >
> > > >
> > > > host1
> > > > host1
> > > > host2
> > > > host2
> > > >
> > > >
> > > > I then do
> > > >
> > > >
> > > > dmtcp_launch --rm mpiexec -np 4 ./a.out (4-processor job
> > > > successfully runs on 2 processors on each of 2 nodes)
> > > > dmtcp_command --checkpoint (in a separate window)
> > > > dmtcp_command --kill (in a separate window)
> > > > dmtcp_restart ckpt*.dmtcp
> > > >
> > > >
> > > > After the last step, the job successfully restarts, but all 4
> > > > processes are now running on the localhost (host1), nothing is
> > > > running on host2, and the $PBS_NODEFILE appears to be ignored.
> > > >
> > > >
> > > > Thanks for any tips!
> > > >
> > > >
> > > > Bryan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Hellp, Bryan.
> > > >
> > > >
> > > > What version of DMTCP/Torque you use?
> > > >
> > > >
> > > >
> > > > 2013/10/29 gene < [email protected] >
> > > >
> > > >
> > > > > Perhaps this is something that is handled by the Torque plugin?
> > > > Yes, that's correct. You'll need to use the DMTCP plugin for
> > > > Torque.
> > > > Artem Polyakov is supporting that, and I'm cc'ing to him. Among
> > > > other
> > > > issues, mount points can change and network addresses can change
> > > > on
> > > > restart.
> > > > The plugin tries to handle that.
> > > >
> > > > Please let us know if you have any trouble using the Torque
> > > > plugin.
> > > >
> > > > Best,
> > > > - Gene
> > > >
> > > > On Mon, Oct 28, 2013 at 03:10:51PM -0400, Bryan F Putnam wrote:
> > > > >
> > > > > Dear DMTCP developers,
> > > > >
> > > > > I've found that when restarting a multi-node job, dmtcp_restart
> > > > > only appears to be aware of the local host. Is it possible to
> > > > > tell
> > > > > dmtcp_restart which hosts are currently available for a job
> > > > > restart, whether it's the same set of multiple hosts, or a
> > > > > completely different set of hosts?
> > > > >
> > > > > Typically our hosts are contained in $PBS_NODEFILE since we use
> > > > > Torque. Perhaps this is something that is handled by the Torque
> > > > > plugin?
> > > > >
> > > > > Thanks,
> > > > > Bryan
> > > > >
> > > > > --
> > > > > Bryan Putnam
> > > > > Senior Scientific Applications Analyst
> > > > > Rosen Center for Advanced Computing, Purdue University
> > > > > Young Hall (Rm. 910)
> > > > > 155 S. Grant St.
> > > > > West Lafayette, IN 47907-2114
> > > > > Ph 765-496-8225 Fax 765-496-2275
> > > > > [email protected]
> > > > > www.purdue.edu/itap
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > С Уважением, Поляков Артем Юрьевич
> > > > Best regards, Artem Y. Polyakov
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > С Уважением, Поляков Артем Юрьевич
> > > > Best regards, Artem Y. Polyakov
> > >
> > > > c************************************************************************
> > > > c matmat.f - matrix-matrix multiply, C = A*B
> > > > c simple self-scheduling version
> > > > c************************************************************************
> > > >       program matmat
> > > >
> > > >       include 'mpif.h'
> > > > c use mpi
> > > >
> > > >       integer MAX_AROWS, MAX_ACOLS, MAX_BCOLS
> > > > c parameter (MAX_AROWS = 20, MAX_ACOLS = 1000, MAX_BCOLS = 20)
> > > > c parameter (MAX_AROWS = 200, MAX_ACOLS = 1000, MAX_BCOLS = 200)
> > > >       parameter (MAX_AROWS = 2000, MAX_ACOLS = 2000, MAX_BCOLS =
> > > >       2000)
> > > > c parameter (MAX_AROWS = 4000, MAX_ACOLS = 4000, MAX_BCOLS = 4000)
> > > >       double precision a(MAX_AROWS,MAX_ACOLS),
> > > >       b(MAX_ACOLS,MAX_BCOLS)
> > > >       double precision c(MAX_AROWS,MAX_BCOLS)
> > > >       double precision buffer(MAX_ACOLS), ans(MAX_BCOLS)
> > > >       double precision start_time, stop_time
> > > >
> > > >       integer myid, master, numprocs, ierr,
> > > >       status(MPI_STATUS_SIZE)
> > > >       integer i, j, numsent, numrcvd, sender
> > > >       integer anstype, row, arows, acols, brows, bcols, crows,
> > > >       ccols
> > > >       integer errorcode
> > > >       integer niter, iter
> > > >
> > > >       call MPI_INIT(ierr)
> > > >       call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
> > > >       call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
> > > >       if (numprocs .lt. 2) then
> > > >          print *, "Must have at least 2 processes!"
> > > >          errorcode = 1
> > > >          call MPI_ABORT(MPI_COMM_WORLD, errorcode, ierr)
> > > >          stop
> > > >       endif
> > > >       print *, "Process ", myid, " of ", numprocs, " is alive"
> > > >
> > > >       arows = MAX_AROWS
> > > >       acols = MAX_ACOLS
> > > >       brows = MAX_ACOLS
> > > >       bcols = MAX_BCOLS
> > > >       crows = MAX_AROWS
> > > >       ccols = MAX_BCOLS
> > > >
> > > >       master = 0
> > > >
> > > > c
> > > >       niter = 400
> > > > c niter = 100
> > > > c niter = 20
> > > > c niter = 800
> > > > c niter = 4
> > > >       do 900 iter = 1, niter
> > > > c
> > > >       if ( myid .eq. master ) then
> > > > c master initializes and then dispatches
> > > > c initialization of a and b, broadcast of b
> > > > c
> > > > c a(i,j) = i + j
> > > > c
> > > >          do 22 i = 1, arows
> > > >          do 22 j = 1, acols
> > > >             a(i,j) = dble(i+j)
> > > >  22 continue
> > > >
> > > >          do 20 i = 1, brows
> > > >          do 20 j = 1, bcols
> > > >             b(i,j) = dble(i+j)
> > > >  20 continue
> > > >
> > > >          start_time = MPI_WTIME()
> > > > c start_time = mclock()
> > > >          if ( numprocs .lt. 2 ) then
> > > >             do 46 j = 1,ccols
> > > >             do 46 i = 1,crows
> > > >                c(i,j) = 0.0
> > > >             do 46 k = 1,acols
> > > >                c(i,j) = c(i,j) + a(i,k)*b(k,j)
> > > >  46 continue
> > > >             go to 200
> > > >          endif
> > > >
> > > >          do 25 i = 1,bcols
> > > >             call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, master,
> > > >      $ MPI_COMM_WORLD, ierr)
> > > >  25 continue
> > > >
> > > >          numsent = 0
> > > >          numrcvd = 0
> > > >
> > > > c send a row of a to each other process; tag with row number
> > > >          do 40 i = 1,numprocs-1
> > > >             do 30 j = 1,acols
> > > >                buffer(j) = a(i,j)
> > > >  30 continue
> > > >             call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION,
> > > >      $ i, i, MPI_COMM_WORLD, ierr)
> > > >             numsent = numsent+1
> > > >  40 continue
> > > >
> > > >          do 70 i = 1,crows
> > > >          call MPI_RECV(ans, ccols, MPI_DOUBLE_PRECISION,
> > > >          MPI_ANY_SOURCE,
> > > >      $ MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
> > > >             sender = status(MPI_SOURCE)
> > > >             anstype = status(MPI_TAG)
> > > >
> > > >          do 45 j = 1,ccols
> > > >             c(anstype,j) = ans(j)
> > > >  45 continue
> > > >             if (numsent .lt. arows) then
> > > >                do 50 j = 1,acols
> > > >                   buffer(j) = a(numsent+1,j)
> > > >  50 continue
> > > >                call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION,
> > > >      $ sender, numsent+1, MPI_COMM_WORLD, ierr)
> > > >                numsent = numsent+1
> > > >             else
> > > >             call MPI_SEND(1.0, 1, MPI_DOUBLE_PRECISION, sender, 0,
> > > >      $ MPI_COMM_WORLD, ierr)
> > > >             endif
> > > >  70 continue
> > > >
> > > >       else
> > > > c slaves receive b, then compute dot products until done message
> > > >          do 85 i = 1,bcols
> > > >          call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION,
> > > >          master,
> > > >      $ MPI_COMM_WORLD, ierr)
> > > >  85 continue
> > > >  90 continue
> > > >          call MPI_RECV(buffer, acols, MPI_DOUBLE_PRECISION,
> > > >          master,
> > > >      $ MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
> > > >          if (status(MPI_TAG) .eq. 0) then
> > > >             go to 200
> > > >          else
> > > >             row = status(MPI_TAG)
> > > >             do 100 i = 1,bcols
> > > >                ans(i) = 0.0
> > > >                do 95 j = 1,acols
> > > >                   ans(i) = ans(i) + buffer(j)*b(j,i)
> > > >   95 continue
> > > >  100 continue
> > > >             call MPI_SEND(ans, bcols, MPI_DOUBLE_PRECISION,
> > > >             master,
> > > >             row,
> > > >      $ MPI_COMM_WORLD, ierr)
> > > >             go to 90
> > > >          endif
> > > >       endif
> > > >
> > > >  200 continue
> > > > c print out the answer
> > > > c do 80 i = 1,crows
> > > > c do 80 j = 1,ccols
> > > > c print *, "c(", i, j ") = ", c(i,j)
> > > > c80 continue
> > > >
> > > >       if ( myid .eq. master ) then
> > > >          stop_time = MPI_WTIME()
> > > > c stop_time = mclock()
> > > >          print *, 'Time is ', stop_time - start_time,
> > > >      & ' seconds for iteration ', iter
> > > >       endif
> > > > c
> > > >  900 continue
> > > > c
> > > >       call MPI_FINALIZE(ierr)
> > > >       stop
> > > >       end

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Re: [Dmtcp-forum] DMTCP restart for multi-node jobs

Reply via email to