Re: [Dmtcp-forum] DMTCP restart for multi-node jobs

gene Fri, 01 Nov 2013 18:07:29 -0700

If you're running interactively on some local cluster, then
DMTCP should restart a multi-node MPI job without any problem, for TCP/IP.
If you're not seeing this, it's a bug in DMTCP.  Could you report that?
(And especially, if you could try it with the latest DMTCP svn, that's nice.)


Thanks,
- Gene

On Fri, Nov 01, 2013 at 03:58:41PM -0400, Bryan F Putnam wrote:
> Hi Gene,
> 
> Yes, I can take a look at the documentation and try to give some suggestions. 
> I'll get back to you soon on that.
> 
> Our mvapich2 builds are configured with
>  --with-device=ch3:mrail \
>  --with-rdma=gen2 \
> 
> and they don't run on TCP/IP networks, only IB. However we do have some 
> mpich2 and mpich-3 builds (upon which mvapich2 is based) which use TCP/IP, 
> and I'm able to successfully checkpoint, kill, and restart parallel mpich2 
> jobs as long as they are using only a single node.
> 
> In general, DMTCP appears to be working well for me, as long as the job is 
> running on a single node. I can checkpoint, kill the job, and restart it, and 
> it will restart again, even on a different node. It's just that when more 
> than one node is involved, DMTCP doesn't appears to be retaining information 
> about the remote nodes, and it restarts everything on whatever localhost it 
> is restarted on. Perhaps I'm just missing something simple, I'm having 
> difficulty understanding the use of the "rm" plugin.
> 
> I was also able to checkpoint and restart a parallel Gaussian09 job, which 
> doesn't use MPI at all. But again it only worked when the parallel job was a 
> single node job.
> 
> Thanks,
> Bryan
> 
> 
> ----- Original Message -----
> > Hi Bryan,
> > Also, we've been thinking about how to improve the documentation
> > for the resource managers (Torque and SLURM). We always get good
> > insights on this by looking at people seeing it for the first time.
> > If you should have the time, could you make some rough notes on
> > how we can improve our documentation (what to emphasize, extra
> > pointers
> > to include, etc.)?
> > 
> > As for InfiniBand, we're now tracking down still one more bug (a race
> > condition). For InfiniBand, please continue updating from our svn:
> > svn co svn://svn.code.sf.net/p/dmtcp/code/trunk dmtcp-trunk
> > We're hoping to have the last bugs out of InfiniBand sometime this
> > next week.
> > 
> > You also mention mvapich2. Does that work for you with ordinary
> > Ethernet? If it fails for youeven in that case, would you mind letting
> > us know (either informally, or a bug report -- whichever you like).
> > 
> > Thanks,
> > - Gene
> > 
> > On Fri, Nov 01, 2013 at 03:00:08PM -0400, Bryan F Putnam wrote:
> > > Thanks for the examples Artem. Let me take a look as these, and also
> > > your instructions in
> > >
> > >
> > > .../dmtcp-2.0/contrib/rm/README
> > >
> > >
> > > and see if I can come up with something that works with Torque-4. If
> > > not, I'll contact my supervisor and I'm sure he'd be happy to let us
> > > set up an account for you on one of our clusters. So far I've tried
> > > using both openmpi and mpich2 (and mpich-3) but am seeing the same
> > > problems with not being able to specify a specific set of nodes on
> > > restarting.
> > >
> > >
> > > I've also tried mvapich2, but that fails for different reasons, and
> > > I do see that Infiniband is not fully supported.
> > >
> > >
> > > Please feel free to play around with my Fortran code "matmat2.f".
> > > It's a simple matrix multiply inside a loop. If it doesn't run long
> > > enough for you, just modify the variable "niter". The iteration is
> > > printed as the job proceeds, so it's easy to see that the job is
> > > picking up where it left off, after being checkpointed and
> > > restarted.
> > >
> > >
> > > Thanks,
> > > Bryan
> > >
> > >
> > >
> > >
> > > ----- Original Message -----
> > >
> > >
> > >
> > > Bryan,
> > >
> > >
> > >
> > > Resource manager plugin is installed by default. As far as I see you
> > > execute application correctly.
> > > Just in case I am attaching initial and restart batch scripts to
> > > this e-mail for reference.
> > > What is inside: at this moment (for debugging) I usually start
> > > dmtcp_coordinator at the frontend and use DMTCP options to point on
> > > it. We already have a solution how to run coordinator in batch
> > > manner too but untill you get correct behavior this is not
> > > reasonable.
> > > We test DMTCP with Open MPI mostly. Different MPI implementation
> > > also can be the reason but we need to check if that is so.
> > >
> > >
> > > 1. I need to additionally check Torque plugin by myself. This will
> > > take few days. We add
> > > 2. What application you run and is it possible for me to get it for
> > > testing with instructions about how to do that exactly as you do.
> > > 3. I have acces to Torque 2.x installations and we didn't test
> > > Torque 4.x. Is it possible for me to have access on your system for
> > > testing and debuggig?
> > >
> > >
> > >
> > > 2013/10/29 Bryan F Putnam < [email protected] >
> > >
> > >
> > >
> > >
> > > Hi Artem, thanks for writing back.
> > >
> > >
> > > We're using DMTCP-2.0 and Torque-4.1.5.1.
> > >
> > >
> > > I'm a bit confused as to how to install a dmtcp plugin, or if in
> > > fact the Torque plugin is already installed by default. For example
> > > if I start up a nodes=2:ppn=2 PBS session, my $PBS_NODEFILE may look
> > > something like
> > >
> > >
> > > host1
> > > host1
> > > host2
> > > host2
> > >
> > >
> > > I then do
> > >
> > >
> > > dmtcp_launch --rm mpiexec -np 4 ./a.out (4-processor job
> > > successfully runs on 2 processors on each of 2 nodes)
> > > dmtcp_command --checkpoint (in a separate window)
> > > dmtcp_command --kill (in a separate window)
> > > dmtcp_restart ckpt*.dmtcp
> > >
> > >
> > > After the last step, the job successfully restarts, but all 4
> > > processes are now running on the localhost (host1), nothing is
> > > running on host2, and the $PBS_NODEFILE appears to be ignored.
> > >
> > >
> > > Thanks for any tips!
> > >
> > >
> > > Bryan
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hellp, Bryan.
> > >
> > >
> > > What version of DMTCP/Torque you use?
> > >
> > >
> > >
> > > 2013/10/29 gene < [email protected] >
> > >
> > >
> > > > Perhaps this is something that is handled by the Torque plugin?
> > > Yes, that's correct. You'll need to use the DMTCP plugin for Torque.
> > > Artem Polyakov is supporting that, and I'm cc'ing to him. Among
> > > other
> > > issues, mount points can change and network addresses can change on
> > > restart.
> > > The plugin tries to handle that.
> > >
> > > Please let us know if you have any trouble using the Torque plugin.
> > >
> > > Best,
> > > - Gene
> > >
> > > On Mon, Oct 28, 2013 at 03:10:51PM -0400, Bryan F Putnam wrote:
> > > >
> > > > Dear DMTCP developers,
> > > >
> > > > I've found that when restarting a multi-node job, dmtcp_restart
> > > > only appears to be aware of the local host. Is it possible to tell
> > > > dmtcp_restart which hosts are currently available for a job
> > > > restart, whether it's the same set of multiple hosts, or a
> > > > completely different set of hosts?
> > > >
> > > > Typically our hosts are contained in $PBS_NODEFILE since we use
> > > > Torque. Perhaps this is something that is handled by the Torque
> > > > plugin?
> > > >
> > > > Thanks,
> > > > Bryan
> > > >
> > > > --
> > > > Bryan Putnam
> > > > Senior Scientific Applications Analyst
> > > > Rosen Center for Advanced Computing, Purdue University
> > > > Young Hall (Rm. 910)
> > > > 155 S. Grant St.
> > > > West Lafayette, IN 47907-2114
> > > > Ph 765-496-8225 Fax 765-496-2275
> > > > [email protected]
> > > > www.purdue.edu/itap
> > >
> > >
> > >
> > >
> > > --
> > > С Уважением, Поляков Артем Юрьевич
> > > Best regards, Artem Y. Polyakov
> > >
> > >
> > >
> > >
> > > --
> > > С Уважением, Поляков Артем Юрьевич
> > > Best regards, Artem Y. Polyakov
> > 
> > > c************************************************************************
> > > c matmat.f - matrix-matrix multiply, C = A*B
> > > c simple self-scheduling version
> > > c************************************************************************
> > >       program matmat
> > >
> > >       include 'mpif.h'
> > > c use mpi
> > >
> > >       integer MAX_AROWS, MAX_ACOLS, MAX_BCOLS
> > > c parameter (MAX_AROWS = 20, MAX_ACOLS = 1000, MAX_BCOLS = 20)
> > > c parameter (MAX_AROWS = 200, MAX_ACOLS = 1000, MAX_BCOLS = 200)
> > >       parameter (MAX_AROWS = 2000, MAX_ACOLS = 2000, MAX_BCOLS =
> > >       2000)
> > > c parameter (MAX_AROWS = 4000, MAX_ACOLS = 4000, MAX_BCOLS = 4000)
> > >       double precision a(MAX_AROWS,MAX_ACOLS),
> > >       b(MAX_ACOLS,MAX_BCOLS)
> > >       double precision c(MAX_AROWS,MAX_BCOLS)
> > >       double precision buffer(MAX_ACOLS), ans(MAX_BCOLS)
> > >       double precision start_time, stop_time
> > >
> > >       integer myid, master, numprocs, ierr, status(MPI_STATUS_SIZE)
> > >       integer i, j, numsent, numrcvd, sender
> > >       integer anstype, row, arows, acols, brows, bcols, crows, ccols
> > >       integer errorcode
> > >       integer niter, iter
> > >
> > >       call MPI_INIT(ierr)
> > >       call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
> > >       call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
> > >       if (numprocs .lt. 2) then
> > >          print *, "Must have at least 2 processes!"
> > >          errorcode = 1
> > >          call MPI_ABORT(MPI_COMM_WORLD, errorcode, ierr)
> > >          stop
> > >       endif
> > >       print *, "Process ", myid, " of ", numprocs, " is alive"
> > >
> > >       arows = MAX_AROWS
> > >       acols = MAX_ACOLS
> > >       brows = MAX_ACOLS
> > >       bcols = MAX_BCOLS
> > >       crows = MAX_AROWS
> > >       ccols = MAX_BCOLS
> > >
> > >       master = 0
> > >
> > > c
> > >       niter = 400
> > > c niter = 100
> > > c niter = 20
> > > c niter = 800
> > > c niter = 4
> > >       do 900 iter = 1, niter
> > > c
> > >       if ( myid .eq. master ) then
> > > c master initializes and then dispatches
> > > c initialization of a and b, broadcast of b
> > > c
> > > c a(i,j) = i + j
> > > c
> > >          do 22 i = 1, arows
> > >    do 22 j = 1, acols
> > >             a(i,j) = dble(i+j)
> > >  22 continue
> > >
> > >          do 20 i = 1, brows
> > >    do 20 j = 1, bcols
> > >             b(i,j) = dble(i+j)
> > >  20 continue
> > >
> > >          start_time = MPI_WTIME()
> > > c start_time = mclock()
> > >    if ( numprocs .lt. 2 ) then
> > >       do 46 j = 1,ccols
> > >       do 46 i = 1,crows
> > >          c(i,j) = 0.0
> > >             do 46 k = 1,acols
> > >          c(i,j) = c(i,j) + a(i,k)*b(k,j)
> > >  46 continue
> > >       go to 200
> > >          endif
> > >
> > >    do 25 i = 1,bcols
> > >       call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, master,
> > >      $ MPI_COMM_WORLD, ierr)
> > >  25 continue
> > >
> > >          numsent = 0
> > >          numrcvd = 0
> > >
> > > c send a row of a to each other process; tag with row number
> > >          do 40 i = 1,numprocs-1
> > >             do 30 j = 1,acols
> > >                buffer(j) = a(i,j)
> > >  30 continue
> > >             call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION,
> > >      $ i, i, MPI_COMM_WORLD, ierr)
> > >             numsent = numsent+1
> > >  40 continue
> > >
> > >          do 70 i = 1,crows
> > >          call MPI_RECV(ans, ccols, MPI_DOUBLE_PRECISION,
> > >          MPI_ANY_SOURCE,
> > >      $ MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
> > >             sender = status(MPI_SOURCE)
> > >       anstype = status(MPI_TAG)
> > >
> > >          do 45 j = 1,ccols
> > >       c(anstype,j) = ans(j)
> > >  45 continue
> > >             if (numsent .lt. arows) then
> > >                do 50 j = 1,acols
> > >                   buffer(j) = a(numsent+1,j)
> > >  50 continue
> > >                call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION,
> > >      $ sender, numsent+1, MPI_COMM_WORLD, ierr)
> > >                numsent = numsent+1
> > >             else
> > >             call MPI_SEND(1.0, 1, MPI_DOUBLE_PRECISION, sender, 0,
> > >      $ MPI_COMM_WORLD, ierr)
> > >             endif
> > >  70 continue
> > >
> > >       else
> > > c slaves receive b, then compute dot products until done message
> > >    do 85 i = 1,bcols
> > >          call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, master,
> > >      $ MPI_COMM_WORLD, ierr)
> > >  85 continue
> > >  90 continue
> > >          call MPI_RECV(buffer, acols, MPI_DOUBLE_PRECISION, master,
> > >      $ MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
> > >          if (status(MPI_TAG) .eq. 0) then
> > >             go to 200
> > >          else
> > >       row = status(MPI_TAG)
> > >             do 100 i = 1,bcols
> > >          ans(i) = 0.0
> > >                do 95 j = 1,acols
> > >             ans(i) = ans(i) + buffer(j)*b(j,i)
> > >   95 continue
> > >  100 continue
> > >             call MPI_SEND(ans, bcols, MPI_DOUBLE_PRECISION, master,
> > >             row,
> > >      $ MPI_COMM_WORLD, ierr)
> > >             go to 90
> > >          endif
> > >       endif
> > >
> > >  200 continue
> > > c print out the answer
> > > c do 80 i = 1,crows
> > > c do 80 j = 1,ccols
> > > c print *, "c(", i, j ") = ", c(i,j)
> > > c80 continue
> > >
> > >       if ( myid .eq. master ) then
> > >          stop_time = MPI_WTIME()
> > > c stop_time = mclock()
> > >          print *, 'Time is ', stop_time - start_time,
> > >      & ' seconds for iteration ', iter
> > >       endif
> > > c
> > >  900 continue
> > > c
> > >       call MPI_FINALIZE(ierr)
> > >       stop
> > >       end

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Re: [Dmtcp-forum] DMTCP restart for multi-node jobs

Reply via email to