Thanks for the examples Artem. Let me take a look as these, and also your 
instructions in 


.../dmtcp-2.0/contrib/rm/README 


and see if I can come up with something that works with Torque-4. If not, I'll 
contact my supervisor and I'm sure he'd be happy to let us set up an account 
for you on one of our clusters. So far I've tried using both openmpi and mpich2 
(and mpich-3) but am seeing the same problems with not being able to specify a 
specific set of nodes on restarting. 


I've also tried mvapich2, but that fails for different reasons, and I do see 
that Infiniband is not fully supported. 


Please feel free to play around with my Fortran code "matmat2.f". It's a simple 
matrix multiply inside a loop. If it doesn't run long enough for you, just 
modify the variable "niter". The iteration is printed as the job proceeds, so 
it's easy to see that the job is picking up where it left off, after being 
checkpointed and restarted. 


Thanks, 
Bryan 




----- Original Message -----



Bryan, 



Resource manager plugin is installed by default. As far as I see you execute 
application correctly. 
Just in case I am attaching initial and restart batch scripts to this e-mail 
for reference. 
What is inside: at this moment (for debugging) I usually start 
dmtcp_coordinator at the frontend and use DMTCP options to point on it. We 
already have a solution how to run coordinator in batch manner too but untill 
you get correct behavior this is not reasonable. 
We test DMTCP with Open MPI mostly. Different MPI implementation also can be 
the reason but we need to check if that is so. 


1. I need to additionally check Torque plugin by myself. This will take few 
days. We add 
2. What application you run and is it possible for me to get it for testing 
with instructions about how to do that exactly as you do. 
3. I have acces to Torque 2.x installations and we didn't test Torque 4.x. Is 
it possible for me to have access on your system for testing and debuggig? 



2013/10/29 Bryan F Putnam < [email protected] > 




Hi Artem, thanks for writing back. 


We're using DMTCP-2.0 and Torque-4.1.5.1. 


I'm a bit confused as to how to install a dmtcp plugin, or if in fact the 
Torque plugin is already installed by default. For example if I start up a 
nodes=2:ppn=2 PBS session, my $PBS_NODEFILE may look something like 


host1 
host1 
host2 
host2 


I then do 


dmtcp_launch --rm mpiexec -np 4 ./a.out (4-processor job successfully runs on 2 
processors on each of 2 nodes) 
dmtcp_command --checkpoint (in a separate window) 
dmtcp_command --kill (in a separate window) 
dmtcp_restart ckpt*.dmtcp 


After the last step, the job successfully restarts, but all 4 processes are now 
running on the localhost (host1), nothing is running on host2, and the 
$PBS_NODEFILE appears to be ignored. 


Thanks for any tips! 


Bryan 







Hellp, Bryan. 


What version of DMTCP/Torque you use? 



2013/10/29 gene < [email protected] > 


> Perhaps this is something that is handled by the Torque plugin? 
Yes, that's correct. You'll need to use the DMTCP plugin for Torque. 
Artem Polyakov is supporting that, and I'm cc'ing to him. Among other 
issues, mount points can change and network addresses can change on restart. 
The plugin tries to handle that. 

Please let us know if you have any trouble using the Torque plugin. 

Best, 
- Gene 

On Mon, Oct 28, 2013 at 03:10:51PM -0400, Bryan F Putnam wrote: 
> 
> Dear DMTCP developers, 
> 
> I've found that when restarting a multi-node job, dmtcp_restart only appears 
> to be aware of the local host. Is it possible to tell dmtcp_restart which 
> hosts are currently available for a job restart, whether it's the same set of 
> multiple hosts, or a completely different set of hosts? 
> 
> Typically our hosts are contained in $PBS_NODEFILE since we use Torque. 
> Perhaps this is something that is handled by the Torque plugin? 
> 
> Thanks, 
> Bryan 
> 
> -- 
> Bryan Putnam 
> Senior Scientific Applications Analyst 
> Rosen Center for Advanced Computing, Purdue University 
> Young Hall (Rm. 910) 
> 155 S. Grant St. 
> West Lafayette, IN 47907-2114 
> Ph 765-496-8225 Fax 765-496-2275 
> [email protected] 
> www.purdue.edu/itap 




-- 
С Уважением, Поляков Артем Юрьевич 
Best regards, Artem Y. Polyakov 




-- 
С Уважением, Поляков Артем Юрьевич 
Best regards, Artem Y. Polyakov 
c************************************************************************
c     matmat.f - matrix-matrix multiply, C = A*B
c                simple self-scheduling version             
c************************************************************************
      program matmat

      include 'mpif.h'
c     use mpi

      integer MAX_AROWS, MAX_ACOLS, MAX_BCOLS
c     parameter (MAX_AROWS = 20, MAX_ACOLS = 1000, MAX_BCOLS = 20)
c     parameter (MAX_AROWS = 200, MAX_ACOLS = 1000, MAX_BCOLS = 200)
      parameter (MAX_AROWS = 2000, MAX_ACOLS = 2000, MAX_BCOLS = 2000)
c     parameter (MAX_AROWS = 4000, MAX_ACOLS = 4000, MAX_BCOLS = 4000)
      double precision a(MAX_AROWS,MAX_ACOLS), b(MAX_ACOLS,MAX_BCOLS)
      double precision c(MAX_AROWS,MAX_BCOLS)
      double precision buffer(MAX_ACOLS), ans(MAX_BCOLS)
      double precision start_time, stop_time

      integer myid, master, numprocs, ierr, status(MPI_STATUS_SIZE)
      integer i, j, numsent, numrcvd, sender
      integer anstype, row, arows, acols, brows, bcols, crows, ccols
      integer errorcode
      integer niter, iter

      call MPI_INIT(ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
      if (numprocs .lt. 2) then
         print *, "Must have at least 2 processes!"
         errorcode = 1
         call MPI_ABORT(MPI_COMM_WORLD, errorcode, ierr)
         stop
      endif
      print *, "Process ", myid, " of ", numprocs, " is alive"

      arows = MAX_AROWS
      acols = MAX_ACOLS
      brows = MAX_ACOLS
      bcols = MAX_BCOLS
      crows = MAX_AROWS
      ccols = MAX_BCOLS

      master   = 0

c
      niter = 400
c     niter = 100
c     niter = 20
c     niter = 800
c     niter = 4
      do 900 iter = 1, niter
c
      if ( myid .eq. master ) then
c        master initializes and then dispatches
c        initialization of a and b, broadcast of b
c
c                          a(i,j) = i + j
c
         do 22 i = 1, arows
         do 22 j = 1, acols
            a(i,j) = dble(i+j)
 22      continue

         do 20 i = 1, brows
         do 20 j = 1, bcols
            b(i,j) = dble(i+j)
 20      continue

         start_time = MPI_WTIME()
c        start_time = mclock()
         if ( numprocs .lt. 2 ) then
            do 46 j = 1,ccols
            do 46 i = 1,crows
               c(i,j) = 0.0
            do 46 k = 1,acols
               c(i,j) = c(i,j) + a(i,k)*b(k,j)
 46         continue
            go to 200
         endif

         do 25 i = 1,bcols
            call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, master,
     $                     MPI_COMM_WORLD, ierr)
 25      continue

         numsent = 0
         numrcvd = 0
         
c        send a row of a to each other process; tag with row number
         do 40 i = 1,numprocs-1
            do 30 j = 1,acols
               buffer(j) = a(i,j)
 30         continue
            call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION,
     $           i, i, MPI_COMM_WORLD, ierr)
            numsent = numsent+1
 40      continue
         
         do 70 i = 1,crows
         call MPI_RECV(ans, ccols, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE,
     $                 MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
            sender = status(MPI_SOURCE)
            anstype = status(MPI_TAG)

         do 45 j = 1,ccols
            c(anstype,j) = ans(j)
 45      continue
            if (numsent .lt. arows) then
               do 50 j = 1,acols
                  buffer(j) = a(numsent+1,j)
 50            continue
               call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION, 
     $              sender, numsent+1, MPI_COMM_WORLD, ierr)
               numsent = numsent+1
            else
            call MPI_SEND(1.0, 1, MPI_DOUBLE_PRECISION, sender, 0,
     $           MPI_COMM_WORLD, ierr)
            endif
 70      continue
         
      else
c        slaves receive b, then compute dot products until done message
         do 85 i = 1,bcols
         call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, master,
     $        MPI_COMM_WORLD, ierr)
 85      continue
 90      continue
         call MPI_RECV(buffer, acols, MPI_DOUBLE_PRECISION, master,
     $        MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
         if (status(MPI_TAG) .eq. 0) then
            go to 200
         else
            row = status(MPI_TAG)
            do 100 i = 1,bcols
               ans(i) = 0.0
               do 95 j = 1,acols
                  ans(i) = ans(i) + buffer(j)*b(j,i)
  95           continue
 100        continue
            call MPI_SEND(ans, bcols, MPI_DOUBLE_PRECISION, master, row,
     $           MPI_COMM_WORLD, ierr)
            go to 90
         endif
      endif

 200  continue
c     print out the answer
c     do 80 i = 1,crows
c     do 80 j = 1,ccols
c        print *, "c(", i, j ") = ", c(i,j)
c80   continue

      if ( myid .eq. master ) then
         stop_time = MPI_WTIME()
c        stop_time = mclock()
         print *, 'Time is ', stop_time - start_time, 
     &            ' seconds for iteration ', iter
      endif
c
 900  continue
c
      call MPI_FINALIZE(ierr)
      stop
      end
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to