Thanks for the examples Artem. Let me take a look as these, and also your
instructions in
.../dmtcp-2.0/contrib/rm/README
and see if I can come up with something that works with Torque-4. If not, I'll
contact my supervisor and I'm sure he'd be happy to let us set up an account
for you on one of our clusters. So far I've tried using both openmpi and mpich2
(and mpich-3) but am seeing the same problems with not being able to specify a
specific set of nodes on restarting.
I've also tried mvapich2, but that fails for different reasons, and I do see
that Infiniband is not fully supported.
Please feel free to play around with my Fortran code "matmat2.f". It's a simple
matrix multiply inside a loop. If it doesn't run long enough for you, just
modify the variable "niter". The iteration is printed as the job proceeds, so
it's easy to see that the job is picking up where it left off, after being
checkpointed and restarted.
Thanks,
Bryan
----- Original Message -----
Bryan,
Resource manager plugin is installed by default. As far as I see you execute
application correctly.
Just in case I am attaching initial and restart batch scripts to this e-mail
for reference.
What is inside: at this moment (for debugging) I usually start
dmtcp_coordinator at the frontend and use DMTCP options to point on it. We
already have a solution how to run coordinator in batch manner too but untill
you get correct behavior this is not reasonable.
We test DMTCP with Open MPI mostly. Different MPI implementation also can be
the reason but we need to check if that is so.
1. I need to additionally check Torque plugin by myself. This will take few
days. We add
2. What application you run and is it possible for me to get it for testing
with instructions about how to do that exactly as you do.
3. I have acces to Torque 2.x installations and we didn't test Torque 4.x. Is
it possible for me to have access on your system for testing and debuggig?
2013/10/29 Bryan F Putnam < [email protected] >
Hi Artem, thanks for writing back.
We're using DMTCP-2.0 and Torque-4.1.5.1.
I'm a bit confused as to how to install a dmtcp plugin, or if in fact the
Torque plugin is already installed by default. For example if I start up a
nodes=2:ppn=2 PBS session, my $PBS_NODEFILE may look something like
host1
host1
host2
host2
I then do
dmtcp_launch --rm mpiexec -np 4 ./a.out (4-processor job successfully runs on 2
processors on each of 2 nodes)
dmtcp_command --checkpoint (in a separate window)
dmtcp_command --kill (in a separate window)
dmtcp_restart ckpt*.dmtcp
After the last step, the job successfully restarts, but all 4 processes are now
running on the localhost (host1), nothing is running on host2, and the
$PBS_NODEFILE appears to be ignored.
Thanks for any tips!
Bryan
Hellp, Bryan.
What version of DMTCP/Torque you use?
2013/10/29 gene < [email protected] >
> Perhaps this is something that is handled by the Torque plugin?
Yes, that's correct. You'll need to use the DMTCP plugin for Torque.
Artem Polyakov is supporting that, and I'm cc'ing to him. Among other
issues, mount points can change and network addresses can change on restart.
The plugin tries to handle that.
Please let us know if you have any trouble using the Torque plugin.
Best,
- Gene
On Mon, Oct 28, 2013 at 03:10:51PM -0400, Bryan F Putnam wrote:
>
> Dear DMTCP developers,
>
> I've found that when restarting a multi-node job, dmtcp_restart only appears
> to be aware of the local host. Is it possible to tell dmtcp_restart which
> hosts are currently available for a job restart, whether it's the same set of
> multiple hosts, or a completely different set of hosts?
>
> Typically our hosts are contained in $PBS_NODEFILE since we use Torque.
> Perhaps this is something that is handled by the Torque plugin?
>
> Thanks,
> Bryan
>
> --
> Bryan Putnam
> Senior Scientific Applications Analyst
> Rosen Center for Advanced Computing, Purdue University
> Young Hall (Rm. 910)
> 155 S. Grant St.
> West Lafayette, IN 47907-2114
> Ph 765-496-8225 Fax 765-496-2275
> [email protected]
> www.purdue.edu/itap
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
c************************************************************************
c matmat.f - matrix-matrix multiply, C = A*B
c simple self-scheduling version
c************************************************************************
program matmat
include 'mpif.h'
c use mpi
integer MAX_AROWS, MAX_ACOLS, MAX_BCOLS
c parameter (MAX_AROWS = 20, MAX_ACOLS = 1000, MAX_BCOLS = 20)
c parameter (MAX_AROWS = 200, MAX_ACOLS = 1000, MAX_BCOLS = 200)
parameter (MAX_AROWS = 2000, MAX_ACOLS = 2000, MAX_BCOLS = 2000)
c parameter (MAX_AROWS = 4000, MAX_ACOLS = 4000, MAX_BCOLS = 4000)
double precision a(MAX_AROWS,MAX_ACOLS), b(MAX_ACOLS,MAX_BCOLS)
double precision c(MAX_AROWS,MAX_BCOLS)
double precision buffer(MAX_ACOLS), ans(MAX_BCOLS)
double precision start_time, stop_time
integer myid, master, numprocs, ierr, status(MPI_STATUS_SIZE)
integer i, j, numsent, numrcvd, sender
integer anstype, row, arows, acols, brows, bcols, crows, ccols
integer errorcode
integer niter, iter
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
if (numprocs .lt. 2) then
print *, "Must have at least 2 processes!"
errorcode = 1
call MPI_ABORT(MPI_COMM_WORLD, errorcode, ierr)
stop
endif
print *, "Process ", myid, " of ", numprocs, " is alive"
arows = MAX_AROWS
acols = MAX_ACOLS
brows = MAX_ACOLS
bcols = MAX_BCOLS
crows = MAX_AROWS
ccols = MAX_BCOLS
master = 0
c
niter = 400
c niter = 100
c niter = 20
c niter = 800
c niter = 4
do 900 iter = 1, niter
c
if ( myid .eq. master ) then
c master initializes and then dispatches
c initialization of a and b, broadcast of b
c
c a(i,j) = i + j
c
do 22 i = 1, arows
do 22 j = 1, acols
a(i,j) = dble(i+j)
22 continue
do 20 i = 1, brows
do 20 j = 1, bcols
b(i,j) = dble(i+j)
20 continue
start_time = MPI_WTIME()
c start_time = mclock()
if ( numprocs .lt. 2 ) then
do 46 j = 1,ccols
do 46 i = 1,crows
c(i,j) = 0.0
do 46 k = 1,acols
c(i,j) = c(i,j) + a(i,k)*b(k,j)
46 continue
go to 200
endif
do 25 i = 1,bcols
call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, master,
$ MPI_COMM_WORLD, ierr)
25 continue
numsent = 0
numrcvd = 0
c send a row of a to each other process; tag with row number
do 40 i = 1,numprocs-1
do 30 j = 1,acols
buffer(j) = a(i,j)
30 continue
call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION,
$ i, i, MPI_COMM_WORLD, ierr)
numsent = numsent+1
40 continue
do 70 i = 1,crows
call MPI_RECV(ans, ccols, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE,
$ MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
sender = status(MPI_SOURCE)
anstype = status(MPI_TAG)
do 45 j = 1,ccols
c(anstype,j) = ans(j)
45 continue
if (numsent .lt. arows) then
do 50 j = 1,acols
buffer(j) = a(numsent+1,j)
50 continue
call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION,
$ sender, numsent+1, MPI_COMM_WORLD, ierr)
numsent = numsent+1
else
call MPI_SEND(1.0, 1, MPI_DOUBLE_PRECISION, sender, 0,
$ MPI_COMM_WORLD, ierr)
endif
70 continue
else
c slaves receive b, then compute dot products until done message
do 85 i = 1,bcols
call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, master,
$ MPI_COMM_WORLD, ierr)
85 continue
90 continue
call MPI_RECV(buffer, acols, MPI_DOUBLE_PRECISION, master,
$ MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr)
if (status(MPI_TAG) .eq. 0) then
go to 200
else
row = status(MPI_TAG)
do 100 i = 1,bcols
ans(i) = 0.0
do 95 j = 1,acols
ans(i) = ans(i) + buffer(j)*b(j,i)
95 continue
100 continue
call MPI_SEND(ans, bcols, MPI_DOUBLE_PRECISION, master, row,
$ MPI_COMM_WORLD, ierr)
go to 90
endif
endif
200 continue
c print out the answer
c do 80 i = 1,crows
c do 80 j = 1,ccols
c print *, "c(", i, j ") = ", c(i,j)
c80 continue
if ( myid .eq. master ) then
stop_time = MPI_WTIME()
c stop_time = mclock()
print *, 'Time is ', stop_time - start_time,
& ' seconds for iteration ', iter
endif
c
900 continue
c
call MPI_FINALIZE(ierr)
stop
end
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum