Hi all,
I am trying to checkpoint an MVAPICH application. It does not behave as
expected, so maybe you can give me some support.
I have compiled DMTCP with "--enable-infiniband-support " as only flag. I
have MVAPICH installed.
I can execute a test MPI application in two nodes, without DMTCP. I also
can execute the application in a single node with DMTCP. however, it I
execute it in two nodes with DMTCP, only the first one will run.
Below there is a series of test commands with a lot of output, together
with the versions of everything.
Any ideas?
thanks for your help,
Manuel
---
---
# mpichversion
MVAPICH2 Version: 2.2a
MVAPICH2 Release date: Mon Aug 17 20:00:00 EDT 2015
MVAPICH2 Device: ch3:mrail
MVAPICH2 configure: --disable-mcast
MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: gfortran -L/lib -L/lib -O2
MVAPICH2 FC: gfortran -O2
# dmtcp_coordinator --version
dmtcp_coordinator (DMTCP) 2.4.1
---
---
I can execute a test MPI application in two nodes (acme11 and 12), with
---
---
# mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI
Process 0 of 2 is on acme11.ciemat.es
Process 1 of 2 is on acme12.ciemat.es
Hello world from process 0 of 2
Hello world from process 1 of 2
Goodbye world from process 0 of 2
Goodbye world from process 1 of 2
---
---
As you can see, it works correctly.
If I try to execute the application with DMTCP, however, it does not.
I run the coordinator on acme11, with port 7779.
I can execute the application on a single node. For example,
---
---
# dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh -n 1 acme12
./helloWorldMPI
[41000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=33687 MPISPAWN_MPIRUN_PORT=33687 MPISPAWN_NNODES=1
MPISPAWN_GLOBAL_NPROCS=1 MPISPAWN_MPIRUN_ID=40000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_885_acme11.ciemat.es_40000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
/usr/local/bin/mpispawn 0
Process 0 of 1 is on acme12.ciemat.es
Hello world from process 0 of 1
Goodbye world from process 0 of 1
COORDINATOR OUTPUT
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-4029-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpirun_rsh
msg.from = 1d64b124afe30f29-52000-562310a2
client->identity() = 1d64b124afe30f29-4029-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-52000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpirun_rsh_(forked)
msg.from = 1d64b124afe30f29-53000-562310a2
client->identity() = 1d64b124afe30f29-52000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-53000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_ssh_(forked)
msg.from = 1d64b124afe30f29-54000-562310a2
client->identity() = 1d64b124afe30f29-53000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-54000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 1d64b124afe30f29-53000-562310a2
client->identity() = 1d64b124afe30f29-53000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-23945-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 1b69d09fb3238b30-55000-562310a2
client->identity() = 1b69d09fb3238b30-23945-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-55000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme12.ciemat.es
client->progname() = dmtcp_sshd_(forked)
msg.from = 1b69d09fb3238b30-56000-562310a2
client->identity() = 1b69d09fb3238b30-55000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-56000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme12.ciemat.es
client->progname() = mpispawn_(forked)
msg.from = 1b69d09fb3238b30-57000-562310a2
client->identity() = 1b69d09fb3238b30-56000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = env
msg.from = 1b69d09fb3238b30-56000-562310a2
client->identity() = 1b69d09fb3238b30-56000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpispawn
msg.from = 1b69d09fb3238b30-56000-562310a2
client->identity() = 1b69d09fb3238b30-56000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 1b69d09fb3238b30-57000-562310a2
client->identity() = 1b69d09fb3238b30-57000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-57000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-56000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-55000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-53000-562310a2
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-52000-562310a2
---
---
So we see that it is working correctly, connecting and so.
However, if I run the application in more than one core, as in the first
example, it crashes. What happens is that the first node on the node list
executes the application, and the rest do not.
----
----
[root@acme11 tests]# dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh -n 2
acme11 acme12 ./helloWorldMPI
[59000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
/usr/local/bin/mpispawn 0
[60000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1
/usr/local/bin/mpispawn 0
Process 0 of 2 is on acme11.ciemat.es
Hello world from process 0 of 2
Goodbye world from process 0 of 2
COORDINATOR OUTPUT
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-4070-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpirun_rsh
msg.from = 1d64b124afe30f29-58000-56231173
client->identity() = 1d64b124afe30f29-4070-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-58000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-58000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpirun_rsh_(forked)
msg.from = 1d64b124afe30f29-59000-56231173
client->identity() = 1d64b124afe30f29-58000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpirun_rsh_(forked)
msg.from = 1d64b124afe30f29-60000-56231173
client->identity() = 1d64b124afe30f29-58000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-59000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-60000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_ssh_(forked)
msg.from = 1d64b124afe30f29-61000-56231173
client->identity() = 1d64b124afe30f29-59000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_ssh_(forked)
msg.from = 1d64b124afe30f29-62000-56231173
client->identity() = 1d64b124afe30f29-60000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-61000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-62000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 1d64b124afe30f29-59000-56231173
client->identity() = 1d64b124afe30f29-59000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 1d64b124afe30f29-60000-56231173
client->identity() = 1d64b124afe30f29-60000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-24001-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-4094-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 1d64b124afe30f29-64000-56231173
client->identity() = 1d64b124afe30f29-4094-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 1b69d09fb3238b30-63000-56231173
client->identity() = 1b69d09fb3238b30-24001-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-64000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-63000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_sshd_(forked)
msg.from = 1d64b124afe30f29-65000-56231173
client->identity() = 1d64b124afe30f29-64000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme12.ciemat.es
client->progname() = dmtcp_sshd_(forked)
msg.from = 1b69d09fb3238b30-66000-56231173
client->identity() = 1b69d09fb3238b30-63000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = env
msg.from = 1d64b124afe30f29-65000-56231173
client->identity() = 1d64b124afe30f29-65000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpispawn
msg.from = 1d64b124afe30f29-65000-56231173
client->identity() = 1d64b124afe30f29-65000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-66000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-65000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpispawn_(forked)
msg.from = 1d64b124afe30f29-68000-56231173
client->identity() = 1d64b124afe30f29-65000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme12.ciemat.es
client->progname() = mpispawn_(forked)
msg.from = 1b69d09fb3238b30-67000-56231173
client->identity() = 1b69d09fb3238b30-66000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = env
msg.from = 1b69d09fb3238b30-66000-56231173
client->identity() = 1b69d09fb3238b30-66000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpispawn
msg.from = 1b69d09fb3238b30-66000-56231173
client->identity() = 1b69d09fb3238b30-66000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 1d64b124afe30f29-68000-56231173
client->identity() = 1d64b124afe30f29-68000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 1b69d09fb3238b30-67000-56231173
client->identity() = 1b69d09fb3238b30-67000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-68000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-67000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-65000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-66000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-64000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-63000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-59000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-60000-56231173
[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-58000-56231173
----
----
--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108
CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum