Hi Jiajun, all,
I have been performing some more tests.
When running the code with "mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI
", if I want to use DMTCP, there are problems (this is the scenario wof the
previus mail). Withouth the --ib flag the problem still persists. I think
the error is the same, below you have the output anyway.
In case it helps on the debug, some info that might be relevant:
- The application can be executed without DMTCP.
- With DMTCP, only the first node on the list is executed. But there seem
to be exceptions to this:
-- if I execute mpi_run sh with " -n 2 acme11 acme11" (or whatever
node, but the same one twice) : it crashes. This does not happen without
DMTCP, it that case it works correctly.
-- if I use " -n 3 acme11 acme12 acme11" (three nodes, repeating one):
it also crashes. It seems that if you put the same node more than once, it
does not work.
-- the first node in the list is the only one that runs. For example,
if I use "-n 2 acme11 acme12" then acme11 will execute the code. If I use
"-n 2 acme12 acme11", then acme12 will. With three nodes it is identical,
just the first one on the list.
However, I have seen that if I execute the application with another MPI
library, MPICH,
"mpiexec -n 2 acme11 acme12 ./helloWorldMPI"
Everything works as expected. I can use DMTCP with
"dmtcp_launch -h acme11 -p 7779 mpiexec -n 2 acme11 acme12
./helloWorldMPI"
and it succeeds. In this case, it works both with "--ib" and without it.
Just in case it helps, output is below too.
Thansk for your help,
Manuel
----
----
-bash-4.2$ dmtcp_launch -h acme11 -p 7779 mpirun_rsh -n 2 acme11 acme12
./helloWorldMPI
[126000] NOTE at dmtcpworker.cpp:349 in DmtcpWorker; REASON='
*** InfiniBand library detected. Please use dmtcp_launch --ib ***
'
[127000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=59589 MPISPAWN_MPIRUN_PORT=59589 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=126000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_311_acme11.ciemat.es_126000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
/usr/local/bin/mpispawn 0
[128000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=59589 MPISPAWN_MPIRUN_PORT=59589 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=126000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_311_acme11.ciemat.es_126000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1
/usr/local/bin/mpispawn 0
Process 0 of 2 is on acme11.ciemat.es
Hello world from process 0 of 2
Goodbye world from process 0 of 2
COORDINATOR OUTPUT
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-28766-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpirun_rsh
msg.from = 1d64b124afe30f29-126000-56250e08
client->identity() = 1d64b124afe30f29-28766-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-126000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-126000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpirun_rsh_(forked)
msg.from = 1d64b124afe30f29-127000-56250e08
client->identity() = 1d64b124afe30f29-126000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpirun_rsh_(forked)
msg.from = 1d64b124afe30f29-128000-56250e08
client->identity() = 1d64b124afe30f29-126000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-127000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-128000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_ssh_(forked)
msg.from = 1d64b124afe30f29-129000-56250e08
client->identity() = 1d64b124afe30f29-127000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_ssh_(forked)
msg.from = 1d64b124afe30f29-130000-56250e08
client->identity() = 1d64b124afe30f29-128000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-129000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-130000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 1d64b124afe30f29-127000-56250e08
client->identity() = 1d64b124afe30f29-127000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 1d64b124afe30f29-128000-56250e08
client->identity() = 1d64b124afe30f29-128000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-28786-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-12757-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 1d64b124afe30f29-131000-56250e08
client->identity() = 1d64b124afe30f29-28786-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 1b69d09fb3238b30-132000-56250e08
client->identity() = 1b69d09fb3238b30-12757-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-131000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-132000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_sshd_(forked)
msg.from = 1d64b124afe30f29-133000-56250e08
client->identity() = 1d64b124afe30f29-131000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme12.ciemat.es
client->progname() = dmtcp_sshd_(forked)
msg.from = 1b69d09fb3238b30-134000-56250e08
client->identity() = 1b69d09fb3238b30-132000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-133000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-134000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpispawn_(forked)
msg.from = 1d64b124afe30f29-135000-56250e08
client->identity() = 1d64b124afe30f29-133000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme12.ciemat.es
client->progname() = mpispawn_(forked)
msg.from = 1b69d09fb3238b30-136000-56250e08
client->identity() = 1b69d09fb3238b30-134000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = env
msg.from = 1d64b124afe30f29-133000-56250e08
client->identity() = 1d64b124afe30f29-133000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpispawn
msg.from = 1d64b124afe30f29-133000-56250e08
client->identity() = 1d64b124afe30f29-133000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = env
msg.from = 1b69d09fb3238b30-134000-56250e08
client->identity() = 1b69d09fb3238b30-134000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpispawn
msg.from = 1b69d09fb3238b30-134000-56250e08
client->identity() = 1b69d09fb3238b30-134000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 1d64b124afe30f29-135000-56250e08
client->identity() = 1d64b124afe30f29-135000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 1b69d09fb3238b30-136000-56250e08
client->identity() = 1b69d09fb3238b30-136000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-135000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-136000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-133000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-131000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-134000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-132000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-127000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-128000-56250e08
[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-126000-56250e08
-----
-----
WITH MPIEXEC
[root@acme11 tests]# dmtcp_launch --ib mpiexec -f machinefile -n 3
/home/slurm/tests/helloWorldMPI
[42000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -x acme12
/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.31.157
--coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd
"/home/localsoft/mpich3/bin//hydra_pmi_proxy" --control-port acme11:44279
--rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2
--proxy-id 1
[43000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -x acme13
/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.31.157
--coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd
"/home/localsoft/mpich3/bin//hydra_pmi_proxy" --control-port acme11:44279
--rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2
--proxy-id 2
Process 0 of 3 is on acme11.ciemat.es
Hello world from process 0 of 3
this is iteration 0 on process 0 of host acme11.ciemat.es
Process 2 of 3 is on acme13.ciemat.es
Hello world from process 2 of 3
Process 1 of 3 is on acme12.ciemat.es
Hello world from process 1 of 3
Goodbye world from process 1 of 3
Goodbye world from process 2 of 3
Goodbye world from process 0 of 3
COORDINATOR
[root@acme11 ~]# dmtcp_coordinator
dmtcp_coordinator starting...
Host: acme11.ciemat.es (172.17.31.157)
Port: 7779
Checkpoint Interval: disabled (checkpoint manually instead)
Exit on last client: 0
Type '?' for help.
[21211] NOTE at dmtcp_coordinator.cpp:1661 in updateCheckpointInterval;
REASON='CheckpointInterval updated (for this computation only)'
oldInterval = 0
theCheckpointInterval = 0
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-21212-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpiexec.hydra
msg.from = 1d64b124afe30f29-40000-56252da6
client->identity() = 1d64b124afe30f29-21212-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-40000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpiexec.hydra_(forked)
msg.from = 1d64b124afe30f29-41000-56252da6
client->identity() = 1d64b124afe30f29-40000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-40000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpiexec.hydra_(forked)
msg.from = 1d64b124afe30f29-42000-56252da6
client->identity() = 1d64b124afe30f29-40000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-40000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = mpiexec.hydra_(forked)
msg.from = 1d64b124afe30f29-43000-56252da6
client->identity() = 1d64b124afe30f29-40000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-41000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 1d64b124afe30f29-44000-56252da6
client->identity() = 1d64b124afe30f29-41000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-42000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_ssh_(forked)
msg.from = 1d64b124afe30f29-45000-56252da6
client->identity() = 1d64b124afe30f29-42000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1d64b124afe30f29-43000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme11.ciemat.es
client->progname() = dmtcp_ssh_(forked)
msg.from = 1d64b124afe30f29-46000-56252da6
client->identity() = 1d64b124afe30f29-43000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-45000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-46000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = hydra_pmi_proxy
msg.from = 1d64b124afe30f29-41000-56252da6
client->identity() = 1d64b124afe30f29-41000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 1d64b124afe30f29-42000-56252da6
client->identity() = 1d64b124afe30f29-42000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 1d64b124afe30f29-43000-56252da6
client->identity() = 1d64b124afe30f29-43000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 1d64b124afe30f29-44000-56252da6
client->identity() = 1d64b124afe30f29-44000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-14428-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 54385264162a2589-10066-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 1b69d09fb3238b30-47000-56252da7
client->identity() = 1b69d09fb3238b30-14428-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 54385264162a2589-48000-56252da7
client->identity() = 54385264162a2589-10066-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-47000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 54385264162a2589-48000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme13.ciemat.es
client->progname() = dmtcp_sshd_(forked)
msg.from = 54385264162a2589-50000-56252da7
client->identity() = 54385264162a2589-48000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme12.ciemat.es
client->progname() = dmtcp_sshd_(forked)
msg.from = 1b69d09fb3238b30-49000-56252da7
client->identity() = 1b69d09fb3238b30-47000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 54385264162a2589-50000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 1b69d09fb3238b30-49000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme13.ciemat.es
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 54385264162a2589-51000-56252da7
client->identity() = 54385264162a2589-50000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = acme12.ciemat.es
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 1b69d09fb3238b30-52000-56252da7
client->identity() = 1b69d09fb3238b30-49000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = hydra_pmi_proxy
msg.from = 1b69d09fb3238b30-49000-56252da7
client->identity() = 1b69d09fb3238b30-49000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = hydra_pmi_proxy
msg.from = 54385264162a2589-50000-56252da7
client->identity() = 54385264162a2589-50000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 54385264162a2589-51000-56252da7
client->identity() = 54385264162a2589-51000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 1b69d09fb3238b30-52000-56252da7
client->identity() = 1b69d09fb3238b30-52000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-52000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 54385264162a2589-51000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-44000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-49000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 54385264162a2589-50000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-41000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1b69d09fb3238b30-47000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 54385264162a2589-48000-56252da7
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-42000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-43000-56252da6
[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 1d64b124afe30f29-40000-56252da6
2015-10-19 8:26 GMT-07:00 Jiajun Cao <[email protected]>:
> Hi Manuel,
>
> The infiniband plugin shouldn't affect application launching. Could you
> try removing the "--ib" flag and see if the application still crashes? This
> can help diagnose whether the issue is in the ib plugin or other dmtcp
> modules.
>
> Best,
> Jiajun
>
>
> Best,
> Jiajun
>
> On Sun, Oct 18, 2015 at 10:57 PM, Kapil Arya <[email protected]> wrote:
>
>> Hey Jiajun,
>>
>> Can you take a look at this problem as it is closer to your area of
>> expertise :-).
>>
>> Best,
>> Kapil
>>
>> On Sat, Oct 17, 2015 at 11:31 PM, Manuel Rodríguez Pascual <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I am trying to checkpoint an MVAPICH application. It does not behave as
>>> expected, so maybe you can give me some support.
>>>
>>> I have compiled DMTCP with "--enable-infiniband-support " as only flag.
>>> I have MVAPICH installed.
>>>
>>> I can execute a test MPI application in two nodes, without DMTCP. I also
>>> can execute the application in a single node with DMTCP. however, it I
>>> execute it in two nodes with DMTCP, only the first one will run.
>>>
>>> Below there is a series of test commands with a lot of output, together
>>> with the versions of everything.
>>>
>>> Any ideas?
>>>
>>> thanks for your help,
>>>
>>>
>>> Manuel
>>>
>>>
>>> ---
>>> ---
>>>
>>> # mpichversion
>>>
>>> MVAPICH2 Version: 2.2a
>>>
>>> MVAPICH2 Release date: Mon Aug 17 20:00:00 EDT 2015
>>>
>>> MVAPICH2 Device: ch3:mrail
>>>
>>> MVAPICH2 configure: --disable-mcast
>>>
>>> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2
>>>
>>> MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2
>>>
>>> MVAPICH2 F77: gfortran -L/lib -L/lib -O2
>>>
>>> MVAPICH2 FC: gfortran -O2
>>>
>>> # dmtcp_coordinator --version
>>>
>>> dmtcp_coordinator (DMTCP) 2.4.1
>>>
>>> ---
>>>
>>> ---
>>>
>>>
>>> I can execute a test MPI application in two nodes (acme11 and 12), with
>>>
>>> ---
>>> ---
>>> # mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI
>>>
>>> Process 0 of 2 is on acme11.ciemat.es
>>>
>>> Process 1 of 2 is on acme12.ciemat.es
>>>
>>> Hello world from process 0 of 2
>>>
>>> Hello world from process 1 of 2
>>>
>>> Goodbye world from process 0 of 2
>>>
>>> Goodbye world from process 1 of 2
>>> ---
>>> ---
>>>
>>> As you can see, it works correctly.
>>>
>>>
>>> If I try to execute the application with DMTCP, however, it does not.
>>>
>>> I run the coordinator on acme11, with port 7779.
>>>
>>>
>>> I can execute the application on a single node. For example,
>>>
>>> ---
>>> ---
>>>
>>> # dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh -n 1 acme12
>>> ./helloWorldMPI
>>>
>>> [41000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
>>>
>>> newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
>>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
>>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
>>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
>>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0
>>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
>>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
>>> MPISPAWN_CHECKIN_PORT=33687 MPISPAWN_MPIRUN_PORT=33687 MPISPAWN_NNODES=1
>>> MPISPAWN_GLOBAL_NPROCS=1 MPISPAWN_MPIRUN_ID=40000 MPISPAWN_ARGC=1
>>> MPDMAN_KVS_TEMPLATE=kvs_885_acme11.ciemat.es_40000 MPISPAWN_LOCAL_NPROCS=1
>>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
>>> MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0
>>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
>>> /usr/local/bin/mpispawn 0
>>>
>>> Process 0 of 1 is on acme12.ciemat.es
>>>
>>> Hello world from process 0 of 1
>>>
>>> Goodbye world from process 0 of 1
>>>
>>>
>>> COORDINATOR OUTPUT
>>>
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-4029-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = mpirun_rsh
>>>
>>> msg.from = 1d64b124afe30f29-52000-562310a2
>>>
>>> client->identity() = 1d64b124afe30f29-4029-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-52000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme11.ciemat.es
>>>
>>> client->progname() = mpirun_rsh_(forked)
>>>
>>> msg.from = 1d64b124afe30f29-53000-562310a2
>>>
>>> client->identity() = 1d64b124afe30f29-52000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-53000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme11.ciemat.es
>>>
>>> client->progname() = dmtcp_ssh_(forked)
>>>
>>> msg.from = 1d64b124afe30f29-54000-562310a2
>>>
>>> client->identity() = 1d64b124afe30f29-53000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-54000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = dmtcp_ssh
>>>
>>> msg.from = 1d64b124afe30f29-53000-562310a2
>>>
>>> client->identity() = 1d64b124afe30f29-53000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1b69d09fb3238b30-23945-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = dmtcp_sshd
>>>
>>> msg.from = 1b69d09fb3238b30-55000-562310a2
>>>
>>> client->identity() = 1b69d09fb3238b30-23945-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1b69d09fb3238b30-55000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme12.ciemat.es
>>>
>>> client->progname() = dmtcp_sshd_(forked)
>>>
>>> msg.from = 1b69d09fb3238b30-56000-562310a2
>>>
>>> client->identity() = 1b69d09fb3238b30-55000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme12.ciemat.es
>>>
>>> client->progname() = mpispawn_(forked)
>>>
>>> msg.from = 1b69d09fb3238b30-57000-562310a2
>>>
>>> client->identity() = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = env
>>>
>>> msg.from = 1b69d09fb3238b30-56000-562310a2
>>>
>>> client->identity() = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = mpispawn
>>>
>>> msg.from = 1b69d09fb3238b30-56000-562310a2
>>>
>>> client->identity() = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = helloWorldMPI
>>>
>>> msg.from = 1b69d09fb3238b30-57000-562310a2
>>>
>>> client->identity() = 1b69d09fb3238b30-57000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1b69d09fb3238b30-57000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1b69d09fb3238b30-55000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-53000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-52000-562310a2
>>>
>>>
>>> ---
>>> ---
>>>
>>> So we see that it is working correctly, connecting and so.
>>>
>>> However, if I run the application in more than one core, as in the first
>>> example, it crashes. What happens is that the first node on the node list
>>> executes the application, and the rest do not.
>>>
>>> ----
>>> ----
>>>
>>> [root@acme11 tests]# dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh
>>> -n 2 acme11 acme12 ./helloWorldMPI
>>>
>>> [59000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
>>>
>>> newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
>>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd
>>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
>>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
>>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0
>>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
>>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
>>> MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2
>>> MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1
>>> MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1
>>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
>>> MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0
>>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
>>> /usr/local/bin/mpispawn 0
>>>
>>> [60000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
>>>
>>> newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
>>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
>>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
>>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
>>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0
>>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
>>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
>>> MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2
>>> MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1
>>> MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1
>>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
>>> MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1
>>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1
>>> /usr/local/bin/mpispawn 0
>>>
>>> Process 0 of 2 is on acme11.ciemat.es
>>>
>>> Hello world from process 0 of 2
>>>
>>> Goodbye world from process 0 of 2
>>>
>>> COORDINATOR OUTPUT
>>>
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-4070-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = mpirun_rsh
>>>
>>> msg.from = 1d64b124afe30f29-58000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-4070-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-58000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-58000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme11.ciemat.es
>>>
>>> client->progname() = mpirun_rsh_(forked)
>>>
>>> msg.from = 1d64b124afe30f29-59000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-58000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme11.ciemat.es
>>>
>>> client->progname() = mpirun_rsh_(forked)
>>>
>>> msg.from = 1d64b124afe30f29-60000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-58000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-59000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-60000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme11.ciemat.es
>>>
>>> client->progname() = dmtcp_ssh_(forked)
>>>
>>> msg.from = 1d64b124afe30f29-61000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-59000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme11.ciemat.es
>>>
>>> client->progname() = dmtcp_ssh_(forked)
>>>
>>> msg.from = 1d64b124afe30f29-62000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-60000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-61000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-62000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = dmtcp_ssh
>>>
>>> msg.from = 1d64b124afe30f29-59000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-59000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = dmtcp_ssh
>>>
>>> msg.from = 1d64b124afe30f29-60000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-60000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1b69d09fb3238b30-24001-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-4094-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = dmtcp_sshd
>>>
>>> msg.from = 1d64b124afe30f29-64000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-4094-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = dmtcp_sshd
>>>
>>> msg.from = 1b69d09fb3238b30-63000-56231173
>>>
>>> client->identity() = 1b69d09fb3238b30-24001-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-64000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1b69d09fb3238b30-63000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme11.ciemat.es
>>>
>>> client->progname() = dmtcp_sshd_(forked)
>>>
>>> msg.from = 1d64b124afe30f29-65000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-64000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme12.ciemat.es
>>>
>>> client->progname() = dmtcp_sshd_(forked)
>>>
>>> msg.from = 1b69d09fb3238b30-66000-56231173
>>>
>>> client->identity() = 1b69d09fb3238b30-63000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = env
>>>
>>> msg.from = 1d64b124afe30f29-65000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = mpispawn
>>>
>>> msg.from = 1d64b124afe30f29-65000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>> hello_remote.from = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme11.ciemat.es
>>>
>>> client->progname() = mpispawn_(forked)
>>>
>>> msg.from = 1d64b124afe30f29-68000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>> client->hostname() = acme12.ciemat.es
>>>
>>> client->progname() = mpispawn_(forked)
>>>
>>> msg.from = 1b69d09fb3238b30-67000-56231173
>>>
>>> client->identity() = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = env
>>>
>>> msg.from = 1b69d09fb3238b30-66000-56231173
>>>
>>> client->identity() = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = mpispawn
>>>
>>> msg.from = 1b69d09fb3238b30-66000-56231173
>>>
>>> client->identity() = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = helloWorldMPI
>>>
>>> msg.from = 1d64b124afe30f29-68000-56231173
>>>
>>> client->identity() = 1d64b124afe30f29-68000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>> progname = helloWorldMPI
>>>
>>> msg.from = 1b69d09fb3238b30-67000-56231173
>>>
>>> client->identity() = 1b69d09fb3238b30-67000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-68000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1b69d09fb3238b30-67000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-64000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1b69d09fb3238b30-63000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-59000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-60000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>> client->identity() = 1d64b124afe30f29-58000-56231173
>>>
>>>
>>> ----
>>>
>>> ----
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Dr. Manuel Rodríguez-Pascual
>>> skype: manuel.rodriguez.pascual
>>> phone: (+34) 913466173 // (+34) 679925108
>>>
>>> CIEMAT-Moncloa
>>> Edificio 22, desp. 1.25
>>> Avenida Complutense, 40
>>> 28040- MADRID
>>> SPAIN
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Dmtcp-forum mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>
>>>
>>
>
--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108
CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum