Hi,
I am now trying using dmtcp with two nodes, each one with 4 cores, and
Debian jessie amd64,
OpenMPI 1.6.5, DMTCP: 2.3.1 (using the last trank from git is the same problem).
When I lauch the program, it hangs out, showing this:
hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_launch
mpirun -np 8 -hostfile hosts lu.B.8
[45000] WARNING at socketconnection.cpp:187 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[46000] NOTE at ssh.cpp:348 in prepareForExec; REASON='New ssh command'
newCommand = /home/hpcpro/dmtcp-trunk/bin/dmtcp_ssh
/home/hpcpro/dmtcp-trunk/bin/dmtcp_nocheckpoint /usr/bin/ssh -x
10.0.2.21 /home/hpcpro/dmtcp-trunk/bin/dmtcp_launch --ssh-slave --host
m112a --ckptdir /home/hpcpro/NPB3.3/NPB3.3-MPI/bin
/home/hpcpro/dmtcp-trunk/bin/dmtcp_sshd orted --daemonize -mca ess
env -mca orte_ess_jobid 758841344 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 2 --hnp-uri "758841344.0;tcp://10.0.2.22:59106"
-mca plm rsh
and the coordinator shows:
[7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 1310c956110-8088-544eef91
[7832] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = orterun
msg.from = 1310c956110-51000-544eef91
client->identity() = 1310c956110-8088-544eef91
[7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 1310c956110-51000-544eef91
[7832] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = m112a
client->progname() = orterun_(forked)
msg.from = 1310c956110-52000-544eef91
client->identity() = 1310c956110-51000-544eef91
[7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 1310c956110-52000-544eef91
[7832] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = m112a
client->progname() = dmtcp_ssh_(forked)
msg.from = 1310c956110-53000-544eef91
client->identity() = 1310c956110-52000-544eef91
[7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
REASON='client disconnected'
client->identity() = 1310c956110-53000-544eef91
[7832] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 1310c956110-52000-544eef91
client->identity() = 1310c956110-52000-544eef91
[7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
REASON='client disconnected'
client->identity() = 1310c956110-52000-544eef91
l
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
32, orterun[51000:8088]@m112a, 1310c956110-51000-544eef91, RUNNING
Any suggestions?
Thanks in advance!
Marina
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum