Hi Jiajun, Can you take a look at this one?
Kapil On Tue, Oct 28, 2014 at 8:50 AM, Marina Moran <[email protected]> wrote: > Hi, > > I am now trying using dmtcp with two nodes, each one with 4 cores, and > Debian jessie amd64, > OpenMPI 1.6.5, DMTCP: 2.3.1 (using the last trank from git is the same > problem). > > When I lauch the program, it hangs out, showing this: > > hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_launch > mpirun -np 8 -hostfile hosts lu.B.8 > [45000] WARNING at socketconnection.cpp:187 in TcpConnection; > REASON='JWARNING(false) failed' > type = 2 > Message: Datagram Sockets not supported. Hopefully, this is a short > lived connection! > [46000] NOTE at ssh.cpp:348 in prepareForExec; REASON='New ssh command' > newCommand = /home/hpcpro/dmtcp-trunk/bin/dmtcp_ssh > /home/hpcpro/dmtcp-trunk/bin/dmtcp_nocheckpoint /usr/bin/ssh -x > 10.0.2.21 /home/hpcpro/dmtcp-trunk/bin/dmtcp_launch --ssh-slave --host > m112a --ckptdir /home/hpcpro/NPB3.3/NPB3.3-MPI/bin > /home/hpcpro/dmtcp-trunk/bin/dmtcp_sshd orted --daemonize -mca ess > env -mca orte_ess_jobid 758841344 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 2 --hnp-uri "758841344.0;tcp://10.0.2.22:59106" > -mca plm rsh > > > > and the coordinator shows: > > [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c956110-8088-544eef91 > [7832] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating > process Information after exec()' > progname = orterun > msg.from = 1310c956110-51000-544eef91 > client->identity() = 1310c956110-8088-544eef91 > [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c956110-51000-544eef91 > [7832] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating > process Information after fork()' > client->hostname() = m112a > client->progname() = orterun_(forked) > msg.from = 1310c956110-52000-544eef91 > client->identity() = 1310c956110-51000-544eef91 > [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1310c956110-52000-544eef91 > [7832] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating > process Information after fork()' > client->hostname() = m112a > client->progname() = dmtcp_ssh_(forked) > msg.from = 1310c956110-53000-544eef91 > client->identity() = 1310c956110-52000-544eef91 > [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > REASON='client disconnected' > client->identity() = 1310c956110-53000-544eef91 > [7832] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating > process Information after exec()' > progname = dmtcp_ssh > msg.from = 1310c956110-52000-544eef91 > client->identity() = 1310c956110-52000-544eef91 > [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; > REASON='client disconnected' > client->identity() = 1310c956110-52000-544eef91 > l > Client List: > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > 32, orterun[51000:8088]@m112a, 1310c956110-51000-544eef91, RUNNING > > > Any suggestions? > > Thanks in advance! > Marina > > > ------------------------------------------------------------------------------ > _______________________________________________ > Dmtcp-forum mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
