Hi Marina,

  Is it possible to access the nodes from outside? If so, can I have access
to the machines? It'll be more convenient if we can diagnose the bug
locally.

Best,
Jiajun

On Tue, Oct 28, 2014 at 3:52 PM, Marina Moran <[email protected]>
wrote:

> Hi Jiajunm!
>
> It is Ethernet (100Mbps), and I am *not* using torque nor any batch
> queue system.
>
> Regards
> Marina
>
> On 10/28/14, Jiajun Cao <[email protected]> wrote:
> > Hi Marina,
> >
> >   What's the network are you using? Is it Ethernet or InfiniBand?
> >
> > On Tue, Oct 28, 2014 at 12:26 PM, Kapil Arya <[email protected]>
> > wrote:
> >
> >> Hi Jiajun,
> >>
> >> Can you take a look at this one?
> >>
> >> Kapil
> >>
> >> On Tue, Oct 28, 2014 at 8:50 AM, Marina Moran <
> >> [email protected]> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am now trying using dmtcp with two nodes, each one with 4 cores, and
> >>> Debian jessie amd64,
> >>> OpenMPI 1.6.5, DMTCP: 2.3.1 (using the last trank from git is the same
> >>> problem).
> >>>
> >>> When I lauch the program, it hangs out, showing this:
> >>>
> >>> hpcpro@m112a:~/NPB3.3/NPB3.3-MPI/bin$ ~/dmtcp-trunk/bin/dmtcp_launch
> >>> mpirun -np 8 -hostfile hosts lu.B.8
> >>> [45000] WARNING at socketconnection.cpp:187 in TcpConnection;
> >>> REASON='JWARNING(false) failed'
> >>>      type = 2
> >>> Message: Datagram Sockets not supported. Hopefully, this is a short
> >>> lived connection!
> >>> [46000] NOTE at ssh.cpp:348 in prepareForExec; REASON='New ssh command'
> >>>      newCommand = /home/hpcpro/dmtcp-trunk/bin/dmtcp_ssh
> >>> /home/hpcpro/dmtcp-trunk/bin/dmtcp_nocheckpoint /usr/bin/ssh -x
> >>> 10.0.2.21 /home/hpcpro/dmtcp-trunk/bin/dmtcp_launch --ssh-slave --host
> >>> m112a --ckptdir /home/hpcpro/NPB3.3/NPB3.3-MPI/bin
> >>> /home/hpcpro/dmtcp-trunk/bin/dmtcp_sshd  orted --daemonize -mca ess
> >>> env -mca orte_ess_jobid 758841344 -mca orte_ess_vpid 1 -mca
> >>> orte_ess_num_procs 2 --hnp-uri "758841344.0;tcp://10.0.2.22:59106"
> >>> -mca plm rsh
> >>>
> >>>
> >>>
> >>> and the coordinator shows:
> >>>
> >>> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >>> connected'
> >>>      hello_remote.from = 1310c956110-8088-544eef91
> >>> [7832] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
> >>> process Information after exec()'
> >>>      progname = orterun
> >>>      msg.from = 1310c956110-51000-544eef91
> >>>      client->identity() = 1310c956110-8088-544eef91
> >>> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >>> connected'
> >>>      hello_remote.from = 1310c956110-51000-544eef91
> >>> [7832] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
> >>> process Information after fork()'
> >>>      client->hostname() = m112a
> >>>      client->progname() = orterun_(forked)
> >>>      msg.from = 1310c956110-52000-544eef91
> >>>      client->identity() = 1310c956110-51000-544eef91
> >>> [7832] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> >>> connected'
> >>>      hello_remote.from = 1310c956110-52000-544eef91
> >>> [7832] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
> >>> process Information after fork()'
> >>>      client->hostname() = m112a
> >>>      client->progname() = dmtcp_ssh_(forked)
> >>>      msg.from = 1310c956110-53000-544eef91
> >>>      client->identity() = 1310c956110-52000-544eef91
> >>> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >>> REASON='client disconnected'
> >>>      client->identity() = 1310c956110-53000-544eef91
> >>> [7832] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
> >>> process Information after exec()'
> >>>      progname = dmtcp_ssh
> >>>      msg.from = 1310c956110-52000-544eef91
> >>>      client->identity() = 1310c956110-52000-544eef91
> >>> [7832] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
> >>> REASON='client disconnected'
> >>>      client->identity() = 1310c956110-52000-544eef91
> >>> l
> >>> Client List:
> >>> #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> >>> 32, orterun[51000:8088]@m112a, 1310c956110-51000-544eef91, RUNNING
> >>>
> >>>
> >>> Any suggestions?
> >>>
> >>> Thanks in advance!
> >>> Marina
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> _______________________________________________
> >>> Dmtcp-forum mailing list
> >>> [email protected]
> >>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >>>
> >>
> >>
> >
>
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to