Hi Husen,

There can be multiple reasons a client disconnects. Is it possible to give
us access to your cluster? This should be the fastest way to diagnose the
problem. Also, to have some initial guess, could you please provide the
following info:

1. MPI version;
2. What resource management software is used;
3. What interconnect is used in the cluster.

In principle, when resource management is used, submitting jobs using job
scripts is recommended. You can find some job examples
in plugin/batch-queue/job_examples. However, running application
interactively is also supported. In your case, if the configuration is no
problem, it can be a bug in DMTCP, and we'll help you fix that.

Also, if InfiniBand is used as the interconnect, you'll need to enforce the
IB plugin of DMTCP by adding the --ib option to dmtcp_launch.


Best,
Jiajun

On Sun, Apr 24, 2016 at 12:34 AM, Husen R <[email protected]> wrote:

> Dear all,
>
> I run dmtcp_coordinator in head-node and then I tried to run dmtcp_launch
> in another node (compute-node) using the following command :
>
> dmtcp_launch --coord-host head-node --coord-port 7779 mpirun -np 24
> -hostfile machines ./mm.o
>
> However, the mpi application is not executed. When I see dmtcp_coordinator
> output log, the last two REASONs said "client disconnected".
> Why the client is disconnected ? any idea how to fix this ? Thank you in
> advance.
>
> This is the output of dmtcp_coordinator :
>
>
> [25572] NOTE at dmtcp_coordinator.cpp:1664 in updateCheckpointInterval;
> REASON='CheckpointInterval updated (for this computation only)'
>      oldInterval = 0
>      theCheckpointInterval = 0
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-12706-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mpiexec.hydra
>      msg.from = 3537527e5a992df8-40000-571c48a6
>      client->identity() = 3537527e5a992df8-12706-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-40000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = mpiexec.hydra_(forked)
>      msg.from = 3537527e5a992df8-41000-571c48a6
>      client->identity() = 3537527e5a992df8-40000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-40000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-40000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = mpiexec.hydra_(forked)
>      msg.from = 3537527e5a992df8-42000-571c48a6
>      client->identity() = 3537527e5a992df8-40000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = mpiexec.hydra_(forked)
>      msg.from = 3537527e5a992df8-43000-571c48a6
>      client->identity() = 3537527e5a992df8-40000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-41000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = dmtcp_ssh_(forked)
>      msg.from = 3537527e5a992df8-44000-571c48a6
>      client->identity() = 3537527e5a992df8-41000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-43000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 3537527e5a992df8-44000-571c48a6
>      client->progname() = dmtcp_ssh_(forked)
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = dmtcp_ssh_(forked)
>      msg.from = 3537527e5a992df8-45000-571c48a6
>      client->identity() = 3537527e5a992df8-43000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 3537527e5a992df8-45000-571c48a6
>      client->progname() = dmtcp_ssh_(forked)
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = hydra_pmi_proxy_(forked)
>      msg.from = 3537527e5a992df8-46000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = hydra_pmi_proxy_(forked)
>      msg.from = 3537527e5a992df8-47000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = hydra_pmi_proxy_(forked)
>      msg.from = 3537527e5a992df8-48000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = hydra_pmi_proxy_(forked)
>      msg.from = 3537527e5a992df8-49000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = hydra_pmi_proxy_(forked)
>      msg.from = 3537527e5a992df8-50000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = hydra_pmi_proxy_(forked)
>      msg.from = 3537527e5a992df8-51000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = hydra_pmi_proxy_(forked)
>      msg.from = 3537527e5a992df8-52000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> connected'
>      hello_remote.from = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
> process Information after fork()'
>      client->hostname() = compute-node
>      client->progname() = hydra_pmi_proxy_(forked)
>      msg.from = 3537527e5a992df8-53000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = dmtcp_ssh
>      msg.from = 3537527e5a992df8-41000-571c48a6
>      client->identity() = 3537527e5a992df8-41000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = hydra_pmi_proxy
>      msg.from = 3537527e5a992df8-42000-571c48a6
>      client->identity() = 3537527e5a992df8-42000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = dmtcp_ssh
>      msg.from = 3537527e5a992df8-43000-571c48a6
>      client->identity() = 3537527e5a992df8-43000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mm.o
>      msg.from = 3537527e5a992df8-46000-571c48a6
>      client->identity() = 3537527e5a992df8-46000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mm.o
>      msg.from = 3537527e5a992df8-47000-571c48a6
>      client->identity() = 3537527e5a992df8-47000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mm.o
>      msg.from = 3537527e5a992df8-48000-571c48a6
>      client->identity() = 3537527e5a992df8-48000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mm.o
>      msg.from = 3537527e5a992df8-49000-571c48a6
>      client->identity() = 3537527e5a992df8-49000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mm.o
>      msg.from = 3537527e5a992df8-50000-571c48a6
>      client->identity() = 3537527e5a992df8-50000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mm.o
>      msg.from = 3537527e5a992df8-51000-571c48a6
>      client->identity() = 3537527e5a992df8-51000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mm.o
>      msg.from = 3537527e5a992df8-52000-571c48a6
>      client->identity() = 3537527e5a992df8-52000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
> process Information after exec()'
>      progname = mm.o
>      msg.from = 3537527e5a992df8-53000-571c48a6
>      client->identity() = 3537527e5a992df8-53000-571c48a6
> [25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 3537527e5a992df8-41000-571c48a6
>      client->progname() = dmtcp_ssh
> [25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
> disconnected'
>      client->identity() = 3537527e5a992df8-43000-571c48a6
>      client->progname() = dmtcp_ssh
>
>
> Regards,
>
>
> Husen
>
>
>
> ------------------------------------------------------------------------------
> Find and fix application performance issues faster with Applications
> Manager
> Applications Manager provides deep performance insights into multiple
> tiers of
> your business applications. It resolves application problems quickly and
> reduces your MTTR. Get your free trial!
> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to