Hi Manuel,

  What kind of network is used in the cluster? Ethernet or InfiniBand?

On Mon, Nov 17, 2014 at 2:52 PM, Gene Cooperman <[email protected]> wrote:

> Jiajun,
>     Could you respond to this, since you've been extending our support
> for MPI?
>
> Thanks,
> - Gene
>
> On Mon, Nov 17, 2014 at 06:02:34PM +0100, Manuel Rodríguez Pascual wrote:
> > Good morning list,
> >
> >
> > I am a newbie with DMTCP, so probably this is something obvious. Anyway,
> I
> > am not able of checkpointing MPI applications. Instead, I receive an
> error.
> > I have looked in the internet but still haven't been able to solve it.
> >
> > -MPI and sequential applications work fine without DMTCP
> > -DMTCP works fine when running a secuential application in the master and
> > restoring it.
> > ...but it cracks when checkpointing a parallel App.
> >
> > When I execute my code (simple loop for 1 to 50, to detected the moment
> of
> > checkpoint) with
> >
> > (one tab) dmtcp_coordinator
> > (other tab) dmtcp_launch --rm srun -n 2 mpiLoop 50
> >
> > and then checkpoint with "c" in the coordinator tab, it does not work.
> > Instead, what happens is that the application starts printing the same
> > error message while it is running on the background. And when the
> execution
> > of the mpi code has finished, all the output is returned and the system
> > kind of halts until I manually stopos it.
> >
> > Below you can find all the informtion that may be relevant: software
> stack,
> > output from app and coordinator, and output when executed in debug mode.
> > Anyway, I suspect that this is probably due to me not knowing how to
> > install, configure or use the application.
> >
> > Thanks for your help,
> >
> > Manuel
> >
> >
> >
> > My software stack is:
> > CentOS 6 Virtual Machine
> > Slurm: slurm 14.03.10
> > MPI: mpich-3.1.3
> > dmtcp_coordinator (DMTCP) 2.3.1
> > ->1 master node
> > -> 3 working nodes. master is not a working node
> >
> >
> > I have tried to run dmtcp_coordinator only on the master and both in the
> > master and working nodes with identical results.
> >
> >
> >
> > Output in application:
> > ----
> > ----
> >
> > [slurm@slurm-master ~]$  dmtcp_launch --rm srun -n 2 mpiLoop 50
> > [42000] TRACE at rm_main.cpp:38 in dmtcp_event_hook; REASON='Start'
> > Process 0 of 2 is on slurm-compute1
> > iteration 0 on process 0
> > Process 1 of 2 is on slurm-compute2
> > iteration 0 on process 1
> >
> > (start checkpoint here)
> >
> > [42000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
> > REASON='JWARNING(false) failed'
> >      _dataSockets[i]->socket().sockfd() = 19
> >      buffer.size() = 196
> >      WARN_INTERVAL_SEC = 10
> > Message: Still draining socket... perhaps remote host is not running
> under
> > DMTCP?
> > ----
> > ----
> >
> >
> >
> > I keep receiving the same error every 10 seconds until the execution is
> > supposed to have finished. Then, the execution *doesn't* finish, and I
> have
> > to stop it manually with CTRL+C
> >
> >
> > Output in coordinator:
> > ----
> > ----
> > dmtcp_coordinator starting...
> >     Host: slurm-master (192.168.122.11)
> >     Port: 7779
> >     Checkpoint Interval: disabled (checkpoint manually instead)
> >     Exit on last client: 0
> > Type '?' for help.
> >
> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-8271-546a25f2
> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = srun
> >      msg.from = 6db90f3d5a9dd200-40000-546a25f2
> >      client->identity() = 6db90f3d5a9dd200-8271-546a25f2
> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-40000-546a25f2
> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
> > process Information after fork()'
> >      client->hostname() = slurm-master
> >      client->progname() = srun_(forked)
> >      msg.from = 6db90f3d5a9dd200-41000-546a25f2
> >      client->identity() = 6db90f3d5a9dd200-40000-546a25f2
> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
> > disconnected'
> >      client->identity() = 6db90f3d5a9dd200-41000-546a25f2
> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
> > disconnected'
> >      client->identity() = 6db90f3d5a9dd200-40000-546a25f2
> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-8323-546a2609
> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = srun
> >      msg.from = 6db90f3d5a9dd200-42000-546a2609
> >      client->identity() = 6db90f3d5a9dd200-8323-546a2609
> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-42000-546a2609
> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
> > process Information after fork()'
> >      client->hostname() = slurm-master
> >      client->progname() = srun_(forked)
> >      msg.from = 6db90f3d5a9dd200-43000-546a2609
> >      client->identity() = 6db90f3d5a9dd200-42000-546a2609
> >
> > (checkpoint)
> >
> > c
> > [8270] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint;
> > REASON='starting checkpoint, suspending all nodes'
> >      s.numPeers = 2
> > [8270] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint;
> > REASON='Incremented Generation'
> >      compId.generation() = 1
> > [8270] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState;
> > REASON='locking all nodes'
> > [8270] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
> > REASON='draining all nodes'
> >
> > [8270] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
> > REASON='checkpointing all nodes'
> > [8270] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
> > REASON='building name service database'
> > [8270] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
> > REASON='entertaining queries now'
> > [8270] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
> > REASON='refilling all nodes'
> > [8270] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState;
> > REASON='restarting all nodes'
> > ----
> > ----
> >
> >
> > I have executed it in debug mode too, after compilating with
> >  ./configure --enable-debug && make -j5 clean && make -j5
> >
> > . The output is inmense but not very helpful for me with my limited
> > knowledge.I have uploaded it to pastebin.
> >
> > -coordinator output : http://pastebin.com/4m5REy28
> > -application output : http://pastebin.com/inxmfvCc
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > Dr. Manuel Rodríguez-Pascual
> > skype: manuel.rodriguez.pascual
> > phone: (+34) 913466173 // (+34) 679925108
> >
> > CIEMAT-Moncloa
> > Edificio 22, desp. 1.25
> > Avenida Complutense, 40
> > 28040- MADRID
> > SPAIN
>
> >
> ------------------------------------------------------------------------------
> > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> > from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> > with Interactivity, Sharing, Native Excel Exports, App Integration & more
> > Get technology previously reserved for billion-dollar corporations, FREE
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>
> > _______________________________________________
> > Dmtcp-forum mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to