Jiajun,
Could you respond to this, since you've been extending our support
for MPI?
Thanks,
- Gene
On Mon, Nov 17, 2014 at 06:02:34PM +0100, Manuel Rodríguez Pascual wrote:
> Good morning list,
>
>
> I am a newbie with DMTCP, so probably this is something obvious. Anyway, I
> am not able of checkpointing MPI applications. Instead, I receive an error.
> I have looked in the internet but still haven't been able to solve it.
>
> -MPI and sequential applications work fine without DMTCP
> -DMTCP works fine when running a secuential application in the master and
> restoring it.
> ...but it cracks when checkpointing a parallel App.
>
> When I execute my code (simple loop for 1 to 50, to detected the moment of
> checkpoint) with
>
> (one tab) dmtcp_coordinator
> (other tab) dmtcp_launch --rm srun -n 2 mpiLoop 50
>
> and then checkpoint with "c" in the coordinator tab, it does not work.
> Instead, what happens is that the application starts printing the same
> error message while it is running on the background. And when the execution
> of the mpi code has finished, all the output is returned and the system
> kind of halts until I manually stopos it.
>
> Below you can find all the informtion that may be relevant: software stack,
> output from app and coordinator, and output when executed in debug mode.
> Anyway, I suspect that this is probably due to me not knowing how to
> install, configure or use the application.
>
> Thanks for your help,
>
> Manuel
>
>
>
> My software stack is:
> CentOS 6 Virtual Machine
> Slurm: slurm 14.03.10
> MPI: mpich-3.1.3
> dmtcp_coordinator (DMTCP) 2.3.1
> ->1 master node
> -> 3 working nodes. master is not a working node
>
>
> I have tried to run dmtcp_coordinator only on the master and both in the
> master and working nodes with identical results.
>
>
>
> Output in application:
> ----
> ----
>
> [slurm@slurm-master ~]$ dmtcp_launch --rm srun -n 2 mpiLoop 50
> [42000] TRACE at rm_main.cpp:38 in dmtcp_event_hook; REASON='Start'
> Process 0 of 2 is on slurm-compute1
> iteration 0 on process 0
> Process 1 of 2 is on slurm-compute2
> iteration 0 on process 1
>
> (start checkpoint here)
>
> [42000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
> REASON='JWARNING(false) failed'
> _dataSockets[i]->socket().sockfd() = 19
> buffer.size() = 196
> WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under
> DMTCP?
> ----
> ----
>
>
>
> I keep receiving the same error every 10 seconds until the execution is
> supposed to have finished. Then, the execution *doesn't* finish, and I have
> to stop it manually with CTRL+C
>
>
> Output in coordinator:
> ----
> ----
> dmtcp_coordinator starting...
> Host: slurm-master (192.168.122.11)
> Port: 7779
> Checkpoint Interval: disabled (checkpoint manually instead)
> Exit on last client: 0
> Type '?' for help.
>
> [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
> hello_remote.from = 6db90f3d5a9dd200-8271-546a25f2
> [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
> process Information after exec()'
> progname = srun
> msg.from = 6db90f3d5a9dd200-40000-546a25f2
> client->identity() = 6db90f3d5a9dd200-8271-546a25f2
> [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
> hello_remote.from = 6db90f3d5a9dd200-40000-546a25f2
> [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
> process Information after fork()'
> client->hostname() = slurm-master
> client->progname() = srun_(forked)
> msg.from = 6db90f3d5a9dd200-41000-546a25f2
> client->identity() = 6db90f3d5a9dd200-40000-546a25f2
> [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
> disconnected'
> client->identity() = 6db90f3d5a9dd200-41000-546a25f2
> [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
> disconnected'
> client->identity() = 6db90f3d5a9dd200-40000-546a25f2
> [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
> hello_remote.from = 6db90f3d5a9dd200-8323-546a2609
> [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
> process Information after exec()'
> progname = srun
> msg.from = 6db90f3d5a9dd200-42000-546a2609
> client->identity() = 6db90f3d5a9dd200-8323-546a2609
> [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
> connected'
> hello_remote.from = 6db90f3d5a9dd200-42000-546a2609
> [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
> process Information after fork()'
> client->hostname() = slurm-master
> client->progname() = srun_(forked)
> msg.from = 6db90f3d5a9dd200-43000-546a2609
> client->identity() = 6db90f3d5a9dd200-42000-546a2609
>
> (checkpoint)
>
> c
> [8270] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
> s.numPeers = 2
> [8270] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint;
> REASON='Incremented Generation'
> compId.generation() = 1
> [8270] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState;
> REASON='locking all nodes'
> [8270] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
> REASON='draining all nodes'
>
> [8270] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
> REASON='checkpointing all nodes'
> [8270] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
> REASON='building name service database'
> [8270] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
> REASON='entertaining queries now'
> [8270] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
> REASON='refilling all nodes'
> [8270] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState;
> REASON='restarting all nodes'
> ----
> ----
>
>
> I have executed it in debug mode too, after compilating with
> ./configure --enable-debug && make -j5 clean && make -j5
>
> . The output is inmense but not very helpful for me with my limited
> knowledge.I have uploaded it to pastebin.
>
> -coordinator output : http://pastebin.com/4m5REy28
> -application output : http://pastebin.com/inxmfvCc
>
>
>
>
>
>
>
>
>
>
> --
> Dr. Manuel Rodríguez-Pascual
> skype: manuel.rodriguez.pascual
> phone: (+34) 913466173 // (+34) 679925108
>
> CIEMAT-Moncloa
> Edificio 22, desp. 1.25
> Avenida Complutense, 40
> 28040- MADRID
> SPAIN
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum