Good morning list,
I am a newbie with DMTCP, so probably this is something obvious. Anyway, I
am not able of checkpointing MPI applications. Instead, I receive an error.
I have looked in the internet but still haven't been able to solve it.
-MPI and sequential applications work fine without DMTCP
-DMTCP works fine when running a secuential application in the master and
restoring it.
...but it cracks when checkpointing a parallel App.
When I execute my code (simple loop for 1 to 50, to detected the moment of
checkpoint) with
(one tab) dmtcp_coordinator
(other tab) dmtcp_launch --rm srun -n 2 mpiLoop 50
and then checkpoint with "c" in the coordinator tab, it does not work.
Instead, what happens is that the application starts printing the same
error message while it is running on the background. And when the execution
of the mpi code has finished, all the output is returned and the system
kind of halts until I manually stopos it.
Below you can find all the informtion that may be relevant: software stack,
output from app and coordinator, and output when executed in debug mode.
Anyway, I suspect that this is probably due to me not knowing how to
install, configure or use the application.
Thanks for your help,
Manuel
My software stack is:
CentOS 6 Virtual Machine
Slurm: slurm 14.03.10
MPI: mpich-3.1.3
dmtcp_coordinator (DMTCP) 2.3.1
->1 master node
-> 3 working nodes. master is not a working node
I have tried to run dmtcp_coordinator only on the master and both in the
master and working nodes with identical results.
Output in application:
----
----
[slurm@slurm-master ~]$ dmtcp_launch --rm srun -n 2 mpiLoop 50
[42000] TRACE at rm_main.cpp:38 in dmtcp_event_hook; REASON='Start'
Process 0 of 2 is on slurm-compute1
iteration 0 on process 0
Process 1 of 2 is on slurm-compute2
iteration 0 on process 1
(start checkpoint here)
[42000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
REASON='JWARNING(false) failed'
_dataSockets[i]->socket().sockfd() = 19
buffer.size() = 196
WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running under
DMTCP?
----
----
I keep receiving the same error every 10 seconds until the execution is
supposed to have finished. Then, the execution *doesn't* finish, and I have
to stop it manually with CTRL+C
Output in coordinator:
----
----
dmtcp_coordinator starting...
Host: slurm-master (192.168.122.11)
Port: 7779
Checkpoint Interval: disabled (checkpoint manually instead)
Exit on last client: 0
Type '?' for help.
[8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-8271-546a25f2
[8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = srun
msg.from = 6db90f3d5a9dd200-40000-546a25f2
client->identity() = 6db90f3d5a9dd200-8271-546a25f2
[8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-40000-546a25f2
[8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = srun_(forked)
msg.from = 6db90f3d5a9dd200-41000-546a25f2
client->identity() = 6db90f3d5a9dd200-40000-546a25f2
[8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-41000-546a25f2
[8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-40000-546a25f2
[8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-8323-546a2609
[8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = srun
msg.from = 6db90f3d5a9dd200-42000-546a2609
client->identity() = 6db90f3d5a9dd200-8323-546a2609
[8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-42000-546a2609
[8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = srun_(forked)
msg.from = 6db90f3d5a9dd200-43000-546a2609
client->identity() = 6db90f3d5a9dd200-42000-546a2609
(checkpoint)
c
[8270] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint;
REASON='starting checkpoint, suspending all nodes'
s.numPeers = 2
[8270] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint;
REASON='Incremented Generation'
compId.generation() = 1
[8270] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState;
REASON='locking all nodes'
[8270] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
REASON='draining all nodes'
[8270] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
REASON='checkpointing all nodes'
[8270] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
REASON='building name service database'
[8270] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
REASON='entertaining queries now'
[8270] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
REASON='refilling all nodes'
[8270] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState;
REASON='restarting all nodes'
----
----
I have executed it in debug mode too, after compilating with
./configure --enable-debug && make -j5 clean && make -j5
. The output is inmense but not very helpful for me with my limited
knowledge.I have uploaded it to pastebin.
-coordinator output : http://pastebin.com/4m5REy28
-application output : http://pastebin.com/inxmfvCc
--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108
CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum