Well, it is in fact a virtual KVM cluster inside my local PC, so I would
say it's ethernet.
*ifconfig (master, computing nodes change IP and MAC):*
eth0 Link encap:Ethernet HWaddr 02:00:C0:A8:7A:01
inet addr:192.168.122.2 Bcast:192.168.122.255 Mask:255.255.255.0
inet6 addr: fe80::c0ff:fea8:7a01/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:338 errors:0 dropped:0 overruns:0 frame:0
TX packets:243 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:26352 (25.7 KiB) TX bytes:23475 (22.9 KiB)
Interrupt:10
*iptables (master and computing nodes)*
[root@slurm-master ~]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
I can passwordless SSH as root form master to computing nodes. I cannot
from computing nodes to master.
Regarding *user configuration*, I am running:
-slurmctld on master as user slurm
-slurmd on computing nodes as user root
-dmtcp_coordinator on master as user root
-dmtcp_launch on master both as user slurm and root (same results)
DMTCP has been installed both in master and computing nodes, same version.
I am compiling it with no flags, or just the debug ones.
2014-11-17 23:01 GMT+01:00 Jiajun Cao <[email protected]>:
> Hi Manuel,
>
> What kind of network is used in the cluster? Ethernet or InfiniBand?
>
> On Mon, Nov 17, 2014 at 2:52 PM, Gene Cooperman <[email protected]> wrote:
>
>> Jiajun,
>> Could you respond to this, since you've been extending our support
>> for MPI?
>>
>> Thanks,
>> - Gene
>>
>> On Mon, Nov 17, 2014 at 06:02:34PM +0100, Manuel Rodríguez Pascual wrote:
>> > Good morning list,
>> >
>> >
>> > I am a newbie with DMTCP, so probably this is something obvious.
>> Anyway, I
>> > am not able of checkpointing MPI applications. Instead, I receive an
>> error.
>> > I have looked in the internet but still haven't been able to solve it.
>> >
>> > -MPI and sequential applications work fine without DMTCP
>> > -DMTCP works fine when running a secuential application in the master
>> and
>> > restoring it.
>> > ...but it cracks when checkpointing a parallel App.
>> >
>> > When I execute my code (simple loop for 1 to 50, to detected the moment
>> of
>> > checkpoint) with
>> >
>> > (one tab) dmtcp_coordinator
>> > (other tab) dmtcp_launch --rm srun -n 2 mpiLoop 50
>> >
>> > and then checkpoint with "c" in the coordinator tab, it does not work.
>> > Instead, what happens is that the application starts printing the same
>> > error message while it is running on the background. And when the
>> execution
>> > of the mpi code has finished, all the output is returned and the system
>> > kind of halts until I manually stopos it.
>> >
>> > Below you can find all the informtion that may be relevant: software
>> stack,
>> > output from app and coordinator, and output when executed in debug mode.
>> > Anyway, I suspect that this is probably due to me not knowing how to
>> > install, configure or use the application.
>> >
>> > Thanks for your help,
>> >
>> > Manuel
>> >
>> >
>> >
>> > My software stack is:
>> > CentOS 6 Virtual Machine
>> > Slurm: slurm 14.03.10
>> > MPI: mpich-3.1.3
>> > dmtcp_coordinator (DMTCP) 2.3.1
>> > ->1 master node
>> > -> 3 working nodes. master is not a working node
>> >
>> >
>> > I have tried to run dmtcp_coordinator only on the master and both in the
>> > master and working nodes with identical results.
>> >
>> >
>> >
>> > Output in application:
>> > ----
>> > ----
>> >
>> > [slurm@slurm-master ~]$ dmtcp_launch --rm srun -n 2 mpiLoop 50
>> > [42000] TRACE at rm_main.cpp:38 in dmtcp_event_hook; REASON='Start'
>> > Process 0 of 2 is on slurm-compute1
>> > iteration 0 on process 0
>> > Process 1 of 2 is on slurm-compute2
>> > iteration 0 on process 1
>> >
>> > (start checkpoint here)
>> >
>> > [42000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
>> > REASON='JWARNING(false) failed'
>> > _dataSockets[i]->socket().sockfd() = 19
>> > buffer.size() = 196
>> > WARN_INTERVAL_SEC = 10
>> > Message: Still draining socket... perhaps remote host is not running
>> under
>> > DMTCP?
>> > ----
>> > ----
>> >
>> >
>> >
>> > I keep receiving the same error every 10 seconds until the execution is
>> > supposed to have finished. Then, the execution *doesn't* finish, and I
>> have
>> > to stop it manually with CTRL+C
>> >
>> >
>> > Output in coordinator:
>> > ----
>> > ----
>> > dmtcp_coordinator starting...
>> > Host: slurm-master (192.168.122.11)
>> > Port: 7779
>> > Checkpoint Interval: disabled (checkpoint manually instead)
>> > Exit on last client: 0
>> > Type '?' for help.
>> >
>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> > connected'
>> > hello_remote.from = 6db90f3d5a9dd200-8271-546a25f2
>> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
>> > process Information after exec()'
>> > progname = srun
>> > msg.from = 6db90f3d5a9dd200-40000-546a25f2
>> > client->identity() = 6db90f3d5a9dd200-8271-546a25f2
>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> > connected'
>> > hello_remote.from = 6db90f3d5a9dd200-40000-546a25f2
>> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
>> > process Information after fork()'
>> > client->hostname() = slurm-master
>> > client->progname() = srun_(forked)
>> > msg.from = 6db90f3d5a9dd200-41000-546a25f2
>> > client->identity() = 6db90f3d5a9dd200-40000-546a25f2
>> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
>> > disconnected'
>> > client->identity() = 6db90f3d5a9dd200-41000-546a25f2
>> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
>> > disconnected'
>> > client->identity() = 6db90f3d5a9dd200-40000-546a25f2
>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> > connected'
>> > hello_remote.from = 6db90f3d5a9dd200-8323-546a2609
>> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
>> > process Information after exec()'
>> > progname = srun
>> > msg.from = 6db90f3d5a9dd200-42000-546a2609
>> > client->identity() = 6db90f3d5a9dd200-8323-546a2609
>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>> > connected'
>> > hello_remote.from = 6db90f3d5a9dd200-42000-546a2609
>> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
>> > process Information after fork()'
>> > client->hostname() = slurm-master
>> > client->progname() = srun_(forked)
>> > msg.from = 6db90f3d5a9dd200-43000-546a2609
>> > client->identity() = 6db90f3d5a9dd200-42000-546a2609
>> >
>> > (checkpoint)
>> >
>> > c
>> > [8270] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint;
>> > REASON='starting checkpoint, suspending all nodes'
>> > s.numPeers = 2
>> > [8270] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint;
>> > REASON='Incremented Generation'
>> > compId.generation() = 1
>> > [8270] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState;
>> > REASON='locking all nodes'
>> > [8270] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
>> > REASON='draining all nodes'
>> >
>> > [8270] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
>> > REASON='checkpointing all nodes'
>> > [8270] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
>> > REASON='building name service database'
>> > [8270] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
>> > REASON='entertaining queries now'
>> > [8270] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
>> > REASON='refilling all nodes'
>> > [8270] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState;
>> > REASON='restarting all nodes'
>> > ----
>> > ----
>> >
>> >
>> > I have executed it in debug mode too, after compilating with
>> > ./configure --enable-debug && make -j5 clean && make -j5
>> >
>> > . The output is inmense but not very helpful for me with my limited
>> > knowledge.I have uploaded it to pastebin.
>> >
>> > -coordinator output : http://pastebin.com/4m5REy28
>> > -application output : http://pastebin.com/inxmfvCc
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Dr. Manuel Rodríguez-Pascual
>> > skype: manuel.rodriguez.pascual
>> > phone: (+34) 913466173 // (+34) 679925108
>> >
>> > CIEMAT-Moncloa
>> > Edificio 22, desp. 1.25
>> > Avenida Complutense, 40
>> > 28040- MADRID
>> > SPAIN
>>
>> >
>> ------------------------------------------------------------------------------
>> > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> > from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> > with Interactivity, Sharing, Native Excel Exports, App Integration &
>> more
>> > Get technology previously reserved for billion-dollar corporations, FREE
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>
>> > _______________________________________________
>> > Dmtcp-forum mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>
>>
>
--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108
CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum