Hi Manuel,

  Is it possible for you to run the application using only MPI (without
SLURM)? I'm asking because DMTCP has a plugin for SLURM, and I want to
isolate the plugin from DMTCP core. This can help us locate the bug more
precisely.

Best,
Jiajun

On Tue, Nov 18, 2014 at 4:17 AM, Manuel Rodríguez Pascual <
[email protected]> wrote:

> Well, it is in fact a virtual KVM cluster inside my local PC, so I would
> say it's ethernet.
>
> *ifconfig (master, computing nodes change IP and MAC):*
> eth0      Link encap:Ethernet  HWaddr 02:00:C0:A8:7A:01
>           inet addr:192.168.122.2  Bcast:192.168.122.255
>  Mask:255.255.255.0
>           inet6 addr: fe80::c0ff:fea8:7a01/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:338 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:243 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:26352 (25.7 KiB)  TX bytes:23475 (22.9 KiB)
>           Interrupt:10
>
> *iptables (master and computing nodes)*
> [root@slurm-master ~]# iptables -L
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
>
>
> I can passwordless SSH as root form master to computing nodes. I cannot
> from computing nodes to master.
>
> Regarding *user configuration*, I am running:
> -slurmctld on master as user slurm
> -slurmd on computing nodes as user root
> -dmtcp_coordinator on master as user root
> -dmtcp_launch on master both as user slurm and root (same results)
>
> DMTCP has been installed both in master and computing nodes, same version.
> I am compiling it with no flags, or just the debug ones.
>
>
>
>
>
> 2014-11-17 23:01 GMT+01:00 Jiajun Cao <[email protected]>:
>
>> Hi Manuel,
>>
>>   What kind of network is used in the cluster? Ethernet or InfiniBand?
>>
>> On Mon, Nov 17, 2014 at 2:52 PM, Gene Cooperman <[email protected]> wrote:
>>
>>> Jiajun,
>>>     Could you respond to this, since you've been extending our support
>>> for MPI?
>>>
>>> Thanks,
>>> - Gene
>>>
>>> On Mon, Nov 17, 2014 at 06:02:34PM +0100, Manuel Rodríguez Pascual wrote:
>>> > Good morning list,
>>> >
>>> >
>>> > I am a newbie with DMTCP, so probably this is something obvious.
>>> Anyway, I
>>> > am not able of checkpointing MPI applications. Instead, I receive an
>>> error.
>>> > I have looked in the internet but still haven't been able to solve it.
>>> >
>>> > -MPI and sequential applications work fine without DMTCP
>>> > -DMTCP works fine when running a secuential application in the master
>>> and
>>> > restoring it.
>>> > ...but it cracks when checkpointing a parallel App.
>>> >
>>> > When I execute my code (simple loop for 1 to 50, to detected the
>>> moment of
>>> > checkpoint) with
>>> >
>>> > (one tab) dmtcp_coordinator
>>> > (other tab) dmtcp_launch --rm srun -n 2 mpiLoop 50
>>> >
>>> > and then checkpoint with "c" in the coordinator tab, it does not work.
>>> > Instead, what happens is that the application starts printing the same
>>> > error message while it is running on the background. And when the
>>> execution
>>> > of the mpi code has finished, all the output is returned and the system
>>> > kind of halts until I manually stopos it.
>>> >
>>> > Below you can find all the informtion that may be relevant: software
>>> stack,
>>> > output from app and coordinator, and output when executed in debug
>>> mode.
>>> > Anyway, I suspect that this is probably due to me not knowing how to
>>> > install, configure or use the application.
>>> >
>>> > Thanks for your help,
>>> >
>>> > Manuel
>>> >
>>> >
>>> >
>>> > My software stack is:
>>> > CentOS 6 Virtual Machine
>>> > Slurm: slurm 14.03.10
>>> > MPI: mpich-3.1.3
>>> > dmtcp_coordinator (DMTCP) 2.3.1
>>> > ->1 master node
>>> > -> 3 working nodes. master is not a working node
>>> >
>>> >
>>> > I have tried to run dmtcp_coordinator only on the master and both in
>>> the
>>> > master and working nodes with identical results.
>>> >
>>> >
>>> >
>>> > Output in application:
>>> > ----
>>> > ----
>>> >
>>> > [slurm@slurm-master ~]$  dmtcp_launch --rm srun -n 2 mpiLoop 50
>>> > [42000] TRACE at rm_main.cpp:38 in dmtcp_event_hook; REASON='Start'
>>> > Process 0 of 2 is on slurm-compute1
>>> > iteration 0 on process 0
>>> > Process 1 of 2 is on slurm-compute2
>>> > iteration 0 on process 1
>>> >
>>> > (start checkpoint here)
>>> >
>>> > [42000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
>>> > REASON='JWARNING(false) failed'
>>> >      _dataSockets[i]->socket().sockfd() = 19
>>> >      buffer.size() = 196
>>> >      WARN_INTERVAL_SEC = 10
>>> > Message: Still draining socket... perhaps remote host is not running
>>> under
>>> > DMTCP?
>>> > ----
>>> > ----
>>> >
>>> >
>>> >
>>> > I keep receiving the same error every 10 seconds until the execution is
>>> > supposed to have finished. Then, the execution *doesn't* finish, and I
>>> have
>>> > to stop it manually with CTRL+C
>>> >
>>> >
>>> > Output in coordinator:
>>> > ----
>>> > ----
>>> > dmtcp_coordinator starting...
>>> >     Host: slurm-master (192.168.122.11)
>>> >     Port: 7779
>>> >     Checkpoint Interval: disabled (checkpoint manually instead)
>>> >     Exit on last client: 0
>>> > Type '?' for help.
>>> >
>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>>> > connected'
>>> >      hello_remote.from = 6db90f3d5a9dd200-8271-546a25f2
>>> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
>>> > process Information after exec()'
>>> >      progname = srun
>>> >      msg.from = 6db90f3d5a9dd200-40000-546a25f2
>>> >      client->identity() = 6db90f3d5a9dd200-8271-546a25f2
>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>>> > connected'
>>> >      hello_remote.from = 6db90f3d5a9dd200-40000-546a25f2
>>> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
>>> > process Information after fork()'
>>> >      client->hostname() = slurm-master
>>> >      client->progname() = srun_(forked)
>>> >      msg.from = 6db90f3d5a9dd200-41000-546a25f2
>>> >      client->identity() = 6db90f3d5a9dd200-40000-546a25f2
>>> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>>> REASON='client
>>> > disconnected'
>>> >      client->identity() = 6db90f3d5a9dd200-41000-546a25f2
>>> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>>> REASON='client
>>> > disconnected'
>>> >      client->identity() = 6db90f3d5a9dd200-40000-546a25f2
>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>>> > connected'
>>> >      hello_remote.from = 6db90f3d5a9dd200-8323-546a2609
>>> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
>>> > process Information after exec()'
>>> >      progname = srun
>>> >      msg.from = 6db90f3d5a9dd200-42000-546a2609
>>> >      client->identity() = 6db90f3d5a9dd200-8323-546a2609
>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
>>> > connected'
>>> >      hello_remote.from = 6db90f3d5a9dd200-42000-546a2609
>>> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
>>> > process Information after fork()'
>>> >      client->hostname() = slurm-master
>>> >      client->progname() = srun_(forked)
>>> >      msg.from = 6db90f3d5a9dd200-43000-546a2609
>>> >      client->identity() = 6db90f3d5a9dd200-42000-546a2609
>>> >
>>> > (checkpoint)
>>> >
>>> > c
>>> > [8270] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint;
>>> > REASON='starting checkpoint, suspending all nodes'
>>> >      s.numPeers = 2
>>> > [8270] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint;
>>> > REASON='Incremented Generation'
>>> >      compId.generation() = 1
>>> > [8270] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState;
>>> > REASON='locking all nodes'
>>> > [8270] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
>>> > REASON='draining all nodes'
>>> >
>>> > [8270] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
>>> > REASON='checkpointing all nodes'
>>> > [8270] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
>>> > REASON='building name service database'
>>> > [8270] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
>>> > REASON='entertaining queries now'
>>> > [8270] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
>>> > REASON='refilling all nodes'
>>> > [8270] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState;
>>> > REASON='restarting all nodes'
>>> > ----
>>> > ----
>>> >
>>> >
>>> > I have executed it in debug mode too, after compilating with
>>> >  ./configure --enable-debug && make -j5 clean && make -j5
>>> >
>>> > . The output is inmense but not very helpful for me with my limited
>>> > knowledge.I have uploaded it to pastebin.
>>> >
>>> > -coordinator output : http://pastebin.com/4m5REy28
>>> > -application output : http://pastebin.com/inxmfvCc
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Dr. Manuel Rodríguez-Pascual
>>> > skype: manuel.rodriguez.pascual
>>> > phone: (+34) 913466173 // (+34) 679925108
>>> >
>>> > CIEMAT-Moncloa
>>> > Edificio 22, desp. 1.25
>>> > Avenida Complutense, 40
>>> > 28040- MADRID
>>> > SPAIN
>>>
>>> >
>>> ------------------------------------------------------------------------------
>>> > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> > from Actuate! Instantly Supercharge Your Business Reports and
>>> Dashboards
>>> > with Interactivity, Sharing, Native Excel Exports, App Integration &
>>> more
>>> > Get technology previously reserved for billion-dollar corporations,
>>> FREE
>>> >
>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>>
>>> > _______________________________________________
>>> > Dmtcp-forum mailing list
>>> > [email protected]
>>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>
>>>
>>
>
>
> --
> Dr. Manuel Rodríguez-Pascual
> skype: manuel.rodriguez.pascual
> phone: (+34) 913466173 // (+34) 679925108
>
> CIEMAT-Moncloa
> Edificio 22, desp. 1.25
> Avenida Complutense, 40
> 28040- MADRID
> SPAIN
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to