I submitted several MPI tests in the previous mail Please check it and if
you need anything else please let me know. I am copying the tests below so
you have all the info in this mail

Regarding my environment, it is the following one (I think I posted it
earlier too)

My software stack is:
CentOS 6 Virtual Machine
Slurm: slurm 14.03.10
MPI: mpich-3.1.3
dmtcp_coordinator (DMTCP) 2.3.1
->1 master node
-> 3 working nodes. master is not a working node
Network: Ethernet, no firewalls or restrictions

Again, this is all performed on virtual machines, so feel free to ask me
for the images if you want an exact replica of my environment on your side.


Thanks for your help,


Manuel



MPI Tests
*MPI on a single machine:*
---
---
[root@slurm-master slurm]# mpiexec -n 3 ./mpiLoop 2
Process 2 of 3 is on slurm-master
iteration 0 on process 2
Process 1 of 3 is on slurm-master
iteration 0 on process 1
Process 0 of 3 is on slurm-master
iteration 0 on process 0
iteration 1 on process 1
iteration 1 on process 2
iteration 1 on process 0
Goodbye world from process 0 of 3
Goodbye world from process 1 of 3
Goodbye world from process 2 of 3
---
---


*MPI on multiple nodes*
---
---
[root@slurm-master slurm]# more machinefile
slurm-compute1
slurm-compute2
slurm-compute3

[root@slurm-master slurm]# mpiexec -n 3 -f machinefile ./mpiLoop 2
Process 2 of 3 is on slurm-compute3
Process 1 of 3 is on slurm-compute2
iteration 0 on process 2
iteration 0 on process 1
Process 0 of 3 is on slurm-compute1
iteration 0 on process 0
iteration 1 on process 1
iteration 1 on process 2
iteration 1 on process 0
Goodbye world from process 1 of 3
Goodbye world from process 0 of 3
Goodbye world from process 2 of 3
---
---


*MPI with SLURM through queue system.*
---
---
[root@slurm-master slurm]# more mpiLoop_slurm.sh
#!/bin/sh
sbatch -n 3 mpiLoop_script.sh


[root@slurm-master slurm]# more  mpiLoop_script.sh
#!/bin/sh
srun ./mpiLoop 2

[root@slurm-master slurm]# sbatch mpiLoop_slurm.sh

[root@slurm-master slurm]# more slurm-115.out
Process 1 of 3 is on slurm-compute2
iteration 0 on process 1
Process 0 of 3 is on slurm-compute1
iteration 0 on process 0
Process 2 of 3 is on slurm-compute3
iteration 0 on process 2
iteration 1 on process 0
iteration 1 on process 1
iteration 1 on process 2
Goodbye world from process 1 of 3
Goodbye world from process 0 of 3
Goodbye world from process 2 of 3
---
---


*MPI with Slurm, direct submission with srun*
---
---
[root@slurm-master slurm]# srun -n 3 ./mpiLoop 2
Process 2 of 3 is on slurm-compute3
iteration 0 on process 2
Process 1 of 3 is on slurm-compute2
iteration 0 on process 1
Process 0 of 3 is on slurm-compute1
iteration 0 on process 0
iteration 1 on process 2
iteration 1 on process 1
iteration 1 on process 0
Goodbye world from process 0 of 3
Goodbye world from process 2 of 3
Goodbye world from process 1 of 3
---
---


2014-11-20 10:23 GMT+01:00 Jiajun Cao <[email protected]>:

> Hi Manuel,
>
>   Sorry for the confusion. I meant the bug in DMTCP. My point is to run
> the application under DMTCP using MPI only, without the involvement of
> Slurm.
> If it shows the same behavior, we're almost 100% sure it's a bug in DMTCP
> core, not in the Slurm plugin.
>
>   If you can provide me your environment, that'll be great. I can test it
> on my side as well.
>
> Best,
> Jiajun
>
> On Wed, Nov 19, 2014 at 4:28 AM, Manuel Rodríguez Pascual <
> [email protected]> wrote:
>
>> by bug, do you mean this is something I am doing wrong, or a bug in
>> DMTCP? If so, I can provide you my virtual machines where I am running all
>> this, so you can employ them in your debug.
>>
>> Anyway, it is possible to run an MPI application in many different ways.
>> Please find the output below.
>>
>> Thanks for your help. Best regards,
>>
>> Manuel
>>
>>
>>
>> *MPI on a single machine:*
>> ---
>> ---
>> [root@slurm-master slurm]# mpiexec -n 3 ./mpiLoop 2
>> Process 2 of 3 is on slurm-master
>> iteration 0 on process 2
>> Process 1 of 3 is on slurm-master
>> iteration 0 on process 1
>> Process 0 of 3 is on slurm-master
>> iteration 0 on process 0
>> iteration 1 on process 1
>> iteration 1 on process 2
>> iteration 1 on process 0
>> Goodbye world from process 0 of 3
>> Goodbye world from process 1 of 3
>> Goodbye world from process 2 of 3
>> ---
>> ---
>>
>>
>> *MPI on multiple nodes*
>> ---
>> ---
>> [root@slurm-master slurm]# more machinefile
>> slurm-compute1
>> slurm-compute2
>> slurm-compute3
>>
>> [root@slurm-master slurm]# mpiexec -n 3 -f machinefile ./mpiLoop 2
>> Process 2 of 3 is on slurm-compute3
>> Process 1 of 3 is on slurm-compute2
>> iteration 0 on process 2
>> iteration 0 on process 1
>> Process 0 of 3 is on slurm-compute1
>> iteration 0 on process 0
>> iteration 1 on process 1
>> iteration 1 on process 2
>> iteration 1 on process 0
>> Goodbye world from process 1 of 3
>> Goodbye world from process 0 of 3
>> Goodbye world from process 2 of 3
>> ---
>> ---
>>
>>
>> *MPI with SLURM through queue system.*
>> ---
>> ---
>> [root@slurm-master slurm]# more mpiLoop_slurm.sh
>> #!/bin/sh
>> sbatch -n 3 mpiLoop_script.sh
>>
>>
>> [root@slurm-master slurm]# more  mpiLoop_script.sh
>> #!/bin/sh
>> srun ./mpiLoop 2
>>
>> [root@slurm-master slurm]# sbatch mpiLoop_slurm.sh
>>
>> [root@slurm-master slurm]# more slurm-115.out
>> Process 1 of 3 is on slurm-compute2
>> iteration 0 on process 1
>> Process 0 of 3 is on slurm-compute1
>> iteration 0 on process 0
>> Process 2 of 3 is on slurm-compute3
>> iteration 0 on process 2
>> iteration 1 on process 0
>> iteration 1 on process 1
>> iteration 1 on process 2
>> Goodbye world from process 1 of 3
>> Goodbye world from process 0 of 3
>> Goodbye world from process 2 of 3
>> ---
>> ---
>>
>>
>> *MPI with Slurm, direct submission with srun*
>> ---
>> ---
>> [root@slurm-master slurm]# srun -n 3 ./mpiLoop 2
>> Process 2 of 3 is on slurm-compute3
>> iteration 0 on process 2
>> Process 1 of 3 is on slurm-compute2
>> iteration 0 on process 1
>> Process 0 of 3 is on slurm-compute1
>> iteration 0 on process 0
>> iteration 1 on process 2
>> iteration 1 on process 1
>> iteration 1 on process 0
>> Goodbye world from process 0 of 3
>> Goodbye world from process 2 of 3
>> Goodbye world from process 1 of 3
>> ---
>> ---
>>
>>
>>
>>
>>
>> 2014-11-18 19:58 GMT+01:00 Jiajun Cao <[email protected]>:
>>
>>> Hi Manuel,
>>>
>>>   Is it possible for you to run the application using only MPI (without
>>> SLURM)? I'm asking because DMTCP has a plugin for SLURM, and I want to
>>> isolate the plugin from DMTCP core. This can help us locate the bug more
>>> precisely.
>>>
>>> Best,
>>> Jiajun
>>>
>>> On Tue, Nov 18, 2014 at 4:17 AM, Manuel Rodríguez Pascual <
>>> [email protected]> wrote:
>>>
>>>> Well, it is in fact a virtual KVM cluster inside my local PC, so I
>>>> would say it's ethernet.
>>>>
>>>> *ifconfig (master, computing nodes change IP and MAC):*
>>>> eth0      Link encap:Ethernet  HWaddr 02:00:C0:A8:7A:01
>>>>           inet addr:192.168.122.2  Bcast:192.168.122.255
>>>>  Mask:255.255.255.0
>>>>           inet6 addr: fe80::c0ff:fea8:7a01/64 Scope:Link
>>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>           RX packets:338 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:243 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:1000
>>>>           RX bytes:26352 (25.7 KiB)  TX bytes:23475 (22.9 KiB)
>>>>           Interrupt:10
>>>>
>>>> *iptables (master and computing nodes)*
>>>> [root@slurm-master ~]# iptables -L
>>>> Chain INPUT (policy ACCEPT)
>>>> target     prot opt source               destination
>>>> Chain FORWARD (policy ACCEPT)
>>>> target     prot opt source               destination
>>>> Chain OUTPUT (policy ACCEPT)
>>>> target     prot opt source               destination
>>>>
>>>>
>>>> I can passwordless SSH as root form master to computing nodes. I cannot
>>>> from computing nodes to master.
>>>>
>>>> Regarding *user configuration*, I am running:
>>>> -slurmctld on master as user slurm
>>>> -slurmd on computing nodes as user root
>>>> -dmtcp_coordinator on master as user root
>>>> -dmtcp_launch on master both as user slurm and root (same results)
>>>>
>>>> DMTCP has been installed both in master and computing nodes, same
>>>> version. I am compiling it with no flags, or just the debug ones.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2014-11-17 23:01 GMT+01:00 Jiajun Cao <[email protected]>:
>>>>
>>>>> Hi Manuel,
>>>>>
>>>>>   What kind of network is used in the cluster? Ethernet or InfiniBand?
>>>>>
>>>>> On Mon, Nov 17, 2014 at 2:52 PM, Gene Cooperman <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Jiajun,
>>>>>>     Could you respond to this, since you've been extending our support
>>>>>> for MPI?
>>>>>>
>>>>>> Thanks,
>>>>>> - Gene
>>>>>>
>>>>>> On Mon, Nov 17, 2014 at 06:02:34PM +0100, Manuel Rodríguez Pascual
>>>>>> wrote:
>>>>>> > Good morning list,
>>>>>> >
>>>>>> >
>>>>>> > I am a newbie with DMTCP, so probably this is something obvious.
>>>>>> Anyway, I
>>>>>> > am not able of checkpointing MPI applications. Instead, I receive
>>>>>> an error.
>>>>>> > I have looked in the internet but still haven't been able to solve
>>>>>> it.
>>>>>> >
>>>>>> > -MPI and sequential applications work fine without DMTCP
>>>>>> > -DMTCP works fine when running a secuential application in the
>>>>>> master and
>>>>>> > restoring it.
>>>>>> > ...but it cracks when checkpointing a parallel App.
>>>>>> >
>>>>>> > When I execute my code (simple loop for 1 to 50, to detected the
>>>>>> moment of
>>>>>> > checkpoint) with
>>>>>> >
>>>>>> > (one tab) dmtcp_coordinator
>>>>>> > (other tab) dmtcp_launch --rm srun -n 2 mpiLoop 50
>>>>>> >
>>>>>> > and then checkpoint with "c" in the coordinator tab, it does not
>>>>>> work.
>>>>>> > Instead, what happens is that the application starts printing the
>>>>>> same
>>>>>> > error message while it is running on the background. And when the
>>>>>> execution
>>>>>> > of the mpi code has finished, all the output is returned and the
>>>>>> system
>>>>>> > kind of halts until I manually stopos it.
>>>>>> >
>>>>>> > Below you can find all the informtion that may be relevant:
>>>>>> software stack,
>>>>>> > output from app and coordinator, and output when executed in debug
>>>>>> mode.
>>>>>> > Anyway, I suspect that this is probably due to me not knowing how to
>>>>>> > install, configure or use the application.
>>>>>> >
>>>>>> > Thanks for your help,
>>>>>> >
>>>>>> > Manuel
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > My software stack is:
>>>>>> > CentOS 6 Virtual Machine
>>>>>> > Slurm: slurm 14.03.10
>>>>>> > MPI: mpich-3.1.3
>>>>>> > dmtcp_coordinator (DMTCP) 2.3.1
>>>>>> > ->1 master node
>>>>>> > -> 3 working nodes. master is not a working node
>>>>>> >
>>>>>> >
>>>>>> > I have tried to run dmtcp_coordinator only on the master and both
>>>>>> in the
>>>>>> > master and working nodes with identical results.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Output in application:
>>>>>> > ----
>>>>>> > ----
>>>>>> >
>>>>>> > [slurm@slurm-master ~]$  dmtcp_launch --rm srun -n 2 mpiLoop 50
>>>>>> > [42000] TRACE at rm_main.cpp:38 in dmtcp_event_hook; REASON='Start'
>>>>>> > Process 0 of 2 is on slurm-compute1
>>>>>> > iteration 0 on process 0
>>>>>> > Process 1 of 2 is on slurm-compute2
>>>>>> > iteration 0 on process 1
>>>>>> >
>>>>>> > (start checkpoint here)
>>>>>> >
>>>>>> > [42000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
>>>>>> > REASON='JWARNING(false) failed'
>>>>>> >      _dataSockets[i]->socket().sockfd() = 19
>>>>>> >      buffer.size() = 196
>>>>>> >      WARN_INTERVAL_SEC = 10
>>>>>> > Message: Still draining socket... perhaps remote host is not
>>>>>> running under
>>>>>> > DMTCP?
>>>>>> > ----
>>>>>> > ----
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > I keep receiving the same error every 10 seconds until the
>>>>>> execution is
>>>>>> > supposed to have finished. Then, the execution *doesn't* finish,
>>>>>> and I have
>>>>>> > to stop it manually with CTRL+C
>>>>>> >
>>>>>> >
>>>>>> > Output in coordinator:
>>>>>> > ----
>>>>>> > ----
>>>>>> > dmtcp_coordinator starting...
>>>>>> >     Host: slurm-master (192.168.122.11)
>>>>>> >     Port: 7779
>>>>>> >     Checkpoint Interval: disabled (checkpoint manually instead)
>>>>>> >     Exit on last client: 0
>>>>>> > Type '?' for help.
>>>>>> >
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect;
>>>>>> REASON='worker
>>>>>> > connected'
>>>>>> >      hello_remote.from = 6db90f3d5a9dd200-8271-546a25f2
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
>>>>>> > process Information after exec()'
>>>>>> >      progname = srun
>>>>>> >      msg.from = 6db90f3d5a9dd200-40000-546a25f2
>>>>>> >      client->identity() = 6db90f3d5a9dd200-8271-546a25f2
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect;
>>>>>> REASON='worker
>>>>>> > connected'
>>>>>> >      hello_remote.from = 6db90f3d5a9dd200-40000-546a25f2
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
>>>>>> > process Information after fork()'
>>>>>> >      client->hostname() = slurm-master
>>>>>> >      client->progname() = srun_(forked)
>>>>>> >      msg.from = 6db90f3d5a9dd200-41000-546a25f2
>>>>>> >      client->identity() = 6db90f3d5a9dd200-40000-546a25f2
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>>>>>> REASON='client
>>>>>> > disconnected'
>>>>>> >      client->identity() = 6db90f3d5a9dd200-41000-546a25f2
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>>>>>> REASON='client
>>>>>> > disconnected'
>>>>>> >      client->identity() = 6db90f3d5a9dd200-40000-546a25f2
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect;
>>>>>> REASON='worker
>>>>>> > connected'
>>>>>> >      hello_remote.from = 6db90f3d5a9dd200-8323-546a2609
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
>>>>>> > process Information after exec()'
>>>>>> >      progname = srun
>>>>>> >      msg.from = 6db90f3d5a9dd200-42000-546a2609
>>>>>> >      client->identity() = 6db90f3d5a9dd200-8323-546a2609
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect;
>>>>>> REASON='worker
>>>>>> > connected'
>>>>>> >      hello_remote.from = 6db90f3d5a9dd200-42000-546a2609
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
>>>>>> > process Information after fork()'
>>>>>> >      client->hostname() = slurm-master
>>>>>> >      client->progname() = srun_(forked)
>>>>>> >      msg.from = 6db90f3d5a9dd200-43000-546a2609
>>>>>> >      client->identity() = 6db90f3d5a9dd200-42000-546a2609
>>>>>> >
>>>>>> > (checkpoint)
>>>>>> >
>>>>>> > c
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint;
>>>>>> > REASON='starting checkpoint, suspending all nodes'
>>>>>> >      s.numPeers = 2
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint;
>>>>>> > REASON='Incremented Generation'
>>>>>> >      compId.generation() = 1
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState;
>>>>>> > REASON='locking all nodes'
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
>>>>>> > REASON='draining all nodes'
>>>>>> >
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
>>>>>> > REASON='checkpointing all nodes'
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
>>>>>> > REASON='building name service database'
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
>>>>>> > REASON='entertaining queries now'
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
>>>>>> > REASON='refilling all nodes'
>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState;
>>>>>> > REASON='restarting all nodes'
>>>>>> > ----
>>>>>> > ----
>>>>>> >
>>>>>> >
>>>>>> > I have executed it in debug mode too, after compilating with
>>>>>> >  ./configure --enable-debug && make -j5 clean && make -j5
>>>>>> >
>>>>>> > . The output is inmense but not very helpful for me with my limited
>>>>>> > knowledge.I have uploaded it to pastebin.
>>>>>> >
>>>>>> > -coordinator output : http://pastebin.com/4m5REy28
>>>>>> > -application output : http://pastebin.com/inxmfvCc
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Dr. Manuel Rodríguez-Pascual
>>>>>> > skype: manuel.rodriguez.pascual
>>>>>> > phone: (+34) 913466173 // (+34) 679925108
>>>>>> >
>>>>>> > CIEMAT-Moncloa
>>>>>> > Edificio 22, desp. 1.25
>>>>>> > Avenida Complutense, 40
>>>>>> > 28040- MADRID
>>>>>> > SPAIN
>>>>>>
>>>>>> >
>>>>>> ------------------------------------------------------------------------------
>>>>>> > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>>>> > from Actuate! Instantly Supercharge Your Business Reports and
>>>>>> Dashboards
>>>>>> > with Interactivity, Sharing, Native Excel Exports, App Integration
>>>>>> & more
>>>>>> > Get technology previously reserved for billion-dollar corporations,
>>>>>> FREE
>>>>>> >
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>>>>>
>>>>>> > _______________________________________________
>>>>>> > Dmtcp-forum mailing list
>>>>>> > [email protected]
>>>>>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Manuel Rodríguez-Pascual
>>>> skype: manuel.rodriguez.pascual
>>>> phone: (+34) 913466173 // (+34) 679925108
>>>>
>>>> CIEMAT-Moncloa
>>>> Edificio 22, desp. 1.25
>>>> Avenida Complutense, 40
>>>> 28040- MADRID
>>>> SPAIN
>>>>
>>>
>>>
>>
>>
>> --
>> Dr. Manuel Rodríguez-Pascual
>> skype: manuel.rodriguez.pascual
>> phone: (+34) 913466173 // (+34) 679925108
>>
>> CIEMAT-Moncloa
>> Edificio 22, desp. 1.25
>> Avenida Complutense, 40
>> 28040- MADRID
>> SPAIN
>>
>
>


-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to