Hi Jiajun,
It crashes :(
I have performed the same tests with Slurm running on the system and not
running. The output is the same. I don't know if this helps.
Anyway, here we go
---
---
[root@slurm-master slurm]# dmtcp_launch mpiexec -n 3 -f machinefile
./mpiLoop 2
[51000] NOTE at ssh.cpp:348 in prepareForExec; REASON='New ssh command'
newCommand = /usr/local/bin/dmtcp_ssh
/usr/local/bin/dmtcp_nocheckpoint /usr/bin/ssh -x slurm-compute1
/usr/local/bin/dmtcp_launch --ssh-slave --host slurm-master --ckptdir
/home/slurm /usr/local/bin/dmtcp_sshd "/home/mpich3/bin/hydra_pmi_proxy"
--control-port slurm-master:40371 --rmk user --launcher ssh --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id 0
[52000] NOTE at ssh.cpp:348 in prepareForExec; REASON='New ssh command'
newCommand = /usr/local/bin/dmtcp_ssh
/usr/local/bin/dmtcp_nocheckpoint /usr/bin/ssh -x slurm-compute2
/usr/local/bin/dmtcp_launch --ssh-slave --host slurm-master --ckptdir
/home/slurm /usr/local/bin/dmtcp_sshd "/home/mpich3/bin/hydra_pmi_proxy"
--control-port slurm-master:40371 --rmk user --launcher ssh --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id 1
[53000] NOTE at ssh.cpp:348 in prepareForExec; REASON='New ssh command'
newCommand = /usr/local/bin/dmtcp_ssh
/usr/local/bin/dmtcp_nocheckpoint /usr/bin/ssh -x slurm-compute3
/usr/local/bin/dmtcp_launch --ssh-slave --host slurm-master --ckptdir
/home/slurm /usr/local/bin/dmtcp_sshd "/home/mpich3/bin/hydra_pmi_proxy"
--control-port slurm-master:40371 --rmk user --launcher ssh --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id 2
(halts here)
---
---
If I execute the command manually,
"
/usr/local/bin/dmtcp_ssh /usr/local/bin/dmtcp_nocheckpoint /usr/bin/ssh -x
slurm-compute2 /usr/local/bin/dmtcp_launch --ssh-slave --host slurm-master
--ckptdir /home/slurm /usr/local/bin/dmtcp_sshd
"/home/mpich3/bin/hydra_pmi_proxy" --control-port slurm-master:40371 --rmk
user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2
--proxy-id 1
"
It finishes really fast and without any error, although no output is
displayed (I don't know if this is good or bad)
COORDINATOR
---
---
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-1602-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = mpiexec.hydra
msg.from = 6db90f3d5a9dd200-50000-546dbd30
client->identity() = 6db90f3d5a9dd200-1602-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-50000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-50000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-50000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = mpiexec.hydra_(forked)
msg.from = 6db90f3d5a9dd200-51000-546dbd30
client->identity() = 6db90f3d5a9dd200-50000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = mpiexec.hydra_(forked)
msg.from = 6db90f3d5a9dd200-52000-546dbd30
client->identity() = 6db90f3d5a9dd200-50000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = mpiexec.hydra_(forked)
msg.from = 6db90f3d5a9dd200-53000-546dbd30
client->identity() = 6db90f3d5a9dd200-50000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 6db90f3d5a9dd200-51000-546dbd30
client->identity() = 6db90f3d5a9dd200-51000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 6db90f3d5a9dd200-52000-546dbd30
client->identity() = 6db90f3d5a9dd200-52000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 6db90f3d5a9dd200-53000-546dbd30
client->identity() = 6db90f3d5a9dd200-53000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-52000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-51000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = dmtcp_ssh_(forked)
msg.from = 6db90f3d5a9dd200-54000-546dbd30
client->identity() = 6db90f3d5a9dd200-52000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-54000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = dmtcp_ssh_(forked)
msg.from = 6db90f3d5a9dd200-55000-546dbd30
client->identity() = 6db90f3d5a9dd200-51000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-53000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-55000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:816 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = dmtcp_ssh_(forked)
msg.from = 6db90f3d5a9dd200-56000-546dbd30
client->identity() = 6db90f3d5a9dd200-53000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-56000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 5e6cf1fdf038fb25-1449-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 5e6cf1fdf038fb24-1447-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 5e6cf1fdf038fb25-57000-546dbd30
client->identity() = 5e6cf1fdf038fb25-1449-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker
connected'
hello_remote.from = 5e6cf1fdf038fb26-1447-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 5e6cf1fdf038fb24-58000-546dbd30
client->identity() = 5e6cf1fdf038fb24-1447-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:825 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_sshd
msg.from = 5e6cf1fdf038fb26-59000-546dbd30
client->identity() = 5e6cf1fdf038fb26-1447-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 5e6cf1fdf038fb25-57000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-52000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 5e6cf1fdf038fb24-58000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-51000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 5e6cf1fdf038fb26-59000-546dbd30
[1557] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-53000-546dbd30
----
-----
2014-11-20 10:51 GMT+01:00 Jiajun Cao <[email protected]>:
> Here is what I meant, and you can give a quick try:
>
> dmtcp_coordinator
> dmtcp_launch mpiexec -n 3 -f machinefile ./mpiLoop 2
>
> Regarding the command you used (dmtcp_launch --rm srun -n 2 mpiLoop 50)
> as described in the first email,
> "srun -n 2 mpiLoop 50" is managed by Slurm, while "--rm" triggers the
> Slurm plugin in DMTCP,
> and I want to put Slurm aside first. "mpiexec -n 3 -f machinefile
> ./mpiLoop 2 " does not run under Slurm, so it's safe not to use "--rm".
>
> On Thu, Nov 20, 2014 at 4:28 AM, Manuel Rodríguez Pascual <
> [email protected]> wrote:
>
>> I submitted several MPI tests in the previous mail Please check it and if
>> you need anything else please let me know. I am copying the tests below so
>> you have all the info in this mail
>>
>> Regarding my environment, it is the following one (I think I posted it
>> earlier too)
>>
>> My software stack is:
>> CentOS 6 Virtual Machine
>> Slurm: slurm 14.03.10
>> MPI: mpich-3.1.3
>> dmtcp_coordinator (DMTCP) 2.3.1
>> ->1 master node
>> -> 3 working nodes. master is not a working node
>> Network: Ethernet, no firewalls or restrictions
>>
>> Again, this is all performed on virtual machines, so feel free to ask me
>> for the images if you want an exact replica of my environment on your side.
>>
>>
>> Thanks for your help,
>>
>>
>> Manuel
>>
>>
>>
>> MPI Tests
>> *MPI on a single machine:*
>> ---
>> ---
>> [root@slurm-master slurm]# mpiexec -n 3 ./mpiLoop 2
>> Process 2 of 3 is on slurm-master
>> iteration 0 on process 2
>> Process 1 of 3 is on slurm-master
>> iteration 0 on process 1
>> Process 0 of 3 is on slurm-master
>> iteration 0 on process 0
>> iteration 1 on process 1
>> iteration 1 on process 2
>> iteration 1 on process 0
>> Goodbye world from process 0 of 3
>> Goodbye world from process 1 of 3
>> Goodbye world from process 2 of 3
>> ---
>> ---
>>
>>
>> *MPI on multiple nodes*
>> ---
>> ---
>> [root@slurm-master slurm]# more machinefile
>> slurm-compute1
>> slurm-compute2
>> slurm-compute3
>>
>> [root@slurm-master slurm]# mpiexec -n 3 -f machinefile ./mpiLoop 2
>> Process 2 of 3 is on slurm-compute3
>> Process 1 of 3 is on slurm-compute2
>> iteration 0 on process 2
>> iteration 0 on process 1
>> Process 0 of 3 is on slurm-compute1
>> iteration 0 on process 0
>> iteration 1 on process 1
>> iteration 1 on process 2
>> iteration 1 on process 0
>> Goodbye world from process 1 of 3
>> Goodbye world from process 0 of 3
>> Goodbye world from process 2 of 3
>> ---
>> ---
>>
>>
>> *MPI with SLURM through queue system.*
>> ---
>> ---
>> [root@slurm-master slurm]# more mpiLoop_slurm.sh
>> #!/bin/sh
>> sbatch -n 3 mpiLoop_script.sh
>>
>>
>> [root@slurm-master slurm]# more mpiLoop_script.sh
>> #!/bin/sh
>> srun ./mpiLoop 2
>>
>> [root@slurm-master slurm]# sbatch mpiLoop_slurm.sh
>>
>> [root@slurm-master slurm]# more slurm-115.out
>> Process 1 of 3 is on slurm-compute2
>> iteration 0 on process 1
>> Process 0 of 3 is on slurm-compute1
>> iteration 0 on process 0
>> Process 2 of 3 is on slurm-compute3
>> iteration 0 on process 2
>> iteration 1 on process 0
>> iteration 1 on process 1
>> iteration 1 on process 2
>> Goodbye world from process 1 of 3
>> Goodbye world from process 0 of 3
>> Goodbye world from process 2 of 3
>> ---
>> ---
>>
>>
>> *MPI with Slurm, direct submission with srun*
>> ---
>> ---
>> [root@slurm-master slurm]# srun -n 3 ./mpiLoop 2
>> Process 2 of 3 is on slurm-compute3
>> iteration 0 on process 2
>> Process 1 of 3 is on slurm-compute2
>> iteration 0 on process 1
>> Process 0 of 3 is on slurm-compute1
>> iteration 0 on process 0
>> iteration 1 on process 2
>> iteration 1 on process 1
>> iteration 1 on process 0
>> Goodbye world from process 0 of 3
>> Goodbye world from process 2 of 3
>> Goodbye world from process 1 of 3
>> ---
>> ---
>>
>>
>> 2014-11-20 10:23 GMT+01:00 Jiajun Cao <[email protected]>:
>>
>>> Hi Manuel,
>>>
>>> Sorry for the confusion. I meant the bug in DMTCP. My point is to run
>>> the application under DMTCP using MPI only, without the involvement of
>>> Slurm.
>>> If it shows the same behavior, we're almost 100% sure it's a bug in
>>> DMTCP core, not in the Slurm plugin.
>>>
>>> If you can provide me your environment, that'll be great. I can test
>>> it on my side as well.
>>>
>>> Best,
>>> Jiajun
>>>
>>> On Wed, Nov 19, 2014 at 4:28 AM, Manuel Rodríguez Pascual <
>>> [email protected]> wrote:
>>>
>>>> by bug, do you mean this is something I am doing wrong, or a bug in
>>>> DMTCP? If so, I can provide you my virtual machines where I am running all
>>>> this, so you can employ them in your debug.
>>>>
>>>> Anyway, it is possible to run an MPI application in many different
>>>> ways. Please find the output below.
>>>>
>>>> Thanks for your help. Best regards,
>>>>
>>>> Manuel
>>>>
>>>>
>>>>
>>>> *MPI on a single machine:*
>>>> ---
>>>> ---
>>>> [root@slurm-master slurm]# mpiexec -n 3 ./mpiLoop 2
>>>> Process 2 of 3 is on slurm-master
>>>> iteration 0 on process 2
>>>> Process 1 of 3 is on slurm-master
>>>> iteration 0 on process 1
>>>> Process 0 of 3 is on slurm-master
>>>> iteration 0 on process 0
>>>> iteration 1 on process 1
>>>> iteration 1 on process 2
>>>> iteration 1 on process 0
>>>> Goodbye world from process 0 of 3
>>>> Goodbye world from process 1 of 3
>>>> Goodbye world from process 2 of 3
>>>> ---
>>>> ---
>>>>
>>>>
>>>> *MPI on multiple nodes*
>>>> ---
>>>> ---
>>>> [root@slurm-master slurm]# more machinefile
>>>> slurm-compute1
>>>> slurm-compute2
>>>> slurm-compute3
>>>>
>>>> [root@slurm-master slurm]# mpiexec -n 3 -f machinefile ./mpiLoop 2
>>>> Process 2 of 3 is on slurm-compute3
>>>> Process 1 of 3 is on slurm-compute2
>>>> iteration 0 on process 2
>>>> iteration 0 on process 1
>>>> Process 0 of 3 is on slurm-compute1
>>>> iteration 0 on process 0
>>>> iteration 1 on process 1
>>>> iteration 1 on process 2
>>>> iteration 1 on process 0
>>>> Goodbye world from process 1 of 3
>>>> Goodbye world from process 0 of 3
>>>> Goodbye world from process 2 of 3
>>>> ---
>>>> ---
>>>>
>>>>
>>>> *MPI with SLURM through queue system.*
>>>> ---
>>>> ---
>>>> [root@slurm-master slurm]# more mpiLoop_slurm.sh
>>>> #!/bin/sh
>>>> sbatch -n 3 mpiLoop_script.sh
>>>>
>>>>
>>>> [root@slurm-master slurm]# more mpiLoop_script.sh
>>>> #!/bin/sh
>>>> srun ./mpiLoop 2
>>>>
>>>> [root@slurm-master slurm]# sbatch mpiLoop_slurm.sh
>>>>
>>>> [root@slurm-master slurm]# more slurm-115.out
>>>> Process 1 of 3 is on slurm-compute2
>>>> iteration 0 on process 1
>>>> Process 0 of 3 is on slurm-compute1
>>>> iteration 0 on process 0
>>>> Process 2 of 3 is on slurm-compute3
>>>> iteration 0 on process 2
>>>> iteration 1 on process 0
>>>> iteration 1 on process 1
>>>> iteration 1 on process 2
>>>> Goodbye world from process 1 of 3
>>>> Goodbye world from process 0 of 3
>>>> Goodbye world from process 2 of 3
>>>> ---
>>>> ---
>>>>
>>>>
>>>> *MPI with Slurm, direct submission with srun*
>>>> ---
>>>> ---
>>>> [root@slurm-master slurm]# srun -n 3 ./mpiLoop 2
>>>> Process 2 of 3 is on slurm-compute3
>>>> iteration 0 on process 2
>>>> Process 1 of 3 is on slurm-compute2
>>>> iteration 0 on process 1
>>>> Process 0 of 3 is on slurm-compute1
>>>> iteration 0 on process 0
>>>> iteration 1 on process 2
>>>> iteration 1 on process 1
>>>> iteration 1 on process 0
>>>> Goodbye world from process 0 of 3
>>>> Goodbye world from process 2 of 3
>>>> Goodbye world from process 1 of 3
>>>> ---
>>>> ---
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2014-11-18 19:58 GMT+01:00 Jiajun Cao <[email protected]>:
>>>>
>>>>> Hi Manuel,
>>>>>
>>>>> Is it possible for you to run the application using only MPI
>>>>> (without SLURM)? I'm asking because DMTCP has a plugin for SLURM, and I
>>>>> want to isolate the plugin from DMTCP core. This can help us locate the
>>>>> bug
>>>>> more precisely.
>>>>>
>>>>> Best,
>>>>> Jiajun
>>>>>
>>>>> On Tue, Nov 18, 2014 at 4:17 AM, Manuel Rodríguez Pascual <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Well, it is in fact a virtual KVM cluster inside my local PC, so I
>>>>>> would say it's ethernet.
>>>>>>
>>>>>> *ifconfig (master, computing nodes change IP and MAC):*
>>>>>> eth0 Link encap:Ethernet HWaddr 02:00:C0:A8:7A:01
>>>>>> inet addr:192.168.122.2 Bcast:192.168.122.255
>>>>>> Mask:255.255.255.0
>>>>>> inet6 addr: fe80::c0ff:fea8:7a01/64 Scope:Link
>>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>>> RX packets:338 errors:0 dropped:0 overruns:0 frame:0
>>>>>> TX packets:243 errors:0 dropped:0 overruns:0 carrier:0
>>>>>> collisions:0 txqueuelen:1000
>>>>>> RX bytes:26352 (25.7 KiB) TX bytes:23475 (22.9 KiB)
>>>>>> Interrupt:10
>>>>>>
>>>>>> *iptables (master and computing nodes)*
>>>>>> [root@slurm-master ~]# iptables -L
>>>>>> Chain INPUT (policy ACCEPT)
>>>>>> target prot opt source destination
>>>>>> Chain FORWARD (policy ACCEPT)
>>>>>> target prot opt source destination
>>>>>> Chain OUTPUT (policy ACCEPT)
>>>>>> target prot opt source destination
>>>>>>
>>>>>>
>>>>>> I can passwordless SSH as root form master to computing nodes. I
>>>>>> cannot from computing nodes to master.
>>>>>>
>>>>>> Regarding *user configuration*, I am running:
>>>>>> -slurmctld on master as user slurm
>>>>>> -slurmd on computing nodes as user root
>>>>>> -dmtcp_coordinator on master as user root
>>>>>> -dmtcp_launch on master both as user slurm and root (same results)
>>>>>>
>>>>>> DMTCP has been installed both in master and computing nodes, same
>>>>>> version. I am compiling it with no flags, or just the debug ones.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014-11-17 23:01 GMT+01:00 Jiajun Cao <[email protected]>:
>>>>>>
>>>>>>> Hi Manuel,
>>>>>>>
>>>>>>> What kind of network is used in the cluster? Ethernet or
>>>>>>> InfiniBand?
>>>>>>>
>>>>>>> On Mon, Nov 17, 2014 at 2:52 PM, Gene Cooperman <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Jiajun,
>>>>>>>> Could you respond to this, since you've been extending our
>>>>>>>> support
>>>>>>>> for MPI?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> - Gene
>>>>>>>>
>>>>>>>> On Mon, Nov 17, 2014 at 06:02:34PM +0100, Manuel Rodríguez Pascual
>>>>>>>> wrote:
>>>>>>>> > Good morning list,
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > I am a newbie with DMTCP, so probably this is something obvious.
>>>>>>>> Anyway, I
>>>>>>>> > am not able of checkpointing MPI applications. Instead, I receive
>>>>>>>> an error.
>>>>>>>> > I have looked in the internet but still haven't been able to
>>>>>>>> solve it.
>>>>>>>> >
>>>>>>>> > -MPI and sequential applications work fine without DMTCP
>>>>>>>> > -DMTCP works fine when running a secuential application in the
>>>>>>>> master and
>>>>>>>> > restoring it.
>>>>>>>> > ...but it cracks when checkpointing a parallel App.
>>>>>>>> >
>>>>>>>> > When I execute my code (simple loop for 1 to 50, to detected the
>>>>>>>> moment of
>>>>>>>> > checkpoint) with
>>>>>>>> >
>>>>>>>> > (one tab) dmtcp_coordinator
>>>>>>>> > (other tab) dmtcp_launch --rm srun -n 2 mpiLoop 50
>>>>>>>> >
>>>>>>>> > and then checkpoint with "c" in the coordinator tab, it does not
>>>>>>>> work.
>>>>>>>> > Instead, what happens is that the application starts printing the
>>>>>>>> same
>>>>>>>> > error message while it is running on the background. And when the
>>>>>>>> execution
>>>>>>>> > of the mpi code has finished, all the output is returned and the
>>>>>>>> system
>>>>>>>> > kind of halts until I manually stopos it.
>>>>>>>> >
>>>>>>>> > Below you can find all the informtion that may be relevant:
>>>>>>>> software stack,
>>>>>>>> > output from app and coordinator, and output when executed in
>>>>>>>> debug mode.
>>>>>>>> > Anyway, I suspect that this is probably due to me not knowing how
>>>>>>>> to
>>>>>>>> > install, configure or use the application.
>>>>>>>> >
>>>>>>>> > Thanks for your help,
>>>>>>>> >
>>>>>>>> > Manuel
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > My software stack is:
>>>>>>>> > CentOS 6 Virtual Machine
>>>>>>>> > Slurm: slurm 14.03.10
>>>>>>>> > MPI: mpich-3.1.3
>>>>>>>> > dmtcp_coordinator (DMTCP) 2.3.1
>>>>>>>> > ->1 master node
>>>>>>>> > -> 3 working nodes. master is not a working node
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > I have tried to run dmtcp_coordinator only on the master and both
>>>>>>>> in the
>>>>>>>> > master and working nodes with identical results.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Output in application:
>>>>>>>> > ----
>>>>>>>> > ----
>>>>>>>> >
>>>>>>>> > [slurm@slurm-master ~]$ dmtcp_launch --rm srun -n 2 mpiLoop 50
>>>>>>>> > [42000] TRACE at rm_main.cpp:38 in dmtcp_event_hook;
>>>>>>>> REASON='Start'
>>>>>>>> > Process 0 of 2 is on slurm-compute1
>>>>>>>> > iteration 0 on process 0
>>>>>>>> > Process 1 of 2 is on slurm-compute2
>>>>>>>> > iteration 0 on process 1
>>>>>>>> >
>>>>>>>> > (start checkpoint here)
>>>>>>>> >
>>>>>>>> > [42000] WARNING at kernelbufferdrainer.cpp:124 in
>>>>>>>> onTimeoutInterval;
>>>>>>>> > REASON='JWARNING(false) failed'
>>>>>>>> > _dataSockets[i]->socket().sockfd() = 19
>>>>>>>> > buffer.size() = 196
>>>>>>>> > WARN_INTERVAL_SEC = 10
>>>>>>>> > Message: Still draining socket... perhaps remote host is not
>>>>>>>> running under
>>>>>>>> > DMTCP?
>>>>>>>> > ----
>>>>>>>> > ----
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > I keep receiving the same error every 10 seconds until the
>>>>>>>> execution is
>>>>>>>> > supposed to have finished. Then, the execution *doesn't* finish,
>>>>>>>> and I have
>>>>>>>> > to stop it manually with CTRL+C
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Output in coordinator:
>>>>>>>> > ----
>>>>>>>> > ----
>>>>>>>> > dmtcp_coordinator starting...
>>>>>>>> > Host: slurm-master (192.168.122.11)
>>>>>>>> > Port: 7779
>>>>>>>> > Checkpoint Interval: disabled (checkpoint manually instead)
>>>>>>>> > Exit on last client: 0
>>>>>>>> > Type '?' for help.
>>>>>>>> >
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect;
>>>>>>>> REASON='worker
>>>>>>>> > connected'
>>>>>>>> > hello_remote.from = 6db90f3d5a9dd200-8271-546a25f2
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData;
>>>>>>>> REASON='Updating
>>>>>>>> > process Information after exec()'
>>>>>>>> > progname = srun
>>>>>>>> > msg.from = 6db90f3d5a9dd200-40000-546a25f2
>>>>>>>> > client->identity() = 6db90f3d5a9dd200-8271-546a25f2
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect;
>>>>>>>> REASON='worker
>>>>>>>> > connected'
>>>>>>>> > hello_remote.from = 6db90f3d5a9dd200-40000-546a25f2
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData;
>>>>>>>> REASON='Updating
>>>>>>>> > process Information after fork()'
>>>>>>>> > client->hostname() = slurm-master
>>>>>>>> > client->progname() = srun_(forked)
>>>>>>>> > msg.from = 6db90f3d5a9dd200-41000-546a25f2
>>>>>>>> > client->identity() = 6db90f3d5a9dd200-40000-546a25f2
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>>>>>>>> REASON='client
>>>>>>>> > disconnected'
>>>>>>>> > client->identity() = 6db90f3d5a9dd200-41000-546a25f2
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect;
>>>>>>>> REASON='client
>>>>>>>> > disconnected'
>>>>>>>> > client->identity() = 6db90f3d5a9dd200-40000-546a25f2
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect;
>>>>>>>> REASON='worker
>>>>>>>> > connected'
>>>>>>>> > hello_remote.from = 6db90f3d5a9dd200-8323-546a2609
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:825 in onData;
>>>>>>>> REASON='Updating
>>>>>>>> > process Information after exec()'
>>>>>>>> > progname = srun
>>>>>>>> > msg.from = 6db90f3d5a9dd200-42000-546a2609
>>>>>>>> > client->identity() = 6db90f3d5a9dd200-8323-546a2609
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1040 in onConnect;
>>>>>>>> REASON='worker
>>>>>>>> > connected'
>>>>>>>> > hello_remote.from = 6db90f3d5a9dd200-42000-546a2609
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:816 in onData;
>>>>>>>> REASON='Updating
>>>>>>>> > process Information after fork()'
>>>>>>>> > client->hostname() = slurm-master
>>>>>>>> > client->progname() = srun_(forked)
>>>>>>>> > msg.from = 6db90f3d5a9dd200-43000-546a2609
>>>>>>>> > client->identity() = 6db90f3d5a9dd200-42000-546a2609
>>>>>>>> >
>>>>>>>> > (checkpoint)
>>>>>>>> >
>>>>>>>> > c
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint;
>>>>>>>> > REASON='starting checkpoint, suspending all nodes'
>>>>>>>> > s.numPeers = 2
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint;
>>>>>>>> > REASON='Incremented Generation'
>>>>>>>> > compId.generation() = 1
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState;
>>>>>>>> > REASON='locking all nodes'
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState;
>>>>>>>> > REASON='draining all nodes'
>>>>>>>> >
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState;
>>>>>>>> > REASON='checkpointing all nodes'
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState;
>>>>>>>> > REASON='building name service database'
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState;
>>>>>>>> > REASON='entertaining queries now'
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState;
>>>>>>>> > REASON='refilling all nodes'
>>>>>>>> > [8270] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState;
>>>>>>>> > REASON='restarting all nodes'
>>>>>>>> > ----
>>>>>>>> > ----
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > I have executed it in debug mode too, after compilating with
>>>>>>>> > ./configure --enable-debug && make -j5 clean && make -j5
>>>>>>>> >
>>>>>>>> > . The output is inmense but not very helpful for me with my
>>>>>>>> limited
>>>>>>>> > knowledge.I have uploaded it to pastebin.
>>>>>>>> >
>>>>>>>> > -coordinator output : http://pastebin.com/4m5REy28
>>>>>>>> > -application output : http://pastebin.com/inxmfvCc
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Dr. Manuel Rodríguez-Pascual
>>>>>>>> > skype: manuel.rodriguez.pascual
>>>>>>>> > phone: (+34) 913466173 // (+34) 679925108
>>>>>>>> >
>>>>>>>> > CIEMAT-Moncloa
>>>>>>>> > Edificio 22, desp. 1.25
>>>>>>>> > Avenida Complutense, 40
>>>>>>>> > 28040- MADRID
>>>>>>>> > SPAIN
>>>>>>>>
>>>>>>>> >
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>>>>>> > from Actuate! Instantly Supercharge Your Business Reports and
>>>>>>>> Dashboards
>>>>>>>> > with Interactivity, Sharing, Native Excel Exports, App
>>>>>>>> Integration & more
>>>>>>>> > Get technology previously reserved for billion-dollar
>>>>>>>> corporations, FREE
>>>>>>>> >
>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>>>>>>>
>>>>>>>> > _______________________________________________
>>>>>>>> > Dmtcp-forum mailing list
>>>>>>>> > [email protected]
>>>>>>>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dr. Manuel Rodríguez-Pascual
>>>>>> skype: manuel.rodriguez.pascual
>>>>>> phone: (+34) 913466173 // (+34) 679925108
>>>>>>
>>>>>> CIEMAT-Moncloa
>>>>>> Edificio 22, desp. 1.25
>>>>>> Avenida Complutense, 40
>>>>>> 28040- MADRID
>>>>>> SPAIN
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Manuel Rodríguez-Pascual
>>>> skype: manuel.rodriguez.pascual
>>>> phone: (+34) 913466173 // (+34) 679925108
>>>>
>>>> CIEMAT-Moncloa
>>>> Edificio 22, desp. 1.25
>>>> Avenida Complutense, 40
>>>> 28040- MADRID
>>>> SPAIN
>>>>
>>>
>>>
>>
>>
>> --
>> Dr. Manuel Rodríguez-Pascual
>> skype: manuel.rodriguez.pascual
>> phone: (+34) 913466173 // (+34) 679925108
>>
>> CIEMAT-Moncloa
>> Edificio 22, desp. 1.25
>> Avenida Complutense, 40
>> 28040- MADRID
>> SPAIN
>>
>
>
--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108
CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum