Hi,
Instead of using host and port with command, I am using -machinefile option
while running.
for example -
$ mpirun -np 10 -machinefile hosts ./executable
where machine file is hosts
The machine file contains simply hosts name.
Example:
host1
host2
Thanks & Regards
Manisha Chauhan
On Fri, Apr 5, 2013 at 11:13 AM, manisha chauhan <[email protected]
> wrote:
> Hi
>
>
>
> See I installed DMTCP tool on my servers and trying to checkpoint an
> application. See i am having two machines say "machine1" and "machine2".
>
> What actually i am doing, i am running dmtcp_coordinator and my
> application through machine1. For starting application i am using command -
>
>
>
> dmtcp_checkoint ./executablename
>
>
>
> I am running application for total 10 processes, out of which 5 processes
> will run on machine1 and another 5 on machine2. It is working fine. When I
> will type command "l" (List connected nodes) on dmtcp_coordinator after
> giving above command it will show-
>
>
>
> Client List:
> #, PROG[PID]@HOST, DMTCP-UNIQUEPID, STATE
> 1, bash[497]@power1, 195f062c3e8d8-497-515d2153, RUNNING
> 21, tee[525]@power1, 195f062c3e8d8-525-515d2153, RUNNING
> 20, mpiexec[524]@power1, 195f062c3e8d8-524-515d2153, RUNNING
> 24, hydra_pmi_proxy[530]@power1, 195f062c3e8d8-530-515d2153, RUNNING
> 27, xhpl_intel64[535]@power1, 195f062c3e8d8-535-515d2153, RUNNING
> 29, xhpl_intel64[537]@power1, 195f062c3e8d8-537-515d2153, RUNNING
> 31, xhpl_intel64[540]@power1, 195f062c3e8d8-540-515d2153, RUNNING
> 33, xhpl_intel64[543]@power1, 195f062c3e8d8-543-515d2153, RUNNING
> 34, xhpl_intel64[546]@power1, 195f062c3e8d8-546-515d2153, RUNNING
> 35, hydra_pmi_proxy[7546]@power2, 195f062c3e8d9-7546-515d1f85, RUNNING
> 38, xhpl_intel64[7565]@power2, 195f062c3e8d9-7565-515d1f85, RUNNING
> 40, xhpl_intel64[7567]@power2, 195f062c3e8d9-7567-515d1f85, RUNNING
> 42, xhpl_intel64[7571]@power2, 195f062c3e8d9-7571-515d1f85, RUNNING
> 44, xhpl_intel64[7573]@power2, 195f062c3e8d9-7573-515d1f85, RUNNING
> 45, xhpl_intel64[7576]@power2, 195f062c3e8d9-7576-515d1f85, RUNNING
>
> all the processes are running fine.
>
> But as soon as i checkpoint the application using command "c" it will show
> the following message-
>
> 643] WARNING at kernelbufferdrainer.cpp:100 in onTimeoutInterval;
> REASON='JWARNING(false) failed'
> _dataSockets[i]->socket().sockfd() = 13
> buffer.size() = 0
> WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under
> DMTCP?
>
>
>
>
>
>
>
> And at the dmtcp_coordinator it is showing –
>
> 496] NOTE at dmtcp_coordinator.cpp:1316 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
> s.numPeers = 15
> [496] NOTE at dmtcp_coordinator.cpp:1318 in startCheckpoint;
> REASON='Incremented Generation'
> UniquePid::ComputationId().generation() = 1
> [496] NOTE at dmtcp_coordinator.cpp:643 in onData; REASON='locking all
> nodes'
> [496] NOTE at dmtcp_coordinator.cpp:678 in onData; REASON='draining all
> nodes'
>
>
> And "l" command will show-
>
> Client List:
> #, PROG[PID]@HOST, DMTCP-UNIQUEPID, STATE
> 1, bash[497]@power1, 195f062c3e8d8-497-515d2153, DRAINED
> 21, tee[525]@power1, 195f062c3e8d8-525-515d2153, DRAINED
> 20, mpiexec[524]@power1, 195f062c3e8d8-524-515d2153, FD_LEADER_ELECTION
> 24, hydra_pmi_proxy[530]@power1, 195f062c3e8d8-530-515d2153, DRAINED
> 27, xhpl_intel64[535]@power1, 195f062c3e8d8-535-515d2153, DRAINED
> 29, xhpl_intel64[537]@power1, 195f062c3e8d8-537-515d2153, DRAINED
> 31, xhpl_intel64[540]@power1, 195f062c3e8d8-540-515d2153, DRAINED
> 33, xhpl_intel64[543]@power1, 195f062c3e8d8-543-515d2153, DRAINED
> 34, xhpl_intel64[546]@power1, 195f062c3e8d8-546-515d2153, DRAINED
> 35, hydra_pmi_proxy[7546]@power2, 195f062c3e8d9-7546-515d1f85, DRAINED
> 38, xhpl_intel64[7565]@power2, 195f062c3e8d9-7565-515d1f85, DRAINED
> 40, xhpl_intel64[7567]@power2, 195f062c3e8d9-7567-515d1f85, DRAINED
> 42, xhpl_intel64[7571]@power2, 195f062c3e8d9-7571-515d1f85, DRAINED
> 44, xhpl_intel64[7573]@power2, 195f062c3e8d9-7573-515d1f85, DRAINED
> 45, xhpl_intel64[7576]@power2, 195f062c3e8d9-7576-515d1f85, DRAINED
>
> I am using MPICH2-1.4.1 with hydra process manager and one more thing this
> issue is coming when i want to checkpoint processes running on two
> machines. For a single machine both checkpoint and restarting is working
> fine.
>
> It looks like some socket related issue and it is also showing on remote
> host DMTCP is not running if it is like tat then y it runs first time
> before checkpointing. i am not able to find out where actually i am doing
> mistake.
>
> Please help.....
>
> Thanks & Regards
>
> Manisha Chauhan
>
>
>
>
>
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire
the most talented Cisco Certified professionals. Visit the
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum