Hi
See I installed DMTCP tool on my servers and trying to checkpoint an
application. See i am having two machines say "machine1" and "machine2".
What actually i am doing, i am running dmtcp_coordinator and my application
through machine1. For starting application i am using command -
dmtcp_checkoint ./executablename
I am running application for total 10 processes, out of which 5 processes
will run on machine1 and another 5 on machine2. It is working fine. When I
will type command "l" (List connected nodes) on dmtcp_coordinator after
giving above command it will show-
Client List:
#, PROG[PID]@HOST, DMTCP-UNIQUEPID, STATE
1, bash[497]@power1, 195f062c3e8d8-497-515d2153, RUNNING
21, tee[525]@power1, 195f062c3e8d8-525-515d2153, RUNNING
20, mpiexec[524]@power1, 195f062c3e8d8-524-515d2153, RUNNING
24, hydra_pmi_proxy[530]@power1, 195f062c3e8d8-530-515d2153, RUNNING
27, xhpl_intel64[535]@power1, 195f062c3e8d8-535-515d2153, RUNNING
29, xhpl_intel64[537]@power1, 195f062c3e8d8-537-515d2153, RUNNING
31, xhpl_intel64[540]@power1, 195f062c3e8d8-540-515d2153, RUNNING
33, xhpl_intel64[543]@power1, 195f062c3e8d8-543-515d2153, RUNNING
34, xhpl_intel64[546]@power1, 195f062c3e8d8-546-515d2153, RUNNING
35, hydra_pmi_proxy[7546]@power2, 195f062c3e8d9-7546-515d1f85, RUNNING
38, xhpl_intel64[7565]@power2, 195f062c3e8d9-7565-515d1f85, RUNNING
40, xhpl_intel64[7567]@power2, 195f062c3e8d9-7567-515d1f85, RUNNING
42, xhpl_intel64[7571]@power2, 195f062c3e8d9-7571-515d1f85, RUNNING
44, xhpl_intel64[7573]@power2, 195f062c3e8d9-7573-515d1f85, RUNNING
45, xhpl_intel64[7576]@power2, 195f062c3e8d9-7576-515d1f85, RUNNING
all the processes are running fine.
But as soon as i checkpoint the application using command "c" it will show
the following message-
643] WARNING at kernelbufferdrainer.cpp:100 in onTimeoutInterval;
REASON='JWARNING(false) failed'
_dataSockets[i]->socket().sockfd() = 13
buffer.size() = 0
WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running under
DMTCP?
And at the dmtcp_coordinator it is showing –
496] NOTE at dmtcp_coordinator.cpp:1316 in startCheckpoint;
REASON='starting checkpoint, suspending all nodes'
s.numPeers = 15
[496] NOTE at dmtcp_coordinator.cpp:1318 in startCheckpoint;
REASON='Incremented Generation'
UniquePid::ComputationId().generation() = 1
[496] NOTE at dmtcp_coordinator.cpp:643 in onData; REASON='locking all
nodes'
[496] NOTE at dmtcp_coordinator.cpp:678 in onData; REASON='draining all
nodes'
And "l" command will show-
Client List:
#, PROG[PID]@HOST, DMTCP-UNIQUEPID, STATE
1, bash[497]@power1, 195f062c3e8d8-497-515d2153, DRAINED
21, tee[525]@power1, 195f062c3e8d8-525-515d2153, DRAINED
20, mpiexec[524]@power1, 195f062c3e8d8-524-515d2153, FD_LEADER_ELECTION
24, hydra_pmi_proxy[530]@power1, 195f062c3e8d8-530-515d2153, DRAINED
27, xhpl_intel64[535]@power1, 195f062c3e8d8-535-515d2153, DRAINED
29, xhpl_intel64[537]@power1, 195f062c3e8d8-537-515d2153, DRAINED
31, xhpl_intel64[540]@power1, 195f062c3e8d8-540-515d2153, DRAINED
33, xhpl_intel64[543]@power1, 195f062c3e8d8-543-515d2153, DRAINED
34, xhpl_intel64[546]@power1, 195f062c3e8d8-546-515d2153, DRAINED
35, hydra_pmi_proxy[7546]@power2, 195f062c3e8d9-7546-515d1f85, DRAINED
38, xhpl_intel64[7565]@power2, 195f062c3e8d9-7565-515d1f85, DRAINED
40, xhpl_intel64[7567]@power2, 195f062c3e8d9-7567-515d1f85, DRAINED
42, xhpl_intel64[7571]@power2, 195f062c3e8d9-7571-515d1f85, DRAINED
44, xhpl_intel64[7573]@power2, 195f062c3e8d9-7573-515d1f85, DRAINED
45, xhpl_intel64[7576]@power2, 195f062c3e8d9-7576-515d1f85, DRAINED
I am using MPICH2-1.4.1 with hydra process manager and one more thing this
issue is coming when i want to checkpoint processes running on two
machines. For a single machine both checkpoint and restarting is working
fine.
It looks like some socket related issue and it is also showing on remote
host DMTCP is not running if it is like tat then y it runs first time
before checkpointing. i am not able to find out where actually i am doing
mistake.
Please help.....
Thanks & Regards
Manisha Chauhan
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire
the most talented Cisco Certified professionals. Visit the
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum