Hi All! I'm back using DMTCP!
I'm having a problem when restarting a checkpoint.
I have two nodes (PCs) in an ethernet lan, with:
-Debian 8 Jessi,
-DMTCP 2.4.2 (configure with -enable-timing)
-OpenMPI 1.10.1.
I do:
$ dmtcp_launch mpirun -np 8 -hostfile hosts app_heat_512
On the console where the coordinator is running, I press 'c' to
checkpoint. After that, I killed the application or it finished, and
then, from the same directory where the checkpoints are stored I run
the restarting script, with the following output:
$ ./dmtcp_restart_script.sh
[75000] WARNING at socketconnection.cpp:540 in postRestart;
REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
&_bindAddr,_bindAddrlen) == 0) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 1310c955e7a-75000-564d1a58(99506)
Message: Bind failed.
[77000] WARNING at socketconnection.cpp:540 in postRestart;
REASON='JWARNING(_real_bind(_fds[0], (sockaddr*)
&_bindAddr,_bindAddrlen) == 0) failed'
(strerror((*__errno_location ()))) = Address already in use
id() = 1310c955e7a-77000-564d1a58(99517)
Message: Bind failed.
On the coordinator console it outputs this:
[762] NOTE at dmtcp_coordinator.cpp:1137 in
validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart
connection. Set numPeers. Generate timestamp'
numPeers = 12
curTimeStamp = 23166315138
compId = 1310c955e7a-66000-564d1a57
[762] WARNING at jtimer.h:81 in start; REASON='JWARNING(!_isStarted) failed'
_name = restart
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
hello_remote.from = 1310c955e7a-66000-564d1a57
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
hello_remote.from = 1310c955e7a-67000-564d1a57
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
hello_remote.from = 1310c955e7a-71000-564d1a58
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
hello_remote.from = 1310c955e7a-73000-564d1a58
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
hello_remote.from = 1310c955e7a-77000-564d1a58
[762] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected'
hello_remote.from = 1310c955e7a-75000-564d1a58
And when I pressed L to show connected nodes:
l
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
64, orterun[66000:1530]@m110a, 1310c955e7a-66000-564d1a57, CHECKPOINTED
65, dmtcp_ssh[67000:1618]@m110a, 1310c955e7a-67000-564d1a57, CHECKPOINTED
66, app_heat_512[71000:1619]@m110a, 1310c955e7a-71000-564d1a58, CHECKPOINTED
67, app_heat_512[73000:1620]@m110a, 1310c955e7a-73000-564d1a58, CHECKPOINTED
68, app_heat_512[77000:1622]@m110a, 1310c955e7a-77000-564d1a58, CHECKPOINTED
69, app_heat_512[75000:1621]@m110a, 1310c955e7a-75000-564d1a58, CHECKPOINTED
It seems ti hangs... It never ends.
Hope this is something I forgot...
Thanks all in advance,
Regards
Marina
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum