Good work, I can use DMTCP to checkpoint/restart regular CPU-bound program.
However, I can not use it to restart a CUDA program. While I checkpoint it,
there goes the message
NOTE at writeckpt.cpp:513 in preprocess_special_segments; REASON='bottom-most
page of stack (page with highest address) was invisible in /proc/self/maps. It
is made visible again now.'
When I try to restart its checkpoint file, there's an ERROR message showing:
[40000] ERROR at fileconnlist.cpp:224 in remapShmMaps; REASON='JASSERT(addr !=
MAP_FAILED) failed'
area->flags = 17
area->prot = 3
(strerror((*__errno_location ()))) = Invalid argument
Message: mmap failed
vectorAdd (40000): Terminating...
In the coordinator part:
[4712] NOTE at dmtcp_coordinator.cpp:1265 in startCheckpoint; REASON='starting
checkpoint, suspending all nodes'
s.numPeers = 1
[4712] NOTE at dmtcp_coordinator.cpp:1267 in startCheckpoint;
REASON='Incremented Generation'
compId.generation() = 1
[4712] NOTE at dmtcp_coordinator.cpp:614 in updateMinimumState; REASON='locking
all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:620 in updateMinimumState;
REASON='draining all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:626 in updateMinimumState;
REASON='checkpointing all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:640 in updateMinimumState;
REASON='building name service database'
[4712] NOTE at dmtcp_coordinator.cpp:656 in updateMinimumState;
REASON='entertaining queries now'
[4712] NOTE at dmtcp_coordinator.cpp:661 in updateMinimumState;
REASON='refilling all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:692 in updateMinimumState;
REASON='restarting all nodes'
k
[4712] NOTE at dmtcp_coordinator.cpp:568 in handleUserCommand; REASON='Killing
all connected Peers...'
[4712] NOTE at dmtcp_coordinator.cpp:871 in onDisconnect; REASON='client
disconnected'
client->identity() = d865de4dcd23d52-40000-535e196b
[4712] NOTE at dmtcp_coordinator.cpp:1092 in validateRestartingWorkerProcess;
REASON='FIRST dmtcp_restart connection. Set numPeers. Generate timestamp'
numPeers = 1
curTimeStamp = 22378813428
compId = d865de4dcd23d52-40000-535e196b
[4712] NOTE at dmtcp_coordinator.cpp:1036 in onConnect; REASON='worker
connected'
hello_remote.from = d865de4dcd23d52-40000-535e196b
[4712] NOTE at dmtcp_coordinator.cpp:651 in updateMinimumState;
REASON='building name service database (after restart)'
[4712] NOTE at dmtcp_coordinator.cpp:656 in updateMinimumState;
REASON='entertaining queries now'
[4712] NOTE at dmtcp_coordinator.cpp:661 in updateMinimumState;
REASON='refilling all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:692 in updateMinimumState;
REASON='restarting all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:871 in onDisconnect; REASON='client
disconnected'
client->identity() = d865de4dcd23d52-40000-535e196b
When I tried use --disable-all-plugins for debugging. I then go forwards
through the steps: run, checkpoint, kill, restart, the message shows right
after the restart command.
[4763] mtcp_restart.c:955 read_shared_memory_area_from_file:
error 22 mapping /dev/nvidia0 offset 383467520 at 0x7f9ff9204000
Segmentation fault (core dumped)
Could you help?
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos. Get
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum