Good work, I can use DMTCP to checkpoint/restart regular CPU-bound program. 
However, I can not use it to restart a CUDA program. While I checkpoint it, 
there goes the message
NOTE at writeckpt.cpp:513 in preprocess_special_segments; REASON='bottom-most 
page of stack (page with highest address) was invisible in /proc/self/maps. It 
is made visible again now.'
When I try to restart its checkpoint file, there's an ERROR message showing: 
[40000] ERROR at fileconnlist.cpp:224 in remapShmMaps; REASON='JASSERT(addr != 
MAP_FAILED) failed'
     area->flags = 17
     area->prot = 3
     (strerror((*__errno_location ()))) = Invalid argument
Message: mmap failed
vectorAdd (40000): Terminating...
In the coordinator part:
[4712] NOTE at dmtcp_coordinator.cpp:1265 in startCheckpoint; REASON='starting 
checkpoint, suspending all nodes'
     s.numPeers = 1
[4712] NOTE at dmtcp_coordinator.cpp:1267 in startCheckpoint; 
REASON='Incremented Generation'
     compId.generation() = 1
[4712] NOTE at dmtcp_coordinator.cpp:614 in updateMinimumState; REASON='locking 
all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:620 in updateMinimumState; 
REASON='draining all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:626 in updateMinimumState; 
REASON='checkpointing all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:640 in updateMinimumState; 
REASON='building name service database'
[4712] NOTE at dmtcp_coordinator.cpp:656 in updateMinimumState; 
REASON='entertaining queries now'
[4712] NOTE at dmtcp_coordinator.cpp:661 in updateMinimumState; 
REASON='refilling all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:692 in updateMinimumState; 
REASON='restarting all nodes'
k
[4712] NOTE at dmtcp_coordinator.cpp:568 in handleUserCommand; REASON='Killing 
all connected Peers...'
[4712] NOTE at dmtcp_coordinator.cpp:871 in onDisconnect; REASON='client 
disconnected'
     client->identity() = d865de4dcd23d52-40000-535e196b
[4712] NOTE at dmtcp_coordinator.cpp:1092 in validateRestartingWorkerProcess; 
REASON='FIRST dmtcp_restart connection.  Set numPeers. Generate timestamp'
     numPeers = 1
     curTimeStamp = 22378813428
     compId = d865de4dcd23d52-40000-535e196b
[4712] NOTE at dmtcp_coordinator.cpp:1036 in onConnect; REASON='worker 
connected'
     hello_remote.from = d865de4dcd23d52-40000-535e196b
[4712] NOTE at dmtcp_coordinator.cpp:651 in updateMinimumState; 
REASON='building name service database (after restart)'
[4712] NOTE at dmtcp_coordinator.cpp:656 in updateMinimumState; 
REASON='entertaining queries now'
[4712] NOTE at dmtcp_coordinator.cpp:661 in updateMinimumState; 
REASON='refilling all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:692 in updateMinimumState; 
REASON='restarting all nodes'
[4712] NOTE at dmtcp_coordinator.cpp:871 in onDisconnect; REASON='client 
disconnected'
     client->identity() = d865de4dcd23d52-40000-535e196b

When I tried use --disable-all-plugins for debugging. I then go forwards 
through the steps: run, checkpoint, kill, restart, the message shows right 
after the restart command.
[4763] mtcp_restart.c:955 read_shared_memory_area_from_file:
  error 22 mapping /dev/nvidia0 offset 383467520 at 0x7f9ff9204000
Segmentation fault (core dumped)

Could you help?


                                          
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to