I've been playing around with some examples in contrib/python on the latest
master.
This lead me to discover the minor issue fixed in 510c026.
However, as I pressed further, I found that any checkpoints I tried to
initiate within python were not finishing (the function call never
returned). Through the wonders of git-bisect, I ended up squarely on
823d096.
Upon inspection, I discovered:
-void dmtcp::userHookTrampoline_postCkpt(bool isRestart)
-{
- //this function runs before other threads are resumed
- if(isRestart){
- numRestarts++;
- if(userHookPostRestart != NULL)
- (*userHookPostRestart)();
- }else{
- numCheckpoints++;
- if(userHookPostCheckpoint != NULL)
- (*userHookPostCheckpoint)();
- }
-}
The removal of userHookTrampoline_postCkpt also removes calls which
increment the static variables numRestarts and numCheckpoints.
These are used later in dmtcpplugin.cpp @ 113:
if(dmtcpRunCommand('c')){ //request checkpoint
//and wait for the checkpoint
while(oldNumRestarts==numRestarts && oldNumCheckpoints==numCheckpoints){
//nanosleep should get interrupted by checkpointing with an EINTR
error
//though there is a race to get to nanosleep() before the checkpoint
struct timespec t = {1,0};
nanosleep(&t, NULL);
memfence(); //make sure the loop condition doesn't get optimized
}
rv = (oldNumRestarts==numRestarts ? DMTCP_AFTER_CHECKPOINT :
DMTCP_AFTER_RESTART);
}
The logic here suggests that if either numRestarts or numCheckpoints is not
incremented, you will be stuck in the while loop. Having grep'd the code,
I can't find any other place where the incrementing happens. I also recall
from running strace that my checkpoint was stuck making nanosleep calls.
So, everything seems to confirm my suspicions.
I'm not sure the proper solution. I added a quick snippet to
dmtcpplugin.cpp:
void dmtcp::increment_counters(bool isRestart)
{
if (isRestart)
{
numRestarts++;
}
else
{
numCheckpoints++;
}
}
and called it from mtcpinterface.cpp:
DmtcpWorker::waitForStage4Resume(isRestart);
WorkerState::setCurrentState( WorkerState::RUNNING );
increment_counters(isRestart);
if (dmtcp_is_ptracing == NULL || !dmtcp_is_ptracing()) {
// Inform Coordinator of our RUNNING state;
// If running under ptrace, lets do this in sleep-between-ckpt
callback
DmtcpWorker::informCoordinatorOfRUNNINGState();
}
This did the trick, but I'm not sure it is the cleanest solution, or even
what the authors intended.
I would appreciate of someone can confirm this is an issue or otherwise let
me know if I'm in error?
I'm happy to help implement a fix, but I may need to study the code more to
come up with something more appropriate than the above hack.
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum