Bear with me, this example might be a bit long winded. I have a situation
where running with the trunk is resulting in a failure during checkpointing.
It appears to be due to something with sub-processes not being in the same
directory as the launching process. When I run with 1.2.6 checkpointing
works, but it fails during restore. High-level flow looks like this:

Works with both checkpoint and restore both 1.2.6 and trunk:
DMTCP run.csh --> run_sleep.csh --> sleep_ckpt

Fails either in ckpt or restore depending on DMTCP version:
DMTCP run.csh --> cd run_dir --> run_sleep.csh --> ../sleep_ckpt

Failure signature during restore (1.2.6). It looks like we end up running the 
program fresh:
[28002] WARNING at connection.cpp:1160 in restore; REASON='JWARNING(false) 
failed'
Message: Size of file smaller than what we expected
[28002] WARNING at connection.cpp:1183 in restore; REASON='JWARNING(false) 
failed'
     _path = <my_path_removed>/run_sleep.csh
     _offset = 26
     _stat.st_size = 26
     buf.st_size = 25
Message: No lseek done:  offset is larger than min of old and new size.

Failure signature during checkpoint (trunk):
[40000] ERROR at fileconnection.cpp:522 in handleUnlinkedFile; 
REASON='JASSERT(_type == FILE_DELETED) failed'
     _path = ./run.csh
     currPath = <my_path_removed>/run.csh
Message: File not found on disk and yet the filename doesn't contain the suffix 
'(deleted)'

How to build sleep_ckpt.c:
setenv DMTCP_INSTALLATION <Your DMTCP here>
gcc sleep_ckpt.c -I$DMTCP_INSTALLATION/include \
-L$DMTCP_INSTALLATION/lib -ldmtcpaware \
-Xlinker -rpath -Xlinker $DMTCP_INSTALLATION/lib -g -o sleep_ckpt

How to checkpoint:
Run program using run.csh, keeping the cd into run_dir line
Checkpoint by doing "kill -s ALRM <pid of sleep_ckpt>"

Sources:
/*** sleep_ckpt.c ***/
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <dmtcpaware.h>

#define SLEEP_SEC 10
static int ckpt_requested = 0;

void ckpt_handler(int signum) {
    ckpt_requested = 1;
}

int main(int argc, char *argv[]) {
    int i;

    /* Setup signal handler for doing a checkpoint */
    signal(SIGALRM, ckpt_handler);

    printf("Sleeping for %u seconds\n", SLEEP_SEC);
    for (i = 0; i < SLEEP_SEC; i++) {
        sleep(1);
        if (ckpt_requested && dmtcpIsEnabled()) {
            printf("Checkpointing at %u seconds\n", i);
            dmtcpCheckpoint();
            ckpt_requested = 0;
        }
    }
}

/*** run.csh ***/
#!/bin/csh

# This cd, when there causes problems, using script in local directory is fine
cd ./run_dir
./run_sleep.csh

/*** run_sleep.csh ***/
#!/bin/csh

./sleep_ckpt

/*** run_dir/run_sleep.csh ***/
#!/bin/csh

../sleep_ckpt


Joshua Louie
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to