Hi Joshua, I know what the problem is (atleast with the trunk). When "./run.csh" was open()'d by /bin/csh, DMTCP recorded the "./run.csh" path. However, due to the change in working directory, the name "./run.csh" no longer points to a file at the time of checkpoint. The fix is simple, if the _relative path_ is no longer valid, try the absolute path before declaring the file as *unlinked*. However, the question is what should happen at restart. Plus, it is also possible that there is a separate run.csh file in the new working directory (i.e. run_dir/run.csh). We had a little bit discussion about it earlier too.
One possibility would be to record the abspath if the working directory has changed since opening the file and subsequently use the abspath to restore the file during restart time. Is this a reasonable approach? Kapil On Wed, Mar 13, 2013 at 12:55 PM, Louie, Joshua D <[email protected]> wrote: > Bear with me, this example might be a bit long winded. I have a situation > > where running with the trunk is resulting in a failure during checkpointing. > > It appears to be due to something with sub-processes not being in the same > > directory as the launching process. When I run with 1.2.6 checkpointing > > works, but it fails during restore. High-level flow looks like this: > > > > Works with both checkpoint and restore both 1.2.6 and trunk: > > DMTCP run.csh --> run_sleep.csh --> sleep_ckpt > > > > Fails either in ckpt or restore depending on DMTCP version: > > DMTCP run.csh --> cd run_dir --> run_sleep.csh --> ../sleep_ckpt > > > > Failure signature during restore (1.2.6). It looks like we end up running > the program fresh: > > [28002] WARNING at connection.cpp:1160 in restore; REASON='JWARNING(false) > failed' > > Message: Size of file smaller than what we expected > > [28002] WARNING at connection.cpp:1183 in restore; REASON='JWARNING(false) > failed' > > _path = <my_path_removed>/run_sleep.csh > > _offset = 26 > > _stat.st_size = 26 > > buf.st_size = 25 > > Message: No lseek done: offset is larger than min of old and new size. > > > > Failure signature during checkpoint (trunk): > > [40000] ERROR at fileconnection.cpp:522 in handleUnlinkedFile; > REASON='JASSERT(_type == FILE_DELETED) failed' > > _path = ./run.csh > > currPath = <my_path_removed>/run.csh > > Message: File not found on disk and yet the filename doesn't contain the > suffix '(deleted)' > > > > How to build sleep_ckpt.c: > > setenv DMTCP_INSTALLATION <Your DMTCP here> > > gcc sleep_ckpt.c -I$DMTCP_INSTALLATION/include \ > > -L$DMTCP_INSTALLATION/lib -ldmtcpaware \ > > -Xlinker -rpath -Xlinker $DMTCP_INSTALLATION/lib -g -o sleep_ckpt > > > > How to checkpoint: > > Run program using run.csh, keeping the cd into run_dir line > > Checkpoint by doing "kill -s ALRM <pid of sleep_ckpt>" > > > > Sources: > > /*** sleep_ckpt.c ***/ > > #include <stdio.h> > > #include <stdlib.h> > > #include <signal.h> > > #include <dmtcpaware.h> > > > > #define SLEEP_SEC 10 > > static int ckpt_requested = 0; > > > > void ckpt_handler(int signum) { > > ckpt_requested = 1; > > } > > > > int main(int argc, char *argv[]) { > > int i; > > > > /* Setup signal handler for doing a checkpoint */ > > signal(SIGALRM, ckpt_handler); > > > > printf("Sleeping for %u seconds\n", SLEEP_SEC); > > for (i = 0; i < SLEEP_SEC; i++) { > > sleep(1); > > if (ckpt_requested && dmtcpIsEnabled()) { > > printf("Checkpointing at %u seconds\n", i); > > dmtcpCheckpoint(); > > ckpt_requested = 0; > > } > > } > > } > > > > /*** run.csh ***/ > > #!/bin/csh > > > > # This cd, when there causes problems, using script in local directory is > fine > > cd ./run_dir > > ./run_sleep.csh > > > > /*** run_sleep.csh ***/ > > #!/bin/csh > > > > ./sleep_ckpt > > > > /*** run_dir/run_sleep.csh ***/ > > #!/bin/csh > > > > ../sleep_ckpt > > > > > > Joshua Louie > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > _______________________________________________ > Dmtcp-forum mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
