Hi,
I have a shell script that launch a program with dmtcp, if it is first time
run, use dmtcp_launch, while if it is not first time run, use dmtcp_restart.
let it run about 3 minutes then use dmtcp command to checkpoint and then
terminated the program using dmtcp command quit, and run itself again. The
purpose of this script is to try a way that a long program run be converted
into a sequence of short run. The source code and the script are attached for
your reference.
The problem I got is this: If the program could be complete by one or two
restart, it is good to get results. If it need more time, the third time when
dmtcp_command -c is invoke, the running program is crashed with segmentation
fault and the dmtcp checkpointing only produces a file with the name as the
restart ckpt_*.dmtcp with an extension ".temp". Therefore, the script could not
continue successfully. I am so puzzled that why it happened at third time of
checkpointing, not second time? the command used is exactly the some. I also
tried manually with two screens, it is happened in the same way. The error
massage I got is the following:
[23043] ERROR at dmtcpmessagetypes.cpp:56 in assertValid;
REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
_magicBits =
Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die
uncleanly?
dmtcp_command (23043): Terminating...
/var/lib/slurmd/job202408/slurm_script: line 121: 22777 Segmentation fault
dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp >
num-16.even
We are using the version as
$ dmtcp_command --version
dmtcp_command (DMTCP) 2.5.2
License LGPLv3+: GNU LGPL version 3 or later
<http://gnu.org/licenses/lgpl.html>.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
on CentOS7.
Please let me know you need any more information.
Thank you in advance for your help.
Best,
Xiaoge
README
Description: README
#!/bin/bash -login
# current working directory shuld have source code dmtcp1.c
# script name. This script is to be resubmit multiple times
export JOBSCRIPT="manual.sh"
# start dmtcp_coordinator
dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file port $@ 1>/dev/null 2>&1 # start coordinater
h=`hostname` # get host name
p=`cat port`
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p
# print out some information
#echo "coordinator is on host $DMTCP_COORD_HOST "
#echo "port number is $DMTCP_COORD_PORT "
#echo " working directory: ${SLURM_SUBMIT_DIR} "
#echo " job script is $SLURM_JOBSCRIPT "
####################### BODY of the JOB ######################
# prepare work environment of the job
# build the program if not exist
if [ ! -f count.exe ]
then
cc count.c -o count.exe
fi
# run the program count.exe.
# To run interactively:
# $ ./count.exe n num.odd 1> num.even
# it will count to number n and generate 2 files:
# num.odd contains all the odd number;
# num.even contains all the even number.
# To run with DMTCP, use dmtcp commamds.
# if first time launch, use "dmtcp_launch"
# otherwise use "dmtcp_restart"
# set checkpoint interval. This script would wait after dmtcp_launch
# the job for the interval (in seconds), then do start the checkpoint.
export CKPT_WAIT_SEC=$(( 3 * 60 ))
# Launch or restart the execution
if [ ! -f ckpt_*.dmtcp ] # no ckpt file exists, use dmtcp_launch
then
# first time run, use dmtcp_launch the job */
echo " call dmtcp_launch "
dmtcp_launch -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --rm --ckpt-open-files ./count.exe 1200 num.odd 1> num.even &
#wait for an inverval of checkpoint seconds to start checkpointing
sleep $CKPT_WAIT_SEC
# start checkpointing
# echo " start dmtcp checkpointing"
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files --bcheckpoint
# echo " finish dmtcp checkpointing"
# kill the running job after checkpointing
# echo " terminate job after checkpoint "
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit
# echo " terminate job after checkpoint "
# resubmit the job
echo "resubmit $JOBSCRIPT "
./$JOBSCRIPT
else
# restart job with checkpoint files
echo " call dmtcp_restart "
dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp 1> num.even &
# echo " restarted "
# wait for a checkpoint interval to start checkpointing
sleep $CKPT_WAIT_SEC
# clean up the old image
rm -r ckpt_*.dmtcp ckpt_*_files
# if program is running, do the checkpoint and resubmit
if dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -s 1>/dev/null 2>&1
then
# echo " start checkpointing again "
# clean up old ckpt files before start new ckpt
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files -bc
# echo " finish checkpointing again "
# kill the running program
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit
# resubmit this script to slurm
echo " resumit $JOBSCRIPT "
./$JOBSCRIPT
else
echo "job finished"
fi
fi
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char* argv[])
{
if(argc<=1) {
printf("not enough arguments.\n");
printf("Usage: ./dmtcp1 n filename \n");
exit(1);
}
FILE *ofp = NULL;
int n = atoi(argv[1]);
if (argc == 2) {
ofp = fopen("odd.out", "w");
}
else {
ofp = fopen(argv[2], "w");
}
/* fprintf(ofp,"\ncmdline args count=%d", argc); */
/* First argument is executable name only */
/* fprintf(ofp, "\nexe name=%s\n", argv[0]); */
/* Second argument is a output filename */
/* fprintf(ofp,"\nfilename=%s\n", argv[1]); */
/* Open file as writable */
if (ofp == NULL) {
printf("Can't open output file %s!\n", argv[1]);
exit(1);
}
int count = 1;
while (count<=n)
{
fprintf(ofp," %2d\n ",count++);
printf(" %2d\n ",count++);
sleep(1);
}
fclose(ofp);
return 0;
}
longjob.sb
Description: longjob.sb
shortjob.sb
Description: shortjob.sb
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
