I have a couple more thoughts/details on this.

It just occurred to me that this situation did not occur during my trip toward 
the first checkpoint.  I had 7 jobs (of 12) that finished during the trip 
toward the first checkpoint.  Those jobs' processes disappeared from the ps/top 
output and exited before the checkpoint time (and hence never checkpointed).  
The 5 jobs that underwent a restart are the ones that hung around in a defunct 
state after finishing.

I was also looking at my blcr code and noted that I obtain the exit status of 
my jobs by calling waitpid on the cr_run/cr_restart process ID.  So my guess is 
that the OS is waiting for me to collect the exit status of the defunct jobs 
(child processes) before removing them from the top/ps output.  I'm not sure 
why that did not happen though for the jobs that finished on their way toward 
checkpoint 1.

Here's an example of one of the jobs

On the trip toward checkpoint 1, I started my job with:

dmtcp_coordinator --port 0 --background --port-file 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1.port
 --ckptdir 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1
 --tmpdir /panasas/scratch/rwleach/tmp &

(I waited for the port file to show up here)

dmtcp_checkpoint --no-gzip --join --port 40173 --tmpdir 
/panasas/scratch/rwleach/tmp --ckptdir 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1
 --quiet --host d16n35.ccr.buffalo.edu /util/meme/4.6.0/bin/meme.bin 
LNCaP_control-LNCaP_input_summits.bed.pad150.formeme -dna -mod zoops -minw 6 
-maxw 25 -revcomp -nostatus -o 
LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memesummits150 -m axsize 
30000000 1> 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits
 2> 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.err

On the trip toward checkpoint 2, I restarted my job with:

dmtcp_coordinator --port 0 --background --port-file 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt2.port
 --ckptdir 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt2
 --tmpdir /panasas/scratch/rwleach/tmp &

(I waited for the port file to show up here)

dmtcp_restart --join --port 52814 --quiet 
/panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1/ckpt_4b276121bca0190a-30157-51c3517e_00001/*.dmtcp

The jobs are all finishing successfully.  There are only 2 left that are still 
actually not finished, though pbs still thinks all 5 are running since 3 are 
currently stuck in the defunct state.

Rob
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to