Hi Robert,

The --exit-on-last option is only for the coordinator to quit after all
other nodes have exited. The defunct processes should be waited upon by the
parent. Who is the parent in this case? Is this some shell script? In that
case you might want to do wait in there.

Kapil


On Tue, Jun 25, 2013 at 2:07 PM, Robert William Leach
<[email protected]>wrote:

> I have a couple more thoughts/details on this.
>
> It just occurred to me that this situation did not occur during my trip
> toward the first checkpoint.  I had 7 jobs (of 12) that finished during the
> trip toward the first checkpoint.  Those jobs' processes disappeared from
> the ps/top output and exited before the checkpoint time (and hence never
> checkpointed).  The 5 jobs that underwent a restart are the ones that hung
> around in a defunct state after finishing.
>
> I was also looking at my blcr code and noted that I obtain the exit status
> of my jobs by calling waitpid on the cr_run/cr_restart process ID.  So my
> guess is that the OS is waiting for me to collect the exit status of the
> defunct jobs (child processes) before removing them from the top/ps output.
>  I'm not sure why that did not happen though for the jobs that finished on
> their way toward checkpoint 1.
>
> Here's an example of one of the jobs
>
> On the trip toward checkpoint 1, I started my job with:
>
> dmtcp_coordinator --port 0 --background --port-file /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1.port--ckptdir
>  /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1--tmpdir
>  /panasas/scratch/rwleach/tmp &
>
> (I waited for the port file to show up here)
>
> dmtcp_checkpoint --no-gzip --join --port 40173 --tmpdir
> /panasas/scratch/rwleach/tmp --ckptdir /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1--quiet
>  --host
> d16n35.ccr.buffalo.edu /util/meme/4.6.0/bin/meme.bin
> LNCaP_control-LNCaP_input_summits.bed.pad150.formeme -dna -mod zoops -minw
> 6 -maxw 25 -revcomp -nostatus -o
> LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memesummits150
> -m axsize 30000000 1> /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits2>
>  /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.err
>
> On the trip toward checkpoint 2, I restarted my job with:
>
> dmtcp_coordinator --port 0 --background --port-file /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt2.port--ckptdir
>  /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt2--tmpdir
>  /panasas/scratch/rwleach/tmp &
>
> (I waited for the port file to show up here)
>
> dmtcp_restart --join --port 52814 --quiet /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1/ckpt_4b276121bca0190a-30157-51c3517e_00001/*.dmtcp
>
> The jobs are all finishing successfully.  There are only 2 left that are
> still actually not finished, though pbs still thinks all 5 are running
> since 3 are currently stuck in the defunct state.
>
> Rob
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to