It is possible that during checkpoint, the processes are not direct
children of the perl script (fork-fork trick creates grandchildren) and so
they wont become zombies. However, during restart, the processes are child
processes causing them to become zombie
On Wed, Jun 26, 2013 at 11:50 AM, Robert William Leach
<[email protected]>wrote:
> Hi Kapil,
>
> Ah, OK. Thanks. I wasn't sure about what "exit on last" did.
>
> The parent script is a perl script. I guessed yesterday that the reason
> the strategy was working in blcr was because the process belonged to cr_run
> (or cr_restart), so cr_* was doing the waiting and I was calling waitpid on
> it in my script. So, in the case of dmtcp, I implemented a new strategy
> where I detect the job ending before the checkpoint and then call waitpid
> on the process itself instead of on dmtcp. I'm going to test it out once
> my current set of jobs are done.
>
> If this solution is correct though, I'm not sure why I didn't run into
> this issue during the run toward the first checkpoint. There were some
> jobs which finished during that run and they disappeared from the top/ps
> output and the script ended before the checkpoint was reached. I didn't
> call wait for those and they didn't hang around as defunct. I'm not sure
> why that happened. I looked at my logs for those jobs and the processes
> belonged to my script and not a dmtcp process.
>
> It appears to only be after the restart however where I ran into a
> situation where I ended up with defunct processes waiting on me to collect
> them using wait. I suppose as long as I get it to work, it's not an issue,
> but I would like to have a better understanding of what's going on.
>
> Rob
>
> On Jun 26, 2013, at Jun26, 11:23 AM, Kapil Arya wrote:
>
> > Hi Robert,
> >
> > The --exit-on-last option is only for the coordinator to quit after all
> other nodes have exited. The defunct processes should be waited upon by the
> parent. Who is the parent in this case? Is this some shell script? In that
> case you might want to do wait in there.
> >
> > Kapil
> >
> >
> > On Tue, Jun 25, 2013 at 2:07 PM, Robert William Leach <
> [email protected]> wrote:
> > I have a couple more thoughts/details on this.
> >
> > It just occurred to me that this situation did not occur during my trip
> toward the first checkpoint. I had 7 jobs (of 12) that finished during the
> trip toward the first checkpoint. Those jobs' processes disappeared from
> the ps/top output and exited before the checkpoint time (and hence never
> checkpointed). The 5 jobs that underwent a restart are the ones that hung
> around in a defunct state after finishing.
> >
> > I was also looking at my blcr code and noted that I obtain the exit
> status of my jobs by calling waitpid on the cr_run/cr_restart process ID.
> So my guess is that the OS is waiting for me to collect the exit status of
> the defunct jobs (child processes) before removing them from the top/ps
> output. I'm not sure why that did not happen though for the jobs that
> finished on their way toward checkpoint 1.
> >
> > Here's an example of one of the jobs
> >
> > On the trip toward checkpoint 1, I started my job with:
> >
> > dmtcp_coordinator --port 0 --background --port-file /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1.port--ckptdir
> /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1--tmpdir
> /panasas/scratch/rwleach/tmp &
> >
> > (I waited for the port file to show up here)
> >
> > dmtcp_checkpoint --no-gzip --join --port 40173 --tmpdir
> /panasas/scratch/rwleach/tmp --ckptdir /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1--quiet
> --host
> d16n35.ccr.buffalo.edu /util/meme/4.6.0/bin/meme.bin
> LNCaP_control-LNCaP_input_summits.bed.pad150.formeme -dna -mod zoops -minw
> 6 -maxw 25 -revcomp -nostatus -o
> LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memesummits150 -m
> axsize 30000000 1> /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits2>
> /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.err
> >
> > On the trip toward checkpoint 2, I restarted my job with:
> >
> > dmtcp_coordinator --port 0 --background --port-file /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt2.port--ckptdir
> /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt2--tmpdir
> /panasas/scratch/rwleach/tmp &
> >
> > (I waited for the port file to show up here)
> >
> > dmtcp_restart --join --port 52814 --quiet /panfs/
> panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/LNCaP_control-LNCaP_input_summits.bed.pad150.formeme.memeout-summits.ckpt1/ckpt_4b276121bca0190a-30157-51c3517e_00001/*.dmtcp
> >
> > The jobs are all finishing successfully. There are only 2 left that are
> still actually not finished, though pbs still thinks all 5 are running
> since 3 are currently stuck in the defunct state.
> >
> > Rob
> >
> >
> ------------------------------------------------------------------------------
> > This SF.net email is sponsored by Windows:
> >
> > Build for Windows Store.
> >
> > http://p.sf.net/sfu/windows-dev2dev
> > _______________________________________________
> > Dmtcp-forum mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >
> >
>
>
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum