Re: [MTT devel] fix zombie commit

2013-02-26 Thread Jeff Squyres (jsquyres)
On Feb 26, 2013, at 2:11 AM, Mike Dubman  wrote:

> On Mon, Feb 25, 2013 at 6:24 PM, Jeff Squyres (jsquyres)  
> wrote:
> >Looking at the code, you're checking for zombie status before MTT kills the 
> >proc.  Am I reading that right?
> I don`t think the order matters, if process is not Zombie yet and about to be 
> killed by MTT later - it is a good flow.
> If process is already Zombie - mtt will not be able to kill it anyway and and 
> can stop waiting and switch to the new task.

No, the _kill_proc() routine does both a kill() and a waitpid().  The waitpid() 
should reap the zombie.

I.e., if the process has died, MTT simply just hasn't reaped it yet.  Hence, 
it's a zombie.

> >If so, then it could well be that the process has exited but not yet been 
> >reaped (because _kill_proc() hasn't been invoked yet).  If this is the case, 
> >is the real cause of the problem that >the OUTread and ERRread aren't being 
> >closed when the child process exits, and therefore we keep looping looking 
> >for new output from them?
> yep, sounds like it can be the cause, need to look into this code.

Ok.  It would be interesting to see if the process dies, but:

1) MTT is still blocking in select() (i.e., OUTread and OUTerr aren't returning 
0 from sysread upon process death)

2) $done is somehow not getting set to 0, and therefore MTT is still looping 
until the timeout expires

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [MTT devel] fix zombie commit

2013-02-25 Thread Jeff Squyres (jsquyres)
On Feb 24, 2013, at 6:59 AM, Mike Dubman  wrote:

> What protection do you mean? Check that /proc/pid/status exists? It is done 
> in Grep()

Ah, excellent -- I hadn't noticed that.

> We observe that process which was launched by mtt and hangs (mtt detect 
> timeout and starts do_command procedure), later enters into "defunct" state.

Looking at the code, you're checking for zombie status before MTT kills the 
proc.  Am I reading that right?

If so, then it could well be that the process has exited but not yet been 
reaped (because _kill_proc() hasn't been invoked yet).  If this is the case, is 
the real cause of the problem that the OUTread and ERRread aren't being closed 
when the child process exits, and therefore we keep looping looking for new 
output from them?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[MTT devel] fix zombie commit

2013-02-24 Thread Jeff Squyres (jsquyres)
Mike --

Please protect this code better; MTT is also run on Solaris and OS X.

Also, can you describe more fully the case where zombies are being left behind 
by MTT?


On Feb 24, 2013, at 1:44 AM,  wrote:

> Author: miked (Mike Dubman)
> Date: 2013-02-24 01:44:31 EST (Sun, 24 Feb 2013)
> New Revision: 1589
> URL: https://svn.open-mpi.org/trac/mtt/changeset/1589
> 
> Log:
> * fix: fork leaves zombie processes sometimes. temp fix: detect zombie and 
> proceed with tests.
> 
> Text files modified: 
>   trunk/lib/MTT/DoCommand.pm | 6 ++  
>   1 files changed, 6 insertions(+), 0 deletions(-)
> 
> Modified: trunk/lib/MTT/DoCommand.pm
> ==
> --- trunk/lib/MTT/DoCommand.pmWed Feb 20 12:41:12 2013(r1588)
> +++ trunk/lib/MTT/DoCommand.pm2013-02-24 01:44:31 EST (Sun, 24 Feb 
> 2013)  (r1589)
> @@ -641,6 +641,12 @@
> if (!$pid_exists) {
> Verbose("--> Process completed somehow at " . time() . ", 
> proceeding with tests\n");
> $resume_tests++;
> +} else {
> +my $matches = MTT::Files::Grep("zombie", "/proc/$pid/status");
> +if (@$matches) {
> +Verbose("--> Process become Zombie at " . time() . ", 
> proceeding with tests\n");
> +$resume_tests++;
> +}
> }
> # Remove the timeout sentinel file, if a timeout notify timeout value 
> is set
> if (defined($end_time) and time() > $end_time) {
> ___
> mtt-svn mailing list
> mtt-...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/