Hi Jeff,

What protection do you mean? Check that /proc/pid/status exists? It is done
in Grep()



We observe that process which was launched by mtt and hangs (mtt detect
timeout and starts do_command procedure), later enters into "defunct" state.



The mtt sends email that process hangs and when we check the reason, it
appears that process basically finished and mtt monitoring "defunct"
process which is an only left.



This fix will let mtt detect that it is monitoring such process and proceed
to the next test.



I don`t know yet what mtt part caused "defunct" but looking into it.

After some googling found that fork from perl (used in mtt) can have such
side-effect.



This is an example, based on true story:



miked     1362  0.0  0.0      0     0 ?        Z    13:36   0:00 [sh]
<defunct>



My guess, inside mtt.ini we use mpi details like this, which calls "sh"
from shebang. Somehow and sometimes it can become zombie.



exec      =<<EOF

#!/bin/sh

#SBATCH --job-name=&get_ini_val(mtt, description)_mtt_case

#SBATCH --nodes=&getenv('SLURM_NNODES')
--ntasks-per-node=&getenv('SLURM_NTASKS_PER_NODE')
--partition=&shell('squeue -h -j $SLURM_JOB_ID -o %P') --time=01:00:00

#

# To run interactive, copy paste command below:

#





export OMPI_HOME=&test_prefix()

export EXE=&test_executable_abspath()



PPN=$SLURM_NTASKS_PER_NODE

NP=$SLURM_NPROCS

NNODES=$SLURM_NNODES

HOSTS=$(hostlist -e -s , $SLURM_NODELIST)





set -x



$OMPI_HOME/bin/mpirun @tag@ -np &test_np() -H $HOSTS -bind-to-core -bynode
-display-map &if(&get_ini_val(mtt,pkg) eq codecov,@codecov_params@,'')
&get_ini_val("mtt","mpi_args") @mca@ &test_extra_mpi_argv() $EXE
&test_argv()

EOF





Regards

M



> -----Original Message-----

> From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com<jsquy...@cisco.com>
]

> Sent: Sunday, February 24, 2013 13:10

> To: <mtt-de...@open-mpi.org>

> Cc: Mike Dubman

> Subject: fix zombie commit

>

> Mike --

>

> Please protect this code better; MTT is also run on Solaris and OS X.

>

> Also, can you describe more fully the case where zombies are being left

> behind by MTT?

>

>

> On Feb 24, 2013, at 1:44 AM, <svn-commit-mai...@open-mpi.org> wrote:

>

> > Author: miked (Mike Dubman)

> > Date: 2013-02-24 01:44:31 EST (Sun, 24 Feb 2013) New Revision: 1589

> > URL: https://svn.open-mpi.org/trac/mtt/changeset/1589

> >

> > Log:

> > * fix: fork leaves zombie processes sometimes. temp fix: detect zombie

> and proceed with tests.

> >

> > Text files modified:

> >   trunk/lib/MTT/DoCommand.pm |     6 ++++++

> >   1 files changed, 6 insertions(+), 0 deletions(-)

> >

> > Modified: trunk/lib/MTT/DoCommand.pm

> >

> ==========================================================

> ====================

> > --- trunk/lib/MTT/DoCommand.pm      Wed Feb 20 12:41:12 2013

>        (r1588)

> > +++ trunk/lib/MTT/DoCommand.pm      2013-02-24 01:44:31 EST (Sun, 24 Feb

> 2013) (r1589)

> > @@ -641,6 +641,12 @@

> >         if (!$pid_exists) {

> >             Verbose("--> Process completed somehow at " . time() . ",

> proceeding with tests\n");

> >             $resume_tests++;

> > +        } else {

> > +            my $matches = MTT::Files::Grep("zombie",
"/proc/$pid/status");

> > +            if (@$matches) {

> > +                Verbose("--> Process become Zombie at " . time() . ",
proceeding

> with tests\n");

> > +                $resume_tests++;

> > +            }

> >         }

> >         # Remove the timeout sentinel file, if a timeout notify timeout
value is

> set

> >         if (defined($end_time) and time() > $end_time) {

> > _______________________________________________

> > mtt-svn mailing list

> > mtt-...@open-mpi.org

> > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn

>

>

> --

> Jeff Squyres

> jsquy...@cisco.com

> For corporate legal information go to:

> http://www.cisco.com/web/about/doing_business/legal/cri/





On Sun, Feb 24, 2013 at 1:09 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Mike --
>
> Please protect this code better; MTT is also run on Solaris and OS X.
>
> Also, can you describe more fully the case where zombies are being left
> behind by MTT?
>
>
> On Feb 24, 2013, at 1:44 AM, <svn-commit-mai...@open-mpi.org> wrote:
>
> > Author: miked (Mike Dubman)
> > Date: 2013-02-24 01:44:31 EST (Sun, 24 Feb 2013)
> > New Revision: 1589
> > URL: https://svn.open-mpi.org/trac/mtt/changeset/1589
> >
> > Log:
> > * fix: fork leaves zombie processes sometimes. temp fix: detect zombie
> and proceed with tests.
> >
> > Text files modified:
> >   trunk/lib/MTT/DoCommand.pm |     6 ++++++
> >   1 files changed, 6 insertions(+), 0 deletions(-)
> >
> > Modified: trunk/lib/MTT/DoCommand.pm
> >
> ==============================================================================
> > --- trunk/lib/MTT/DoCommand.pm        Wed Feb 20 12:41:12 2013
>  (r1588)
> > +++ trunk/lib/MTT/DoCommand.pm        2013-02-24 01:44:31 EST (Sun, 24
> Feb 2013)      (r1589)
> > @@ -641,6 +641,12 @@
> >         if (!$pid_exists) {
> >             Verbose("--> Process completed somehow at " . time() . ",
> proceeding with tests\n");
> >             $resume_tests++;
> > +        } else {
> > +            my $matches = MTT::Files::Grep("zombie",
> "/proc/$pid/status");
> > +            if (@$matches) {
> > +                Verbose("--> Process become Zombie at " . time() . ",
> proceeding with tests\n");
> > +                $resume_tests++;
> > +            }
> >         }
> >         # Remove the timeout sentinel file, if a timeout notify timeout
> value is set
> >         if (defined($end_time) and time() > $end_time) {
> > _______________________________________________
> > mtt-svn mailing list
> > mtt-...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
>

Reply via email to