I do not understand your fix yet, but it would be better, I guess.

I'll check it later, but now please let me expalin what I thought:

If some nodes are allocated, it doen't go through this part because
opal_list_get_size(&nodes) > 0 at this location.

1590    if (0 == opal_list_get_size(&nodes)) {
1591        OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
1592                             "%s plm:base:setup_vm only HNP in
allocation",
1593                             ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
1594        /* cleanup */
1595        OBJ_DESTRUCT(&nodes);
1596        /* mark that the daemons have reported so we can proceed */
1597        daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
1598         daemons->updated = false;
1599        return ORTE_SUCCESS;
1600    }

After filtering, opal_list_get_size(&nodes) becomes zero at this location.
That's why I think I should add two lines 1597,1598 to the if-clause below.

1660    if (0 == opal_list_get_size(&nodes)) {
1661        OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
1662                             "%s plm:base:setup_vm only HNP left",
1663                             ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
1664        OBJ_DESTRUCT(&nodes);
1665        return ORTE_SUCCESS;

Tetsuya

> Hmm...no, I don't think that's the correct patch. We want that function
to remain "clean" as it's job is simply to construct the list of nodes for
the VM. It's the responsibility of the launcher to
> decide what to do with it.
>
> Please see https://svn.open-mpi.org/trac/ompi/ticket/4408 for a fix
>
> Ralph
>
> On Mar 17, 2014, at 5:40 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> > Hi Ralph, I found another corner case hangup in openmpi-1.7.5rc3.
> >
> > Condition:
> > 1. allocate some nodes using RM such as TORQUE.
> > 2. request the head node only in executing the job with
> >   -host or -hostfile option.
> >
> > Example:
> > 1. allocate node05,node06 using TORQUE.
> > 2. request node05 only with -host option
> >
> > [mishima@manage ~]$ qsub -I -l nodes=node05+node06
> > qsub: waiting for job 8661.manage.cluster to start
> > qsub: job 8661.manage.cluster ready
> >
> > [mishima@node05 ~]$ cat $PBS_NODEFILE
> > node05
> > node06
> > [mishima@node05 ~]$ mpirun -np 1 -host node05
~/mis/openmpi/demos/myprog
> > << hang here >>
> >
> > And, my fix for plm_base_launch_support.c is as follows:
> > --- plm_base_launch_support.c   2014-03-12 05:51:45.000000000 +0900
> > +++ plm_base_launch_support.try.c       2014-03-18 08:38:03.000000000
+0900
> > @@ -1662,7 +1662,11 @@
> >         OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
> >                              "%s plm:base:setup_vm only HNP left",
> >                              ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
> > +        /* cleanup */
> >         OBJ_DESTRUCT(&nodes);
> > +        /* mark that the daemons have reported so we can proceed */
> > +        daemons->state = ORTE_JOB_STATE_DAEMONS_REPORTED;
> > +        daemons->updated = false;
> >         return ORTE_SUCCESS;
> >     }
> >
> > Tetsuya
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to