Yep,

The constructor of pmix2x_threadshift_t (tscon) does not initialize
nondefault to false.
I won't be able to investigate this until Monday, but so far, my guess is
that if the constructor is fixed, then RHEL6 will fail like RHEL7 ...

fwiw, the intercomm_create used to fail in Cisco mtt because of too many
tasks and no over subscription, now it fails because of this bug.

Cheers,

Gilles

On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org> wrote:

> That would break debugger attach. Sounds to me like it’s just an
> uninitialized variable for in_event_hdlr?
>
> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet <gil...@rist.or.jp
> <javascript:;>> wrote:
> >
> > Ralph,
> >
> > i noticed MPI_Comm_spawn is broken on master and on RHEL7
> >
> > for some reason i cannot yet explain, it works just fine on RHEL6 (!)
> >
> >
> > mpirun -np 1 ./dynamic/intercomm_create
> >
> > from the ibm test suite can be used to reproduce the issue.
> >
> >
> >
> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in mpirun,
> then the tasks received
> >
> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is
> registered, so the default handler
> >
> > kills the task.
> >
> >
> > for the time being, a trivial workaround is not to fire
> OPAL_ERR_DEBUGGER_RELEASE in the first place
> >
> > (see patch below)
> >
> >
> > could you please have a look ?
> >
> > i am not sure whether client should not be notified at all, or whether
> they should register a dummy handler.
> >
> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on
> RHEL7, and that might indicate a race condition
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
> > index b9d571c..0de0e79 100644
> > --- a/orte/orted/orted_submit.c
> > +++ b/orte/orted/orted_submit.c
> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;
> >
> > static void _send_notification(void)
> > {
> > +#if 0
> >     opal_buffer_t buf;
> >     int status = OPAL_ERR_DEBUGGER_RELEASE;
> >     orte_grpcomm_signature_t sig;
> > @@ -2209,6 +2210,7 @@ static void _send_notification(void)
> >     }
> >     OBJ_DESTRUCT(&sig);
> >     OBJ_DESTRUCT(&buf);
> > +#endif
> > }
> >
> > static void orte_debugger_dump(void)
> >
> >
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org <javascript:;>
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/07/19214.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <javascript:;>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php

Reply via email to