I found the reason for the notification and fixed that as well - should all be done now
> On Jul 16, 2016, at 10:37 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Kewl - thanks! I will take care of this, but to me the most pressing issue is > why this event notification is being generated at all. It shouldn’t be. > >> On Jul 16, 2016, at 9:11 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote: >> >> I finally got it :-) >> >> in send_notification() from orted_submit.c, info is >> OPAL_PMIX_EVENT_NON_DEFAULT, but in pmix2x.c and pmix_ext20.c, >> PMIX_EVENT_NON_DEFAULT is tested. >> If I use OPAL_PMIX_EVENT_NON_DEFAULT in pmix*, that fixes the issue >> >> Cheers, >> >> Gilles >> >> On Sunday, July 17, 2016, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> Okay, I’ll investigate why that is happening - thanks! >> >>> On Jul 16, 2016, at 7:45 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@gmail.com >>> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: >>> >>> The parent job (e.g. the task that calls MPI_Comm_spawn) receives it. >>> I cannot tell whether the child (e.g. the spawned task) receives it too or >>> not >>> >>> Cheers, >>> >>> Gilles >>> >>> On Saturday, July 16, 2016, Ralph Castain <r...@open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: >>> I can fix the initialization. What puzzles me is that no debugger_release >>> message should be sent unless a debugger is attached - in which case, the >>> event should be registered. >>> >>> So why is it being sent? Is it the child job that is receiving it? Or is it >>> the parent? >>> >>> >>>> On Jul 16, 2016, at 7:19 AM, Gilles Gouaillardet >>>> <gilles.gouaillar...@gmail.com <>> wrote: >>>> >>>> I found some time to investigate this. >>>> tscon should initialize nondefault to false in both pmix2x.c and >>>> pmix_ext20.c >>>> >>>> A better workaround is to update ompi_errhandler_callback, so it does not >>>> invoke ompi_mpi_abort if status is OPAL_ERR_DEBUGGER_RELEASE >>>> >>>> That still seems counter intuitive to me ... >>>> Does ERR stands for error ? I did not find any error here ... >>>> Should it be EVT for event instead ? Should ERR not be fired in the first >>>> place ? >>>> Should Open MPI register a handler for this event (so nondefault is true >>>> and ompi_errhandler_callback is not invoked here) ? >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org <>> wrote: >>>> Okay, I’ll take a look - thanks! >>>> >>>>> On Jul 15, 2016, at 7:08 AM, Gilles Gouaillardet >>>>> <gilles.gouaillar...@gmail.com <>> wrote: >>>>> >>>>> >>>>> Yep, >>>>> >>>>> The constructor of pmix2x_threadshift_t (tscon) does not initialize >>>>> nondefault to false. >>>>> I won't be able to investigate this until Monday, but so far, my guess is >>>>> that if the constructor is fixed, then RHEL6 will fail like RHEL7 ... >>>>> >>>>> fwiw, the intercomm_create used to fail in Cisco mtt because of too many >>>>> tasks and no over subscription, now it fails because of this bug. >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org <>> wrote: >>>>> That would break debugger attach. Sounds to me like it’s just an >>>>> uninitialized variable for in_event_hdlr? >>>>> >>>>> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet <gil...@rist.or.jp <>> >>>>> > wrote: >>>>> > >>>>> > Ralph, >>>>> > >>>>> > i noticed MPI_Comm_spawn is broken on master and on RHEL7 >>>>> > >>>>> > for some reason i cannot yet explain, it works just fine on RHEL6 (!) >>>>> > >>>>> > >>>>> > mpirun -np 1 ./dynamic/intercomm_create >>>>> > >>>>> > from the ibm test suite can be used to reproduce the issue. >>>>> > >>>>> > >>>>> > >>>>> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in >>>>> > mpirun, then the tasks received >>>>> > >>>>> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is >>>>> > registered, so the default handler >>>>> > >>>>> > kills the task. >>>>> > >>>>> > >>>>> > for the time being, a trivial workaround is not to fire >>>>> > OPAL_ERR_DEBUGGER_RELEASE in the first place >>>>> > >>>>> > (see patch below) >>>>> > >>>>> > >>>>> > could you please have a look ? >>>>> > >>>>> > i am not sure whether client should not be notified at all, or whether >>>>> > they should register a dummy handler. >>>>> > >>>>> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on >>>>> > RHEL7, and that might indicate a race condition >>>>> > >>>>> > >>>>> > Cheers, >>>>> > >>>>> > >>>>> > Gilles >>>>> > >>>>> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c >>>>> > index b9d571c..0de0e79 100644 >>>>> > --- a/orte/orted/orted_submit.c >>>>> > +++ b/orte/orted/orted_submit.c >>>>> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false; >>>>> > >>>>> > static void _send_notification(void) >>>>> > { >>>>> > +#if 0 >>>>> > opal_buffer_t buf; >>>>> > int status = OPAL_ERR_DEBUGGER_RELEASE; >>>>> > orte_grpcomm_signature_t sig; >>>>> > @@ -2209,6 +2210,7 @@ static void _send_notification(void) >>>>> > } >>>>> > OBJ_DESTRUCT(&sig); >>>>> > OBJ_DESTRUCT(&buf); >>>>> > +#endif >>>>> > } >>>>> > >>>>> > static void orte_debugger_dump(void) >>>>> > >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > devel mailing list >>>>> > de...@open-mpi.org <> >>>>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> > <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> > Link to this post: >>>>> > http://www.open-mpi.org/community/lists/devel/2016/07/19214.php >>>>> > <http://www.open-mpi.org/community/lists/devel/2016/07/19214.php> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <> >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php >>>>> <http://www.open-mpi.org/community/lists/devel/2016/07/19215.php>_______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <> >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2016/07/19216.php >>>>> <http://www.open-mpi.org/community/lists/devel/2016/07/19216.php> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <> >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/07/19220.php >>>> <http://www.open-mpi.org/community/lists/devel/2016/07/19220.php> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/07/19222.php >>> <http://www.open-mpi.org/community/lists/devel/2016/07/19222.php> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/07/19224.php >