I found the reason for the notification and fixed that as well - should all be 
done now

> On Jul 16, 2016, at 10:37 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Kewl - thanks! I will take care of this, but to me the most pressing issue is 
> why this event notification is being generated at all. It shouldn’t be.
> 
>> On Jul 16, 2016, at 9:11 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote:
>> 
>> I finally got it :-)
>> 
>> in send_notification() from orted_submit.c, info is 
>> OPAL_PMIX_EVENT_NON_DEFAULT, but in pmix2x.c and pmix_ext20.c, 
>> PMIX_EVENT_NON_DEFAULT is tested.
>> If I use OPAL_PMIX_EVENT_NON_DEFAULT in pmix*, that fixes the issue
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Sunday, July 17, 2016, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> Okay, I’ll investigate why that is happening - thanks!
>> 
>>> On Jul 16, 2016, at 7:45 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com 
>>> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote:
>>> 
>>> The parent job (e.g.  the task that calls MPI_Comm_spawn) receives it.
>>> I cannot tell whether the child (e.g. the spawned task) receives it too or 
>>> not
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On Saturday, July 16, 2016, Ralph Castain <r...@open-mpi.org 
>>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>>> I can fix the initialization. What puzzles me is that no debugger_release 
>>> message should be sent unless a debugger is attached - in which case, the 
>>> event should be registered.
>>> 
>>> So why is it being sent? Is it the child job that is receiving it? Or is it 
>>> the parent?
>>> 
>>> 
>>>> On Jul 16, 2016, at 7:19 AM, Gilles Gouaillardet 
>>>> <gilles.gouaillar...@gmail.com <>> wrote:
>>>> 
>>>> I found some time to investigate this.
>>>> tscon should initialize nondefault to false in both pmix2x.c and 
>>>> pmix_ext20.c
>>>> 
>>>> A better workaround is to update ompi_errhandler_callback, so it does not 
>>>> invoke ompi_mpi_abort if status is OPAL_ERR_DEBUGGER_RELEASE
>>>> 
>>>> That still seems counter intuitive to me ...
>>>> Does ERR stands for error ? I did not find any error here ...
>>>> Should it be EVT for event instead ? Should ERR not be fired in the first 
>>>> place ?
>>>> Should Open MPI register a handler for this event (so nondefault is true 
>>>> and ompi_errhandler_callback is not invoked here) ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org <>> wrote:
>>>> Okay, I’ll take a look - thanks!
>>>> 
>>>>> On Jul 15, 2016, at 7:08 AM, Gilles Gouaillardet 
>>>>> <gilles.gouaillar...@gmail.com <>> wrote:
>>>>> 
>>>>> 
>>>>> Yep,
>>>>> 
>>>>> The constructor of pmix2x_threadshift_t (tscon) does not initialize 
>>>>> nondefault to false.
>>>>> I won't be able to investigate this until Monday, but so far, my guess is 
>>>>> that if the constructor is fixed, then RHEL6 will fail like RHEL7 ...
>>>>> 
>>>>> fwiw, the intercomm_create used to fail in Cisco mtt because of too many 
>>>>> tasks and no over subscription, now it fails because of this bug.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> On Friday, July 15, 2016, Ralph Castain <r...@open-mpi.org <>> wrote:
>>>>> That would break debugger attach. Sounds to me like it’s just an 
>>>>> uninitialized variable for in_event_hdlr?
>>>>> 
>>>>> > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet <gil...@rist.or.jp <>> 
>>>>> > wrote:
>>>>> >
>>>>> > Ralph,
>>>>> >
>>>>> > i noticed MPI_Comm_spawn is broken on master and on RHEL7
>>>>> >
>>>>> > for some reason i cannot yet explain, it works just fine on RHEL6 (!)
>>>>> >
>>>>> >
>>>>> > mpirun -np 1 ./dynamic/intercomm_create
>>>>> >
>>>>> > from the ibm test suite can be used to reproduce the issue.
>>>>> >
>>>>> >
>>>>> >
>>>>> > i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is fired in 
>>>>> > mpirun, then the tasks received
>>>>> >
>>>>> > a PMIX_ERR_DEBUGGER_RELEASE notification. it seems no event handler is 
>>>>> > registered, so the default handler
>>>>> >
>>>>> > kills the task.
>>>>> >
>>>>> >
>>>>> > for the time being, a trivial workaround is not to fire 
>>>>> > OPAL_ERR_DEBUGGER_RELEASE in the first place
>>>>> >
>>>>> > (see patch below)
>>>>> >
>>>>> >
>>>>> > could you please have a look ?
>>>>> >
>>>>> > i am not sure whether client should not be notified at all, or whether 
>>>>> > they should register a dummy handler.
>>>>> >
>>>>> > fwiw, in _event_hdlr, cd->nondefault is true on RHEL6, but false on 
>>>>> > RHEL7, and that might indicate a race condition
>>>>> >
>>>>> >
>>>>> > Cheers,
>>>>> >
>>>>> >
>>>>> > Gilles
>>>>> >
>>>>> > diff --git a/orte/orted/orted_submit.c b/orte/orted/orted_submit.c
>>>>> > index b9d571c..0de0e79 100644
>>>>> > --- a/orte/orted/orted_submit.c
>>>>> > +++ b/orte/orted/orted_submit.c
>>>>> > @@ -2155,6 +2155,7 @@ static bool mpir_breakpoint_fired = false;
>>>>> >
>>>>> > static void _send_notification(void)
>>>>> > {
>>>>> > +#if 0
>>>>> >     opal_buffer_t buf;
>>>>> >     int status = OPAL_ERR_DEBUGGER_RELEASE;
>>>>> >     orte_grpcomm_signature_t sig;
>>>>> > @@ -2209,6 +2210,7 @@ static void _send_notification(void)
>>>>> >     }
>>>>> >     OBJ_DESTRUCT(&sig);
>>>>> >     OBJ_DESTRUCT(&buf);
>>>>> > +#endif
>>>>> > }
>>>>> >
>>>>> > static void orte_debugger_dump(void)
>>>>> >
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > devel mailing list
>>>>> > de...@open-mpi.org <>
>>>>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>> > <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>> > Link to this post: 
>>>>> > http://www.open-mpi.org/community/lists/devel/2016/07/19214.php 
>>>>> > <http://www.open-mpi.org/community/lists/devel/2016/07/19214.php>
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org <>
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2016/07/19215.php 
>>>>> <http://www.open-mpi.org/community/lists/devel/2016/07/19215.php>_______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org <>
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2016/07/19216.php 
>>>>> <http://www.open-mpi.org/community/lists/devel/2016/07/19216.php>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <>
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2016/07/19220.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2016/07/19220.php>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/07/19222.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/07/19222.php>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/07/19224.php
> 

Reply via email to