On Feb 6, 2014, at 2:16 PM, Adrian Reber <adr...@lisas.de> wrote:

> Josh explained it to me a few days ago, that after a checkpoint has been
> received TCP should no longer be used to not lose any messages. The
> communication happens over named pipes and therefore (I think) OOB
> ft_event() is used to quite anything besides the pipes. This all seems
> to work but I was just confused as the functions for ft_event()
> in oob/tcp and oob/ud do not seem to contain any functionality.
> 
> So do I try to fix the ft_event() function in oob/base/ to call the
> registered ft_event() function which does nothing or do I just remove
> the call to orte oob ft_event().

Sounds like you'll need to tell the OOB components to stop processing messages, 
so that will require that you insert an event into the system. You have to 
account for two things:

(a) the OOB base and OOB components are operating on the orte_event_base, but

(b) each OOB component can have multiple active modules (one per NIC) that are 
operating on their own event base/thread.

So you have to start by pushing an event that calls the OOB base, which then 
loops across the components calling their ft_event interface. Each component 
would then have to create an event for each active module, inserting that event 
into the module's event base/thread. When activated, each module would have to 
shutdown its message engine, and activate another event to notify its component 
that all is quiet.

Once a component finds out that all its modules are quiet, it would then have 
to activate an event to the OOB base. Once the OOB base sees all components 
report quiet, then it would have to activate an event to take you to the next 
step in your process.

In other words, you need to turn the quieting process into its own set of 
states and run it through the state machine. This is the only way to guarantee 
that you'll keep things orderly, and is the major change needed in the C/R 
procedure as it flows thru ORTE. You can't just progress thru a set of function 
calls as you'll inevitably run into a roadblock requiring that you wait for an 
event-driven process to complete.

HTH
Ralph

> 
> On Thu, Feb 06, 2014 at 10:49:25AM -0800, Ralph Castain wrote:
>> The only reason I can think of for an OOB ft-event would be to tell the OOB 
>> to stop sending any messages. You would need to push that into the event 
>> library and use a callback event to let you know when it was done.
>> 
>> Of course, once you did that, the OOB would no longer be available to, for 
>> example, tell the local daemon that the app is ready for checkpoint :-)
>> 
>> Afraid I'll have to defer to Josh H for any further guidance.
>> 
>> 
>> On Feb 6, 2014, at 8:15 AM, Adrian Reber <adr...@lisas.de> wrote:
>> 
>>> When I initially made the C/R code compile again I made following
>>> change:
>>> 
>>> diff --git a/orte/mca/rml/oob/rml_oob_component.c 
>>> b/orte/mca/rml/oob/rml_oob_component.c
>>> index f0b22fc..90ed086 100644
>>> --- a/orte/mca/rml/oob/rml_oob_component.c
>>> +++ b/orte/mca/rml/oob/rml_oob_component.c
>>> @@ -185,8 +185,7 @@ orte_rml_oob_ft_event(int state) {
>>>        ;
>>>    }
>>> 
>>> -    if( ORTE_SUCCESS != 
>>> -        (ret = orte_oob.ft_event(state)) ) {
>>> +    if( ORTE_SUCCESS != (ret = orte_rml_oob_ft_event(state)) ) {
>>>        ORTE_ERROR_LOG(ret);
>>>        exit_status = ret;
>>>        goto cleanup;
>>> 
>>> 
>>> 
>>> This is, of course, wrong. Now the function calls itself in a loop until
>>> it crashes. Looking at orte/mca/oob there is still a ft_event()
>>> function, but it is disabled using "#if 0". Looking at other functions
>>> it seems I would need to create something like
>>> 
>>> #define ORTE_OOB_FT_EVENT(m)
>>> 
>>> Looking at the modules in orte/mca/oob/ it seems ft_event is implemented
>>> in some places but it never seems to have any real functionality. Is
>>> ft_event() actually needed there?
>>> 
>>>             Adrian
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to