On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote:
> On Feb 6, 2014, at 2:16 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Josh explained it to me a few days ago, that after a checkpoint has been
> > received TCP should no longer be used to not lose any messages. The
> > communication happens over named pipes and therefore (I think) OOB
> > ft_event() is used to quite anything besides the pipes. This all seems
> > to work but I was just confused as the functions for ft_event()
> > in oob/tcp and oob/ud do not seem to contain any functionality.
> > 
> > So do I try to fix the ft_event() function in oob/base/ to call the
> > registered ft_event() function which does nothing or do I just remove
> > the call to orte oob ft_event().
> 
> Sounds like you'll need to tell the OOB components to stop processing 
> messages, so that will require that you insert an event into the system. You 
> have to account for two things:
> 
> (a) the OOB base and OOB components are operating on the orte_event_base, but
> 
> (b) each OOB component can have multiple active modules (one per NIC) that 
> are operating on their own event base/thread.
> 
> So you have to start by pushing an event that calls the OOB base, which then 
> loops across the components calling their ft_event interface. Each component 
> would then have to create an event for each active module, inserting that 
> event into the module's event base/thread. When activated, each module would 
> have to shutdown its message engine, and activate another event to notify its 
> component that all is quiet.
> 
> Once a component finds out that all its modules are quiet, it would then have 
> to activate an event to the OOB base. Once the OOB base sees all components 
> report quiet, then it would have to activate an event to take you to the next 
> step in your process.
> 
> In other words, you need to turn the quieting process into its own set of 
> states and run it through the state machine. This is the only way to 
> guarantee that you'll keep things orderly, and is the major change needed in 
> the C/R procedure as it flows thru ORTE. You can't just progress thru a set 
> of function calls as you'll inevitably run into a roadblock requiring that 
> you wait for an event-driven process to complete.

I tried to implement something like you described. It is not yet event
driven, but before continuing I wanted to get some feedback if it is at
least the right start:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706

I looked at the other ORTE_OOB_* macros and tried to model my
functionality a bit after what I have seen there. Right now it is still
a simple function which just tries to call ft_event() on all oob
components. Does this look right so far?

                Adrian

Reply via email to