On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote: > On Feb 6, 2014, at 2:16 PM, Adrian Reber <adr...@lisas.de> wrote: > > > Josh explained it to me a few days ago, that after a checkpoint has been > > received TCP should no longer be used to not lose any messages. The > > communication happens over named pipes and therefore (I think) OOB > > ft_event() is used to quite anything besides the pipes. This all seems > > to work but I was just confused as the functions for ft_event() > > in oob/tcp and oob/ud do not seem to contain any functionality. > > > > So do I try to fix the ft_event() function in oob/base/ to call the > > registered ft_event() function which does nothing or do I just remove > > the call to orte oob ft_event(). > > Sounds like you'll need to tell the OOB components to stop processing > messages, so that will require that you insert an event into the system. You > have to account for two things: > > (a) the OOB base and OOB components are operating on the orte_event_base, but > > (b) each OOB component can have multiple active modules (one per NIC) that > are operating on their own event base/thread. > > So you have to start by pushing an event that calls the OOB base, which then > loops across the components calling their ft_event interface. Each component > would then have to create an event for each active module, inserting that > event into the module's event base/thread. When activated, each module would > have to shutdown its message engine, and activate another event to notify its > component that all is quiet. > > Once a component finds out that all its modules are quiet, it would then have > to activate an event to the OOB base. Once the OOB base sees all components > report quiet, then it would have to activate an event to take you to the next > step in your process. > > In other words, you need to turn the quieting process into its own set of > states and run it through the state machine. This is the only way to > guarantee that you'll keep things orderly, and is the major change needed in > the C/R procedure as it flows thru ORTE. You can't just progress thru a set > of function calls as you'll inevitably run into a roadblock requiring that > you wait for an event-driven process to complete.
I tried to implement something like you described. It is not yet event driven, but before continuing I wanted to get some feedback if it is at least the right start: https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706 I looked at the other ORTE_OOB_* macros and tried to model my functionality a bit after what I have seen there. Right now it is still a simple function which just tries to call ft_event() on all oob components. Does this look right so far? Adrian