Re: [OMPI devel] C/R and orte_oob

2014-03-10 Thread Ralph Castain
On Mar 10, 2014, at 1:29 PM, Adrian Reber wrote: > On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote: >>> If you like, I can define the required code in the trunk and let you >>> fill in the event functionality. >> >> That would be great. > > Thanks for you

Re: [OMPI devel] C/R and orte_oob

2014-03-10 Thread Adrian Reber
On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote: > > If you like, I can define the required code in the trunk and let you > > fill in the event functionality. > > That would be great. > >>> > >>> Thanks for your changes. When using --with-ft there are a few compil

Re: [OMPI devel] C/R and orte_oob

2014-03-07 Thread Ralph Castain
On Mar 7, 2014, at 3:07 AM, Adrian Reber wrote: > On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote: >>> Sorry for delay - yes, that looks like the right direction. I would >>> suggest doing it via the current state machine, though, by simply >>> defining another job or

Re: [OMPI devel] C/R and orte_oob

2014-03-07 Thread Adrian Reber
On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote: > > Sorry for delay - yes, that looks like the right direction. I would > > suggest doing it via the current state machine, though, by simply > > defining another job or proc state in orte/mca/plm/plm_types.h, and > >

Re: [OMPI devel] C/R and orte_oob

2014-03-06 Thread Ralph Castain
On Mar 6, 2014, at 1:02 PM, Adrian Reber wrote: > On Tue, Feb 18, 2014 at 03:46:58PM +0100, Adrian Reber wrote: >> I tried to implement something like you described. It is not yet event >> driven, but before continuing I wanted to get some feedback if it is at >> least the right star

Re: [OMPI devel] C/R and orte_oob

2014-03-06 Thread Adrian Reber
On Tue, Feb 18, 2014 at 03:46:58PM +0100, Adrian Reber wrote: > > >>> I tried to implement something like you described. It is not yet event > > >>> driven, but before continuing I wanted to get some feedback if it is at > > >>> least the right start: > > >>> > > >>> https://lisas.de/git/?p=open-m

Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Tue, Feb 18, 2014 at 06:39:12AM -0800, Ralph Castain wrote: > On Feb 18, 2014, at 6:24 AM, Adrian Reber wrote: > > > On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote: > >> On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote: > >>> I tried to implement something like you described. I

Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Ralph Castain
On Feb 18, 2014, at 6:24 AM, Adrian Reber wrote: > On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote: >> On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote: >>> I tried to implement something like you described. It is not yet event >>> driven, but before continuing I wanted to get som

Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote: > On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote: > > I tried to implement something like you described. It is not yet event > > driven, but before continuing I wanted to get some feedback if it is at > > least the right start: > >

Re: [OMPI devel] C/R and orte_oob

2014-02-14 Thread Ralph Castain
On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote: > On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote: >> On Feb 6, 2014, at 2:16 PM, Adrian Reber wrote: >> >>> Josh explained it to me a few days ago, that after a checkpoint has been >>> received TCP should no longer be used to not

Re: [OMPI devel] C/R and orte_oob

2014-02-13 Thread Adrian Reber
On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote: > On Feb 6, 2014, at 2:16 PM, Adrian Reber wrote: > > > Josh explained it to me a few days ago, that after a checkpoint has been > > received TCP should no longer be used to not lose any messages. The > > communication happens over na

Re: [OMPI devel] C/R and orte_oob

2014-02-07 Thread Josh Hursey
In the original implementation, the OOB ft_event did not do much of anything on checkpoint preparation and continue. We did not even close the sockets. However, during restart the OOB will need to renegotiate the socket connections - usually by calling the finalization function (close stale sockets

Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Ralph Castain
On Feb 6, 2014, at 2:16 PM, Adrian Reber wrote: > Josh explained it to me a few days ago, that after a checkpoint has been > received TCP should no longer be used to not lose any messages. The > communication happens over named pipes and therefore (I think) OOB > ft_event() is used to quite anyt

Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
Josh explained it to me a few days ago, that after a checkpoint has been received TCP should no longer be used to not lose any messages. The communication happens over named pipes and therefore (I think) OOB ft_event() is used to quite anything besides the pipes. This all seems to work but I was ju

Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Ralph Castain
The only reason I can think of for an OOB ft-event would be to tell the OOB to stop sending any messages. You would need to push that into the event library and use a callback event to let you know when it was done. Of course, once you did that, the OOB would no longer be available to, for exam

[OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
When I initially made the C/R code compile again I made following change: diff --git a/orte/mca/rml/oob/rml_oob_component.c b/orte/mca/rml/oob/rml_oob_component.c index f0b22fc..90ed086 100644 --- a/orte/mca/rml/oob/rml_oob_component.c +++ b/orte/mca/rml/oob/rml_oob_component.c @@ -185,8 +185,7 @