Good points.

You know, this checkpoint stuff is all on the agenda to discuss next week at 
the OMPI dev meeting in Chicago.  Ralph correctly points out that since the 
fundamental design of ORTE has changed (which caused all this FT bit rot), a 
bunch of the original FT design isn't right (or necessary) any more, anyway.  
We need to talk through this stuff to figure out where to go.

Adrian: do you want to join us at the meeting via webex?  I think you're in 
Germany; we can do this part of the OMPI dev meeting first thing Friday morning 
US Central time, which would put it mid/late-afternoon for you.  It would 
probably be good for us to be introduced to you, and for you to hear all the 
discussion about how we think the FT design will need to be changed, etc.

    https://svn.open-mpi.org/trac/ompi/wiki/Dec13Meeting



On Dec 6, 2013, at 9:30 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:

> Since the blocking semantics are important for correctness of the prior code, 
> I would not just replace send_buffer with send_buffer_nb. This makes the 
> semantics incorrect, and will make things confusing later when you try to 
> sort out prior calls to send_buffer_nb with those that you replaced.
> 
> As an alternative I would suggest that you "#ifdef 0" out those sections of 
> code and leave the send_buffer_nb alternative in a comment. Then leave a big 
> TODO comment there for you to go back and fix the semantics - which will 
> likely involve just rewriting large sections of that framework. But at least 
> you will be able to see what was there before when you try to move it to a 
> more nonblocking model.
> 
> The bkmrk component is subtle, maybe more that it should be. So keeping the 
> old blocking interfaces there will probably help quite a bit when you get to 
> it later. In that component the blocking calls are critical to correctness, 
> so we will need to sort out how to make that more asynchronous in our 
> redesign.
> 
> Other than that modification (#ifdef comments instead of nonblocking 
> replacements), I think this patch is fine. As was mentioned previously, we 
> will need to go back (after things compile) and figure out a new model for 
> this behavior.
> 
> Thanks!
> Josh
> 
> 
> 
> On Wed, Dec 4, 2013 at 9:58 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> Err... upon further thought, I might be totally wrong about emulating 
> blocking.  There might be (probably are?) rules/assumptions in the ORTE layer 
> (of which I am *not* an expert) that disallow you from [emulating] blocking.
> 
> If that's the case, then there's architectural issues with converting from 
> blocking to nonblocking on both the sending and the receiving sides that 
> might be a bit thorny to sort out.
> 
> 
> 
> On Dec 4, 2013, at 10:54 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
> wrote:
> 
> > On Nov 25, 2013, at 9:59 AM, Adrian Reber <adr...@lisas.de> wrote:
> >
> >> * Send Non-blocking
> >> */
> >> int orte_rml_ftrm_send_nb(orte_process_name_t* peer,
> >>                          struct iovec* msg,
> >>                          int count,
> >>                          orte_rml_tag_t tag,
> >> -                          int flags,
> >>                          orte_rml_callback_fn_t cbfunc,
> >>                          void* cbdata)
> >> {
> >>    int ret;
> >>
> >>    opal_output_verbose(20, rml_ftrm_output_handle,
> >> -                        "orte_rml_ftrm: send_nb(%s, %d, %d, %d )",
> >> -                        ORTE_NAME_PRINT(peer), count, tag, flags);
> >> +                        "orte_rml_ftrm: send_nb(%s, %d, %d )",
> >> +                        ORTE_NAME_PRINT(peer), count, tag);
> >>
> >>    if( NULL != orte_rml_ftrm_wrapped_module.send_nb ) {
> >> -        if( ORTE_SUCCESS != (ret = 
> >> orte_rml_ftrm_wrapped_module.send_nb(peer, msg, count, tag, flags, cbfunc, 
> >> cbdata) ) ) {
> >> -            return ret;
> >> -        }
> >> -    }
> >> -
> >> -    return ORTE_SUCCESS;
> >> -}
> >> -
> >> -/*
> >> - * Send Buffer
> >> - */
> >> -int orte_rml_ftrm_send_buffer(orte_process_name_t* peer,
> >> -                              opal_buffer_t* buffer,
> >> -                              orte_rml_tag_t tag,
> >> -                              int flags)
> >> -{
> >> -    int ret;
> >> -
> >> -    opal_output_verbose(20, rml_ftrm_output_handle,
> >> -                        "orte_rml_ftrm: send_buffer(%s, %d, %d )",
> >> -                        ORTE_NAME_PRINT(peer), tag, flags);
> >> -
> >> -    if( NULL != orte_rml_ftrm_wrapped_module.send_buffer ) {
> >> -        if( ORTE_SUCCESS != (ret = 
> >> orte_rml_ftrm_wrapped_module.send_buffer(peer, buffer, tag, flags) ) ) {
> >> +        if( ORTE_SUCCESS != (ret = 
> >> orte_rml_ftrm_wrapped_module.send_nb(peer, msg, count, tag, cbfunc, 
> >> cbdata) ) ) {
> >>            return ret;
> >>        }
> >>    }
> >
> > Similar to my reply about patch 3, I don't think this hunk is correct.
> >
> > This routine accepts an iovec and sends it in a non-blocking fashion.  I'll 
> > bet that the caller frees the iovec upon return from the function (because 
> > it used to be a blocking send, and freeing it immediately was acceptable).
> >
> > But now the iovec may well still be in use when this function returns, so 
> > the caller should *not* free/reuse the iovec until it knows that the send 
> > has complete.
> >
> > It may be more desirable to keep the blocking send function 
> > orte_rml_ftrm_send_buffer() and emulate blocking by invoking send_nb under 
> > the covers, but then not returning until the send callback has actually 
> > been invoked.
> >
> > Then the blocking semantics expected by the caller may well be 
> > acceptable/safe.
> >
> > This loses some potential optimizations of asynchronicity, but it may be 
> > worth it: a) performance in this part of the code isn't too critical, and 
> > b) blocking semantics are usually simpler and easier to maintain, from the 
> > caller's perspective.
> >
> > This idea may also apply to what I said in reply to patch 3...?  (i.e., 
> > preserve a blocking send by using the _nb variant under the covers, but not 
> > returning until the nonblocking variant has actually completed the receive).
> >
> > Since this is a fairly large change, I didn't look too closely throughout 
> > the rest of this patch.  I assume that there are a few other architectural 
> > cases similar to this one.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to