On Mon, Aug 13, 2007 at 03:59:28PM -0400, Richard Graham wrote:
> 
> 
> 
> On 8/13/07 3:52 PM, "Gleb Natapov" <gl...@voltaire.com> wrote:
> 
> > On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> > Here are the
> > items we have identified:
> > 
> All those things sounds very promising. Is there
> > tmp branch where you
> are going to work on this?
> 
> > 
> >
> 
>  tmp/latency
> 
> Some changes have already gone in - mainly trying to remove as much as
> possible from the isend/send path, before moving on to the list bellow.  Do
> you have cycles to help with this ?
I am very interested, not sure about cycles though. I'll get back from
my vacation next week and look over this list one more time to see where
I can help.

> 
> Rich
> 
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 1) remove 0 byte optimization of not initializing the convertor
> >
> > This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
> > ³if³ in
> > mca_pml_ob1_send_request_start_copy
> > +++
> > Measure the convertor
> > initialization before taking any other action.
> >
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 2) get rid of mca_pml_ob1_send_request_start_prepare and  
> >
> > mca_pml_ob1_send_request_start_copy by removing the  
> >
> > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
> >
> > return OMPI_SUCCESS if the fragment can be marked as completed and  
> >
> > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
> > solves
> > another problem, with IB if there are a bunch of isends  
> > outstanding we end
> > up buffering them all in the btl, marking  
> > completion and never get them on
> > the wire because the BTL runs out of  
> > credits, we never get credits back
> > until finalize because we never  
> > call progress cause the requests are
> > complete.  There is one issue  
> > here, start_prepare calls prepare_src and
> > start_copy calls alloc, I  
> > think we can work around this by just always
> > using prepare_src,  
> > OpenIB BTL will give a fragment off the free list
> > anyway because the  
> > fragment is less than the eager limit.
> > +++
> > Make the
> > BTL return different return codes for the send. If the  
> > fragment is gone,
> > then the PML is responsible of marking the MPI  
> > request as completed and so
> > on. Only the updated BTLs will get any  
> > benefit from this feature. Add a
> > flag into the descriptor to allow or  
> > not the BTL to free the fragment.
> >
> > 
> > Add a 3 level flag:
> > - BTL_HAVE_OWNERSHIP : the fragment can be released
> > by the BTL after  
> > the send, and then it report back a special return to the
> > PML
> > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
> >
> > by the BTL once the completion callback was triggered.
> > - PML_HAVE_OWNERSHIP
> > : the BTL is not allowed to release the fragment  
> > at all (the PML is
> > responsible for this).
> > 
> > Return codes:
> > - done and there will be no
> > callbacks
> > - not done, wait for a callback later
> > - error state
> >
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 3) Change the remote callback function (and tag value based on what
> > 
> > data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
> >
> > everything!
> >  I think we need:
> > 
> >  mca_pml_ob1_recv_frag_match
> >
> >  mca_pml_ob1_recv_frag_rndv
> >  mca_pml_ob1_recv_frag_rget
> >  
> >
> >  mca_pml_ob1_recv_match_ack_copy
> >  mca_pml_ob1_recv_match_ack_pipeline
> >  
> >
> >  mca_pml_ob1_recv_copy_frag
> >  mca_pml_ob1_recv_put_request
> >
> >  mca_pml_ob1_recv_put_fin
> > +++
> > Pass the callback as parameter to the match
> > function will save us 2  
> > switches. Add more registrations in the BTL in
> > order to jump directly  
> > in the correct function (the first 3 require a
> > match while the others  
> > don't). 4 & 4 bits on the tag so each layer will
> > have 4 bits of tags  
> > [i.e. first 4 bits for the protocol tag and lower 4
> > bits they are up  
> > to the protocol] and the registration table will still be
> > local to  
> > each component.
> >
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same
> > 
> > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
> >
> >  I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
> > take
> > a function pointer for what it should call on a successful match.
> >  So based
> > on the receive callback we can pass the correct scheduling  
> > function to
> > invoke into the generic mca_pml_ob1_recv_frag_match
> > 
> > Recv_request progress
> > is call in a generic way from multiple places,  
> > and we do a big switch
> > inside. In the match function we might want to  
> > pass a function pointer to
> > the successful match progress function.  
> > This way we will be able to
> > specialize what happens after the match,  
> > in a more optimized way. Or the
> > recv_request_match can return the  
> > match and then the caller will have to
> > specialize it's action.
> >
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 5) Don't initialize the entire request. We can use item 2 below (if
> > 
> > we get back OMPI_SUCCESS from btl_send) then we don't need to fully  
> >
> > initialize the request, we need the convertor setup but the rest we  
> > can
> > pass down the stack in order to setup the match header and setup  
> > the
> > request if we get OMPI_NOT_ON_WIRE back from btl_send.
> > 
> > I think we need
> > something like:
> > MCA_PML_BASE_SEND_REQUEST_INIT_CONV
> > 
> > and
> >
> > MCA_PML_BASE_SEND_REQUEST_INIT_FULL
> > 
> > so the first macro just sets up the
> > convertor, the second populates  
> > all the rest of the request state in the
> > case that we will need it  
> > later because the fragment doesn't hit the
> > wire.
> > +++
> > We all agreed.
> >
> > ------------------------------------------------------------------------ >
> > ----
> > 
> > 
> > 
> > On Aug 13, 2007, at 9:00 AM, Christian Bell wrote:
> > 
> > > On
> > Sun, 12 Aug 2007, Gleb Natapov wrote:
> > >
> > >>> Any objections?  We can
> > discuss what approaches we want to take
> > >>> (there's going to be some
> > complications because of the PML driver,
> > >>> etc.); perhaps in the Tuesday
> > Mellanox teleconf...?
> > >>>
> > >> My main objection is that the only reason you
> > propose to do this  
> > >> is some
> > >> bogus benchmark? Is there any other
> > reason to implement header  
> > >> caching?
> > >> I also hope you don't propose
> > to break layering and somehow cache  
> > >> PML headers
> > >> in BTL.
> > >
> > >
> > Gleb is hitting the main points I wanted to bring up.  We had
> > > examined
> > this header caching in the context of PSM a little while
> > > ago.  0.5us is
> > much more than we had observed -- at 3GHz, 0.5us would
> > > be about 1500
> > cycles of code that has little amounts of branches.
> > > For us, with a much
> > bigger header and more fields to fetch from
> > > different structures, it was
> > more like 350 cycles which is on the
> > > order of 0.1us and not worth the
> > effort (in code complexity,
> > > readability and frankly motivation for
> > performance).  Maybe there's
> > > more to it than just "code caching" -- like
> > sending from pre-pinned
> > > headers or using the RDMA with immediate, etc.
> > But I'd be suprised
> > > to find out that openib btl doesn't do the best thing
> > here.
> > >
> > > I have pretty good evidence that for CM, the latency difference
> > comes
> > > from the receive-side (in particular opal_progress).  Doesn't the
> >
> > > openib btl receive-side do something similiar with opal_progress,
> > > i.e.
> > register a callback function?  It probably does something
> > > different like
> > check a few RDMA mailboxes (or per-peer landing pads)
> > > but anything that
> > gets called before or after it as part of
> > > opal_progress is cause for
> > slowdown.
> > >
> > >     . . christian
> > >
> > > -- 
> > >
> > christian.b...@qlogic.com
> > > (QLogic Host Solutions Group, formerly
> > Pathscale)
> > > _______________________________________________
> > > devel
> > mailing list
> > > de...@open-mpi.org
> > >
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > 
> >
> > _______________________________________________
> > devel mailing list
> >
> > de...@open-mpi.org
> >
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
>    Gleb.
> 
> _____________
> > __________________________________
> devel mailing
> > list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
                        Gleb.

Reply via email to