On Mon, Aug 13, 2007 at 03:59:28PM -0400, Richard Graham wrote: > > > > On 8/13/07 3:52 PM, "Gleb Natapov" <gl...@voltaire.com> wrote: > > > On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote: > > Here are the > > items we have identified: > > > All those things sounds very promising. Is there > > tmp branch where you > are going to work on this? > > > > > > > tmp/latency > > Some changes have already gone in - mainly trying to remove as much as > possible from the isend/send path, before moving on to the list bellow. Do > you have cycles to help with this ? I am very interested, not sure about cycles though. I'll get back from my vacation next week and look over this list one more time to see where I can help.
> > Rich > > > ------------------------------------------------------------------------ > > > ---- > > > > 1) remove 0 byte optimization of not initializing the convertor > > > > This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an > > ³if³ in > > mca_pml_ob1_send_request_start_copy > > +++ > > Measure the convertor > > initialization before taking any other action. > > > > ------------------------------------------------------------------------ > > > ---- > > > > > > ------------------------------------------------------------------------ > > > ---- > > > > 2) get rid of mca_pml_ob1_send_request_start_prepare and > > > > mca_pml_ob1_send_request_start_copy by removing the > > > > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send > > > > return OMPI_SUCCESS if the fragment can be marked as completed and > > > > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This > > solves > > another problem, with IB if there are a bunch of isends > > outstanding we end > > up buffering them all in the btl, marking > > completion and never get them on > > the wire because the BTL runs out of > > credits, we never get credits back > > until finalize because we never > > call progress cause the requests are > > complete. There is one issue > > here, start_prepare calls prepare_src and > > start_copy calls alloc, I > > think we can work around this by just always > > using prepare_src, > > OpenIB BTL will give a fragment off the free list > > anyway because the > > fragment is less than the eager limit. > > +++ > > Make the > > BTL return different return codes for the send. If the > > fragment is gone, > > then the PML is responsible of marking the MPI > > request as completed and so > > on. Only the updated BTLs will get any > > benefit from this feature. Add a > > flag into the descriptor to allow or > > not the BTL to free the fragment. > > > > > > Add a 3 level flag: > > - BTL_HAVE_OWNERSHIP : the fragment can be released > > by the BTL after > > the send, and then it report back a special return to the > > PML > > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released > > > > by the BTL once the completion callback was triggered. > > - PML_HAVE_OWNERSHIP > > : the BTL is not allowed to release the fragment > > at all (the PML is > > responsible for this). > > > > Return codes: > > - done and there will be no > > callbacks > > - not done, wait for a callback later > > - error state > > > > ------------------------------------------------------------------------ > > > ---- > > > > > > ------------------------------------------------------------------------ > > > ---- > > > > 3) Change the remote callback function (and tag value based on what > > > > data we are sending), don't use mca_pml_ob1_recv_frag_callback for > > > > everything! > > I think we need: > > > > mca_pml_ob1_recv_frag_match > > > > mca_pml_ob1_recv_frag_rndv > > mca_pml_ob1_recv_frag_rget > > > > > > mca_pml_ob1_recv_match_ack_copy > > mca_pml_ob1_recv_match_ack_pipeline > > > > > > mca_pml_ob1_recv_copy_frag > > mca_pml_ob1_recv_put_request > > > > mca_pml_ob1_recv_put_fin > > +++ > > Pass the callback as parameter to the match > > function will save us 2 > > switches. Add more registrations in the BTL in > > order to jump directly > > in the correct function (the first 3 require a > > match while the others > > don't). 4 & 4 bits on the tag so each layer will > > have 4 bits of tags > > [i.e. first 4 bits for the protocol tag and lower 4 > > bits they are up > > to the protocol] and the registration table will still be > > local to > > each component. > > > > ------------------------------------------------------------------------ > > > ---- > > > > > > ------------------------------------------------------------------------ > > > ---- > > > > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same > > > > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! > > > > I think what we can do here is modify mca_pml_ob1_recv_frag_match to > > take > > a function pointer for what it should call on a successful match. > > So based > > on the receive callback we can pass the correct scheduling > > function to > > invoke into the generic mca_pml_ob1_recv_frag_match > > > > Recv_request progress > > is call in a generic way from multiple places, > > and we do a big switch > > inside. In the match function we might want to > > pass a function pointer to > > the successful match progress function. > > This way we will be able to > > specialize what happens after the match, > > in a more optimized way. Or the > > recv_request_match can return the > > match and then the caller will have to > > specialize it's action. > > > > ------------------------------------------------------------------------ > > > ---- > > > > > > ------------------------------------------------------------------------ > > > ---- > > > > 5) Don't initialize the entire request. We can use item 2 below (if > > > > we get back OMPI_SUCCESS from btl_send) then we don't need to fully > > > > initialize the request, we need the convertor setup but the rest we > > can > > pass down the stack in order to setup the match header and setup > > the > > request if we get OMPI_NOT_ON_WIRE back from btl_send. > > > > I think we need > > something like: > > MCA_PML_BASE_SEND_REQUEST_INIT_CONV > > > > and > > > > MCA_PML_BASE_SEND_REQUEST_INIT_FULL > > > > so the first macro just sets up the > > convertor, the second populates > > all the rest of the request state in the > > case that we will need it > > later because the fragment doesn't hit the > > wire. > > +++ > > We all agreed. > > > > ------------------------------------------------------------------------ > > > ---- > > > > > > > > On Aug 13, 2007, at 9:00 AM, Christian Bell wrote: > > > > > On > > Sun, 12 Aug 2007, Gleb Natapov wrote: > > > > > >>> Any objections? We can > > discuss what approaches we want to take > > >>> (there's going to be some > > complications because of the PML driver, > > >>> etc.); perhaps in the Tuesday > > Mellanox teleconf...? > > >>> > > >> My main objection is that the only reason you > > propose to do this > > >> is some > > >> bogus benchmark? Is there any other > > reason to implement header > > >> caching? > > >> I also hope you don't propose > > to break layering and somehow cache > > >> PML headers > > >> in BTL. > > > > > > > > Gleb is hitting the main points I wanted to bring up. We had > > > examined > > this header caching in the context of PSM a little while > > > ago. 0.5us is > > much more than we had observed -- at 3GHz, 0.5us would > > > be about 1500 > > cycles of code that has little amounts of branches. > > > For us, with a much > > bigger header and more fields to fetch from > > > different structures, it was > > more like 350 cycles which is on the > > > order of 0.1us and not worth the > > effort (in code complexity, > > > readability and frankly motivation for > > performance). Maybe there's > > > more to it than just "code caching" -- like > > sending from pre-pinned > > > headers or using the RDMA with immediate, etc. > > But I'd be suprised > > > to find out that openib btl doesn't do the best thing > > here. > > > > > > I have pretty good evidence that for CM, the latency difference > > comes > > > from the receive-side (in particular opal_progress). Doesn't the > > > > > openib btl receive-side do something similiar with opal_progress, > > > i.e. > > register a callback function? It probably does something > > > different like > > check a few RDMA mailboxes (or per-peer landing pads) > > > but anything that > > gets called before or after it as part of > > > opal_progress is cause for > > slowdown. > > > > > > . . christian > > > > > > -- > > > > > christian.b...@qlogic.com > > > (QLogic Host Solutions Group, formerly > > Pathscale) > > > _______________________________________________ > > > devel > > mailing list > > > de...@open-mpi.org > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > > > _______________________________________________ > > devel mailing list > > > > de...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Gleb. > > _____________ > > __________________________________ > devel mailing > > list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.