On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote: > Here are the items we have identified: > All those things sounds very promising. Is there tmp branch where you are going to work on this?
> > ------------------------------------------------------------------------ > ---- > > 1) remove 0 byte optimization of not initializing the convertor > This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an > “if“ in mca_pml_ob1_send_request_start_copy > +++ > Measure the convertor initialization before taking any other action. > ------------------------------------------------------------------------ > ---- > > ------------------------------------------------------------------------ > ---- > > 2) get rid of mca_pml_ob1_send_request_start_prepare and > mca_pml_ob1_send_request_start_copy by removing the > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send > return OMPI_SUCCESS if the fragment can be marked as completed and > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This > solves another problem, with IB if there are a bunch of isends > outstanding we end up buffering them all in the btl, marking > completion and never get them on the wire because the BTL runs out of > credits, we never get credits back until finalize because we never > call progress cause the requests are complete. There is one issue > here, start_prepare calls prepare_src and start_copy calls alloc, I > think we can work around this by just always using prepare_src, > OpenIB BTL will give a fragment off the free list anyway because the > fragment is less than the eager limit. > +++ > Make the BTL return different return codes for the send. If the > fragment is gone, then the PML is responsible of marking the MPI > request as completed and so on. Only the updated BTLs will get any > benefit from this feature. Add a flag into the descriptor to allow or > not the BTL to free the fragment. > > Add a 3 level flag: > - BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after > the send, and then it report back a special return to the PML > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released > by the BTL once the completion callback was triggered. > - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment > at all (the PML is responsible for this). > > Return codes: > - done and there will be no callbacks > - not done, wait for a callback later > - error state > ------------------------------------------------------------------------ > ---- > > ------------------------------------------------------------------------ > ---- > > 3) Change the remote callback function (and tag value based on what > data we are sending), don't use mca_pml_ob1_recv_frag_callback for > everything! > I think we need: > > mca_pml_ob1_recv_frag_match > mca_pml_ob1_recv_frag_rndv > mca_pml_ob1_recv_frag_rget > > mca_pml_ob1_recv_match_ack_copy > mca_pml_ob1_recv_match_ack_pipeline > > mca_pml_ob1_recv_copy_frag > mca_pml_ob1_recv_put_request > mca_pml_ob1_recv_put_fin > +++ > Pass the callback as parameter to the match function will save us 2 > switches. Add more registrations in the BTL in order to jump directly > in the correct function (the first 3 require a match while the others > don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags > [i.e. first 4 bits for the protocol tag and lower 4 bits they are up > to the protocol] and the registration table will still be local to > each component. > ------------------------------------------------------------------------ > ---- > > ------------------------------------------------------------------------ > ---- > > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! > I think what we can do here is modify mca_pml_ob1_recv_frag_match to > take a function pointer for what it should call on a successful match. > So based on the receive callback we can pass the correct scheduling > function to invoke into the generic mca_pml_ob1_recv_frag_match > > Recv_request progress is call in a generic way from multiple places, > and we do a big switch inside. In the match function we might want to > pass a function pointer to the successful match progress function. > This way we will be able to specialize what happens after the match, > in a more optimized way. Or the recv_request_match can return the > match and then the caller will have to specialize it's action. > ------------------------------------------------------------------------ > ---- > > ------------------------------------------------------------------------ > ---- > > 5) Don't initialize the entire request. We can use item 2 below (if > we get back OMPI_SUCCESS from btl_send) then we don't need to fully > initialize the request, we need the convertor setup but the rest we > can pass down the stack in order to setup the match header and setup > the request if we get OMPI_NOT_ON_WIRE back from btl_send. > > I think we need something like: > MCA_PML_BASE_SEND_REQUEST_INIT_CONV > > and > MCA_PML_BASE_SEND_REQUEST_INIT_FULL > > so the first macro just sets up the convertor, the second populates > all the rest of the request state in the case that we will need it > later because the fragment doesn't hit the wire. > +++ > We all agreed. > ------------------------------------------------------------------------ > ---- > > > > On Aug 13, 2007, at 9:00 AM, Christian Bell wrote: > > > On Sun, 12 Aug 2007, Gleb Natapov wrote: > > > >>> Any objections? We can discuss what approaches we want to take > >>> (there's going to be some complications because of the PML driver, > >>> etc.); perhaps in the Tuesday Mellanox teleconf...? > >>> > >> My main objection is that the only reason you propose to do this > >> is some > >> bogus benchmark? Is there any other reason to implement header > >> caching? > >> I also hope you don't propose to break layering and somehow cache > >> PML headers > >> in BTL. > > > > Gleb is hitting the main points I wanted to bring up. We had > > examined this header caching in the context of PSM a little while > > ago. 0.5us is much more than we had observed -- at 3GHz, 0.5us would > > be about 1500 cycles of code that has little amounts of branches. > > For us, with a much bigger header and more fields to fetch from > > different structures, it was more like 350 cycles which is on the > > order of 0.1us and not worth the effort (in code complexity, > > readability and frankly motivation for performance). Maybe there's > > more to it than just "code caching" -- like sending from pre-pinned > > headers or using the RDMA with immediate, etc. But I'd be suprised > > to find out that openib btl doesn't do the best thing here. > > > > I have pretty good evidence that for CM, the latency difference comes > > from the receive-side (in particular opal_progress). Doesn't the > > openib btl receive-side do something similiar with opal_progress, > > i.e. register a callback function? It probably does something > > different like check a few RDMA mailboxes (or per-peer landing pads) > > but anything that gets called before or after it as part of > > opal_progress is cause for slowdown. > > > > . . christian > > > > -- > > christian.b...@qlogic.com > > (QLogic Host Solutions Group, formerly Pathscale) > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.