On Tue, Aug 16, 2005 at 10:47:52AM -0600, Tim S. Woodall wrote: > We're working on an approach to this now, that I think will address what > you are looking for. Will let you know when we have something a little more > concrete. > I have the initial code that does what I've described. Do you want me to send it to you?
> > Tim > > > Gleb Natapov wrote: > >Hello Tim, > > > >On Thu, Aug 11, 2005 at 10:08:04AM -0600, Tim S. Woodall wrote: > > > >>Hello Gleb, > >> > >>A couple of general comments: > >> > >>We initially started by maintaining the cache at the btl/mpool level. > >>However, > >>we needed to expose the registrations to the upper level (pml), to > >>allow the pml to make scheduling decisions (which btl/protocol to use), > >>so we re-organized this to maintain a global cache/tree, where a given > >>registration in the tree may reference multiple btls. This allows the pml > >>to do a single lookup, and optionally schedule the message on the set of > >>btls that have registered the memory. > >> > > > >I understand the need to expose the registration cache to pml and I > >think the optimisations you are using this for is very neat idea, but as > >far > >as I see the current code returns multiple btls for the registration only > >if memory was registered using mca_mpool_base_alloc() (the rare case IMHO). > > > > > >>>The saddest thing is you can't override the interface in your module. It > >>>is too > >>>coupled with pml (ob1) and btls. If you don't like the way registration > >>>cache > >>>works the only way to fix it is rewrite pml/btl/mpool. > >>> > >> > >>True. We could implement a new framework for the cache, to allow this to > >>be replaced. > >>However, my preference is still to maintain a single cache/tree, to > >>minimize latency/overhead > >>in doing lookups. > > > >I think new framework for the cache would be grate and will fit openmpi > >philosophy nicely (if something may be done different do it modular) :). > >Regarding lookup latency I think it is not so obvious that the latency will > >suffer (your datastructures will be more complex for instance), and think > >about other optimisation you can do if cache is per btl, like searching > >cache only > >for btls from endpoint->btl_rdma for instance. > > > >>>I have some ideas about interface that I want to see, but perhaps it > >>>will not > >>>play nice with the way ob1 works now. And remember my view is IB centric > >>>and may > >>>be completely wrong for other interconnects. I will be glad to here your > >>>comments. > >>> > >>>I think cache should be implemented for each mpool and not single global > >>>one. > >>> > >>>Three function will be added to mca_mpool_base_module_t: > >>>mpool_insert(mca_mpool_base_module_t, mca_mpool_base_registration_t) > >>>mca_mpool_base_registration_t mpool_find(mca_mpool_base_module_t, void > >>>*addr, size_t size) > >>>mpool_put (mca_mpool_base_module_t, mca_mpool_base_registration_t); > >>> > >>>Each mpool can override those functions and provide its own cache > >>>implementation. > >>>But base implementation will provide default one. The cache will > >>>maintain it's > >>>own mru list. > >>> > >>>mca_mpool_base_find(void *addr, size_t length) will iterate through > >>>mpool list, > >>>will call mpool_find() for each of them and will return list of > >>>registration to > >>>pml. pml should call mpool_put() on registration it no longer needs > >>>(this is > >>>needed for proper reference counting). > >>> > >>>btl will call mpool_insert() after mpool_register() it is possible to > >>>merge these > >>>two functions in one. > >>> > >> > >>My only issue with this is the cost of iterating over each of the mpools > >>and doing > >>a lookup in each. > > > >What about optimisation I mentioned (iterating only over > >endpoint->btl_rdma). > >Also I looked more closely on all usages of mca_mpool_base_find() (there > >are not many of those) and I have a question: > > > >Lets say application wants to send buffer B to 100 other ranks. It does > >100 MPI_Isend (B). OpenMPI desides to use rdma to send B and calls > >mca_pml_ob1_send_request_start_prepare() for each sendreq. > >The following is called for each sendreq: > >... > >sendreq->req_chunk = mca_mpool_base_find(sendreq->req_send.req_addr); > >... > >Note that req_chunk is cached in sendreq until call to > >mca_pml_ob1_send_request_put() (lets say B was not yet registered so > >sendreq->req_chunk is NULL for all 100 requests). > >When mca_pml_ob1_send_request_put() is called on first sendreq sometime > >later, it > >finds NULL in sendreq->req_chunk and subsequent call to > >mca_bml_base_prepare_src() > >registers B, but all other 99 requests have NULL in sendreq->req_chunk > >and they will register same buffer 99 times more! > > > >Is this scenario possible or I missed something? > > > >If it is possible then it looks like the right thing to do is to search > >cache inside mca_pml_ob1_send_request_put() and since btl is available as > >parameter to this function we can search only the btls cache. > > > > > >> > >>>I have code that manages overlapping registrations and I am porting it to > >>>openmpi now, but without changing the way mpool works it will be not very > >>>useful. > >>> > >> > >>Could we implement this a single cache where each entry could reference > >>multiple > >>mpools/btls? > >> > > > >Yes this is possible with performance hit. > > > >My cache is slightly reminds linux virtual memory area (VMA) list. > > > >The registration structure looks like this: > >struct reg > >{ > > int begin; /* start of registration */ > > int end; /* end of registration */ > >}; > > > >The VMA structure looks like this > >struct vma > >{ > > int begin; /* first page in the VMA */ > > int end; /* last page in the VMA */ > > list reg_list /* list of registration in this VMA sorted by reg->end */ > >}; > > > >VMAs are never overlap. reg_list of each VMA contains sorted list of all > >registrations that cover the entire VMA. > >For instance if we have two registrations first (R1) from 50 to 150 and > >second > >(R2) from 100 to 200 the cache will have three VMAs. First one will cover > >area from 50 to 99 and will have R1 in reg_list, second will cover area > >from 100 to > >150 and will have R2 and R1 in reg_list, third will cover area from 151 to > >200 > >and will have R2 in reg_list. > > > >All VMAs are stored in the R/B tree and in the sorted list. Insertion time > >is linear, search is logarithmic. The search function gets address and > >size as > >parameters, it looks for VMA holding the address and check if the first > >registration on the list is big enough to hold size bytes (note that the > >reg_list is sorted so it is enough to check only the first element on the > >list). > > > >I can hold registrations from different mpools in the same database, but > >in this > >case search functions will be less efficient since it will need to scan > >whole reg_list to find registration for each mpool and in most cases > >caller doesn't even need all the list, but only the registration for > >specific btl! > > > >-- > > Gleb. > > -- Gleb.