Re: [O-MPI devel] Memory registration question.

Gleb Natapov Tue, 16 Aug 2005 15:08:05 -0500

On Tue, Aug 16, 2005 at 10:47:52AM -0600, Tim S. Woodall wrote:
> We're working on an approach to this now, that I think will address what
> you are looking for. Will let you know when we have something a little more
> concrete.
> 
I have the initial code that does what I've described. Do you want me to send it
to you?


> 
> Tim
> 
> 
> Gleb Natapov wrote:
> >Hello Tim,
> >
> >On Thu, Aug 11, 2005 at 10:08:04AM -0600, Tim S. Woodall wrote:
> >
> >>Hello Gleb,
> >>
> >>A couple of general comments:
> >>
> >>We initially started by maintaining the cache at the btl/mpool level. 
> >>However,
> >>we needed to expose the registrations to the upper level (pml), to
> >>allow the pml to make scheduling decisions (which btl/protocol to use),
> >>so we re-organized this to maintain a global cache/tree, where a given
> >>registration in the tree may reference multiple btls. This allows the pml
> >>to do a single lookup, and optionally schedule the message on the set of
> >>btls that have registered the memory.
> >>
> >
> >I understand the need to expose the registration cache to pml and I
> >think the optimisations you are using this for is very neat idea, but as 
> >far
> >as I see the current code returns multiple btls for the registration only 
> >if memory was registered using mca_mpool_base_alloc() (the rare case IMHO).
> >
> >
> >>>The saddest thing is you can't override the interface in your module. It 
> >>>is too
> >>>coupled with pml (ob1) and btls. If you don't like the way registration 
> >>>cache
> >>>works the only way to fix it is rewrite pml/btl/mpool.
> >>>
> >>
> >>True. We could implement a new framework for the cache, to allow this to 
> >>be replaced.
> >>However, my preference is still to maintain a single cache/tree, to 
> >>minimize latency/overhead
> >>in doing lookups.
> >
> >I think new framework for the cache would be grate and will fit openmpi
> >philosophy nicely (if something may be done different do it modular) :).
> >Regarding lookup latency I think it is not so obvious that the latency will
> >suffer (your datastructures will be more complex for instance), and think 
> >about other optimisation you can do if cache is per btl, like searching 
> >cache only
> >for btls from endpoint->btl_rdma for instance.
> >
> >>>I have some ideas about interface that I want to see, but perhaps it 
> >>>will not
> >>>play nice with the way ob1 works now. And remember my view is IB centric 
> >>>and may
> >>>be completely wrong for other interconnects. I will be glad to here your
> >>>comments.
> >>>
> >>>I think cache should be implemented for each mpool and not single global 
> >>>one.
> >>>
> >>>Three function will be added to mca_mpool_base_module_t:
> >>>mpool_insert(mca_mpool_base_module_t, mca_mpool_base_registration_t)
> >>>mca_mpool_base_registration_t mpool_find(mca_mpool_base_module_t, void 
> >>>*addr, size_t size)
> >>>mpool_put (mca_mpool_base_module_t, mca_mpool_base_registration_t);
> >>>
> >>>Each mpool can override those functions and provide its own cache 
> >>>implementation.
> >>>But base implementation will provide default one. The cache will 
> >>>maintain it's
> >>>own mru list.
> >>>
> >>>mca_mpool_base_find(void *addr, size_t length) will iterate through 
> >>>mpool list,
> >>>will call mpool_find() for each of them and will return list of 
> >>>registration to
> >>>pml. pml should call mpool_put() on registration it no longer needs 
> >>>(this is
> >>>needed for proper reference counting).
> >>>
> >>>btl will call mpool_insert() after mpool_register() it is possible to 
> >>>merge these
> >>>two functions in one.
> >>>
> >>
> >>My only issue with this is the cost of iterating over each of the mpools 
> >>and doing
> >>a lookup in each.
> >
> >What about optimisation I mentioned (iterating only over 
> >endpoint->btl_rdma).
> >Also I looked more closely on all usages of mca_mpool_base_find() (there 
> >are not many of those) and I have a question:
> >
> >Lets say application wants to send buffer B to 100 other ranks. It does
> >100 MPI_Isend (B). OpenMPI desides to use rdma to send B and calls 
> >mca_pml_ob1_send_request_start_prepare() for each sendreq. 
> >The following is called for each sendreq:
> >...
> >sendreq->req_chunk = mca_mpool_base_find(sendreq->req_send.req_addr);
> >...
> >Note that req_chunk is cached in sendreq until call to
> >mca_pml_ob1_send_request_put() (lets say B was not yet registered so 
> >sendreq->req_chunk is NULL for all 100 requests).
> >When mca_pml_ob1_send_request_put() is called on first sendreq sometime 
> >later, it
> >finds NULL in sendreq->req_chunk and subsequent call to 
> >mca_bml_base_prepare_src()
> >registers B, but all other 99 requests have NULL in sendreq->req_chunk
> >and they will register same buffer 99 times more!
> >
> >Is this scenario possible or I missed something?
> >
> >If it is possible then it looks like the right thing to do is to search
> >cache inside mca_pml_ob1_send_request_put() and since btl is available as 
> >parameter to this function we can search only the btls cache.
> >
> >
> >>
> >>>I have code that manages overlapping registrations and I am porting it to
> >>>openmpi now, but without changing the way mpool works it will be not very
> >>>useful.
> >>>
> >>
> >>Could we implement this a single cache where each entry could reference 
> >>multiple
> >>mpools/btls?
> >>
> >
> >Yes this is possible with performance hit.
> >
> >My cache is slightly reminds linux virtual memory area (VMA) list.
> >
> >The registration structure looks like this:
> >struct reg
> >{
> >  int begin; /* start of registration */
> >  int end;   /* end of registration */
> >};
> >
> >The VMA structure looks like this
> >struct vma
> >{
> >  int begin; /* first page in the VMA */
> >  int end;   /* last page in the VMA */
> >  list reg_list /* list of registration in this VMA sorted by reg->end */
> >};
> >
> >VMAs are never overlap. reg_list of each VMA contains sorted list of all 
> >registrations that cover the entire VMA.
> >For instance if we have two registrations first (R1) from 50 to 150 and 
> >second
> >(R2) from 100 to 200 the cache will have three VMAs. First one will cover 
> >area from 50 to 99 and will have R1 in reg_list, second will cover area 
> >from 100 to
> >150 and will have R2 and R1 in reg_list, third will cover area from 151 to 
> >200
> >and will have R2 in reg_list.
> >
> >All VMAs are stored in the R/B tree and in the sorted list. Insertion time 
> >is linear, search is logarithmic. The search function gets address and 
> >size as
> >parameters, it looks for VMA holding the address and check if the first 
> >registration on the list is big enough to hold size bytes (note that the 
> >reg_list is sorted so it is enough to check only the first element on the 
> >list).
> >
> >I can hold registrations from different mpools in the same database, but 
> >in this
> >case search functions will be less efficient since it will need to scan 
> >whole reg_list to find registration for each mpool and in most cases 
> >caller doesn't even need all the list, but only the registration for 
> >specific btl!
> >
> >--
> >                     Gleb.
> >

--
                        Gleb.

Re: [O-MPI devel] Memory registration question.

Reply via email to