Re: [O-MPI devel] Memory registration question.

Gleb Natapov Sun, 14 Aug 2005 07:53:43 -0500

Hello Tim,

On Thu, Aug 11, 2005 at 10:08:04AM -0600, Tim S. Woodall wrote:
> Hello Gleb,
> 
> A couple of general comments:
> 
> We initially started by maintaining the cache at the btl/mpool level. However,
> we needed to expose the registrations to the upper level (pml), to
> allow the pml to make scheduling decisions (which btl/protocol to use),
> so we re-organized this to maintain a global cache/tree, where a given
> registration in the tree may reference multiple btls. This allows the pml
> to do a single lookup, and optionally schedule the message on the set of
> btls that have registered the memory.
> 
I understand the need to expose the registration cache to pml and I
think the optimisations you are using this for is very neat idea, but as far
as I see the current code returns multiple btls for the registration only if 
memory was registered using mca_mpool_base_alloc() (the rare case IMHO).


> > The saddest thing is you can't override the interface in your module. It is 
> > too
> > coupled with pml (ob1) and btls. If you don't like the way registration 
> > cache
> > works the only way to fix it is rewrite pml/btl/mpool.
> > 
> 
> True. We could implement a new framework for the cache, to allow this to be 
> replaced.
> However, my preference is still to maintain a single cache/tree, to minimize 
> latency/overhead
> in doing lookups.
I think new framework for the cache would be grate and will fit openmpi
philosophy nicely (if something may be done different do it modular) :).
Regarding lookup latency I think it is not so obvious that the latency will
suffer (your datastructures will be more complex for instance), and think 
about other optimisation you can do if cache is per btl, like searching cache 
only
for btls from endpoint->btl_rdma for instance.
> 
> > I have some ideas about interface that I want to see, but perhaps it will 
> > not
> > play nice with the way ob1 works now. And remember my view is IB centric 
> > and may
> > be completely wrong for other interconnects. I will be glad to here your
> > comments.
> > 
> > I think cache should be implemented for each mpool and not single global 
> > one.
> > 
> > Three function will be added to mca_mpool_base_module_t:
> > mpool_insert(mca_mpool_base_module_t, mca_mpool_base_registration_t)
> > mca_mpool_base_registration_t mpool_find(mca_mpool_base_module_t, void 
> > *addr, size_t size)
> > mpool_put (mca_mpool_base_module_t, mca_mpool_base_registration_t);
> > 
> > Each mpool can override those functions and provide its own cache 
> > implementation.
> > But base implementation will provide default one. The cache will maintain 
> > it's
> > own mru list.
> > 
> > mca_mpool_base_find(void *addr, size_t length) will iterate through mpool 
> > list,
> > will call mpool_find() for each of them and will return list of 
> > registration to
> > pml. pml should call mpool_put() on registration it no longer needs (this is
> > needed for proper reference counting).
> > 
> > btl will call mpool_insert() after mpool_register() it is possible to merge 
> > these
> > two functions in one.
> >
> 
> My only issue with this is the cost of iterating over each of the mpools and 
> doing
> a lookup in each.
What about optimisation I mentioned (iterating only over endpoint->btl_rdma).
Also I looked more closely on all usages of mca_mpool_base_find() (there are 
not many of 
those) and I have a question:

Lets say application wants to send buffer B to 100 other ranks. It does
100 MPI_Isend (B). OpenMPI desides to use rdma to send B and calls 
mca_pml_ob1_send_request_start_prepare() for each sendreq. 
The following is called for each sendreq:
...
sendreq->req_chunk = mca_mpool_base_find(sendreq->req_send.req_addr);
...
Note that req_chunk is cached in sendreq until call to
mca_pml_ob1_send_request_put() (lets say B was not yet registered so 
sendreq->req_chunk is NULL for all 100 requests).
When mca_pml_ob1_send_request_put() is called on first sendreq sometime later, 
it
finds NULL in sendreq->req_chunk and subsequent call to 
mca_bml_base_prepare_src()
registers B, but all other 99 requests have NULL in sendreq->req_chunk
and they will register same buffer 99 times more!

Is this scenario possible or I missed something?

If it is possible then it looks like the right thing to do is to search
cache inside mca_pml_ob1_send_request_put() and since btl is available as 
parameter 
to this function we can search only the btls cache.

> 
> 
> > I have code that manages overlapping registrations and I am porting it to
> > openmpi now, but without changing the way mpool works it will be not very
> > useful.
> > 
> 
> Could we implement this a single cache where each entry could reference 
> multiple
> mpools/btls?
> 
Yes this is possible with performance hit.

My cache is slightly reminds linux virtual memory area (VMA) list.

The registration structure looks like this:
struct reg
{
  int begin; /* start of registration */
  int end;   /* end of registration */
};

The VMA structure looks like this
struct vma
{
  int begin; /* first page in the VMA */
  int end;   /* last page in the VMA */
  list reg_list /* list of registration in this VMA sorted by reg->end */
};

VMAs are never overlap. reg_list of each VMA contains sorted list of all 
registrations that cover the entire VMA.
For instance if we have two registrations first (R1) from 50 to 150 and second
(R2) from 100 to 200 the cache will have three VMAs. First one will cover area 
from 50 to 99 and will have R1 in reg_list, second will cover area from 100 to
150 and will have R2 and R1 in reg_list, third will cover area from 151 to 200
and will have R2 in reg_list.

All VMAs are stored in the R/B tree and in the sorted list. Insertion time is 
linear, search is logarithmic. The search function gets address and size as
parameters, it looks for VMA holding the address and check if the first 
registration on the list is big enough to hold size bytes (note that the 
reg_list is sorted so it is enough to check only the first element on the list).

I can hold registrations from different mpools in the same database, but in this
case search functions will be less efficient since it will need to scan whole 
reg_list to find registration for each mpool and in most cases caller doesn't 
even need all the list, but only the registration for specific btl!

--
                        Gleb.

Re: [O-MPI devel] Memory registration question.

Reply via email to