Hello Gleb,

A couple of general comments:

We initially started by maintaining the cache at the btl/mpool level. However,
we needed to expose the registrations to the upper level (pml), to
allow the pml to make scheduling decisions (which btl/protocol to use),
so we re-organized this to maintain a global cache/tree, where a given
registration in the tree may reference multiple btls. This allows the pml
to do a single lookup, and optionally schedule the message on the set of
btls that have registered the memory.

That said, there are problems with the current approach as you've indicated.
MRU lists are maintained on a per-btl module basis (for fairness), which results
in a good bit of duplicated code across btls. Also, as you've indicated, the
current API/global cache (R/B tree) doesn't support overlapping registrations.

Additional comments inline:

Gleb Natapov wrote:
Hello Tim,

On Tue, Aug 09, 2005 at 10:22:34AM -0600, Timothy B. Prins wrote:

If you have anyother ideas of how to do it please let us know.




I have to confess I don't like current pindown cache implementation much or perhaps I don't understand it enough.

What I managed to understand from the code is this:

There are three functions:
int mca_mpool_base_insert(void * addr, size_t size, mca_mpool_base_module_t* mpool, void* user_data,
                          mca_mpool_base_registration_t* registration);
int mca_mpool_base_remove(void * base);
mca_mpool_base_chunk_t* mca_mpool_base_find(void* base);

When btl registers memory it inserts registration in global cache by calling
mca_mpool_base_insert() this insertion may shadow registration of the same memory from another module or even from the same module.

mca_mpool_base_remove() removes address from the cache, but there is no way module can guaranty that deleted registration belongs to the module calling remove.

mca_mpool_base_find() returns first registration it encounter in the cache. The
registration may not be the best (biggest) or it may belong to the wrong module (endpoint is not accessible through it).


This is true. We have discussed changing the API to accept the base address
and range - and return the entire set of overlapping registrations.

Each btl should maintain it's own mru list, but the code is pretty much the 
same.


Agreed - this is ugly...

The saddest thing is you can't override the interface in your module. It is too
coupled with pml (ob1) and btls. If you don't like the way registration cache
works the only way to fix it is rewrite pml/btl/mpool.


True. We could implement a new framework for the cache, to allow this to be 
replaced.
However, my preference is still to maintain a single cache/tree, to minimize 
latency/overhead
in doing lookups.

I have some ideas about interface that I want to see, but perhaps it will not
play nice with the way ob1 works now. And remember my view is IB centric and may
be completely wrong for other interconnects. I will be glad to here your
comments.

I think cache should be implemented for each mpool and not single global one.

Three function will be added to mca_mpool_base_module_t:
mpool_insert(mca_mpool_base_module_t, mca_mpool_base_registration_t)
mca_mpool_base_registration_t mpool_find(mca_mpool_base_module_t, void *addr, 
size_t size)
mpool_put (mca_mpool_base_module_t, mca_mpool_base_registration_t);

Each mpool can override those functions and provide its own cache 
implementation.
But base implementation will provide default one. The cache will maintain it's
own mru list.

mca_mpool_base_find(void *addr, size_t length) will iterate through mpool list,
will call mpool_find() for each of them and will return list of registration to
pml. pml should call mpool_put() on registration it no longer needs (this is
needed for proper reference counting).

btl will call mpool_insert() after mpool_register() it is possible to merge 
these
two functions in one.


My only issue with this is the cost of iterating over each of the mpools and 
doing
a lookup in each.


I have code that manages overlapping registrations and I am porting it to
openmpi now, but without changing the way mpool works it will be not very
useful.


Could we implement this a single cache where each entry could reference multiple
mpools/btls?

Any thoughts/opions regarding a framework for a single cache?


Thanks,
Tim



Reply via email to