MVAPICH is also doing pretty much the same thing as well. Matt
On Thu, 7 May 2009, Tang, Changqing wrote: > > HP-MPI is pretty much doing the similar thing. --CQ > > > > -----Original Message----- > > From: [email protected] > > [mailto:[email protected]] On Behalf Of > > Jeff Squyres > > Sent: Thursday, May 07, 2009 8:54 AM > > To: Roland Dreier > > Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny > > Verkhovsky; H?kon Bugge; Donald Kerr; OpenFabrics General; > > Alexander Supalov > > Subject: Re: [ofa-general] Memory registration redux > > > > On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote: > > > > > By the way, what's the desired behavior of the cache if a process > > > registers, say, address range 0x1000 ... 0x3fff, and then the same > > > process registers address range 0x2000 ... 0x2fff (with all > > the same > > > permissions, etc)? > > > > > > The initial registration creates an MR that is still valid for the > > > smaller virtual address range, so the second registration is much > > > cheaper if we used the cached registration; but if we use the cache > > > for the second registration, and then deregister the first > > one, we're > > > stuck with a too-big range pinned in the cache because of > > the second > > > registration. > > > > > > > > > I don't know what the other MPI's do in this scenario, but > > here's what OMPI will do: > > > > 1. lookup 0x1000-0x3fff in the cache; not find any of it it, > > and therefore register > > - add each page to our cache with a refcount of 1 2. > > lookup 0x2000-0x2fff in the cache, find that all the pages > > are already registered > > - refcount++ on each page in the cache 3. when we go to > > dereg 0x1000-0x3fff > > - refcount-- on each page in the cache > > - since some pages in the range still have refcount>0, > > don't do anything further > > > > Specifically: the actual dereg of 0x1000-0x3fff is blocked on > > also releasing 0x2000-0x2fff. > > > > Note that OMPI will only register a max of X bytes at a time > > (where X defaults to 2MB). So even if a user calls > > MPI_SEND(...) with an enormous buffer, we'll register it > > X/page_size pages at a time, not the entire buffer at once. > > Hence, the "buffer A is blocked from dereg'ing by buffer B" > > scenario is *somewhat* mitigated -- it's less wasteful than > > if we can registered/cached the entire huge buffer at once. > > > > Finally, note that if 0x2000-0x2fff had not been registered, > > the 0x1000-0x3fff pages are not actually deregistered when > > all the pages' > > refcounts go to 0 -- they are just moved to the "able to be > > dereg'ed list". We don't actually dereg it until we later > > try to reg new memory and fail due to lack of resources. > > Then we take entries off the "able to be dereg'ed list" and > > dereg them, then try reg'ing the new memory again. > > > > MVAPICH: do you guys do similar things? > > > > (I don't know if HP/Scali/Intel will comment on their > > registration cache schemes) > > > > -- > > Jeff Squyres > > Cisco Systems > > > > _______________________________________________ > > general mailing list > > [email protected] > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
