Here are a few more clarifications: 1) ODP MRs can cover address ranges that do not have a mapping at registration time.
This means that MPI can register in advance, say, the lower GB's of the address space, covering malloc's primary arena. Thus, there is no need to adjust to each increase in sbrk(). Similarly, you can register the stack region up to the maximum size of the stack. The stack can grow and shrink, and ODP will always use the current mapping. 2) Virtual addresses covered by an ODP MR must have a valid mapping when they are is accessed (during send/receive WQE processing or as a target of an RDMA/atomic operation). So, Jeff, the only thing you need to make sure is that you don't free() a buffer that you posted and haven't got a completion yet - but I guess that this is something that you already do... :) For example, in the following scenario: a. reg_mr(first GB of the address space) b. p = malloc() c. post_send(p) d. poll for completion e. free(p) f. p = malloc() g. post_send(p) h. poll for completion i. free(p) (c) may incur a page fault (if not pre-fetched or faulted-in by another thread). (e) happens after the completion, so it is guaranteed that (c), when processed by HW, uses the correct application buffer with the current virt-to-phys mapping (at HW access time) The reallocation may or may not change the virtual-to-physical mappings. The message may or may not be paged out (ODP does not hold a reference on the page). In any case, when (g) is processed, it always uses the current mapping. --Liran -----Original Message----- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Saturday, June 08, 2013 2:58 AM To: Jeff Squyres (jsquyres) Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel Subject: Re: Status of "ummunot" branch? On Fri, Jun 07, 2013 at 10:59:43PM +0000, Jeff Squyres (jsquyres) wrote: > > I don't think this covers other memory regions, like those added via mmap, > > right? > > We talked about this at the MPI Forum this week; it doesn't seem like > ODP fixes any MPI problems. ODP without 'register all address space' changes the nature of the problem, and fixes only one problem. You do need to cache registrations, and all the tuning parameters (how much do I cache, how long do I hold it for, etc, etc) all still apply. What goes away (is fixed) is the need for intercepts and the need to purge address space from the cache because the backing registration has become non-coherent/invalid. Registrations are always coherent/valid with ODP. This cache, and the associated optimization problem, can never go away. With a 'register all of memory' semantic the cache can move into the kernel, but the performance implication and overheads are all still present, just migrated. > 2. MPI still has to intercept (at least) munmap(). Curious to know what for? If you want to prune registrations (ie to reduce memory footprint), this can be done lazyily at any time (eg in a background thread or something). Read /proc/self/maps and purge all the registrations pointing to unmapped memory. Similar to garbage collection. There is no harm in keeping a registration for a long period, except for the memory footprint in the kernel. > 3. Having mmap/malloc/etc. return "new" memory that may already be > registered because of a prior memory registration and subsequent > munmap/free/etc. is just plain weird. Worse, if we re-register it, > ref counts could go such that the actual registration will never > actually expire until the process dies (which could lead to processes > with abnormally large memory footprints, because they never actually > let go of memory because it's still registered). This is entirely on the registration cache implementation to sort out, there are lots of performance/memory trade offs. It is only weird when you think about it in terms of buffers. memory registration has to do with address space, not buffers. > What MPI wants is: > > 1. verbs for ummunotify-like functionality 2. non-blocking memory > registration verbs; poll the cq to know when it has completed To me, ODP with an additional 'register all address space' semantic, plus an asynchronous prefetch does both of these for you. 1. ummunotify functionality and caching is now in the kernel, under ODP. RDMA access to an 'all of memory' registration always does the right thing. 2. asynchronous prefetch (eg as a work request) triggers ODP and kernel actions to ready a subset of memory for RDMA, including all the work that memory registration does today (get_user_pages, COW break, etc) Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html