On Wed, Jun 05, 2013 at 06:45:13PM +0000, Jeff Squyres (jsquyres) wrote: > Hum. I was under the impression that with today's code (i.e., not ODP), if > you > > a = malloc(N); > ibv_reg_mr(..., a, N, ...); > free(a); > > (assuming that the memory actually left the process at free) > > Then the relevant kernel verbs driver was notified, and would > unregister that device. ...but I'm an MPI guy, not a kernel guy -- > it seems like you're saying that my impression was wrong (which > doesn't currently matter because we intercept free/sbrk and > unregister such memory, anyway).
Sadly no, what happens is that once you do ibv_reg_mr that 'HCA virtual address' is forever tied to the physical memory under the 'process virtual address' *at that moment* forever. So in the case above, RDMA can continue after the free, and it continues to hit the same *physical* memory that it always hit, but due to the free the process has lost access to that memory (the kernel keeps the physical memory reserved for RDMA purposes until unreg though). This is fundamentally why you need to intercept mmap/munmap/sbrk - if the process's VM mapping is changed through those syscalls then the HCA's VM and the process VM becomes de-synchronized. > > 'magically be registered' is the wrong way to think about it - the > > registration of VA=0x100 is simply kept, and any change to the > > underlying physical mapping of the VA is synchronized with the HCA. > > What happens if you: > > a = malloc(N * page_size); > ibv_reg_mr(..., a, N * page_size, ...); > free(a); > // incoming RDMA arrives targeted at buffer a Haggai should comment on this, but my impression/expectation was you'll get a remote protection fault/ > Or if you: > > a = malloc(N * page_size); > ibv_reg_mr(..., a, N * page_size, ...); > free(a); > a = malloc(N / 2 * page_size); > // incoming RDMA arrives targeted at buffer a that is of length (N*page_size) again, I expect a remote protection fault. Noting of course, both of these cases are only true if the underlying VM is manipulated in a way that makes the pages unmapped (eg mmap/munmap, not free) I would also assume that attempts to RDMA write read only pages protection fault as well. > It does seem quite odd, abstractly speaking, that a registration > would survive a free/re-malloc (which is arguably a "different" > buffer). Not at all: the purpose of the registration is to allow access via RDMA to a portion of the process's address space. The address space doesn't change, but what it is mapped to can vary. So - the ODP semantics make much more sense, so much so I'm not sure we need a ODP flag at all, but that can be discussed when the patches are proposed... > That being said, it still seems like MPI needs a registration cache. > It is several good steps forward if we don't need to intercept > free/sbrk/whatever, but when MPI_Send(buf, ...) is invoked, we still > have to check that the entire buf is registered. If ibv_reg_mr(..., > 0, 2^64, ...) was supported, that would obviate the entire need for > registration caches. That would be wonderful. Yes, except that this shifts around where the registration overhead ends up. Basically the HCA driver now has the registration cache you had in MPI, and all the same overheads still exist. No free lunch here :( Haggai: A verb to resize a registration would probably be a helpful step. MPI could maintain one registration that covers the sbrk region and one registration that covers the heap, much easier than searching tables and things. Also bear in mind that all RDMA access protections will be disabled if you register the entire process VM, the remote(s) can scribble/read everything.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html