On Wed, Jun 05, 2013 at 06:45:13PM +0000, Jeff Squyres (jsquyres) wrote:

> Hum.  I was under the impression that with today's code (i.e., not ODP), if 
> you
> 
> a = malloc(N);
> ibv_reg_mr(..., a, N, ...);
> free(a);
> 
> (assuming that the memory actually left the process at free)
> 
> Then the relevant kernel verbs driver was notified, and would
> unregister that device.  ...but I'm an MPI guy, not a kernel guy --
> it seems like you're saying that my impression was wrong (which
> doesn't currently matter because we intercept free/sbrk and
> unregister such memory, anyway).

Sadly no, what happens is that once you do ibv_reg_mr that 'HCA
virtual address' is forever tied to the physical memory under the
'process virtual address' *at that moment* forever.

So in the case above, RDMA can continue after the free, and it
continues to hit the same *physical* memory that it always hit, but
due to the free the process has lost access to that memory (the kernel
keeps the physical memory reserved for RDMA purposes until unreg
though).

This is fundamentally why you need to intercept mmap/munmap/sbrk - if
the process's VM mapping is changed through those syscalls then the
HCA's VM and the process VM becomes de-synchronized.

> > 'magically be registered' is the wrong way to think about it - the
> > registration of VA=0x100 is simply kept, and any change to the
> > underlying physical mapping of the VA is synchronized with the HCA.
> 
> What happens if you:
> 
> a = malloc(N * page_size);
> ibv_reg_mr(..., a, N * page_size, ...);
> free(a);
> // incoming RDMA arrives targeted at buffer a

Haggai should comment on this, but my impression/expectation was
you'll get a remote protection fault/

> Or if you:
> 
> a = malloc(N * page_size);
> ibv_reg_mr(..., a, N * page_size, ...);
> free(a);
> a = malloc(N / 2 * page_size);
> // incoming RDMA arrives targeted at buffer a that is of length (N*page_size)

again, I expect a remote protection fault.

Noting of course, both of these cases are only true if the underlying
VM is manipulated in a way that makes the pages unmapped (eg
mmap/munmap, not free)

I would also assume that attempts to RDMA write read only pages
protection fault as well.

> It does seem quite odd, abstractly speaking, that a registration
> would survive a free/re-malloc (which is arguably a "different"
> buffer).

Not at all: the purpose of the registration is to allow access via
RDMA to a portion of the process's address space. The address space
doesn't change, but what it is mapped to can vary.

So - the ODP semantics make much more sense, so much so I'm not sure
we need a ODP flag at all, but that can be discussed when the patches
are proposed...

> That being said, it still seems like MPI needs a registration cache.
> It is several good steps forward if we don't need to intercept
> free/sbrk/whatever, but when MPI_Send(buf, ...) is invoked, we still
> have to check that the entire buf is registered.  If ibv_reg_mr(...,
> 0, 2^64, ...) was supported, that would obviate the entire need for
> registration caches.  That would be wonderful.

Yes, except that this shifts around where the registration overhead
ends up. Basically the HCA driver now has the registration cache you
had in MPI, and all the same overheads still exist. No free lunch
here :(

Haggai: A verb to resize a registration would probably be a helpful
step. MPI could maintain one registration that covers the sbrk
region and one registration that covers the heap, much easier than
searching tables and things.

Also bear in mind that all RDMA access protections will be disabled if
you register the entire process VM, the remote(s) can scribble/read
everything..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to