On 01/02/2018 14:57, Michael S. Tsirkin wrote: > On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote: >> On 01/02/2018 4:22, Michael S. Tsirkin wrote: >>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote: >>>> On Wed, Jan 31, 2018 at 11:10:07PM +0200, Michael S. Tsirkin wrote: >>>>> On Wed, Jan 31, 2018 at 06:40:59PM -0200, Eduardo Habkost wrote: >>>>>> On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote: >>>>>>> Currently only file backed memory backend can >>>>>>> be created with a "share" flag in order to allow >>>>>>> sharing guest RAM with other processes in the host. >>>>>>> >>>>>>> Add the "share" flag also to RAM Memory Backend >>>>>>> in order to allow remapping parts of the guest RAM >>>>>>> to different host virtual addresses. This is needed >>>>>>> by the RDMA devices in order to remap non-contiguous >>>>>>> QEMU virtual addresses to a contiguous virtual address range. >>>>>>> >>>>>> >>>>>> Why do we need to make this configurable? Would anything break >>>>>> if MAP_SHARED was always used if possible? >>>>> >>>>> See Documentation/vm/numa_memory_policy.txt for a list >>>>> of complications. >>>> >>>> Ew. >>>> >>>>> >>>>> Maybe we should more of an effort to detect and report these >>>>> issues. >>>> >>>> Probably. Having other features breaking silently when using >>>> pvrdma doesn't sound good. We must at least document those >>>> problems in the documentation for memory-backend-ram. >>>> >>>> BTW, what's the root cause for requiring HVAs in the buffer? >>> >>> It's a side effect of the kernel/userspace API which always wants >>> a single HVA/len pair to map memory for the application. >>> >>> >> >> Hi Eduardo and Michael, >> >>>> Can >>>> this be fixed? >>> >>> I think yes. It'd need to be a kernel patch for the RDMA subsystem >>> mapping an s/g list with actual memory. The HVA/len pair would then just >>> be used to refer to the region, without creating the two mappings. >>> >>> Something like splitting the register mr into >>> >>> mr = create mr (va/len) - allocate a handle and record the va/len >>> >>> addmemory(mr, offset, hva, len) - pin memory >>> >>> register mr - pass it to HW >>> >>> As a nice side effect we won't burn so much virtual address space. >>> >> >> We would still need a contiguous virtual address space range (for post-send) >> which we don't have since guest contiguous virtual address space >> will always end up as non-contiguous host virtual address space. > > It just needs to be contiguous in the HCA virtual address space. > Software never accesses through this pointer. > In other words - basically expose register physical mr to userspace. > > >> >> I am not sure the RDMA HW can handle a large VA with holes. >> >> An alternative would be 0-based MR, QEMU intercepts the post-send >> operations and can substract the guest VA base address. >> However I didn't see the implementation in kernel for 0 based MRs >> and also the RDMA maintainer said it would work for local keys >> and not for remote keys. >> >>> This will fix rdma with hugetlbfs as well which is currently broken. >>> >>> >> >> There is already a discussion on the linux-rdma list: >> https://www.spinics.net/lists/linux-rdma/msg60079.html >> But it will take some (actually a lot of) time, we are currently talking >> about >> a possible API. > > You probably need to pass the s/g piece by piece since it might exceed > any reasonable array size.
Right. They say the new API is ioctl based but so this is not a limitation. We proposed also a bitmap representation of a large range, but what we really need is what you mentioned: to pass the Guest VA directly to reg_mr. Thanks, Marcel > >> And it does not solve the re-mapping... >> >> Thanks, >> Marcel > > Haven't read through that discussion. But at least what I posted solves > it since you do not need it contiguous in HVA any longer. > >>>> -- >>>> Eduardo