On 01/02/2018 4:22, Michael S. Tsirkin wrote: > On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote: >> On Wed, Jan 31, 2018 at 11:10:07PM +0200, Michael S. Tsirkin wrote: >>> On Wed, Jan 31, 2018 at 06:40:59PM -0200, Eduardo Habkost wrote: >>>> On Wed, Jan 17, 2018 at 11:54:18AM +0200, Marcel Apfelbaum wrote: >>>>> Currently only file backed memory backend can >>>>> be created with a "share" flag in order to allow >>>>> sharing guest RAM with other processes in the host. >>>>> >>>>> Add the "share" flag also to RAM Memory Backend >>>>> in order to allow remapping parts of the guest RAM >>>>> to different host virtual addresses. This is needed >>>>> by the RDMA devices in order to remap non-contiguous >>>>> QEMU virtual addresses to a contiguous virtual address range. >>>>> >>>> >>>> Why do we need to make this configurable? Would anything break >>>> if MAP_SHARED was always used if possible? >>> >>> See Documentation/vm/numa_memory_policy.txt for a list >>> of complications. >> >> Ew. >> >>> >>> Maybe we should more of an effort to detect and report these >>> issues. >> >> Probably. Having other features breaking silently when using >> pvrdma doesn't sound good. We must at least document those >> problems in the documentation for memory-backend-ram. >> >> BTW, what's the root cause for requiring HVAs in the buffer? > > It's a side effect of the kernel/userspace API which always wants > a single HVA/len pair to map memory for the application. > >
Hi Eduardo and Michael, >> Can >> this be fixed? > > I think yes. It'd need to be a kernel patch for the RDMA subsystem > mapping an s/g list with actual memory. The HVA/len pair would then just > be used to refer to the region, without creating the two mappings. > > Something like splitting the register mr into > > mr = create mr (va/len) - allocate a handle and record the va/len > > addmemory(mr, offset, hva, len) - pin memory > > register mr - pass it to HW > > As a nice side effect we won't burn so much virtual address space. > We would still need a contiguous virtual address space range (for post-send) which we don't have since guest contiguous virtual address space will always end up as non-contiguous host virtual address space. I am not sure the RDMA HW can handle a large VA with holes. An alternative would be 0-based MR, QEMU intercepts the post-send operations and can substract the guest VA base address. However I didn't see the implementation in kernel for 0 based MRs and also the RDMA maintainer said it would work for local keys and not for remote keys. > This will fix rdma with hugetlbfs as well which is currently broken. > > There is already a discussion on the linux-rdma list: https://www.spinics.net/lists/linux-rdma/msg60079.html But it will take some (actually a lot of) time, we are currently talking about a possible API. And it does not solve the re-mapping... Thanks, Marcel >> -- >> Eduardo