Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On 04/10/2013 11:05 AM, Michael S. Tsirkin wrote: On Wed, Apr 10, 2013 at 11:48:24AM -0400, Michael R. Hines wrote: There's a very nice, simple client/server RDMA application on the internet you can use to test your patch. http://thegeekinthecorner.wordpress.com/2010/09/28/ rdma-read-and-write-with-ib-verbs/ This guy provides the source code which dumps several gigabytes over RDMA to the other side. There's no need to run QEMU to test your patch, assuming you have access to infiniband hardware. - Michael Does this app have any COW pages? It can easily be made to be. It's just a couple of C source files. I've only used it for basic testing. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On 04/09/2013 12:39 PM, Michael S. Tsirkin wrote: On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote: To be more specific, here's what I did: 1. apply kernel module patch - re-insert module 1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ) 2. Start the RDMA migration 3. Migration completes without any errors This test does *not* work with a cgroup swap limit, however. The process gets killed. (Both with and without GIFT) - Michael Try to attach a debugger and see where it is when it gets killed? It's killed by cgroups - not a CPU exception. The same test works fine using TCP migration with cgroups - everything is fine there. The memory that RDMA attempted to register hits some kind of cgroups policy which results in a kernel message saying that the cgroup swap limit was hit and then it goes ahead and kills the process altogether. It's not a QEMU problem - it seems to be a kernel bug. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
With respect, I'm going to offload testing this patch back to the author =) because I'm trying to address all of Paolo's other minor issues with the RDMA patch before we can merge. Since dynamic page registration (as you requested) is now fully implemented, this patch is less urgent since we now have a mechanism in place to avoid page pinning on both sides of the migration. - Michael On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote: presumably is_dup_page reads the page, so should not break COW ... I'm not sure about the cgroups swap limit - you might have too many non COW pages so attempting to fault them all in makes you exceed the limit. You really should look at what is going on in the pagemap, to see if there's measureable gain from the patch. On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote: Well, I have the is_dup_page() commented out...when RDMA is activated. Is there something else in QEMU that could be touching the page that I don't know about? - Michael On 04/05/2013 05:03 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: Sorry, I was wrong. ignore the comments about cgroups. That's still broken. (i.e. trying to register RDMA memory while using a cgroup swap limit cause the process get killed). But the GIFT flag patch works (my understanding is that GIFT flag allows the adapter to transmit stale memory information, it does not have anything to do with cgroups specifically). The point of the GIFT patch is to avoid triggering copy-on-write so that memory doesn't blow up during migration. If that doesn't work then there's no point to the patch. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
The userland part of the patch was missing (IBV_ACCESS_GIFT). I added flag that to /usr/include in addition to this patch and did a test RDMA migrate and it seems to work without any problems. I also removed the IBV_*_WRITE flags on the sender-side and activated cgroups with the memory.memsw.limit_in_bytes activated and the migration with RDMA also succeeded without any problems (both with *and* without GIFT also worked). Any additional tests you would like? - Michael On 03/24/2013 11:51 AM, Michael S. Tsirkin wrote: At the moment registering an MR breaks COW. This breaks memory overcommit for users such as KVM: we have a lot of COW pages, e.g. instances of the zero page or pages shared using KSM. If the application does not care that adapter sees stale data (for example, it tracks writes reregisters and resends), it can use a new IBV_ACCESS_GIFT flag to prevent registration from breaking COW. The semantics are similar to that of SPLICE_F_GIFT thus the name. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Please review and consider for 3.10. Changes from v1: rename APP_READONLY to _GIFT: similar to vmsplice's F_GIFT. drivers/infiniband/core/umem.c | 21 - include/rdma/ib_verbs.h| 9 - 2 files changed, 20 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a841123..5dee86d 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -89,6 +89,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, int ret; int off; int i; + bool gift, writable; DEFINE_DMA_ATTRS(attrs); if (dmasync) @@ -96,6 +97,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (!can_do_mlock()) return ERR_PTR(-EPERM); + /* +* We ask for writable memory if any access flags other than +* remote read or gift are set. Local write and remote write +* obviously require write access. Remote atomic can do +* things like fetch and add, which will modify memory, and +* MW bind can change permissions by binding a window. +*/ + gift = access IB_ACCESS_GIFT; + writable = access ~(IB_ACCESS_REMOTE_READ | IB_ACCESS_GIFT); umem = kmalloc(sizeof *umem, GFP_KERNEL); if (!umem) @@ -105,14 +115,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; - /* -* We ask for writable memory if any access flags other than -* remote read are set. Local write and remote write -* obviously require write access. Remote atomic can do -* things like fetch and add, which will modify memory, and -* MW bind can change permissions by binding a window. -*/ - umem-writable = !!(access ~IB_ACCESS_REMOTE_READ); + umem-writable = writable; /* We assume the memory is from hugetlb until proved otherwise */ umem-hugetlb = 1; @@ -152,7 +155,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), -1, !umem-writable, page_list, vma_list); +!gift, !umem-writable, page_list, vma_list); if (ret 0) goto out; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 98cc4b2..2e6e13c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -871,7 +871,14 @@ enum ib_access_flags { IB_ACCESS_REMOTE_READ = (12), IB_ACCESS_REMOTE_ATOMIC = (13), IB_ACCESS_MW_BIND = (14), - IB_ZERO_BASED = (15) + IB_ZERO_BASED = (15), + /* +* IB_ACCESS_GIFT: This memory is a gift to the adapter. If memory is +* modified after registration, the local version and data seen by the +* adapter through this region rkey may differ. +* Only legal with IB_ACCESS_REMOTE_READ or no permissions. +*/ + IB_ACCESS_GIFT = (16) }; struct ib_phys_buf { -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
Sorry, I was wrong. ignore the comments about cgroups. That's still broken. (i.e. trying to register RDMA memory while using a cgroup swap limit cause the process get killed). But the GIFT flag patch works (my understanding is that GIFT flag allows the adapter to transmit stale memory information, it does not have anything to do with cgroups specifically). Am I missing something? I was only testing the GIFT flag patch. Note: I only turned it on - I did not verify the (non) consitency of the memory that was transmitted. - Michael On 04/05/2013 04:43 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: I also removed the IBV_*_WRITE flags on the sender-side and activated cgroups with the memory.memsw.limit_in_bytes activated and the migration with RDMA also succeeded without any problems (both with *and* without GIFT also worked). Not sure I'm interpreting this correctly. Are you saying that things worked without actually setting the GIFT flag? In which case why are we adding this flag? - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
Well, I have the is_dup_page() commented out...when RDMA is activated. Is there something else in QEMU that could be touching the page that I don't know about? - Michael On 04/05/2013 05:03 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: Sorry, I was wrong. ignore the comments about cgroups. That's still broken. (i.e. trying to register RDMA memory while using a cgroup swap limit cause the process get killed). But the GIFT flag patch works (my understanding is that GIFT flag allows the adapter to transmit stale memory information, it does not have anything to do with cgroups specifically). The point of the GIFT patch is to avoid triggering copy-on-write so that memory doesn't blow up during migration. If that doesn't work then there's no point to the patch. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
I'm getting around to it, Michael, I promise =). Just came back from vacation. I have to re-build the ib_ucm kernel module from the original SUSE kernel that I'm using along before I can test it.. The machines I'm using are slightly tied up with other things, so its taking me a little time to prepare to apply the patch and test the new kernel module... - Michael On 04/02/2013 01:05 PM, Michael S. Tsirkin wrote: On Tue, Apr 02, 2013 at 09:57:38AM -0700, Roland Dreier wrote: On Tue, Apr 2, 2013 at 8:51 AM, Michael S. Tsirkin m...@redhat.com wrote: At the moment registering an MR breaks COW. This breaks memory overcommit for users such as KVM: we have a lot of COW pages, e.g. instances of the zero page or pages shared using KSM. If the application does not care that adapter sees stale data (for example, it tracks writes reregisters and resends), it can use a new IBV_ACCESS_GIFT flag to prevent registration from breaking COW. The semantics are similar to that of SPLICE_F_GIFT thus the name. Signed-off-by: Michael S. Tsirkin m...@redhat.com Roland, Michael is yet to test this but could you please confirm whether this looks acceptable to you? The patch itself is reasonable I guess, given the needs of this particular app. I'm not particularly happy with the name of the flag. The analogy with SPLICE_F_GIFT doesn't seem particularly strong and I'm not convinced even the splice flag name is very understandable. But in the RDMA case there's not really any sense in which we're gifting memory to the adapter -- we're just telling the library please don't trigger copy-on-write and it doesn't seem particularly easy for users to understand that from the flag name. - R. The point really is that any writes by application won't be seen until re-registration, right? OK, what's a better name? IBV_ACCESS_NON_COHERENT? Please tell me what is preferable and we'll go ahead with it. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma: don't make pages writeable if not requiested
Yes, I'd be happy to try the patch. Got meetings all day.. but will dive in soon. On 03/21/2013 02:18 AM, Michael S. Tsirkin wrote: core/umem.c seems to get the arguments to get_user_pages in the reverse order: it sets writeable flag and breaks COW for MAP_SHARED if and only if hardware needs to write the page. This breaks memory overcommit for users such as KVM: each time we try to register a page to send it to remote, this breaks COW. It seems that for applications that only have REMOTE_READ permission, there is no reason to break COW at all. If the page that is COW has lots of copies, this makes the user process quickly exceed the cgroups memory limit. This makes RDMA mostly useless for virtualization, thus the stable tag. Reported-by: Michael R. Hines mrhi...@linux.vnet.ibm.com Cc: sta...@vger.kernel.org Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Note: compile-tested only, I don't have RDMA hardware at the moment. Michael, could you please try this patch (also fixing your usespace code not to request write access) and report? Note2: grep for get_user_pages in infiniband drivers turns up lots of users who set write to 1 unconditionally. These might be bugs too, should be checked. drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a841123..5929598 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), -1, !umem-writable, page_list, vma_list); +!umem-writable, 1, page_list, vma_list); if (ret 0) goto out; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html