Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
On Wed, Jul 15, 2015 at 02:48:00PM -0400, Matthew Wilcox wrote: On Wed, Jul 15, 2015 at 11:25:55AM -0600, Jens Axboe wrote: On 07/15/2015 11:19 AM, Keith Busch wrote: On Wed, 15 Jul 2015, Bart Van Assche wrote: * With blk-mq and scsi-mq optimal performance can only be achieved if the relationship between MSI-X vector and NUMA node does not change over time. This is necessary to allow a blk-mq/scsi-mq driver to ensure that interrupts are processed on the same NUMA node as the node on which the data structures for a communication channel have been allocated. However, today there is no API that allows blk-mq/scsi-mq drivers and irqbalanced to exchange information about the relationship between MSI-X vector ranges and NUMA nodes. We could have low-level drivers provide blk-mq the controller's irq associated with a particular h/w context, and the block layer can provide the context's cpumask to irqbalance with the smp affinity hint. The nvme driver already uses the hwctx cpumask to set hints, but this doesn't seems like it should be a driver responsibility. It currently doesn't work correctly anyway with hot-cpu since blk-mq could rebalance the h/w contexts without syncing with the low-level driver. If we can add this to blk-mq, one additional case to consider is if the same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu assignment needs to be aware of this to prevent sharing a vector across NUMA nodes. Exactly. I may have promised to do just that at the last LSF/MM conference, just haven't done it yet. The point is to share the mask, I'd ideally like to take it all the way where the driver just asks for a number of vecs through a nice API that takes care of all this. Lots of duplicated code in drivers for this these days, and it's a mess. Yes. I think the fundamental problem is that our MSI-X API is so funky. We have this incredibly flexible scheme where each MSI-X vector could have its own interrupt handler, but that's not what drivers want. They want to say Give me eight MSI-X vectors spread across the CPUs, and use this interrupt handler for all of them. That is, instead of the current scheme where each MSI-X vector gets its own Linux interrupt, we should have one interrupt handler (of the per-cpu interrupt type), which shows up with N bits set in its CPU mask. It would definitely be nice to have a way to express that. But it's also pretty common for drivers to have e.g. RX and TX use separate vectors, and these need separate handlers. ___ Ksummit-discuss mailing list ksummit-disc...@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
On Wed, Jul 15, 2015 at 05:07:08AM -0700, Christoph Hellwig wrote: Many years ago we decided to move setting of IRQ to core affnities to userspace with the irqbalance daemon. These days we have systems with lots of MSI-X vector, and we have hardware and subsystem support for per-CPU I/O queues in the block layer, the RDMA subsystem and probably the network stack (I'm not too familar with the recent developments there). It would really help the out of the box performance and experience if we could allow such subsystems to bind interrupt vectors to the node that the queue is configured on. I think you are right, it's true for networking. Whenever someone tries to benchmark networking, first thing done is always disabling irqbalance and pinning IRQs manually away from whereever the benchmark is running, but at the same numa node. Without that, interrupts don't let the benchmark make progress. Alternatively, people give up on interrupts completely and start polling hardware aggressively. Nice for a benchmark, not nice for the environment. I'd like to discuss if the rationale for moving the IRQ affinity setting fully to userspace are still correct in todays world any any pitfalls we'll have to learn from in irqbalanced and the old in-kernel affinity code. IMHO there could be a benefit from a better integration with the scheduler. Maybe an interrupt handler can be viewed as a kind of thread, so scheduler can make decisions about where to run it next? ___ Ksummit-discuss mailing list ksummit-disc...@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: decent performance drop for SCSI LLD / SAN initiator when iommu is turned on
On Mon, May 06, 2013 at 03:35:58PM -0700, Alexander Duyck wrote: On 05/06/2013 02:39 PM, Or Gerlitz wrote: On Thu, May 2, 2013 at 4:56 AM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, May 02, 2013 at 02:11:15AM +0300, Or Gerlitz wrote: So we've noted that when configuring the kernel booting with intel iommu set to on on a physical node (non VM, and without enabling SRIOV by the HW device driver) raw performance of the iSER (iSCSI RDMA) SAN initiator is reduced notably, e.g in the testbed we looked today we had ~260K 1KB random IOPS and 5.5GBs BW for 128KB IOs with iommu turned off for single LUN, and ~150K IOPS and 4GBs BW with iommu turned on. No change on the target node between runs. That's why we have iommu=pt. See definition of iommu_pass_through in arch/x86/kernel/pci-dma.c. Hi Michael (hope you feel better), We did some runs with the pt approach you suggested and still didn't get the promised gain -- in parallel we came across this 2012 commit f800326dc ixgbe: Replace standard receive path with a page based receive where they say [...] we are able to see a considerable performance gain when an IOMMU is enabled because we are no longer unmapping every buffer on receive [...] instead we can simply call sync_single_range [...] looking on the commit you can see that they allocate a page/skb dma_map it initially and later of the life cycle of that buffer use dma_sync_for_device/cpu and avoid dma_map/unmap on the fast path. Well few questions which I'd love to hear people's opinion -- 1st this approach seems cool for network device RX path, but what about the TX path, any idea how to avoid dma_map for it? or why on the TX path calling dma_map/unmap for every buffer doesn't involve a notable perf hit? 2nd I don't see how to apply the method on block device since these devices don't allocate buffers, but rather get a scatter-gather list of pages from upper layers, issue dma_map_sg on them and submit the IO, later when done call dma_unmap_sg Or. The Tx path ends up taking a performance hit if IOMMU is enabled. It just isn't as severe due to things like TSO. One way to work around the performance penalty is to allocate bounce buffers and just leave them static mapped. Then you can simply memcpy the data to the buffers and avoid the locking overhead of allocating/freeing IOMMU resources. It consumes more memory but works around the IOMMU limitations. Thanks, Alex But why isn't iommu=pt effective? AFAIK the whole point of it was to give up on security for host-controlled devices, but still get a measure of security for assigned devices. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Wed, Apr 10, 2013 at 12:32:31AM -0400, Michael R. Hines wrote: On 04/09/2013 11:24 PM, Michael S. Tsirkin wrote: Which mechanism do you refer to? You patches still seem to pin each page in guest memory at some point, which will break all COW. In particular any pagemap tricks to detect duplicates on source that I suggested won't work. Sorry, I mispoke. I'm reffering to dynamic server page registration. Of course it does not eliminate pinning - but it does mitigate the foot print of the VM as a feature that was requested. I have implemented it and documented it. - Michael Okay, but GIFT is supposed to be used on send side: it's only allowed with local/remote read access, and serves to reduce memory usage on send side. For example, disable zero page detection and look at memory usage on send side before and after migration. Dynamic registration on the receive side is nice but seems completely unrelated ... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Wed, Apr 10, 2013 at 11:48:24AM -0400, Michael R. Hines wrote: There's a very nice, simple client/server RDMA application on the internet you can use to test your patch. http://thegeekinthecorner.wordpress.com/2010/09/28/ rdma-read-and-write-with-ib-verbs/ This guy provides the source code which dumps several gigabytes over RDMA to the other side. There's no need to run QEMU to test your patch, assuming you have access to infiniband hardware. - Michael Does this app have any COW pages? On 04/10/2013 01:32 AM, Michael S. Tsirkin wrote: On Wed, Apr 10, 2013 at 12:32:31AM -0400, Michael R. Hines wrote: On 04/09/2013 11:24 PM, Michael S. Tsirkin wrote: Which mechanism do you refer to? You patches still seem to pin each page in guest memory at some point, which will break all COW. In particular any pagemap tricks to detect duplicates on source that I suggested won't work. Sorry, I mispoke. I'm reffering to dynamic server page registration. Of course it does not eliminate pinning - but it does mitigate the foot print of the VM as a feature that was requested. I have implemented it and documented it. - Michael Okay, but GIFT is supposed to be used on send side: it's only allowed with local/remote read access, and serves to reduce memory usage on send side. For example, disable zero page detection and look at memory usage on send side before and after migration. Dynamic registration on the receive side is nice but seems completely unrelated ... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote: To be more specific, here's what I did: 1. apply kernel module patch - re-insert module 1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ) 2. Start the RDMA migration 3. Migration completes without any errors This test does *not* work with a cgroup swap limit, however. The process gets killed. (Both with and without GIFT) - Michael Try to attach a debugger and see where it is when it gets killed? On 04/05/2013 04:43 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: I also removed the IBV_*_WRITE flags on the sender-side and activated cgroups with the memory.memsw.limit_in_bytes activated and the migration with RDMA also succeeded without any problems (both with *and* without GIFT also worked). Not sure I'm interpreting this correctly. Are you saying that things worked without actually setting the GIFT flag? In which case why are we adding this flag? - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
presumably is_dup_page reads the page, so should not break COW ... I'm not sure about the cgroups swap limit - you might have too many non COW pages so attempting to fault them all in makes you exceed the limit. You really should look at what is going on in the pagemap, to see if there's measureable gain from the patch. On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote: Well, I have the is_dup_page() commented out...when RDMA is activated. Is there something else in QEMU that could be touching the page that I don't know about? - Michael On 04/05/2013 05:03 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: Sorry, I was wrong. ignore the comments about cgroups. That's still broken. (i.e. trying to register RDMA memory while using a cgroup swap limit cause the process get killed). But the GIFT flag patch works (my understanding is that GIFT flag allows the adapter to transmit stale memory information, it does not have anything to do with cgroups specifically). The point of the GIFT patch is to avoid triggering copy-on-write so that memory doesn't blow up during migration. If that doesn't work then there's no point to the patch. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Fri, Apr 05, 2013 at 01:43:49PM -0700, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: I also removed the IBV_*_WRITE flags on the sender-side and activated cgroups with the memory.memsw.limit_in_bytes activated and the migration with RDMA also succeeded without any problems (both with *and* without GIFT also worked). Not sure I'm interpreting this correctly. Are you saying that things worked without actually setting the GIFT flag? In which case why are we adding this flag? - R. We are adding the flag to reduce memory when there's lots of COW pages. There's no guarantee there will be COW pages so I expect things to work both with and without breaking COW, just using much more memory when we break COW. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Fri, Apr 05, 2013 at 04:17:36PM -0400, Michael R. Hines wrote: The userland part of the patch was missing (IBV_ACCESS_GIFT). I added flag that to /usr/include in addition to this patch and did a test RDMA migrate and it seems to work without any problems. I also removed the IBV_*_WRITE flags on the sender-side and activated cgroups with the memory.memsw.limit_in_bytes activated and the migration with RDMA also succeeded without any problems (both with *and* without GIFT also worked). Any additional tests you would like? - Michael RDMA can't really work with swap so not sure how that's relevant. Please check memory.usage_in_bytes - is it lower with the GIFT flag? I think this is what we really care about. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Tue, Apr 09, 2013 at 09:26:59PM -0400, Michael R. Hines wrote: With respect, I'm going to offload testing this patch back to the author =) because I'm trying to address all of Paolo's other minor issues with the RDMA patch before we can merge. Fair enough, this likely means it won't happen anytime soon though. Since dynamic page registration (as you requested) is now fully implemented, this patch is less urgent since we now have a mechanism in place to avoid page pinning on both sides of the migration. - Michael Which mechanism do you refer to? You patches still seem to pin each page in guest memory at some point, which will break all COW. In particular any pagemap tricks to detect duplicates on source that I suggested won't work. On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote: presumably is_dup_page reads the page, so should not break COW ... I'm not sure about the cgroups swap limit - you might have too many non COW pages so attempting to fault them all in makes you exceed the limit. You really should look at what is going on in the pagemap, to see if there's measureable gain from the patch. On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote: Well, I have the is_dup_page() commented out...when RDMA is activated. Is there something else in QEMU that could be touching the page that I don't know about? - Michael On 04/05/2013 05:03 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: Sorry, I was wrong. ignore the comments about cgroups. That's still broken. (i.e. trying to register RDMA memory while using a cgroup swap limit cause the process get killed). But the GIFT flag patch works (my understanding is that GIFT flag allows the adapter to transmit stale memory information, it does not have anything to do with cgroups specifically). The point of the GIFT patch is to avoid triggering copy-on-write so that memory doesn't blow up during migration. If that doesn't work then there's no point to the patch. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Tue, Apr 02, 2013 at 08:05:21PM +0300, Michael S. Tsirkin wrote: On Tue, Apr 02, 2013 at 09:57:38AM -0700, Roland Dreier wrote: On Tue, Apr 2, 2013 at 8:51 AM, Michael S. Tsirkin m...@redhat.com wrote: At the moment registering an MR breaks COW. This breaks memory overcommit for users such as KVM: we have a lot of COW pages, e.g. instances of the zero page or pages shared using KSM. If the application does not care that adapter sees stale data (for example, it tracks writes reregisters and resends), it can use a new IBV_ACCESS_GIFT flag to prevent registration from breaking COW. The semantics are similar to that of SPLICE_F_GIFT thus the name. Signed-off-by: Michael S. Tsirkin m...@redhat.com Roland, Michael is yet to test this but could you please confirm whether this looks acceptable to you? The patch itself is reasonable I guess, given the needs of this particular app. I'm not particularly happy with the name of the flag. The analogy with SPLICE_F_GIFT doesn't seem particularly strong and I'm not convinced even the splice flag name is very understandable. But in the RDMA case there's not really any sense in which we're gifting memory to the adapter -- we're just telling the library please don't trigger copy-on-write and it doesn't seem particularly easy for users to understand that from the flag name. - R. The point really is that any writes by application won't be seen until re-registration, right? OK, what's a better name? IBV_ACCESS_NON_COHERENT? Please tell me what is preferable and we'll go ahead with it. Um. ping? We are at -rc5 and things need to fall into place if we are to have it in 3.10 ... -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Sun, Mar 24, 2013 at 05:51:53PM +0200, Michael S. Tsirkin wrote: At the moment registering an MR breaks COW. This breaks memory overcommit for users such as KVM: we have a lot of COW pages, e.g. instances of the zero page or pages shared using KSM. If the application does not care that adapter sees stale data (for example, it tracks writes reregisters and resends), it can use a new IBV_ACCESS_GIFT flag to prevent registration from breaking COW. The semantics are similar to that of SPLICE_F_GIFT thus the name. Signed-off-by: Michael S. Tsirkin m...@redhat.com Roland, Michael is yet to test this but could you please confirm whether this looks acceptable to you? --- Please review and consider for 3.10. Changes from v1: rename APP_READONLY to _GIFT: similar to vmsplice's F_GIFT. drivers/infiniband/core/umem.c | 21 - include/rdma/ib_verbs.h| 9 - 2 files changed, 20 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a841123..5dee86d 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -89,6 +89,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, int ret; int off; int i; + bool gift, writable; DEFINE_DMA_ATTRS(attrs); if (dmasync) @@ -96,6 +97,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (!can_do_mlock()) return ERR_PTR(-EPERM); + /* + * We ask for writable memory if any access flags other than + * remote read or gift are set. Local write and remote write + * obviously require write access. Remote atomic can do + * things like fetch and add, which will modify memory, and + * MW bind can change permissions by binding a window. + */ + gift = access IB_ACCESS_GIFT; + writable = access ~(IB_ACCESS_REMOTE_READ | IB_ACCESS_GIFT); umem = kmalloc(sizeof *umem, GFP_KERNEL); if (!umem) @@ -105,14 +115,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; - /* - * We ask for writable memory if any access flags other than - * remote read are set. Local write and remote write - * obviously require write access. Remote atomic can do - * things like fetch and add, which will modify memory, and - * MW bind can change permissions by binding a window. - */ - umem-writable = !!(access ~IB_ACCESS_REMOTE_READ); + umem-writable = writable; /* We assume the memory is from hugetlb until proved otherwise */ umem-hugetlb = 1; @@ -152,7 +155,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), - 1, !umem-writable, page_list, vma_list); + !gift, !umem-writable, page_list, vma_list); if (ret 0) goto out; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 98cc4b2..2e6e13c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -871,7 +871,14 @@ enum ib_access_flags { IB_ACCESS_REMOTE_READ = (12), IB_ACCESS_REMOTE_ATOMIC = (13), IB_ACCESS_MW_BIND = (14), - IB_ZERO_BASED = (15) + IB_ZERO_BASED = (15), + /* + * IB_ACCESS_GIFT: This memory is a gift to the adapter. If memory is + * modified after registration, the local version and data seen by the + * adapter through this region rkey may differ. + * Only legal with IB_ACCESS_REMOTE_READ or no permissions. + */ + IB_ACCESS_GIFT = (16) }; struct ib_phys_buf { -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Tue, Apr 02, 2013 at 09:57:38AM -0700, Roland Dreier wrote: On Tue, Apr 2, 2013 at 8:51 AM, Michael S. Tsirkin m...@redhat.com wrote: At the moment registering an MR breaks COW. This breaks memory overcommit for users such as KVM: we have a lot of COW pages, e.g. instances of the zero page or pages shared using KSM. If the application does not care that adapter sees stale data (for example, it tracks writes reregisters and resends), it can use a new IBV_ACCESS_GIFT flag to prevent registration from breaking COW. The semantics are similar to that of SPLICE_F_GIFT thus the name. Signed-off-by: Michael S. Tsirkin m...@redhat.com Roland, Michael is yet to test this but could you please confirm whether this looks acceptable to you? The patch itself is reasonable I guess, given the needs of this particular app. I'm not particularly happy with the name of the flag. The analogy with SPLICE_F_GIFT doesn't seem particularly strong and I'm not convinced even the splice flag name is very understandable. But in the RDMA case there's not really any sense in which we're gifting memory to the adapter -- we're just telling the library please don't trigger copy-on-write and it doesn't seem particularly easy for users to understand that from the flag name. - R. The point really is that any writes by application won't be seen until re-registration, right? OK, what's a better name? IBV_ACCESS_NON_COHERENT? Please tell me what is preferable and we'll go ahead with it. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] ibverbs: add a new IBV_ACCESS_GIFT option
At the moment registering an MR breaks COW. If the application does not care that adapter sees stale data (for example, it tracks writes reregisters and resends), it can use a new IBV_ACCESS_GIFT flag to prevent registration from breaking COW. The semantics are similar to that of SPLICE_F_GIFT thus the name. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This is compiled but untested. Michael, could you please try this patch (together with the kernel patch I'm sending separately) and report whether setting this flag unbreaks overcommit for you? include/infiniband/verbs.h | 3 ++- man/ibv_reg_mr.3 | 5 + 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index 6acfc81..3290ec9 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -290,7 +290,8 @@ enum ibv_access_flags { IBV_ACCESS_REMOTE_WRITE = (11), IBV_ACCESS_REMOTE_READ = (12), IBV_ACCESS_REMOTE_ATOMIC= (13), - IBV_ACCESS_MW_BIND = (14) + IBV_ACCESS_MW_BIND = (14), + IBV_ACCESS_GIFT = (16) }; struct ibv_pd { diff --git a/man/ibv_reg_mr.3 b/man/ibv_reg_mr.3 index 7723771..3c302f0 100644 --- a/man/ibv_reg_mr.3 +++ b/man/ibv_reg_mr.3 @@ -34,6 +34,8 @@ describes the desired memory protection attributes; it is either 0 or the bitwis .B IBV_ACCESS_REMOTE_ATOMIC\fR Enable Remote Atomic Operation Access (if supported) .TP .B IBV_ACCESS_MW_BIND\fR Enable Memory Window Binding +.TP +.B IBV_ACCESS_GIFT\fR This memory is a gift to the adapter: if memory is modified after registration, the local version and data seen by the adapter through this region rkey may differ .PP If .B IBV_ACCESS_REMOTE_WRITE @@ -43,6 +45,9 @@ is set, then .B IBV_ACCESS_LOCAL_WRITE must be set too. .PP +.B IBV_ACCESS_GIFT +is only legal with remote or local read access. +.PP Local read access is always enabled for the MR. .PP .B ibv_dereg_mr() -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
At the moment registering an MR breaks COW. This breaks memory overcommit for users such as KVM: we have a lot of COW pages, e.g. instances of the zero page or pages shared using KSM. If the application does not care that adapter sees stale data (for example, it tracks writes reregisters and resends), it can use a new IBV_ACCESS_GIFT flag to prevent registration from breaking COW. The semantics are similar to that of SPLICE_F_GIFT thus the name. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Please review and consider for 3.10. Changes from v1: rename APP_READONLY to _GIFT: similar to vmsplice's F_GIFT. drivers/infiniband/core/umem.c | 21 - include/rdma/ib_verbs.h| 9 - 2 files changed, 20 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a841123..5dee86d 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -89,6 +89,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, int ret; int off; int i; + bool gift, writable; DEFINE_DMA_ATTRS(attrs); if (dmasync) @@ -96,6 +97,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (!can_do_mlock()) return ERR_PTR(-EPERM); + /* +* We ask for writable memory if any access flags other than +* remote read or gift are set. Local write and remote write +* obviously require write access. Remote atomic can do +* things like fetch and add, which will modify memory, and +* MW bind can change permissions by binding a window. +*/ + gift = access IB_ACCESS_GIFT; + writable = access ~(IB_ACCESS_REMOTE_READ | IB_ACCESS_GIFT); umem = kmalloc(sizeof *umem, GFP_KERNEL); if (!umem) @@ -105,14 +115,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; - /* -* We ask for writable memory if any access flags other than -* remote read are set. Local write and remote write -* obviously require write access. Remote atomic can do -* things like fetch and add, which will modify memory, and -* MW bind can change permissions by binding a window. -*/ - umem-writable = !!(access ~IB_ACCESS_REMOTE_READ); + umem-writable = writable; /* We assume the memory is from hugetlb until proved otherwise */ umem-hugetlb = 1; @@ -152,7 +155,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), -1, !umem-writable, page_list, vma_list); +!gift, !umem-writable, page_list, vma_list); if (ret 0) goto out; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 98cc4b2..2e6e13c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -871,7 +871,14 @@ enum ib_access_flags { IB_ACCESS_REMOTE_READ = (12), IB_ACCESS_REMOTE_ATOMIC = (13), IB_ACCESS_MW_BIND = (14), - IB_ZERO_BASED = (15) + IB_ZERO_BASED = (15), + /* +* IB_ACCESS_GIFT: This memory is a gift to the adapter. If memory is +* modified after registration, the local version and data seen by the +* adapter through this region rkey may differ. +* Only legal with IB_ACCESS_REMOTE_READ or no permissions. +*/ + IB_ACCESS_GIFT = (16) }; struct ib_phys_buf { -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma: don't make pages writeable if not requiested
On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote: On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin m...@redhat.com wrote: core/umem.c seems to get the arguments to get_user_pages in the reverse order: it sets writeable flag and breaks COW for MAP_SHARED if and only if hardware needs to write the page. This breaks memory overcommit for users such as KVM: each time we try to register a page to send it to remote, this breaks COW. It seems that for applications that only have REMOTE_READ permission, there is no reason to break COW at all. I proposed a similar (but not exactly the same, see below) patch a while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread, especially https://lkml.org/lkml/2012/2/6/265 I think this change will break the case where userspace tries to register an MR with read-only permission, but intends locally through the CPU to write to the memory. If the memory registration is done while the memory is mapped read-only but has VM_MAYWRITE, then userspace gets into trouble when COW happens. In the case you're describing (although I'm not sure where in KVM we're talking about using RDMA), what happens if you register memory with only REMOTE_READ and then COW is triggered because of a local write? (I'm assuming you don't want remote access to continue to get the old contents of the page) I read that, and the above. It looks like everyone was doing tricks like register page, then modify it, then let remote read it and for some reason assumed it's ok to write into page locally from CPU even if LOCAL_WRITE is not set. I don't see why don't applications set LOCAL_WRITE if they are going to write to memory locally, but assuming they don't, we can't just break them. So what we need is a new no I really do not intend to write into this memory flag that avoids doing tricks in the kernel and treats the page normally, just pins it so hardware can read it. I have to confess that I still haven't had a chance to implement the proposed FOLL_FOLLOW solution to all of this. See a much easier to implement proposal at the bottom. If the page that is COW has lots of copies, this makes the user process quickly exceed the cgroups memory limit. This makes RDMA mostly useless for virtualization, thus the stable tag. The actual problem description here is a bit too terse for me to understand. How do we end up with lots of copies of a COW page? Reading the links above, rdma breaks COW intentionally. Imagine a page with lots of instances in the process page map. For example a zero page, but not only that: we rely on KSM heavily to deduplicate pages for multiple VMs. There are gigabytes of these in each of the multiple VMs running on a host. What we are using RDMA for is VM migration so we careful not to change this memory: when we do allow memory to change we are careful to track what was changed, reregister and resend the data. But at the moment, each time we register a virtual address referencing this page, infiniband assumes we might want to change the page so it does get_user_pages with writeable flag, forcing a copy. Amount of used RAM explodes. Why is RDMA registering the memory any more special than having everyone who maps this page actually writing to it and triggering COW? ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), -1, !umem-writable, page_list, vma_list); +!umem-writable, 1, page_list, vma_list); The first two parameters in this line being changed are write and force. I think if we do change this, then we need to pass umem-writable (as opposed to !umem-writable) for the write parameter. Ugh. Sure enough. Let's agree on the direction before I respin the patch though. Not sure whether force makes sense or not. - R. If you don't force write on read-only mappings you don't, but it seems harmless for read-only gup. Still, no need to change what's not broken. Please comment on the below (completely untested, and needs userspace patch too, but just to give you the idea) --- rdma: add IB_ACCESS_APP_READONLY At the moment any attempt to register memory for RDMA breaks COW, which hurts hosts overcomitted for memory. But if the application knows it won't write into the MR after registration, we can save (sometimes a lot of) memory by telling the kernel not to bother breaking COW for us. If the application does change memory registered with this flag, it can re-register afterwards, and resend the data on the wire. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 5929598..635b57a 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core
Re: [PATCH] rdma: don't make pages writeable if not requiested
On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote: On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin m...@redhat.com wrote: In that case, no, I don't see any reason for LOCAL_WRITE, since the only RDMA operations that will access this memory are remote reads. What is the meaning of LOCAL_WRITE then? There are no local RDMA writes as far as I can see. Umm, it means you're giving the local adapter permission to write to that memory. So you can use it as a receive buffer or as the target for remote data from an RDMA read operation. Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ. I don't see why do you need to give adapter permission to use memory as a receive buffer given that you must pass the address in the post receive verb, but maybe that's what the IB spec says so that's what the applications assumed? OK then what we need is a new flag saying I really do not intend to write into this memory please do not break COW or do anything else just in case I do. Isn't that a shared read-only mapping? - R. Nothing to do with how the page is mapped. We can and do write the page before registration. BTW umem.c passes in force so it breaks COW even for read-only mappings, right? -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma: don't make pages writeable if not requiested
On Thu, Mar 21, 2013 at 11:32:30AM +0200, Michael S. Tsirkin wrote: On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote: On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin m...@redhat.com wrote: core/umem.c seems to get the arguments to get_user_pages in the reverse order: it sets writeable flag and breaks COW for MAP_SHARED if and only if hardware needs to write the page. This breaks memory overcommit for users such as KVM: each time we try to register a page to send it to remote, this breaks COW. It seems that for applications that only have REMOTE_READ permission, there is no reason to break COW at all. I proposed a similar (but not exactly the same, see below) patch a while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread, especially https://lkml.org/lkml/2012/2/6/265 I think this change will break the case where userspace tries to register an MR with read-only permission, but intends locally through the CPU to write to the memory. If the memory registration is done while the memory is mapped read-only but has VM_MAYWRITE, then userspace gets into trouble when COW happens. In the case you're describing (although I'm not sure where in KVM we're talking about using RDMA), what happens if you register memory with only REMOTE_READ and then COW is triggered because of a local write? (I'm assuming you don't want remote access to continue to get the old contents of the page) I read that, and the above. It looks like everyone was doing tricks like register page, then modify it, then let remote read it and for some reason assumed it's ok to write into page locally from CPU even if LOCAL_WRITE is not set. I don't see why don't applications set LOCAL_WRITE if they are going to write to memory locally, but assuming they don't, we can't just break them. So what we need is a new no I really do not intend to write into this memory flag that avoids doing tricks in the kernel and treats the page normally, just pins it so hardware can read it. I have to confess that I still haven't had a chance to implement the proposed FOLL_FOLLOW solution to all of this. See a much easier to implement proposal at the bottom. If the page that is COW has lots of copies, this makes the user process quickly exceed the cgroups memory limit. This makes RDMA mostly useless for virtualization, thus the stable tag. The actual problem description here is a bit too terse for me to understand. How do we end up with lots of copies of a COW page? Reading the links above, rdma breaks COW intentionally. Imagine a page with lots of instances in the process page map. For example a zero page, but not only that: we rely on KSM heavily to deduplicate pages for multiple VMs. There are gigabytes of these in each of the multiple VMs running on a host. What we are using RDMA for is VM migration so we careful not to change this memory: when we do allow memory to change we are careful to track what was changed, reregister and resend the data. But at the moment, each time we register a virtual address referencing this page, infiniband assumes we might want to change the page so it does get_user_pages with writeable flag, forcing a copy. Amount of used RAM explodes. Why is RDMA registering the memory any more special than having everyone who maps this page actually writing to it and triggering COW? ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), -1, !umem-writable, page_list, vma_list); +!umem-writable, 1, page_list, vma_list); The first two parameters in this line being changed are write and force. I think if we do change this, then we need to pass umem-writable (as opposed to !umem-writable) for the write parameter. Ugh. Sure enough. Let's agree on the direction before I respin the patch though. Not sure whether force makes sense or not. - R. If you don't force write on read-only mappings you don't, but it seems harmless for read-only gup. Still, no need to change what's not broken. Please comment on the below (completely untested, and needs userspace patch too, but just to give you the idea) --- rdma: add IB_ACCESS_APP_READONLY Or we can call it IB_ACCESS_GIFT - this is a bit like SPLICE_F_GIFT semantics. At the moment any attempt to register memory for RDMA breaks COW, which hurts hosts overcomitted for memory. But if the application knows it won't write into the MR after registration, we can save (sometimes a lot of) memory by telling the kernel not to bother breaking COW for us. If the application does change memory registered with this flag, it can re-register afterwards
Re: [PATCH] rdma: don't make pages writeable if not requiested
On Thu, Mar 21, 2013 at 11:57:32AM -0600, Jason Gunthorpe wrote: On Thu, Mar 21, 2013 at 07:42:37PM +0200, Michael S. Tsirkin wrote: It doesn't actually, and our app would sometimes write to these pages. It simply does not care which version does the remote get in this case since we track writes and resend later. Heh, somehow I thought you might say that :) A new flag seems like the only way then - maybe: IBV_ACCESS_NON_COHERENT - The adaptor and the CPU do not share a coherent view of registered memory. Memory writes from the CPU after ibv_reg_mr completes may not be reflected in the memory viewed by the adaptor. Can only be combined with read only access permissions. Jason I kind of like _GIFT for name, gifts are nice :) But yes that's exactly the semantics we need. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma: don't make pages writeable if not requiested
On Thu, Mar 21, 2013 at 11:11:15AM -0600, Jason Gunthorpe wrote: On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote: On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin m...@redhat.com wrote: In that case, no, I don't see any reason for LOCAL_WRITE, since the only RDMA operations that will access this memory are remote reads. What is the meaning of LOCAL_WRITE then? There are no local RDMA writes as far as I can see. Umm, it means you're giving the local adapter permission to write to that memory. So you can use it as a receive buffer or as the target for remote data from an RDMA read operation. Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ. I don't see why do you need to give adapter permission The access flags have to do with what independent access remote nodes get. There are four major cases: access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes read the memory. access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under the direct control of the application, can read this memory. Remote nodes are barred. access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor will let remote nodes write the memory access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that memory. Only the adaptor, under the direct control of the application, can write the memory. This is the one I find redundant. Since the write will be done by the adaptor under direct control by the application, why does it make sense to declare this beforehand? If you don't want to allow local write access to memory, just do not post any receive WRs with this address. If you posted and regret it, reset the QP to cancel. But IB spec specified LOCAL_WRITE in this redundant way so I guess applications expect it to have the semantics defined there, I just didn't remember what they are. No way around it then, need another flag. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma: don't make pages writeable if not requiested
On Thu, Mar 21, 2013 at 12:41:35PM -0600, Jason Gunthorpe wrote: On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote: This is the one I find redundant. Since the write will be done by the adaptor under direct control by the application, why does it make sense to declare this beforehand? If you don't want to allow local write access to memory, just do not post any receive WRs with this address. If you posted and regret it, reset the QP to cancel. This is to support your COW scenario - the app declares before hand to the kernel that it will write to the memory and the kernel ensures pages are dedicated to the app at registration time. Or the app says it will only read and the kernel could leave them shared. Someone here is confused. LOCAL_WRITE/absence of it does not address COW, it breaks COW anyway. Are you now saying we should change rdma so without LOCAL_WRITE it will not break COW? The adaptor enforces the access control to prevent a naughty app from writing to shared memory - think about mmap'ing libc.so and then using RDMA to write to the shared pages. It is necessary to ensure that is impossible. Jason That's why it's redundant: we can't trust an application to tell us 'this page is writeable', we must get this info from kernel. And so there's apparently no need for application to tell adaptor about LOCAL_WRITE. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mellanox target workaround in SRP
On Mon, Jan 10, 2011 at 10:51:13AM -0800, Roland Dreier wrote: Maybe we can use MST's current email to ask him... Michael, do you have any memory of the issue we worked around here? I have question regarding workaround introduced in commit 559ce8f1 of the mainline tree: IB/srp: Work around data corruption bug on Mellanox targets Data corruption has been seen with Mellanox SRP targets when FMRs create a memory region with I/O virtual address != 0. Add a workaround that disables FMR merging for Mellanox targets (OUI 0002c9). I don't see how this can make a difference to the target -- it sees an address and length, and there should be no visible difference to it when it gets an FMR versus a direct-mapped region of the same space, right? And how is it different than getting a direct or indirect descriptor with a similar offset? I could see there being a bug on the initiator HCA not liking such FMR mappings, but then it should be keyed off of the vendor of our HCA and not the target. I'm sure this was tested and shown to fix the problem; I'm just confused as to what the problem really was and if this is still relevant. Can someone please enlighten me? I don't recall unfortunately. Sorry. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html