Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity

2015-07-16 Thread Michael S. Tsirkin
On Wed, Jul 15, 2015 at 02:48:00PM -0400, Matthew Wilcox wrote:
 On Wed, Jul 15, 2015 at 11:25:55AM -0600, Jens Axboe wrote:
  On 07/15/2015 11:19 AM, Keith Busch wrote:
  On Wed, 15 Jul 2015, Bart Van Assche wrote:
  * With blk-mq and scsi-mq optimal performance can only be achieved if
   the relationship between MSI-X vector and NUMA node does not change
   over time. This is necessary to allow a blk-mq/scsi-mq driver to
   ensure that interrupts are processed on the same NUMA node as the
   node on which the data structures for a communication channel have
   been allocated. However, today there is no API that allows
   blk-mq/scsi-mq drivers and irqbalanced to exchange information
   about the relationship between MSI-X vector ranges and NUMA nodes.
  
  We could have low-level drivers provide blk-mq the controller's irq
  associated with a particular h/w context, and the block layer can provide
  the context's cpumask to irqbalance with the smp affinity hint.
  
  The nvme driver already uses the hwctx cpumask to set hints, but this
  doesn't seems like it should be a driver responsibility. It currently
  doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
  the h/w contexts without syncing with the low-level driver.
  
  If we can add this to blk-mq, one additional case to consider is if the
  same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
  assignment needs to be aware of this to prevent sharing a vector across
  NUMA nodes.
  
  Exactly. I may have promised to do just that at the last LSF/MM conference,
  just haven't done it yet. The point is to share the mask, I'd ideally like
  to take it all the way where the driver just asks for a number of vecs
  through a nice API that takes care of all this. Lots of duplicated code in
  drivers for this these days, and it's a mess.
 
 Yes.  I think the fundamental problem is that our MSI-X API is so funky.
 We have this incredibly flexible scheme where each MSI-X vector could
 have its own interrupt handler, but that's not what drivers want.
 They want to say Give me eight MSI-X vectors spread across the CPUs,
 and use this interrupt handler for all of them.  That is, instead of
 the current scheme where each MSI-X vector gets its own Linux interrupt,
 we should have one interrupt handler (of the per-cpu interrupt type),
 which shows up with N bits set in its CPU mask.

It would definitely be nice to have a way to express that.  But it's
also pretty common for drivers to have e.g. RX and TX use separate
vectors, and these need separate handlers.

 ___
 Ksummit-discuss mailing list
 ksummit-disc...@lists.linuxfoundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity

2015-07-15 Thread Michael S. Tsirkin
On Wed, Jul 15, 2015 at 05:07:08AM -0700, Christoph Hellwig wrote:
 Many years ago we decided to move setting of IRQ to core affnities to
 userspace with the irqbalance daemon.
 
 These days we have systems with lots of MSI-X vector, and we have
 hardware and subsystem support for per-CPU I/O queues in the block
 layer, the RDMA subsystem and probably the network stack (I'm not too
 familar with the recent developments there).  It would really help the
 out of the box performance and experience if we could allow such
 subsystems to bind interrupt vectors to the node that the queue is
 configured on.

I think you are right, it's true for networking.

Whenever someone tries to benchmark networking, first thing done is
always disabling irqbalance and pinning IRQs manually away from
whereever the benchmark is running, but at the same numa node.

Without that, interrupts don't let the benchmark make progress.

Alternatively, people give up on interrupts completely and
start polling hardware aggressively. Nice for a benchmark,
not nice for the environment.

 
 I'd like to discuss if the rationale for moving the IRQ affinity setting
 fully to userspace are still correct in todays world any any pitfalls
 we'll have to learn from in irqbalanced and the old in-kernel affinity
 code.

IMHO there could be a benefit from a better integration with the scheduler.
Maybe an interrupt handler can be viewed as a kind of thread,
so scheduler can make decisions about where to run it next?

 ___
 Ksummit-discuss mailing list
 ksummit-disc...@lists.linuxfoundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: decent performance drop for SCSI LLD / SAN initiator when iommu is turned on

2013-05-07 Thread Michael S. Tsirkin
On Mon, May 06, 2013 at 03:35:58PM -0700, Alexander Duyck wrote:
 On 05/06/2013 02:39 PM, Or Gerlitz wrote:
  On Thu, May 2, 2013 at 4:56 AM, Michael S. Tsirkin m...@redhat.com wrote:
  On Thu, May 02, 2013 at 02:11:15AM +0300, Or Gerlitz wrote:
  So we've noted that when configuring the kernel  booting with intel
  iommu set to on on a physical node (non VM, and without enabling SRIOV
  by the HW device driver) raw performance of the iSER (iSCSI RDMA) SAN
  initiator is reduced notably, e.g in the testbed we looked today we
  had ~260K 1KB random IOPS and 5.5GBs BW for 128KB IOs with iommu
  turned off for single LUN, and ~150K IOPS and 4GBs BW with iommu
  turned on. No change on the target node between runs.
  That's why we have iommu=pt.
  See definition of iommu_pass_through in arch/x86/kernel/pci-dma.c.
 
 
  Hi Michael (hope you feel better),
 
  We did some runs with the pt approach you suggested and still didn't
  get the promised gain -- in parallel we came across this 2012 commit
  f800326dc ixgbe: Replace standard receive path with a page based
  receive where they say [...] we are able to see a considerable
  performance gain when an IOMMU is enabled because we are no longer
  unmapping every buffer on receive [...] instead we can simply call
  sync_single_range [...]  looking on the commit you can see that they
  allocate a page/skb dma_map it initially and later of the life cycle
  of that buffer use dma_sync_for_device/cpu and avoid dma_map/unmap on
  the fast path.
 
  Well few questions which I'd love to hear people's opinion -- 1st this
  approach seems cool for network device RX path, but what about the TX
  path, any idea how to avoid dma_map for it? or why on the TX path
  calling dma_map/unmap for every buffer doesn't involve a notable perf
  hit? 2nd I don't see how to apply the method on block device since
  these devices don't allocate buffers, but rather get a scatter-gather
  list of pages from upper layers, issue dma_map_sg on them and submit
  the IO, later when done call dma_unmap_sg
 
  Or.
 
 The Tx path ends up taking a performance hit if IOMMU is enabled.  It
 just isn't as severe due to things like TSO.
 
 One way to work around the performance penalty is to allocate bounce
 buffers and just leave them static mapped.  Then you can simply memcpy
 the data to the buffers and avoid the locking overhead of
 allocating/freeing IOMMU resources.  It consumes more memory but works
 around the IOMMU limitations.
 
 Thanks,
 
 Alex

But why isn't iommu=pt effective?
AFAIK the whole point of it was to give up on security
for host-controlled devices, but still get a
measure of security for assigned devices.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-10 Thread Michael S. Tsirkin
On Wed, Apr 10, 2013 at 12:32:31AM -0400, Michael R. Hines wrote:
 On 04/09/2013 11:24 PM, Michael S. Tsirkin wrote:
 Which mechanism do you refer to? You patches still seem to pin
 each page in guest memory at some point, which will break all COW.
 In particular any pagemap tricks to detect duplicates on source
 that I suggested won't work.
 
 Sorry, I mispoke. I'm reffering to dynamic server page registration.
 
 Of course it does not eliminate pinning - but it does mitigate the
 foot print of the VM as a feature that was requested.
 
 I have implemented it and documented it.
 
 - Michael

Okay, but GIFT is supposed to be used on send side: it's only allowed
with local/remote read access, and serves to reduce memory usage
on send side.
For example, disable zero page detection and look at memory usage
on send side before and after migration.
Dynamic registration on the receive side is nice but seems
completely unrelated ...

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-10 Thread Michael S. Tsirkin
On Wed, Apr 10, 2013 at 11:48:24AM -0400, Michael R. Hines wrote:
 
 There's a very nice, simple client/server RDMA application on the
 internet you can use to test your patch.
 
 http://thegeekinthecorner.wordpress.com/2010/09/28/
 rdma-read-and-write-with-ib-verbs/
 
 This guy provides the source code which dumps several gigabytes over RDMA
 to the other side.
 
 There's no need to run QEMU to test your patch,
 assuming you have access to infiniband hardware.
 
 - Michael


Does this app have any COW pages?

 On 04/10/2013 01:32 AM, Michael S. Tsirkin wrote:
 
 On Wed, Apr 10, 2013 at 12:32:31AM -0400, Michael R. Hines wrote:
 
 On 04/09/2013 11:24 PM, Michael S. Tsirkin wrote:
 
 Which mechanism do you refer to? You patches still seem to pin
 each page in guest memory at some point, which will break all COW.
 In particular any pagemap tricks to detect duplicates on source
 that I suggested won't work.
 
 Sorry, I mispoke. I'm reffering to dynamic server page registration.
 
 Of course it does not eliminate pinning - but it does mitigate the
 foot print of the VM as a feature that was requested.
 
 I have implemented it and documented it.
 
 - Michael
 
 Okay, but GIFT is supposed to be used on send side: it's only allowed
 with local/remote read access, and serves to reduce memory usage
 on send side.
 For example, disable zero page detection and look at memory usage
 on send side before and after migration.
 Dynamic registration on the receive side is nice but seems
 completely unrelated ...
 
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote:
 To be more specific, here's what I did:
 
 1. apply kernel module patch - re-insert module
 1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ)
 2. Start the RDMA migration
 3. Migration completes without any errors
 
 This test does *not* work with a cgroup swap limit, however. The
 process gets killed. (Both with and without GIFT)
 
 - Michael

Try to attach a debugger and see where it is when it gets killed?

 On 04/05/2013 04:43 PM, Roland Dreier wrote:
 On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines
 mrhi...@linux.vnet.ibm.com wrote:
 I also removed the IBV_*_WRITE flags on the sender-side and activated
 cgroups with the memory.memsw.limit_in_bytes activated and the migration
 with RDMA also succeeded without any problems (both with *and* without GIFT
 also worked).
 Not sure I'm interpreting this correctly.  Are you saying that things
 worked without actually setting the GIFT flag?   In which case why are
 we adding this flag?
 
   - R.
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
presumably is_dup_page reads the page, so should not break COW ...

I'm not sure about the cgroups swap limit - you might have
too many non COW pages so attempting to fault them all in
makes you exceed the limit. You really should look at
what is going on in the pagemap, to see if there's
measureable gain from the patch.


On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:
 Well, I have the is_dup_page() commented out...when RDMA is
 activated.
 
 Is there something else in QEMU that could be touching the page that
 I don't know about?
 
 - Michael
 
 
 On 04/05/2013 05:03 PM, Roland Dreier wrote:
 On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
 mrhi...@linux.vnet.ibm.com wrote:
 Sorry, I was wrong. ignore the comments about cgroups. That's still broken.
 (i.e. trying to register RDMA memory while using a cgroup swap limit cause
 the process get killed).
 
 But the GIFT flag patch works (my understanding is that GIFT flag allows the
 adapter to transmit stale memory information, it does not have anything to
 do with cgroups specifically).
 The point of the GIFT patch is to avoid triggering copy-on-write so
 that memory doesn't blow up during migration.  If that doesn't work
 then there's no point to the patch.
 
   - R.
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 01:43:49PM -0700, Roland Dreier wrote:
 On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines
 mrhi...@linux.vnet.ibm.com wrote:
  I also removed the IBV_*_WRITE flags on the sender-side and activated
  cgroups with the memory.memsw.limit_in_bytes activated and the migration
  with RDMA also succeeded without any problems (both with *and* without GIFT
  also worked).
 
 Not sure I'm interpreting this correctly.  Are you saying that things
 worked without actually setting the GIFT flag?   In which case why are
 we adding this flag?
 
  - R.

We are adding the flag to reduce memory when there's lots of COW pages.
There's no guarantee there will be COW pages so I expect things to work
both with and without breaking COW, just using much more memory when we
break COW.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 04:17:36PM -0400, Michael R. Hines wrote:
 The userland part of the patch was missing (IBV_ACCESS_GIFT).
 
 I added flag that to /usr/include in addition to this patch and did
 a test RDMA migrate and it seems to work without any problems.
 
 I also removed the IBV_*_WRITE flags on the sender-side and
 activated cgroups with the memory.memsw.limit_in_bytes activated
 and the migration with RDMA also succeeded without any problems
 (both with *and* without GIFT also worked).
 
 Any additional tests you would like?
 
 
 - Michael

RDMA can't really work with swap so not sure how that's relevant.

Please check memory.usage_in_bytes - is it lower with
the GIFT flag?  I think this is what we really care about.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin

On Tue, Apr 09, 2013 at 09:26:59PM -0400, Michael R. Hines wrote:
 With respect, I'm going to offload testing this patch back to the author =)
 because I'm trying to address all of Paolo's other minor issues
 with the RDMA patch before we can merge.

Fair enough, this likely means it won't happen anytime soon though.

 Since dynamic page registration (as you requested) is now fully
 implemented, this patch is less urgent since we now have a
 mechanism in place to avoid page pinning on both sides of the migration.
 
 - Michael
 

Which mechanism do you refer to? You patches still seem to pin
each page in guest memory at some point, which will break all
COW. In particular any pagemap tricks to detect duplicates
on source that I suggested won't work.

 On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote:
 presumably is_dup_page reads the page, so should not break COW ...
 
 I'm not sure about the cgroups swap limit - you might have
 too many non COW pages so attempting to fault them all in
 makes you exceed the limit. You really should look at
 what is going on in the pagemap, to see if there's
 measureable gain from the patch.
 
 
 On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:
 Well, I have the is_dup_page() commented out...when RDMA is
 activated.
 
 Is there something else in QEMU that could be touching the page that
 I don't know about?
 
 - Michael
 
 
 On 04/05/2013 05:03 PM, Roland Dreier wrote:
 On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
 mrhi...@linux.vnet.ibm.com wrote:
 Sorry, I was wrong. ignore the comments about cgroups. That's still 
 broken.
 (i.e. trying to register RDMA memory while using a cgroup swap limit cause
 the process get killed).
 
 But the GIFT flag patch works (my understanding is that GIFT flag allows 
 the
 adapter to transmit stale memory information, it does not have anything to
 do with cgroups specifically).
 The point of the GIFT patch is to avoid triggering copy-on-write so
 that memory doesn't blow up during migration.  If that doesn't work
 then there's no point to the patch.
 
   - R.
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-03 Thread Michael S. Tsirkin
On Tue, Apr 02, 2013 at 08:05:21PM +0300, Michael S. Tsirkin wrote:
 On Tue, Apr 02, 2013 at 09:57:38AM -0700, Roland Dreier wrote:
  On Tue, Apr 2, 2013 at 8:51 AM, Michael S. Tsirkin m...@redhat.com wrote:
   At the moment registering an MR breaks COW.  This breaks memory
   overcommit for users such as KVM: we have a lot of COW pages, e.g.
   instances of the zero page or pages shared using KSM.
  
   If the application does not care that adapter sees stale data (for
   example, it tracks writes reregisters and resends), it can use a new
   IBV_ACCESS_GIFT flag to prevent registration from breaking COW.
  
   The semantics are similar to that of SPLICE_F_GIFT thus the name.
  
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
   Roland, Michael is yet to test this but could you please
   confirm whether this looks acceptable to you?
  
  The patch itself is reasonable I guess, given the needs of this particular 
  app.
  
  I'm not particularly happy with the name of the flag.  The analogy
  with SPLICE_F_GIFT doesn't seem particularly strong and I'm not
  convinced even the splice flag name is very understandable.  But in
  the RDMA case there's not really any sense in which we're gifting
  memory to the adapter -- we're just telling the library please don't
  trigger copy-on-write and it doesn't seem particularly easy for users
  to understand that from the flag name.
  
   - R.
 
 The point really is that any writes by application
 won't be seen until re-registration, right?
 OK, what's a better name?  IBV_ACCESS_NON_COHERENT?
 Please tell me what is preferable and we'll go ahead with it.

Um. ping? We are at -rc5 and things need to fall into place
if we are to have it in 3.10 ...

 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-02 Thread Michael S. Tsirkin
On Sun, Mar 24, 2013 at 05:51:53PM +0200, Michael S. Tsirkin wrote:
 At the moment registering an MR breaks COW.  This breaks memory
 overcommit for users such as KVM: we have a lot of COW pages, e.g.
 instances of the zero page or pages shared using KSM.
 
 If the application does not care that adapter sees stale data (for
 example, it tracks writes reregisters and resends), it can use a new
 IBV_ACCESS_GIFT flag to prevent registration from breaking COW.
 
 The semantics are similar to that of SPLICE_F_GIFT thus the name.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Roland, Michael is yet to test this but could you please
confirm whether this looks acceptable to you?

 ---
 
 Please review and consider for 3.10.
 
 Changes from v1:
   rename APP_READONLY to _GIFT: similar to vmsplice's F_GIFT.
 
  drivers/infiniband/core/umem.c | 21 -
  include/rdma/ib_verbs.h|  9 -
  2 files changed, 20 insertions(+), 10 deletions(-)
 
 diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
 index a841123..5dee86d 100644
 --- a/drivers/infiniband/core/umem.c
 +++ b/drivers/infiniband/core/umem.c
 @@ -89,6 +89,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
 unsigned long addr,
   int ret;
   int off;
   int i;
 + bool gift, writable;
   DEFINE_DMA_ATTRS(attrs);
  
   if (dmasync)
 @@ -96,6 +97,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
 unsigned long addr,
  
   if (!can_do_mlock())
   return ERR_PTR(-EPERM);
 + /*
 +  * We ask for writable memory if any access flags other than
 +  * remote read or gift are set.  Local write and remote write
 +  * obviously require write access.  Remote atomic can do
 +  * things like fetch and add, which will modify memory, and
 +  * MW bind can change permissions by binding a window.
 +  */
 + gift  = access  IB_ACCESS_GIFT;
 + writable = access  ~(IB_ACCESS_REMOTE_READ | IB_ACCESS_GIFT);
  
   umem = kmalloc(sizeof *umem, GFP_KERNEL);
   if (!umem)
 @@ -105,14 +115,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
 unsigned long addr,
   umem-length= size;
   umem-offset= addr  ~PAGE_MASK;
   umem-page_size = PAGE_SIZE;
 - /*
 -  * We ask for writable memory if any access flags other than
 -  * remote read are set.  Local write and remote write
 -  * obviously require write access.  Remote atomic can do
 -  * things like fetch and add, which will modify memory, and
 -  * MW bind can change permissions by binding a window.
 -  */
 - umem-writable  = !!(access  ~IB_ACCESS_REMOTE_READ);
 + umem-writable  = writable;
  
   /* We assume the memory is from hugetlb until proved otherwise */
   umem-hugetlb   = 1;
 @@ -152,7 +155,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
 unsigned long addr,
   ret = get_user_pages(current, current-mm, cur_base,
min_t(unsigned long, npages,
  PAGE_SIZE / sizeof (struct page *)),
 -  1, !umem-writable, page_list, vma_list);
 +  !gift, !umem-writable, page_list, 
 vma_list);
  
   if (ret  0)
   goto out;
 diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
 index 98cc4b2..2e6e13c 100644
 --- a/include/rdma/ib_verbs.h
 +++ b/include/rdma/ib_verbs.h
 @@ -871,7 +871,14 @@ enum ib_access_flags {
   IB_ACCESS_REMOTE_READ   = (12),
   IB_ACCESS_REMOTE_ATOMIC = (13),
   IB_ACCESS_MW_BIND   = (14),
 - IB_ZERO_BASED   = (15)
 + IB_ZERO_BASED   = (15),
 + /*
 +  * IB_ACCESS_GIFT: This memory is a gift to the adapter.  If memory is
 +  * modified after registration, the local version and data seen by the
 +  * adapter through this region rkey may differ.
 +  * Only legal with IB_ACCESS_REMOTE_READ or no permissions.
 +  */
 + IB_ACCESS_GIFT  = (16)
  };
  
  struct ib_phys_buf {
 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-02 Thread Michael S. Tsirkin
On Tue, Apr 02, 2013 at 09:57:38AM -0700, Roland Dreier wrote:
 On Tue, Apr 2, 2013 at 8:51 AM, Michael S. Tsirkin m...@redhat.com wrote:
  At the moment registering an MR breaks COW.  This breaks memory
  overcommit for users such as KVM: we have a lot of COW pages, e.g.
  instances of the zero page or pages shared using KSM.
 
  If the application does not care that adapter sees stale data (for
  example, it tracks writes reregisters and resends), it can use a new
  IBV_ACCESS_GIFT flag to prevent registration from breaking COW.
 
  The semantics are similar to that of SPLICE_F_GIFT thus the name.
 
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
  Roland, Michael is yet to test this but could you please
  confirm whether this looks acceptable to you?
 
 The patch itself is reasonable I guess, given the needs of this particular 
 app.
 
 I'm not particularly happy with the name of the flag.  The analogy
 with SPLICE_F_GIFT doesn't seem particularly strong and I'm not
 convinced even the splice flag name is very understandable.  But in
 the RDMA case there's not really any sense in which we're gifting
 memory to the adapter -- we're just telling the library please don't
 trigger copy-on-write and it doesn't seem particularly easy for users
 to understand that from the flag name.
 
  - R.

The point really is that any writes by application
won't be seen until re-registration, right?
OK, what's a better name?  IBV_ACCESS_NON_COHERENT?
Please tell me what is preferable and we'll go ahead with it.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] ibverbs: add a new IBV_ACCESS_GIFT option

2013-03-24 Thread Michael S. Tsirkin
At the moment registering an MR breaks COW.  If the application does not
care that adapter sees stale data (for example, it tracks writes
reregisters and resends), it can use a new IBV_ACCESS_GIFT flag to
prevent registration from breaking COW.

The semantics are similar to that of SPLICE_F_GIFT thus the name.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

This is compiled but untested. Michael, could you please try
this patch (together with the kernel patch I'm sending separately)
and report whether setting this flag unbreaks overcommit for you?

 include/infiniband/verbs.h | 3 ++-
 man/ibv_reg_mr.3   | 5 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 6acfc81..3290ec9 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -290,7 +290,8 @@ enum ibv_access_flags {
IBV_ACCESS_REMOTE_WRITE = (11),
IBV_ACCESS_REMOTE_READ  = (12),
IBV_ACCESS_REMOTE_ATOMIC= (13),
-   IBV_ACCESS_MW_BIND  = (14)
+   IBV_ACCESS_MW_BIND  = (14),
+   IBV_ACCESS_GIFT = (16)
 };
 
 struct ibv_pd {
diff --git a/man/ibv_reg_mr.3 b/man/ibv_reg_mr.3
index 7723771..3c302f0 100644
--- a/man/ibv_reg_mr.3
+++ b/man/ibv_reg_mr.3
@@ -34,6 +34,8 @@ describes the desired memory protection attributes; it is 
either 0 or the bitwis
 .B IBV_ACCESS_REMOTE_ATOMIC\fR Enable Remote Atomic Operation Access (if 
supported)
 .TP
 .B IBV_ACCESS_MW_BIND\fR   Enable Memory Window Binding
+.TP
+.B IBV_ACCESS_GIFT\fR  This memory is a gift to the adapter: if memory 
is modified after registration, the local version and data seen by the adapter 
through this region rkey may differ
 .PP
 If
 .B IBV_ACCESS_REMOTE_WRITE
@@ -43,6 +45,9 @@ is set, then
 .B IBV_ACCESS_LOCAL_WRITE
 must be set too.
 .PP
+.B IBV_ACCESS_GIFT
+is only legal with remote or local read access.
+.PP
 Local read access is always enabled for the MR.
 .PP
 .B ibv_dereg_mr()
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-03-24 Thread Michael S. Tsirkin
At the moment registering an MR breaks COW.  This breaks memory
overcommit for users such as KVM: we have a lot of COW pages, e.g.
instances of the zero page or pages shared using KSM.

If the application does not care that adapter sees stale data (for
example, it tracks writes reregisters and resends), it can use a new
IBV_ACCESS_GIFT flag to prevent registration from breaking COW.

The semantics are similar to that of SPLICE_F_GIFT thus the name.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Please review and consider for 3.10.

Changes from v1:
rename APP_READONLY to _GIFT: similar to vmsplice's F_GIFT.

 drivers/infiniband/core/umem.c | 21 -
 include/rdma/ib_verbs.h|  9 -
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a841123..5dee86d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -89,6 +89,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
int ret;
int off;
int i;
+   bool gift, writable;
DEFINE_DMA_ATTRS(attrs);
 
if (dmasync)
@@ -96,6 +97,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
 
if (!can_do_mlock())
return ERR_PTR(-EPERM);
+   /*
+* We ask for writable memory if any access flags other than
+* remote read or gift are set.  Local write and remote write
+* obviously require write access.  Remote atomic can do
+* things like fetch and add, which will modify memory, and
+* MW bind can change permissions by binding a window.
+*/
+   gift  = access  IB_ACCESS_GIFT;
+   writable = access  ~(IB_ACCESS_REMOTE_READ | IB_ACCESS_GIFT);
 
umem = kmalloc(sizeof *umem, GFP_KERNEL);
if (!umem)
@@ -105,14 +115,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem-length= size;
umem-offset= addr  ~PAGE_MASK;
umem-page_size = PAGE_SIZE;
-   /*
-* We ask for writable memory if any access flags other than
-* remote read are set.  Local write and remote write
-* obviously require write access.  Remote atomic can do
-* things like fetch and add, which will modify memory, and
-* MW bind can change permissions by binding a window.
-*/
-   umem-writable  = !!(access  ~IB_ACCESS_REMOTE_READ);
+   umem-writable  = writable;
 
/* We assume the memory is from hugetlb until proved otherwise */
umem-hugetlb   = 1;
@@ -152,7 +155,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
ret = get_user_pages(current, current-mm, cur_base,
 min_t(unsigned long, npages,
   PAGE_SIZE / sizeof (struct page *)),
-1, !umem-writable, page_list, vma_list);
+!gift, !umem-writable, page_list, 
vma_list);
 
if (ret  0)
goto out;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 98cc4b2..2e6e13c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -871,7 +871,14 @@ enum ib_access_flags {
IB_ACCESS_REMOTE_READ   = (12),
IB_ACCESS_REMOTE_ATOMIC = (13),
IB_ACCESS_MW_BIND   = (14),
-   IB_ZERO_BASED   = (15)
+   IB_ZERO_BASED   = (15),
+   /*
+* IB_ACCESS_GIFT: This memory is a gift to the adapter.  If memory is
+* modified after registration, the local version and data seen by the
+* adapter through this region rkey may differ.
+* Only legal with IB_ACCESS_REMOTE_READ or no permissions.
+*/
+   IB_ACCESS_GIFT  = (16)
 };
 
 struct ib_phys_buf {
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma: don't make pages writeable if not requiested

2013-03-21 Thread Michael S. Tsirkin
On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
 On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin m...@redhat.com wrote:
  core/umem.c seems to get the arguments to get_user_pages
  in the reverse order: it sets writeable flag and
  breaks COW for MAP_SHARED if and only if hardware needs to
  write the page.
 
  This breaks memory overcommit for users such as KVM:
  each time we try to register a page to send it to remote, this
  breaks COW.  It seems that for applications that only have
  REMOTE_READ permission, there is no reason to break COW at all.
 
 I proposed a similar (but not exactly the same, see below) patch a
 while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
 especially https://lkml.org/lkml/2012/2/6/265
 
 I think this change will break the case where userspace tries to
 register an MR with read-only permission, but intends locally through
 the CPU to write to the memory.  If the memory registration is done
 while the memory is mapped read-only but has VM_MAYWRITE, then
 userspace gets into trouble when COW happens.  In the case you're
 describing (although I'm not sure where in KVM we're talking about
 using RDMA), what happens if you register memory with only REMOTE_READ
 and then COW is triggered because of a local write?  (I'm assuming you
 don't want remote access to continue to get the old contents of the
 page)

I read that, and the above. It looks like everyone was doing tricks
like register page, then modify it, then let remote read it
and for some reason assumed it's ok to write into page locally from CPU
even if LOCAL_WRITE is not set.  I don't see why don't applications set
LOCAL_WRITE if they are going to write to memory locally, but assuming
they don't, we can't just break them.

So what we need is a new no I really do not intend to write into this
memory flag that avoids doing tricks in the kernel and treats the
page normally, just pins it so hardware can read it.


 I have to confess that I still haven't had a chance to implement the
 proposed FOLL_FOLLOW solution to all of this.

See a much easier to implement proposal at the bottom.

  If the page that is COW has lots of copies, this makes the user process
  quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
  for virtualization, thus the stable tag.
 
 The actual problem description here is a bit too terse for me to
 understand.  How do we end up with lots of copies of a COW page?

Reading the links above, rdma breaks COW intentionally.

Imagine a page with lots of instances in the process page map.
For example a zero page, but not only that: we rely on KSM heavily
to deduplicate pages for multiple VMs.
There are gigabytes of these in each of the multiple VMs
running on a host.

What we are using RDMA for is VM migration so we careful not to change
this memory: when we do allow memory to change we are careful
to track what was changed, reregister and resend the data.

But at the moment, each time we register a virtual address referencing
this page, infiniband assumes we might want to change the page so it
does get_user_pages with writeable flag, forcing a copy.
Amount of used RAM explodes.

  Why
 is RDMA registering the memory any more  special than having everyone
 who maps this page actually writing to it and triggering COW?
 
  ret = get_user_pages(current, current-mm, cur_base,
   min_t(unsigned long, npages,
 PAGE_SIZE / sizeof (struct page 
  *)),
  -1, !umem-writable, page_list, 
  vma_list);
  +!umem-writable, 1, page_list, 
  vma_list);
 
 The first two parameters in this line being changed are write and force.
 
 I think if we do change this, then we need to pass umem-writable (as
 opposed to !umem-writable) for the write parameter.

Ugh. Sure enough. Let's agree on the direction before I respin the
patch though.

 Not sure
 whether force makes sense or not.
 
  - R.

If you don't force write on read-only mappings you don't, but
it seems harmless for read-only gup. Still, no need to change
what's not broken.

Please comment on the below (completely untested, and needs userspace
patch too, but just to give you the idea)

---

rdma: add IB_ACCESS_APP_READONLY 

At the moment any attempt to register memory for RDMA breaks
COW, which hurts hosts overcomitted for memory.
But if the application knows it won't write into the MR after
registration, we can save (sometimes a lot of) memory
by telling the kernel not to bother breaking COW for us.

If the application does change memory registered with this flag, it can
re-register afterwards, and resend the data on the wire.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 5929598..635b57a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core

Re: [PATCH] rdma: don't make pages writeable if not requiested

2013-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
 On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin m...@redhat.com wrote:
  In that case, no, I don't see any reason for LOCAL_WRITE, since the
  only RDMA operations that will access this memory are remote reads.
 
  What is the meaning of LOCAL_WRITE then? There are no local
  RDMA writes as far as I can see.
 
 Umm, it means you're giving the local adapter permission to write to
 that memory.  So you can use it as a receive buffer or as the target
 for remote data from an RDMA read operation.

Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
I don't see why do you need to give adapter permission
to use memory as a receive buffer given that
you must pass the address in the post receive verb,
but maybe that's what the IB spec says so that's what
the applications assumed?

  OK then what we need is a new flag saying I really do not
  intend to write into this memory please do not break
  COW or do anything else just in case I do.
 
 Isn't that a shared read-only mapping?
 
  - R.

Nothing to do with how the page is mapped.  We can and do write the page
before registration. BTW umem.c passes in force so it breaks COW
even for read-only mappings, right?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma: don't make pages writeable if not requiested

2013-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2013 at 11:32:30AM +0200, Michael S. Tsirkin wrote:
 On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
  On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin m...@redhat.com 
  wrote:
   core/umem.c seems to get the arguments to get_user_pages
   in the reverse order: it sets writeable flag and
   breaks COW for MAP_SHARED if and only if hardware needs to
   write the page.
  
   This breaks memory overcommit for users such as KVM:
   each time we try to register a page to send it to remote, this
   breaks COW.  It seems that for applications that only have
   REMOTE_READ permission, there is no reason to break COW at all.
  
  I proposed a similar (but not exactly the same, see below) patch a
  while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
  especially https://lkml.org/lkml/2012/2/6/265
  
  I think this change will break the case where userspace tries to
  register an MR with read-only permission, but intends locally through
  the CPU to write to the memory.  If the memory registration is done
  while the memory is mapped read-only but has VM_MAYWRITE, then
  userspace gets into trouble when COW happens.  In the case you're
  describing (although I'm not sure where in KVM we're talking about
  using RDMA), what happens if you register memory with only REMOTE_READ
  and then COW is triggered because of a local write?  (I'm assuming you
  don't want remote access to continue to get the old contents of the
  page)
 
 I read that, and the above. It looks like everyone was doing tricks
 like register page, then modify it, then let remote read it
 and for some reason assumed it's ok to write into page locally from CPU
 even if LOCAL_WRITE is not set.  I don't see why don't applications set
 LOCAL_WRITE if they are going to write to memory locally, but assuming
 they don't, we can't just break them.
 
 So what we need is a new no I really do not intend to write into this
 memory flag that avoids doing tricks in the kernel and treats the
 page normally, just pins it so hardware can read it.
 
 
  I have to confess that I still haven't had a chance to implement the
  proposed FOLL_FOLLOW solution to all of this.
 
 See a much easier to implement proposal at the bottom.
 
   If the page that is COW has lots of copies, this makes the user process
   quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
   for virtualization, thus the stable tag.
  
  The actual problem description here is a bit too terse for me to
  understand.  How do we end up with lots of copies of a COW page?
 
 Reading the links above, rdma breaks COW intentionally.
 
 Imagine a page with lots of instances in the process page map.
 For example a zero page, but not only that: we rely on KSM heavily
 to deduplicate pages for multiple VMs.
 There are gigabytes of these in each of the multiple VMs
 running on a host.
 
 What we are using RDMA for is VM migration so we careful not to change
 this memory: when we do allow memory to change we are careful
 to track what was changed, reregister and resend the data.
 
 But at the moment, each time we register a virtual address referencing
 this page, infiniband assumes we might want to change the page so it
 does get_user_pages with writeable flag, forcing a copy.
 Amount of used RAM explodes.
 
   Why
  is RDMA registering the memory any more  special than having everyone
  who maps this page actually writing to it and triggering COW?
  
   ret = get_user_pages(current, current-mm, cur_base,
min_t(unsigned long, npages,
  PAGE_SIZE / sizeof (struct 
   page *)),
   -1, !umem-writable, page_list, 
   vma_list);
   +!umem-writable, 1, page_list, 
   vma_list);
  
  The first two parameters in this line being changed are write and force.
  
  I think if we do change this, then we need to pass umem-writable (as
  opposed to !umem-writable) for the write parameter.
 
 Ugh. Sure enough. Let's agree on the direction before I respin the
 patch though.
 
  Not sure
  whether force makes sense or not.
  
   - R.
 
 If you don't force write on read-only mappings you don't, but
 it seems harmless for read-only gup. Still, no need to change
 what's not broken.
 
 Please comment on the below (completely untested, and needs userspace
 patch too, but just to give you the idea)
 
 ---
 
 rdma: add IB_ACCESS_APP_READONLY 

Or we can call it IB_ACCESS_GIFT - this is a bit like SPLICE_F_GIFT
semantics.



 At the moment any attempt to register memory for RDMA breaks
 COW, which hurts hosts overcomitted for memory.
 But if the application knows it won't write into the MR after
 registration, we can save (sometimes a lot of) memory
 by telling the kernel not to bother breaking COW for us.
 
 If the application does change memory registered with this flag, it can
 re-register afterwards

Re: [PATCH] rdma: don't make pages writeable if not requiested

2013-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2013 at 11:57:32AM -0600, Jason Gunthorpe wrote:
 On Thu, Mar 21, 2013 at 07:42:37PM +0200, Michael S. Tsirkin wrote:
 
  It doesn't actually, and our app would sometimes write to these pages.
  It simply does not care which version does the remote get in this case
  since we track writes and resend later.
 
 Heh, somehow I thought you might say that :)
 
 A new flag seems like the only way then - maybe:
   IBV_ACCESS_NON_COHERENT - The adaptor and the CPU do not share
  a coherent view of registered memory. Memory writes from the CPU
  after ibv_reg_mr completes may not be reflected in the memory
  viewed by the adaptor.
 
  Can only be combined with read only access permissions.
 
 Jason

I kind of like _GIFT for name, gifts are nice :) But yes that's exactly
the semantics we need.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma: don't make pages writeable if not requiested

2013-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2013 at 11:11:15AM -0600, Jason Gunthorpe wrote:
 On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote:
  On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
   On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin m...@redhat.com 
   wrote:
In that case, no, I don't see any reason for LOCAL_WRITE, since the
only RDMA operations that will access this memory are remote reads.
   
What is the meaning of LOCAL_WRITE then? There are no local
RDMA writes as far as I can see.
   
   Umm, it means you're giving the local adapter permission to write to
   that memory.  So you can use it as a receive buffer or as the target
   for remote data from an RDMA read operation.
  
  Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
  I don't see why do you need to give adapter permission
 
 The access flags have to do with what independent access remote nodes
 get. There are four major cases:
 
 access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes
 read the memory.
 
 access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under
 the direct control of the application, can read this memory. Remote
 nodes are barred.
 
 access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor
 will let remote nodes write the memory
 
 access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that
 memory. Only the adaptor, under the direct control of the application,
 can write the memory.

This is the one I find redundant. Since the write will be done
by the adaptor under direct control by the application,
why does it make sense to declare this beforehand?
If you don't want to allow local write access to memory,
just do not post any receive WRs with this address.
If you posted and regret it, reset the QP to cancel.

But IB spec specified LOCAL_WRITE in this redundant way so I guess
applications expect it to have the semantics defined there, I just
didn't remember what they are.

No way around it then, need another flag.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma: don't make pages writeable if not requiested

2013-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2013 at 12:41:35PM -0600, Jason Gunthorpe wrote:
 On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote:
 
  This is the one I find redundant. Since the write will be done by
  the adaptor under direct control by the application, why does it
  make sense to declare this beforehand?  If you don't want to allow
  local write access to memory, just do not post any receive WRs with
  this address.  If you posted and regret it, reset the QP to cancel.
 
 This is to support your COW scenario - the app declares before hand to
 the kernel that it will write to the memory and the kernel ensures
 pages are dedicated to the app at registration time. Or the app says
 it will only read and the kernel could leave them shared.

Someone here is confused. LOCAL_WRITE/absence of it does not address
COW, it breaks COW anyway.  Are you now saying we should change rdma so
without LOCAL_WRITE it will not break COW?

 The adaptor enforces the access control to prevent a naughty app from
 writing to shared memory - think about mmap'ing libc.so and then using
 RDMA to write to the shared pages. It is necessary to ensure that is
 impossible.
 
 Jason

That's why it's redundant: we can't trust an application to tell us
'this page is writeable', we must get this info from kernel.  And so
there's apparently no need for application to tell adaptor about
LOCAL_WRITE.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mellanox target workaround in SRP

2011-01-12 Thread Michael S. Tsirkin
On Mon, Jan 10, 2011 at 10:51:13AM -0800, Roland Dreier wrote:
 Maybe we can use MST's current email to ask him... Michael, do you have
 any memory of the issue we worked around here?
 
   I have question regarding workaround introduced in commit 559ce8f1 of
   the mainline tree:
   
   IB/srp: Work around data corruption bug on Mellanox targets
   
   Data corruption has been seen with Mellanox SRP targets when FMRs
   create a memory region with I/O virtual address != 0.  Add a
   workaround that disables FMR merging for Mellanox targets (OUI 0002c9).
   
   I don't see how this can make a difference to the target -- it sees an
   address and length, and there should be no visible difference to it when
   it gets an FMR versus a direct-mapped region of the same space, right?
   And how is it different than getting a direct or indirect descriptor
   with a similar offset?
   
   I could see there being a bug on the initiator HCA not liking such FMR
   mappings, but then it should be keyed off of the vendor of our HCA and
   not the target.
   
   I'm sure this was tested and shown to fix the problem; I'm just confused
   as to what the problem really was and if this is still relevant. Can
   someone please enlighten me?


I don't recall unfortunately. Sorry.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html