Re: [PATCH] IB/core: Get rid of redundant verb ib_destroy_mr

2015-06-09 Thread Sagi Grimberg

On 6/9/2015 1:55 AM, Hefty, Sean wrote:

--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1174,6 +1174,18 @@ static int clean_mr(struct mlx5_ib_mr *mr)
int umred = mr-umred;
int err;

+   if (mr-sig) {
+   if (mlx5_core_destroy_psv(dev-mdev,
+ mr-sig-psv_memory.psv_idx))
+   mlx5_ib_warn(dev, failed to destroy mem psv %d\n,
+mr-sig-psv_memory.psv_idx);
+   if (mlx5_core_destroy_psv(dev-mdev,
+ mr-sig-psv_wire.psv_idx))
+   mlx5_ib_warn(dev, failed to destroy wire psv %d\n,
+mr-sig-psv_wire.psv_idx);
+   kfree(mr-sig);
+   }
+
if (!umred) {
err = destroy_mkey(dev, mr);
if (err) {


This patch doesn't introduce this problem, but the err check suggests that 
deregistration can fail for some reason.
If so, should mr-sig be cleared above, so that a second call to dereg doesn't 
attempt to free the memory twice?



Hi Sean,

clean_mr() failure is not propagated back at the moment, but you are
correct, it's not a good idea to rely on that. I'll clear mr-sig in
v2.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Indirect memory registration feature

2015-06-09 Thread Sagi Grimberg

On 6/9/2015 9:20 AM, Christoph Hellwig wrote:

On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:

I wouldn't say this is about offloading bounce buffering to silicon.
The RDMA stack always imposed the alignment limitation as we can only
give a page lists to the devices. Other drivers (qlogic/emulex FC
drivers for example), use an _arbitrary_ SG lists where each element can
point to any {addr, len}.


Those are drivers for protocols that support real SG lists.   It seems
only Infiniband and NVMe expose this silly limit.


I agree this is indeed a limitation and that's why SG_GAPS was added
in the first place. I think the next gen of nvme devices will support 
real SG lists. This feature enables existing Infiniband devices that can 
handle SG lists to receive them via the RDMA stack (ib_core).


If the memory registration process wasn't such a big fiasco in the
first place, wouldn't this way makes the most sense?




So please fix it in the proper layers
first,


I agree that we can take care of bounce buffering in the block layer
(or scsi for SG_IO) if the driver doesn't want to see any type of
unaligned SG lists.

But do you think that it should come before the stack can support this?


Yes, absolutely.  The other thing that needs to come first is a proper
abstraction for MRs instead of hacking another type into all drivers.



I'm very much open to the idea of consolidating the memory registration
code instead of doing it in every ULP (srp, iser, xprtrdma, svcrdma,
rds, more to come...) using a general memory registration API. The main
challenge is to abstract the different methods (and considerations) of
memory registration behind an API. Do we completely mask out the way we
are doing it? I'm worried that we might end up either compromising on
performance or trying to understand too much what the caller is trying
to achieve.

For example:
- frwr requires a queue-pair for the post (and it must be the ULP
  queue-pair to ensure the registration is done before the data-transfer
  begins). While fmrs does not need the queue-pair.

- the ULP would probably always initiate data transfer after the
  registration (send a request or do the rdma r/w). It is useful to
  link the frwr post with the next wr in a single post_send call.
  I wander how would an API allow such a thing (while other registration
  methods don't use work request interface).

- There is the fmr_pool API which tries to tackle the disadvantages of
  fmrs (very slow unmap) by delaying the fmr unmap until some dirty
  watermark of remapping is met. I'm not sure how this can be done.

- How would the API choose the method to register memory?

- If there is an alignment issue, do we fail? do we bounce?

- There is the whole T10-DIF support...

...

CC'ing Bart  Chuck who share the suffer of memory registration.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: Change To max_cm_mtu when changing mode to connected

2015-06-09 Thread Erez Shitrit
On Tue, Jun 9, 2015 at 12:04 AM, Doug Ledford dledf...@redhat.com wrote:
 On Mon, 2015-06-08 at 10:42 -0600, Jason Gunthorpe wrote:
 On Sun, Jun 07, 2015 at 01:36:11PM +0300, Erez Shitrit wrote:
 
  When switching between modes (datagram / connected) change the MTU
  accordingly.
  datagram mode up to 4K, connected mode up to (64K - 0x10).

 Is this a bug fix (describe the user visible impact)? Should it go to
 -stable? Add a Fixes: line?

 I'm not sure I would call it a bug.  Setting the MTU to max
 automatically is a policy decision more than anything else.  Currently,
 you have to enable connected mode, then set the MTU.  Both initscripts
 and NetworkManager do this for you on Red Hat, so the user sees no
 difference here.  I can't speak to other OSes.  I'm OK with setting
 connected mode MTU to max by default once we get the scatter/gather
 support for IPoIB added.

Hi Doug,
There is no connection to the scatter/gather patch in that fix.


 --
 Doug Ledford dledf...@redhat.com
   GPG KeyID: 0E572FDD

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Indirect memory registration feature

2015-06-09 Thread Christoph Hellwig
On Tue, Jun 09, 2015 at 11:44:16AM +0300, Sagi Grimberg wrote:
 For example:
 - frwr requires a queue-pair for the post (and it must be the ULP
   queue-pair to ensure the registration is done before the data-transfer
   begins). While fmrs does not need the queue-pair.

How do you transfer data without a queue pair?

 - the ULP would probably always initiate data transfer after the
   registration (send a request or do the rdma r/w). It is useful to
   link the frwr post with the next wr in a single post_send call.
   I wander how would an API allow such a thing (while other registration
   methods don't use work request interface).
 
 - There is the fmr_pool API which tries to tackle the disadvantages of
   fmrs (very slow unmap) by delaying the fmr unmap until some dirty
   watermark of remapping is met. I'm not sure how this can be done.
 
 - How would the API choose the method to register memory?

Seems like the general preference in existing users is FRWR - FMR - FR
with an exception to use physical for single segment I/O requests.

 - If there is an alignment issue, do we fail? do we bounce?

Leave it to the high level driver for now.  NFS can simply split
requests when the offset mismatches for blok drivers the enhanced
SG_GAPS can take care of it.  If indirect is supported the driver
does not need to set SG_GAPS.
registrations tha

 - There is the whole T10-DIF support...

Just tell the driver if the registration model supports it.  Do the
indirect registrations support it, btw?

As far as I can tell the NFS abstraction fits SRP fairly well.  iSER
is a bit of a mess with it's bounce buffering so I'm not too sure about
that one.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: Change To max_cm_mtu when changing mode to connected

2015-06-09 Thread Erez Shitrit
On Tue, Jun 9, 2015 at 12:04 AM, Doug Ledford dledf...@redhat.com wrote:
 On Mon, 2015-06-08 at 10:42 -0600, Jason Gunthorpe wrote:
 On Sun, Jun 07, 2015 at 01:36:11PM +0300, Erez Shitrit wrote:
 
  When switching between modes (datagram / connected) change the MTU
  accordingly.
  datagram mode up to 4K, connected mode up to (64K - 0x10).

 Is this a bug fix (describe the user visible impact)? Should it go to
 -stable? Add a Fixes: line?

 I'm not sure I would call it a bug.  Setting the MTU to max
 automatically is a policy decision more than anything else.  Currently,
 you have to enable connected mode, then set the MTU.  Both initscripts
 and NetworkManager do this for you on Red Hat, so the user sees no
 difference here.  I can't speak to other OSes.  I'm OK with setting

The only real reason for a user to switch to CM mode is for the
performance he can get, and this is only by the huge MTU the CM gives,
so it somehow should be part of the mode setting.
As you said RH does it (in the new versions of its OS's only) via
scripting outside of the driver code, other vendor doesn't.

 connected mode MTU to max by default once we get the scatter/gather
 support for IPoIB added.

OK, great. Thanks.

 --
 Doug Ledford dledf...@redhat.com
   GPG KeyID: 0E572FDD

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Indirect memory registration feature

2015-06-09 Thread Christoph Hellwig
On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:
 I wouldn't say this is about offloading bounce buffering to silicon.
 The RDMA stack always imposed the alignment limitation as we can only
 give a page lists to the devices. Other drivers (qlogic/emulex FC
 drivers for example), use an _arbitrary_ SG lists where each element can
 point to any {addr, len}.

Those are drivers for protocols that support real SG lists.   It seems
only Infiniband and NVMe expose this silly limit.

 So please fix it in the proper layers
 first,
 
 I agree that we can take care of bounce buffering in the block layer
 (or scsi for SG_IO) if the driver doesn't want to see any type of
 unaligned SG lists.
 
 But do you think that it should come before the stack can support this?

Yes, absolutely.  The other thing that needs to come first is a proper
abstraction for MRs instead of hacking another type into all drivers.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-4.1 1/2] cxgb4: Support for user mode bar2 mappings with T4

2015-06-09 Thread Hariprasad Shenai
Enhance cxgb4_t4_bar2_sge_qregs() and cxgb4_bar2_sge_qregs() to support T4
user mode mappings.  Update all the current users as well.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
Signed-off-by: Hariprasad Shenai haripra...@chelsio.com
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  | 1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 4 +++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  | 1 +
 drivers/net/ethernet/chelsio/cxgb4/sge.c| 4 ++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  | 7 ---
 5 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 524d110..e052f05 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1185,6 +1185,7 @@ enum t4_bar2_qtype { T4_BAR2_QTYPE_EGRESS, 
T4_BAR2_QTYPE_INGRESS };
 int cxgb4_t4_bar2_sge_qregs(struct adapter *adapter,
  unsigned int qid,
  enum t4_bar2_qtype qtype,
+ int user,
  u64 *pbar2_qoffset,
  unsigned int *pbar2_qid);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 803d91b..a935559 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2145,6 +2145,7 @@ EXPORT_SYMBOL(cxgb4_read_sge_timestamp);
 int cxgb4_bar2_sge_qregs(struct net_device *dev,
 unsigned int qid,
 enum cxgb4_bar2_qtype qtype,
+int user,
 u64 *pbar2_qoffset,
 unsigned int *pbar2_qid)
 {
@@ -2153,6 +2154,7 @@ int cxgb4_bar2_sge_qregs(struct net_device *dev,
 (qtype == CXGB4_BAR2_QTYPE_EGRESS
  ? T4_BAR2_QTYPE_EGRESS
  : T4_BAR2_QTYPE_INGRESS),
+user,
 pbar2_qoffset,
 pbar2_qid);
 }
@@ -2351,7 +2353,7 @@ static void process_db_drop(struct work_struct *work)
int ret;
 
ret = cxgb4_t4_bar2_sge_qregs(adap, qid, T4_BAR2_QTYPE_EGRESS,
-   bar2_qoffset, bar2_qid);
+ 0, bar2_qoffset, bar2_qid);
if (ret)
dev_err(adap-pdev_dev, doorbell drop recovery: 
qid=%d, pidx_inc=%d\n, qid, pidx_inc);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
index 78ab4d4..e33934a 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
@@ -306,6 +306,7 @@ enum cxgb4_bar2_qtype { CXGB4_BAR2_QTYPE_EGRESS, 
CXGB4_BAR2_QTYPE_INGRESS };
 int cxgb4_bar2_sge_qregs(struct net_device *dev,
 unsigned int qid,
 enum cxgb4_bar2_qtype qtype,
+int user,
 u64 *pbar2_qoffset,
 unsigned int *pbar2_qid);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 0d2edda..1b99aec 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2429,8 +2429,8 @@ static void __iomem *bar2_address(struct adapter *adapter,
u64 bar2_qoffset;
int ret;
 
-   ret = cxgb4_t4_bar2_sge_qregs(adapter, qid, qtype,
-   bar2_qoffset, pbar2_qid);
+   ret = cxgb4_t4_bar2_sge_qregs(adapter, qid, qtype, 0,
+ bar2_qoffset, pbar2_qid);
if (ret)
return NULL;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 5959e3a..2d76a07 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -5102,6 +5102,7 @@ int t4_prep_adapter(struct adapter *adapter)
  * @adapter: the adapter
  * @qid: the Queue ID
  * @qtype: the Ingress or Egress type for @qid
+ * @user: true if this request is for a user mode queue
  * @pbar2_qoffset: BAR2 Queue Offset
  * @pbar2_qid: BAR2 Queue ID or 0 for Queue ID inferred SGE Queues
  *
@@ -5125,6 +5126,7 @@ int t4_prep_adapter(struct adapter *adapter)
 int cxgb4_t4_bar2_sge_qregs(struct adapter *adapter,
  unsigned int qid,
  enum t4_bar2_qtype qtype,
+ int user,
  u64 *pbar2_qoffset,
  unsigned int *pbar2_qid)
 {
@@ -5132,9 +5134,8 @@ int cxgb4_t4_bar2_sge_qregs(struct adapter *adapter,
u64 bar2_page_offset, bar2_qoffset;
unsigned int bar2_qid, bar2_qid_offset, 

[PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size

2015-06-09 Thread Hariprasad Shenai
Handle this configuration:

Queues Per Page * SGE BAR2 Queue Register Area Size  Page Size

Use cxgb4_bar2_sge_qregs() to obtain the proper location within the
bar2 region for a given qid.

Rework the DB and GTS write functions to make use of this bar2 info.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
Signed-off-by: Hariprasad Shenai haripra...@chelsio.com
---
 drivers/infiniband/hw/cxgb4/cq.c   | 22 ++--
 drivers/infiniband/hw/cxgb4/device.c   | 16 +++--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |  5 +--
 drivers/infiniband/hw/cxgb4/qp.c   | 64 ++
 drivers/infiniband/hw/cxgb4/t4.h   | 60 ---
 5 files changed, 98 insertions(+), 69 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 68ddb37..8e5bbcb 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct t4_cq 
*cq,
goto err4;
 
cq-gen = 1;
+   cq-gts = rdev-lldi.gts_reg;
cq-rdev = rdev;
-   if (user) {
-   u32 off = (cq-cqid  rdev-cqshift)  PAGE_MASK;
 
-   cq-ugts = (u64)rdev-bar2_pa + off;
-   } else if (is_t4(rdev-lldi.adapter_type)) {
-   cq-gts = rdev-lldi.gts_reg;
-   cq-qid_mask = -1U;
-   } else {
-   u32 off = ((cq-cqid  rdev-cqshift)  PAGE_MASK) + 12;
-
-   cq-gts = rdev-bar2_kva + off;
-   cq-qid_mask = rdev-qpmask;
+   cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS,
+ cq-bar2_qid,
+ user ? cq-bar2_pa : NULL);
+   if (user  !cq-bar2_va) {
+   pr_warn(MOD %s: cqid %u not in BAR2 range.\n,
+   pci_name(rdev-lldi.pdev), cq-cqid);
+   ret = -EINVAL;
+   goto err4;
}
return 0;
 err4:
@@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int 
entries,
insert_mmap(ucontext, mm);
 
mm2-key = uresp.gts_key;
-   mm2-addr = chp-cq.ugts;
+   mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa;
mm2-len = PAGE_SIZE;
insert_mmap(ucontext, mm2);
}
diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 1ffbd03..ab36804 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -788,13 +788,7 @@ static int c4iw_rdev_open(struct c4iw_rdev *rdev)
goto err1;
}
 
-   /*
-* qpshift is the number of bits to shift the qpid left in order
-* to get the correct address of the doorbell for that qp.
-*/
-   rdev-qpshift = PAGE_SHIFT - ilog2(rdev-lldi.udb_density);
rdev-qpmask = rdev-lldi.udb_density - 1;
-   rdev-cqshift = PAGE_SHIFT - ilog2(rdev-lldi.ucq_density);
rdev-cqmask = rdev-lldi.ucq_density - 1;
PDBG(%s dev %s stag start 0x%0x size 0x%0x num stags %d 
 pbl start 0x%0x size 0x%0x rq start 0x%0x size 0x%0x 
@@ -808,14 +802,12 @@ static int c4iw_rdev_open(struct c4iw_rdev *rdev)
 rdev-lldi.vr-qp.size,
 rdev-lldi.vr-cq.start,
 rdev-lldi.vr-cq.size);
-   PDBG(udb len 0x%x udb base %p db_reg %p gts_reg %p qpshift %lu 
-qpmask 0x%x cqshift %lu cqmask 0x%x\n,
+   PDBG(udb len 0x%x udb base %p db_reg %p gts_reg %p 
+qpmask 0x%x cqmask 0x%x\n,
 (unsigned)pci_resource_len(rdev-lldi.pdev, 2),
 (void *)pci_resource_start(rdev-lldi.pdev, 2),
-rdev-lldi.db_reg,
-rdev-lldi.gts_reg,
-rdev-qpshift, rdev-qpmask,
-rdev-cqshift, rdev-cqmask);
+rdev-lldi.db_reg, rdev-lldi.gts_reg,
+rdev-qpmask, rdev-cqmask);
 
if (c4iw_num_stags(rdev) == 0) {
err = -EINVAL;
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index d87e165..7a1b675 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -164,9 +164,7 @@ struct wr_log_entry {
 
 struct c4iw_rdev {
struct c4iw_resource resource;
-   unsigned long qpshift;
u32 qpmask;
-   unsigned long cqshift;
u32 cqmask;
struct c4iw_dev_ucontext uctx;
struct gen_pool *pbl_pool;
@@ -1025,6 +1023,9 @@ void c4iw_ev_dispatch(struct c4iw_dev *dev, struct t4_cqe 
*err_cqe);
 
 extern struct cxgb4_client t4c_client;
 extern c4iw_handler_func c4iw_handlers[NUM_CPL_CMDS];
+void __iomem *c4iw_bar2_addrs(struct c4iw_rdev *rdev, unsigned int qid,
+ enum cxgb4_bar2_qtype qtype,
+ unsigned int *pbar2_qid, u64 *pbar2_pa);
 extern void c4iw_log_wr_stats(struct t4_wq *wq, struct 

[PATCH for-4.1 0/2] Adds support for user mode bar2 mapping and bar2 qid density

2015-06-09 Thread Hariprasad Shenai
Hi,

This patch series adds support for user mode bar2 mappings for T4 adapter
and also adds  support for bar2 qid densities exceeding page size.

This patch series has been created against Doug's github tree 'for-4.1'
branch and includes patches on cxgb4 and iw_cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.

Thanks

Hariprasad Shenai (2):
  cxgb4: Support for user mode bar2 mappings with T4
  iw_cxgb4: support for bar2 qid densities exceeding the page size

 drivers/infiniband/hw/cxgb4/cq.c| 22 -
 drivers/infiniband/hw/cxgb4/device.c| 16 ++-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |  5 +-
 drivers/infiniband/hw/cxgb4/qp.c| 64 -
 drivers/infiniband/hw/cxgb4/t4.h| 60 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  4 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |  1 +
 drivers/net/ethernet/chelsio/cxgb4/sge.c|  4 +-
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  |  7 +--
 10 files changed, 109 insertions(+), 75 deletions(-)

-- 
2.3.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Indirect memory registration feature

2015-06-09 Thread Christoph Hellwig
On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:
 I think O_DIRECT allows user-space to generate vectored SG lists that
 are not aligned as well.

Indeed if you do sub-page size direct I/O.  Although SG_GAPS helps with
that.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 1/3] ivtv: use arch_phys_wc_add() and require PAT disabled

2015-06-09 Thread Hans Verkuil
On 06/09/2015 02:56 AM, Mauro Carvalho Chehab wrote:
 Em Mon, 08 Jun 2015 17:20:20 -0700
 Luis R. Rodriguez mcg...@do-not-panic.com escreveu:
 
 From: Luis R. Rodriguez mcg...@suse.com

 We are burrying direct access to MTRR code support on
 x86 in order to take advantage of PAT. In the future we
 also want to make the default behaviour of ioremap_nocache()
 to use strong UC, use of mtrr_add() on those systems
 would make write-combining void.

 In order to help both enable us to later make strong
 UC default and in order to phase out direct MTRR access
 code port the driver over to arch_phys_wc_add() and
 annotate that the device driver requires systems to
 boot with PAT disabled, with the nopat kernel parameter.

 This is a worthy comprmise given that the hardware is
 really rare these days, and perhaps only some lost souls
 in some third world country are expected to be using this
 feature of the device driver.

 Acked-by: Andy Walls awa...@md.metrocast.net
 Cc: Andy Walls awa...@md.metrocast.net
 Cc: Doug Ledford dledf...@redhat.com
 Cc: Mauro Carvalho Chehab mche...@osg.samsung.com
 
 Provided that you fix the issues below:
 Acked-by: Mauro Carvalho Chehab mche...@osg.samsung.com
 
 Cc: Andy Lutomirski l...@amacapital.net
 Cc: Suresh Siddha sbsid...@gmail.com
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Juergen Gross jgr...@suse.com
 Cc: Daniel Vetter daniel.vet...@ffwll.ch
 Cc: Dave Airlie airl...@redhat.com
 Cc: Bjorn Helgaas bhelg...@google.com
 Cc: Antonino Daplas adap...@gmail.com
 Cc: Jean-Christophe Plagniol-Villard plagn...@jcrosoft.com
 Cc: Tomi Valkeinen tomi.valkei...@ti.com
 Cc: Dave Hansen dave.han...@linux.intel.com
 Cc: Arnd Bergmann a...@arndb.de
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Stefan Bader stefan.ba...@canonical.com
 Cc: Ville Syrjälä syrj...@sci.fi
 Cc: Mel Gorman mgor...@suse.de
 Cc: Vlastimil Babka vba...@suse.cz
 Cc: Borislav Petkov b...@suse.de
 Cc: Davidlohr Bueso dbu...@suse.de
 Cc: konrad.w...@oracle.com
 Cc: ville.syrj...@linux.intel.com
 Cc: david.vra...@citrix.com
 Cc: jbeul...@suse.com
 Cc: toshi.k...@hp.com
 Cc: Roger Pau Monné roger@citrix.com
 Cc: linux-fb...@vger.kernel.org
 Cc: linux-ker...@vger.kernel.org
 Cc: ivtv-de...@ivtvdriver.org
 Cc: linux-me...@vger.kernel.org
 Cc: xen-de...@lists.xensource.com
 Signed-off-by: Luis R. Rodriguez mcg...@suse.com
 ---
  drivers/media/pci/ivtv/Kconfig  |  3 +++
  drivers/media/pci/ivtv/ivtvfb.c | 58 
 -
  2 files changed, 26 insertions(+), 35 deletions(-)

 diff --git a/drivers/media/pci/ivtv/Kconfig b/drivers/media/pci/ivtv/Kconfig
 index dd6ee57e..b2a7f88 100644
 --- a/drivers/media/pci/ivtv/Kconfig
 +++ b/drivers/media/pci/ivtv/Kconfig
 @@ -57,5 +57,8 @@ config VIDEO_FB_IVTV
This is used in the Hauppauge PVR-350 card. There is a driver
homepage at http://www.ivtvdriver.org.
  
 +  If you have this hardware you will need to boot with PAT disabled
 +  on your x86 systems, use the nopat kernel parameter.
 +
 
 Hmm... FB_IVTV is not hardware... it is framebuffer support for IVTV.
 It is optional to use FB API for the video output port of this board,
 instead of using V4L2 API.

That's not true. It is hardware: it drives the video output OSD which overlays
the video output. The reason it is optional is that it is hard to unload a
framebuffer module and most people don't need it.

So V4L2 drives the video output and ivtvfb drives the OSD overlay. So it is
not a case of 'instead of'.

 
 I would say, instead, something like: 
 
   In order to use this module, you will need to boot with PAT disabled
 on x86 systems, using the nopat kernel parameter.

I do agree with this change, but that's because this module is optional and
not for the reasons you mentioned above.

Regards,

Hans

 
To compile this driver as a module, choose M here: the
module will be called ivtvfb.
 diff --git a/drivers/media/pci/ivtv/ivtvfb.c 
 b/drivers/media/pci/ivtv/ivtvfb.c
 index 9ff1230..7685ae3 100644
 --- a/drivers/media/pci/ivtv/ivtvfb.c
 +++ b/drivers/media/pci/ivtv/ivtvfb.c
 @@ -44,8 +44,8 @@
  #include linux/ivtvfb.h
  #include linux/slab.h
  
 -#ifdef CONFIG_MTRR
 -#include asm/mtrr.h
 +#ifdef CONFIG_X86_64
 +#include asm/pat.h
  #endif
  
  #include ivtv-driver.h
 @@ -155,12 +155,11 @@ struct osd_info {
  /* Buffer size */
  u32 video_buffer_size;
  
 -#ifdef CONFIG_MTRR
  /* video_base rounded down as required by hardware MTRRs */
  unsigned long fb_start_aligned_physaddr;
  /* video_base rounded up as required by hardware MTRRs */
  unsigned long fb_end_aligned_physaddr;
 -#endif
 +int wc_cookie;
  
  /* Store the buffer offset */
  int set_osd_coords_x;
 @@ -1099,6 +1098,8 @@ static int ivtvfb_init_vidmode(struct ivtv *itv)
  static int ivtvfb_init_io(struct ivtv *itv)
  {
  struct osd_info *oi = itv-osd_info;
 +/* Find the largest power of two that maps the 

Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core

2015-06-09 Thread Matan Barak



On 6/9/2015 12:37 AM, Hefty, Sean wrote:

Previously, every vendor implemented its net device notifiers in its own
driver. This introduces a huge code duplication as figuring




  28 files changed, 2253 insertions(+), 860 deletions(-)


How does adding 1400 lines of code help reduce code duplication?

Can you please explain and justify why this change is actually needed?



Let's look at this change from several perspectives:

(1) Each vedor lost ~250 lines of GID management code just by this 
change. In the future it's very probable that more vendor drivers will 
implement RoCE. This removes the burden and code duplication required by 
them to implement a full RoCE support and is a lot more scalable than 
the current approach.


(2) All vendors are now aligned. For example, mlx4 driver had bonding 
support but ocrdma didn't have such support. The user expects the same 
behavior regardless the vendor's driver.


(3) When making something more general it usually requires more lines of 
code as it introduces API and doesn't cut corners assuming anything on 
the vendor's driver.


(4) This is a per-requisite to the RoCE V2 series. I'm sure you remember 
we first submitted this patch-set as a part of the RoCE V2 series. 
Adding more features to the RoCE GID management will make the code 
duplication a lot worse than just ~250 lines. I don't think it's fair 
playing lets divide the RoCE V2 patch-set to several patch-sets and 
then say why do we need this first part at all. Let alone, the other 
there reasons are more than enough IMHO.


Matan
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/4] Sending kernel pathrecord query to user cache server

2015-06-09 Thread kaike . wan
From: Kaike Wan kaike@intel.com

A SA cache is undeniably critical for fabric scalability and performance.
In user space, the ibacm application provides a good example of pathrecord
cache for address and route resolution. With the recent implementation of
the provider architecture, ibacm offers more extensibility as a SA cache.
In kernel, ipoib implements its own small cache for pathrecords, which is
however not available for general use. Furthermore, the implementation of
a SA cache in user space offers better flexibility, larger capacity, and
more robustness for the system.

In this patch series, a mechanism is implemented to allow ib_sa to
send pathrecord query to a user application (eg ibacm) through netlink.
Potentially, this mechanism could be easily extended to other SA queries.

With a customized test implemented in rdma_cm module (not included in this
series), it was shown that the time to retrieve 1 million pathrecords
dropped from 46660 jiffies (46.66 seconds) to 16119 jiffies (or 16.119
seconds) on a two-node system, a reduction of more than 60%.

Changes since v3:
- Patch 1: added basic RESOLVE attribute types.
- Patch 4: changed the encoding of the RESOLVE request message based on
  the new attribute types and the input comp_mask. Changed the response
  handling by iterating all attributes.

Changes since v2:
- Redesigned the communication protocol between the kernel and user space
  application. Instead of the MAD packet format, the new protocol uses 
  netlink message header and attributes to exchange request and 
  response between the kernel and user space.The design was described
  here:
  http://www.spinics.net/lists/linux-rdma/msg25621.html

Changes since v1:
- Move kzalloc changes into a separate patch (Patch 3).
- Remove redundant include line (Patch 4). 
- Rename struct rdma_nl_resp_msg as structure ib_nl_resp_msg (Patch 4).

Kaike Wan (4):
  IB/netlink: Add defines for local service requests through netlink
  IB/core: Check the presence of netlink multicast group listeners
  IB/sa: Allocate SA query with kzalloc
  IB/sa: Route SA pathrecord query through netlink

 drivers/infiniband/core/netlink.c  |8 +
 drivers/infiniband/core/sa_query.c |  523 +++-
 include/rdma/rdma_netlink.h|7 +
 include/uapi/rdma/rdma_netlink.h   |   82 ++
 4 files changed, 615 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size

2015-06-09 Thread Steve Wise


 -Original Message-
 From: Doug Ledford [mailto:dledf...@redhat.com]
 Sent: Tuesday, June 09, 2015 9:03 AM
 To: Hariprasad Shenai
 Cc: linux-rdma@vger.kernel.org; sw...@opengridcomputing.com; 
 lee...@chelsio.com; nirran...@chelsio.com
 Subject: Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities 
 exceeding the page size
 
 On Tue, 2015-06-09 at 18:23 +0530, Hariprasad Shenai wrote:
  Handle this configuration:
 
  Queues Per Page * SGE BAR2 Queue Register Area Size  Page Size
 
  Use cxgb4_bar2_sge_qregs() to obtain the proper location within the
  bar2 region for a given qid.
 
  Rework the DB and GTS write functions to make use of this bar2 info.
 
  Signed-off-by: Steve Wise sw...@opengridcomputing.com
  Signed-off-by: Hariprasad Shenai haripra...@chelsio.com
  ---
   drivers/infiniband/hw/cxgb4/cq.c   | 22 ++--
   drivers/infiniband/hw/cxgb4/device.c   | 16 +++--
   drivers/infiniband/hw/cxgb4/iw_cxgb4.h |  5 +--
   drivers/infiniband/hw/cxgb4/qp.c   | 64 
  ++
   drivers/infiniband/hw/cxgb4/t4.h   | 60 ---
   5 files changed, 98 insertions(+), 69 deletions(-)
 
  diff --git a/drivers/infiniband/hw/cxgb4/cq.c 
  b/drivers/infiniband/hw/cxgb4/cq.c
  index 68ddb37..8e5bbcb 100644
  --- a/drivers/infiniband/hw/cxgb4/cq.c
  +++ b/drivers/infiniband/hw/cxgb4/cq.c
  @@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct 
  t4_cq *cq,
  goto err4;
 
  cq-gen = 1;
  +   cq-gts = rdev-lldi.gts_reg;
  cq-rdev = rdev;
  -   if (user) {
  -   u32 off = (cq-cqid  rdev-cqshift)  PAGE_MASK;
 
  -   cq-ugts = (u64)rdev-bar2_pa + off;
  -   } else if (is_t4(rdev-lldi.adapter_type)) {
  -   cq-gts = rdev-lldi.gts_reg;
  -   cq-qid_mask = -1U;
  -   } else {
  -   u32 off = ((cq-cqid  rdev-cqshift)  PAGE_MASK) + 12;
  -
  -   cq-gts = rdev-bar2_kva + off;
  -   cq-qid_mask = rdev-qpmask;
  +   cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS,
  + cq-bar2_qid,
  + user ? cq-bar2_pa : NULL);
  +   if (user  !cq-bar2_va) {
  +   pr_warn(MOD %s: cqid %u not in BAR2 range.\n,
  +   pci_name(rdev-lldi.pdev), cq-cqid);
  +   ret = -EINVAL;
  +   goto err4;
  }
  return 0;
   err4:
  @@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, 
  int entries,
  insert_mmap(ucontext, mm);
 
  mm2-key = uresp.gts_key;
  -   mm2-addr = chp-cq.ugts;
  +   mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa;
 
 Why are you using a cast here at all?  bar2_pa is already u64...
 

So it should just have the (uintptr_t) cast?

Steve.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/4] IB/core: Check the presence of netlink multicast group listeners

2015-06-09 Thread kaike . wan
From: Kaike Wan kaike@intel.com

This patch adds a function to check if listeners for a netlink multicast
group are present.

Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: John Fleck john.fl...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Reviewed-by: Sean Hefty sean.he...@intel.com
---
 drivers/infiniband/core/netlink.c |8 
 include/rdma/rdma_netlink.h   |7 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/netlink.c 
b/drivers/infiniband/core/netlink.c
index 23dd5a5..e0fc913 100644
--- a/drivers/infiniband/core/netlink.c
+++ b/drivers/infiniband/core/netlink.c
@@ -49,6 +49,14 @@ static DEFINE_MUTEX(ibnl_mutex);
 static struct sock *nls;
 static LIST_HEAD(client_list);
 
+int ibnl_chk_listeners(unsigned int group)
+{
+   if (netlink_has_listeners(nls, group) == 0)
+   return -1;
+   return 0;
+}
+EXPORT_SYMBOL(ibnl_chk_listeners);
+
 int ibnl_add_client(int index, int nops,
const struct ibnl_client_cbs cb_table[])
 {
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index 0790882..5852661 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -77,4 +77,11 @@ int ibnl_unicast(struct sk_buff *skb, struct nlmsghdr *nlh,
 int ibnl_multicast(struct sk_buff *skb, struct nlmsghdr *nlh,
unsigned int group, gfp_t flags);
 
+/**
+ * Check if there are any listeners to the netlink group
+ * @group: the netlink group ID
+ * Returns 0 on success or a negative for no listeners.
+ */
+int ibnl_chk_listeners(unsigned int group);
+
 #endif /* _RDMA_NETLINK_H */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/4] IB/sa: Allocate SA query with kzalloc

2015-06-09 Thread kaike . wan
From: Kaike Wan kaike@intel.com

Replace kmalloc with kzalloc so that all uninitialized fields in SA query
will be zero-ed out to avoid unintentional consequence. This prepares the
SA query structure to accept new fields in the future.

Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: John Fleck john.fl...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Reviewed-by: Sean Hefty sean.he...@intel.com
---
 drivers/infiniband/core/sa_query.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c 
b/drivers/infiniband/core/sa_query.c
index c38f030..17e1cf7 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -739,7 +739,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
port  = sa_dev-port[port_num - sa_dev-start_port];
agent = port-agent;
 
-   query = kmalloc(sizeof *query, gfp_mask);
+   query = kzalloc(sizeof(*query), gfp_mask);
if (!query)
return -ENOMEM;
 
@@ -861,7 +861,7 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
method != IB_SA_METHOD_DELETE)
return -EINVAL;
 
-   query = kmalloc(sizeof *query, gfp_mask);
+   query = kzalloc(sizeof(*query), gfp_mask);
if (!query)
return -ENOMEM;
 
@@ -953,7 +953,7 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
port  = sa_dev-port[port_num - sa_dev-start_port];
agent = port-agent;
 
-   query = kmalloc(sizeof *query, gfp_mask);
+   query = kzalloc(sizeof(*query), gfp_mask);
if (!query)
return -ENOMEM;
 
@@ -1050,7 +1050,7 @@ int ib_sa_guid_info_rec_query(struct ib_sa_client *client,
port  = sa_dev-port[port_num - sa_dev-start_port];
agent = port-agent;
 
-   query = kmalloc(sizeof *query, gfp_mask);
+   query = kzalloc(sizeof(*query), gfp_mask);
if (!query)
return -ENOMEM;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size

2015-06-09 Thread Doug Ledford
On Tue, 2015-06-09 at 09:07 -0500, Steve Wise wrote:
 
  -Original Message-
  From: Doug Ledford [mailto:dledf...@redhat.com]
  Sent: Tuesday, June 09, 2015 9:03 AM
  To: Hariprasad Shenai
  Cc: linux-rdma@vger.kernel.org; sw...@opengridcomputing.com; 
  lee...@chelsio.com; nirran...@chelsio.com
  Subject: Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities 
  exceeding the page size
  
  On Tue, 2015-06-09 at 18:23 +0530, Hariprasad Shenai wrote:
   Handle this configuration:
  
   Queues Per Page * SGE BAR2 Queue Register Area Size  Page Size
  
   Use cxgb4_bar2_sge_qregs() to obtain the proper location within the
   bar2 region for a given qid.
  
   Rework the DB and GTS write functions to make use of this bar2 info.
  
   Signed-off-by: Steve Wise sw...@opengridcomputing.com
   Signed-off-by: Hariprasad Shenai haripra...@chelsio.com
   ---
drivers/infiniband/hw/cxgb4/cq.c   | 22 ++--
drivers/infiniband/hw/cxgb4/device.c   | 16 +++--
drivers/infiniband/hw/cxgb4/iw_cxgb4.h |  5 +--
drivers/infiniband/hw/cxgb4/qp.c   | 64 
   ++
drivers/infiniband/hw/cxgb4/t4.h   | 60 
   ---
5 files changed, 98 insertions(+), 69 deletions(-)
  
   diff --git a/drivers/infiniband/hw/cxgb4/cq.c 
   b/drivers/infiniband/hw/cxgb4/cq.c
   index 68ddb37..8e5bbcb 100644
   --- a/drivers/infiniband/hw/cxgb4/cq.c
   +++ b/drivers/infiniband/hw/cxgb4/cq.c
   @@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct 
   t4_cq *cq,
 goto err4;
  
 cq-gen = 1;
   + cq-gts = rdev-lldi.gts_reg;
 cq-rdev = rdev;
   - if (user) {
   - u32 off = (cq-cqid  rdev-cqshift)  PAGE_MASK;
  
   - cq-ugts = (u64)rdev-bar2_pa + off;
   - } else if (is_t4(rdev-lldi.adapter_type)) {
   - cq-gts = rdev-lldi.gts_reg;
   - cq-qid_mask = -1U;
   - } else {
   - u32 off = ((cq-cqid  rdev-cqshift)  PAGE_MASK) + 12;
   -
   - cq-gts = rdev-bar2_kva + off;
   - cq-qid_mask = rdev-qpmask;
   + cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS,
   +   cq-bar2_qid,
   +   user ? cq-bar2_pa : NULL);
   + if (user  !cq-bar2_va) {
   + pr_warn(MOD %s: cqid %u not in BAR2 range.\n,
   + pci_name(rdev-lldi.pdev), cq-cqid);
   + ret = -EINVAL;
   + goto err4;
 }
 return 0;
err4:
   @@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, 
   int entries,
 insert_mmap(ucontext, mm);
  
 mm2-key = uresp.gts_key;
   - mm2-addr = chp-cq.ugts;
   + mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa;
  
  Why are you using a cast here at all?  bar2_pa is already u64...
  
 
 So it should just have the (uintptr_t) cast?

No, it should be no cast at all.  The uintptr_t cast is only for casting
an int-ptr or ptr-int.  In those cases, if the size of an int != size
of ptr, you can loose data, and uintptr_t tells the compiler I know I'm
casting between possibly lossy data sizes and either A) I've checked and
it's OK or B) I'm ok with ptr truncation and the loss won't hurt us.
It basically turns off size checks when sticking a ptr into an int.  You
should therefore use it only in those circumstances.  For example, when
storing a cookie that doesn't have a strict uniqueness requirement, the
loss due to truncation is probably OK.  Or if you know you are only
doing something like initially storing an int into a pointer, and then
later storing that pointer back into an int, so there can never be any
truncation because the source of the ptr was always int sized.  Those
are the times to use uintptr.  In this case, you have a real u64 going
into a real u64, there should be no casts.


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH 0/5] Indirect memory registration feature

2015-06-09 Thread Chuck Lever

On Jun 9, 2015, at 4:44 AM, Sagi Grimberg sa...@dev.mellanox.co.il wrote:

 On 6/9/2015 9:20 AM, Christoph Hellwig wrote:
 On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote:
 I wouldn't say this is about offloading bounce buffering to silicon.
 The RDMA stack always imposed the alignment limitation as we can only
 give a page lists to the devices. Other drivers (qlogic/emulex FC
 drivers for example), use an _arbitrary_ SG lists where each element can
 point to any {addr, len}.
 
 Those are drivers for protocols that support real SG lists.   It seems
 only Infiniband and NVMe expose this silly limit.
 
 I agree this is indeed a limitation and that's why SG_GAPS was added
 in the first place. I think the next gen of nvme devices will support real SG 
 lists. This feature enables existing Infiniband devices that can handle SG 
 lists to receive them via the RDMA stack (ib_core).
 
 If the memory registration process wasn't such a big fiasco in the
 first place, wouldn't this way makes the most sense?
 
 
 So please fix it in the proper layers
 first,
 
 I agree that we can take care of bounce buffering in the block layer
 (or scsi for SG_IO) if the driver doesn't want to see any type of
 unaligned SG lists.
 
 But do you think that it should come before the stack can support this?
 
 Yes, absolutely.  The other thing that needs to come first is a proper
 abstraction for MRs instead of hacking another type into all drivers.
 
 
 I'm very much open to the idea of consolidating the memory registration
 code instead of doing it in every ULP (srp, iser, xprtrdma, svcrdma,
 rds, more to come...) using a general memory registration API. The main
 challenge is to abstract the different methods (and considerations) of
 memory registration behind an API. Do we completely mask out the way we
 are doing it? I'm worried that we might end up either compromising on
 performance or trying to understand too much what the caller is trying
 to achieve.

The point of an API like this is to flatten the developer’s learning
curve at the cost of adding another layer of abstraction. For
in-kernel storage ULPs, I’m not convinced that’s a good trade-off.

The other major issue is dealing with multiple registration methods.
Having new ULPs stick with just one or two seems like it could take
care of that without fuss. HCA vendors seem to be settling on FRMR.

But FRMR has some limitations, some of which I discuss below. IMO it
would be better to think about improving existing limitations with
FRMR before building a shim over it. That could help both the
learning curve and the complexity issues.

On with the pre-caffeine musings:

 For example:
 - frwr requires a queue-pair for the post (and it must be the ULP
  queue-pair to ensure the registration is done before the data-transfer
  begins).

The QP does guarantee ordering between registration and use in an RDMA
transfer, but it comes at a cost.

And not just a QP is required, but a QP in RTS (ie, connected). Without
a connected QP, AIUI you can’t invalidate registered FRMRs, you have to
destroy them.

So, for RPC, if the server is down and there are pending RPCs, any FRMRs
associated with that transport are pinned (can’t be re-used or invalidated)
until the server comes back. If an RPC is killed or times out while the
server is down, the associated FRMRs are in limbo.

Registration via WR also means the ULP has to handle the registration and
invalidation completions (or decide to leave them unsignaled, but then
it has to worry about send queue wrapping).

All transport connect operations must be serialized with posting to ensure
that ib_post_send() has a valid (not NULL) QP and ID handle. And, if a QP
or ID handle is used during completion handling, you have to be careful
there too.

(Not to say any of this is impossible to deal with. Obviously RPC/RDMA
works today.)

 While fmrs does not need the queue-pair.

 - the ULP would probably always initiate data transfer after the
  registration (send a request or do the rdma r/w). It is useful to
  link the frwr post with the next wr in a single post_send call.
  I wander how would an API allow such a thing (while other registration
  methods don't use work request interface).

rkey management is also important. Registration is done before use so
the ULP can send the rkey for that FRMR to the remote to initiate the
remote DMA operation.

However, after a transport disconnect, the rkey in the FRMR may not be
the same one the hardware knows about. Recovery in this case means the
FRMR has to be destroyed. Invalidating the FRMR also requires the ULP
to know the hardware rkey, so it doesn’t work in this case.

What all this means is that significant extra complexity is required to
deal with transport disconnection, which is very rare compared to normal
data transfer operations.

FMR, for instance, has much less of an issue here because the map and
unmap verbs are synchronous, do not rely on having a QP, and do not
generate send 

Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size

2015-06-09 Thread Doug Ledford
On Tue, 2015-06-09 at 09:18 -0500, Steve Wise wrote:
 

Why are you using a cast here at all?  bar2_pa is already u64...
   
  
   So it should just have the (uintptr_t) cast?
  
  No, it should be no cast at all.  The uintptr_t cast is only for casting
  an int-ptr or ptr-int.  In those cases, if the size of an int != size
  of ptr, you can loose data, and uintptr_t tells the compiler I know I'm
  casting between possibly lossy data sizes and either A) I've checked and
  it's OK or B) I'm ok with ptr truncation and the loss won't hurt us.
  It basically turns off size checks when sticking a ptr into an int.  You
  should therefore use it only in those circumstances.  For example, when
  storing a cookie that doesn't have a strict uniqueness requirement, the
  loss due to truncation is probably OK.  Or if you know you are only
  doing something like initially storing an int into a pointer, and then
  later storing that pointer back into an int, so there can never be any
  truncation because the source of the ptr was always int sized.  Those
  are the times to use uintptr.  In this case, you have a real u64 going
  into a real u64, there should be no casts.
 
 
 My bad.  I thought bar2_pa was a ptr...

I didn't look up the actual structure definition, but:

+   cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS,
+ cq-bar2_qid,
+ user ? cq-bar2_pa : NULL);

+void __iomem *c4iw_bar2_addrs(struct c4iw_rdev *rdev, unsigned int qid,
+ enum cxgb4_bar2_qtype qtype,
+ unsigned int *pbar2_qid, u64 *pbar2_pa)

Looks like either it's a u64 or else there should be compiler warnings
about passing cq-bar2_pa to c4iw_bar2_addrs.

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


[PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink

2015-06-09 Thread kaike . wan
From: Kaike Wan kaike@intel.com

This patch adds netlink defines for SA client, local service group, local
service operations, and related attributes.

Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: John Fleck john.fl...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Reviewed-by: Sean Hefty sean.he...@intel.com
---
 include/uapi/rdma/rdma_netlink.h |   82 ++
 1 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 6e4bb42..341e9be 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -7,12 +7,14 @@ enum {
RDMA_NL_RDMA_CM = 1,
RDMA_NL_NES,
RDMA_NL_C4IW,
+   RDMA_NL_SA,
RDMA_NL_NUM_CLIENTS
 };
 
 enum {
RDMA_NL_GROUP_CM = 1,
RDMA_NL_GROUP_IWPM,
+   RDMA_NL_GROUP_LS,
RDMA_NL_NUM_GROUPS
 };
 
@@ -128,5 +130,85 @@ enum {
IWPM_NLA_ERR_MAX
 };
 
+/* Local service group opcodes */
+enum {
+   RDMA_NL_LS_OP_RESOLVE = 0,
+   RDMA_NL_LS_OP_SET_TIMEOUT,
+   RDMA_NL_LS_NUM_OPS
+};
+
+/* Local service netlink message flags */
+#define RDMA_NL_LS_F_OK0x0100  /* Success response */
+#define RDMA_NL_LS_F_ERR   0x0200  /* Failed response */
+
+/* Local service attribute type */
+enum {
+   LS_NLA_TYPE_STATUS = 0,
+   LS_NLA_TYPE_PATH_RECORD,
+   LS_NLA_TYPE_TIMEOUT,
+   LS_NLA_TYPE_SERVICE_ID,
+   LS_NLA_TYPE_DGID,
+   LS_NLA_TYPE_SGID,
+   LS_NLA_TYPE_TCLASS,
+   LS_NLA_TYPE_REVERSIBLE,
+   LS_NLA_TYPE_NUM_PATH,
+   LS_NLA_TYPE_PKEY,
+   LS_NLA_TYPE_QOS_CLASS,
+   LS_NLA_TYPE_MAX
+};
+
+/* Local service status attribute */
+enum {
+   LS_NLA_STATUS_SUCCESS = 0,
+   LS_NLA_STATUS_EINVAL,
+   LS_NLA_STATUS_ENODATA,
+   LS_NLA_STATUS_MAX
+};
+
+struct rdma_nla_ls_status {
+   __u32   status;
+};
+
+/* Local service pathrecord attribute: struct ib_path_rec_data */
+
+/* Local service timeout attribute */
+struct rdma_nla_ls_timeout {
+   __u32   timeout;
+};
+
+/* Local Service ServiceID attribute */
+struct rdma_nla_ls_service_id {
+   __be64  service_id;
+};
+
+/* Local Service DGID/SGID attribute: big endian */
+struct rdma_nla_ls_gid {
+   __u8gid[16];
+};
+
+/* Local Service Traffic Class attribute */
+struct rdma_nla_ls_tclass {
+   __u8tclass;
+};
+
+/* Local Service Reversible attribute */
+struct rdma_nla_ls_reversible {
+   __u32   reversible;
+};
+
+/* Local Service numb_path attribute */
+struct rdma_nla_ls_numb_path {
+   __u8numb_path;
+};
+
+/* Local Service Pkey attribute*/
+struct rdma_nla_ls_pkey {
+   __be16  pkey;
+};
+
+/* Local Service Qos Class attribute */
+struct rdma_nla_ls_qos_class {
+   __be16  qos_class;
+};
 
 #endif /* _UAPI_RDMA_NETLINK_H */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink

2015-06-09 Thread kaike . wan
From: Kaike Wan kaike@intel.com

This patch routes a SA pathrecord query to netlink first and processes the
response appropriately. If a failure is returned, the request will be sent
through IB. The decision whether to route the request to netlink first is
determined by the presence of a listener for the local service netlink
multicast group. If the user-space local service netlink multicast group
listener is not present, the request will be sent through IB, just like
what is currently being done.

Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: John Fleck john.fl...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Reviewed-by: Sean Hefty sean.he...@intel.com
---
 drivers/infiniband/core/sa_query.c |  515 +++-
 include/uapi/rdma/rdma_netlink.h   |2 +-
 2 files changed, 515 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c 
b/drivers/infiniband/core/sa_query.c
index 17e1cf7..e063593 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -45,12 +45,21 @@
 #include uapi/linux/if_ether.h
 #include rdma/ib_pack.h
 #include rdma/ib_cache.h
+#include rdma/rdma_netlink.h
+#include net/netlink.h
+#include uapi/rdma/ib_user_sa.h
+#include rdma/ib_marshall.h
 #include sa.h
 
 MODULE_AUTHOR(Roland Dreier);
 MODULE_DESCRIPTION(InfiniBand subnet administration query support);
 MODULE_LICENSE(Dual BSD/GPL);
 
+#define IB_SA_LOCAL_SVC_TIMEOUT_MIN100
+#define IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT2000
+#define IB_SA_LOCAL_SVC_TIMEOUT_MAX20
+static int sa_local_svc_timeout_ms = IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT;
+
 struct ib_sa_sm_ah {
struct ib_ah*ah;
struct kref  ref;
@@ -80,8 +89,24 @@ struct ib_sa_query {
struct ib_mad_send_buf *mad_buf;
struct ib_sa_sm_ah *sm_ah;
int id;
+   u32 flags;
 };
 
+#define IB_SA_ENABLE_LOCAL_SERVICE 0x0001
+#define IB_SA_CANCEL   0x0002
+
+#define IB_SA_LOCAL_SVC_ENABLED(query) \
+   ((query)-flags  IB_SA_ENABLE_LOCAL_SERVICE)
+#define IB_SA_ENABLE_LOCAL_SVC(query) \
+   ((query)-flags |= IB_SA_ENABLE_LOCAL_SERVICE)
+#define IB_SA_DISABLE_LOCAL_SVC(query) \
+   ((query)-flags = ~IB_SA_ENABLE_LOCAL_SERVICE)
+
+#define IB_SA_QUERY_CANCELLED(query) \
+   ((query)-flags  IB_SA_CANCEL)
+#define IB_SA_CANCEL_QUERY(query) \
+   ((query)-flags |= IB_SA_CANCEL)
+
 struct ib_sa_service_query {
void (*callback)(int, struct ib_sa_service_rec *, void *);
void *context;
@@ -106,6 +131,26 @@ struct ib_sa_mcmember_query {
struct ib_sa_query sa_query;
 };
 
+struct ib_nl_request_info {
+   struct list_head list;
+   u32 seq;
+   unsigned long timeout;
+   struct ib_sa_query *query;
+};
+
+struct ib_nl_attr_info {
+   u16 len;/* Total attr len: header + payload + padding */
+   ib_sa_comp_mask comp_mask;
+   void *input;
+   void (*set_attrs)(struct sk_buff *skb, struct ib_nl_attr_info *info);
+};
+
+static LIST_HEAD(ib_nl_request_list);
+static DEFINE_SPINLOCK(ib_nl_request_lock);
+static atomic_t ib_nl_sa_request_seq;
+static struct workqueue_struct *ib_nl_wq;
+static struct delayed_work ib_nl_timed_work;
+
 static void ib_sa_add_one(struct ib_device *device);
 static void ib_sa_remove_one(struct ib_device *device);
 
@@ -381,6 +426,433 @@ static const struct ib_field guidinfo_rec_table[] = {
  .size_bits= 512 },
 };
 
+static int ib_nl_send_msg(int opcode, struct ib_nl_attr_info *attrs, u32 seq)
+{
+   struct sk_buff *skb = NULL;
+   struct nlmsghdr *nlh;
+   void *data;
+   int ret = 0;
+
+   if (attrs-len = 0)
+   return -EMSGSIZE;
+
+   skb = nlmsg_new(attrs-len, GFP_KERNEL);
+   if (!skb) {
+   pr_err(alloc failed ret=%d\n, ret);
+   return -ENOMEM;
+   }
+
+   /* Put nlmsg header only for now */
+   data = ibnl_put_msg(skb, nlh, seq, 0, RDMA_NL_SA,
+   opcode, GFP_KERNEL);
+   if (!data) {
+   kfree_skb(skb);
+   return -EMSGSIZE;
+   }
+
+   /* Add attributes */
+   attrs-set_attrs(skb, attrs);
+
+   /* Repair the nlmsg header length */
+   nlmsg_end(skb, nlh);
+
+   ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL);
+   if (!ret) {
+   ret = attrs-len;
+   } else {
+   if (ret != -ESRCH)
+   pr_err(ibnl_multicast failed l=%d, r=%d\n,
+  attrs-len, ret);
+   ret = 0;
+   }
+   return ret;
+}
+
+static struct ib_nl_request_info *
+ib_nl_alloc_request(struct ib_sa_query *query)
+{
+   struct ib_nl_request_info *rinfo;
+
+   rinfo = kzalloc(sizeof(*rinfo), GFP_ATOMIC);
+   if (rinfo == NULL)
+   return ERR_PTR(-ENOMEM);
+
+   

[PATCH v4 1/1] ibacm: Add support for pathrecord query through netlink

2015-06-09 Thread kaike . wan
From: Kaike Wan kaike@intel.com

This patch enables ibacm to process pathrecord queries through netlink.
Since ibacm can cache pathrecords, this implementation provides an easy
pathrecord cache for kernel components and therefore offers great
performance advantage on large fabric systems.

Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: John Fleck john.fl...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Reviewed-by: Sean Hefty sean.he...@intel.com
---
 src/acm.c |  360 -
 1 files changed, 357 insertions(+), 3 deletions(-)

diff --git a/src/acm.c b/src/acm.c
index 7649725..d180bd2 100644
--- a/src/acm.c
+++ b/src/acm.c
@@ -46,6 +46,8 @@
 #include infiniband/acm_prov.h
 #include infiniband/umad.h
 #include infiniband/verbs.h
+#include infiniband/umad_types.h
+#include infiniband/umad_sa.h
 #include dlist.h
 #include dlfcn.h 
 #include search.h
@@ -55,6 +57,8 @@
 #include netinet/in.h
 #include linux/netlink.h
 #include linux/rtnetlink.h
+#include rdma/rdma_netlink.h
+#include rdma/ib_user_sa.h
 #include poll.h
 #include acm_mad.h
 #include acm_util.h
@@ -66,6 +70,7 @@
 #define MAX_EP_ADDR 4
 #define NL_MSG_BUF_SIZE 4096
 #define ACM_PROV_NAME_SIZE 64
+#define NL_CLIENT_INDEX 0
 
 struct acmc_subnet {
DLIST_ENTRYentry;
@@ -151,6 +156,26 @@ struct acmc_sa_req {
struct acm_sa_mad   mad;
 };
 
+struct acm_nl_status {
+   struct nlattr   attr_hdr;
+   struct rdma_nla_ls_status   status;
+};
+
+struct acm_nl_path {
+   struct nlattr   attr_hdr;
+   struct ib_path_rec_data rec;
+};
+
+struct acm_nl_msg {
+   struct nlmsghdr nlmsg_header;
+   union {
+   uint8_t data[ACM_MSG_DATA_LENGTH];
+   struct nlattr   attr[0];
+   struct acm_nl_statusstatus[0];
+   struct acm_nl_path  path[0];
+   };
+};
+
 static char def_prov_name[ACM_PROV_NAME_SIZE] = ibacmp;
 static DLIST_ENTRY provider_list;
 static struct acmc_prov *def_provider = NULL;
@@ -172,6 +197,7 @@ static struct acmc_ep *acm_find_ep(struct acmc_port *port, 
uint16_t pkey);
 static int acm_ep_insert_addr(struct acmc_ep *ep, const char *name, uint8_t 
*addr,
  size_t addr_len, uint8_t addr_type);
 static void acm_event_handler(struct acmc_device *dev);
+static int acm_nl_send(SOCKET sock, struct acm_msg *msg);
 
 static struct sa_data {
int timeout;
@@ -466,7 +492,11 @@ int acm_resolve_response(uint64_t id, struct acm_msg *msg)
goto release;
}
 
-   ret = send(client-sock, (char *) msg, msg-hdr.length, 0);
+   if (id == NL_CLIENT_INDEX)
+   ret = acm_nl_send(client-sock, msg);
+   else
+   ret = send(client-sock, (char *) msg, msg-hdr.length, 0);
+
if (ret != msg-hdr.length)
acm_log(0, ERROR - failed to send response\n);
else
@@ -597,6 +627,8 @@ static void acm_svr_accept(void)
}
 
for (i = 0; i  FD_SETSIZE - 1; i++) {
+   if (i == NL_CLIENT_INDEX)
+   continue;
if (!atomic_get(client_array[i].refcnt))
break;
}
@@ -1346,6 +1378,323 @@ static void acm_ipnl_handler(void)
}
 }
 
+static int acm_nl_send(SOCKET sock, struct acm_msg *msg)
+{
+   struct sockaddr_nl dst_addr;
+   struct acm_nl_msg acmnlmsg;
+   struct acm_nl_msg *orig;
+   int ret;
+   int datalen;
+
+   orig = (struct acm_nl_msg *) msg-hdr.tid;
+
+   memset(dst_addr, 0, sizeof(dst_addr));
+   dst_addr.nl_family = AF_NETLINK;
+   dst_addr.nl_groups = (1  (RDMA_NL_GROUP_LS - 1));
+
+   memset(acmnlmsg, 0, sizeof(acmnlmsg));
+   acmnlmsg.nlmsg_header.nlmsg_len = NLMSG_HDRLEN;
+   acmnlmsg.nlmsg_header.nlmsg_pid = getpid();
+   acmnlmsg.nlmsg_header.nlmsg_type = orig-nlmsg_header.nlmsg_type;
+   acmnlmsg.nlmsg_header.nlmsg_flags = NLM_F_REQUEST;
+   acmnlmsg.nlmsg_header.nlmsg_seq = orig-nlmsg_header.nlmsg_seq;
+
+   if (msg-hdr.status != ACM_STATUS_SUCCESS) {
+   acm_log(2, acm status no success = %d\n, msg-hdr.status);
+   acmnlmsg.nlmsg_header.nlmsg_flags |= RDMA_NL_LS_F_ERR;
+   acmnlmsg.nlmsg_header.nlmsg_len +=
+   NLA_ALIGN(sizeof(struct acm_nl_status));
+   acmnlmsg.status[0].attr_hdr.nla_type = LS_NLA_TYPE_STATUS;
+   acmnlmsg.status[0].attr_hdr.nla_len = NLA_HDRLEN +
+   sizeof(struct rdma_nla_ls_status);
+   if (msg-hdr.status == ACM_STATUS_EINVAL)
+   acmnlmsg.status[0].status.status = LS_NLA_STATUS_EINVAL;
+   else
+   acmnlmsg.status[0].status.status =
+   LS_NLA_STATUS_ENODATA;
+ 

RE: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size

2015-06-09 Thread Steve Wise


 -Original Message-
 From: Doug Ledford [mailto:dledf...@redhat.com]
 Sent: Tuesday, June 09, 2015 9:16 AM
 To: Steve Wise
 Cc: 'Hariprasad Shenai'; linux-rdma@vger.kernel.org; lee...@chelsio.com; 
 nirran...@chelsio.com
 Subject: Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities 
 exceeding the page size
 
 On Tue, 2015-06-09 at 09:07 -0500, Steve Wise wrote:
 
   -Original Message-
   From: Doug Ledford [mailto:dledf...@redhat.com]
   Sent: Tuesday, June 09, 2015 9:03 AM
   To: Hariprasad Shenai
   Cc: linux-rdma@vger.kernel.org; sw...@opengridcomputing.com; 
   lee...@chelsio.com; nirran...@chelsio.com
   Subject: Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities 
   exceeding the page size
  
   On Tue, 2015-06-09 at 18:23 +0530, Hariprasad Shenai wrote:
Handle this configuration:
   
Queues Per Page * SGE BAR2 Queue Register Area Size  Page Size
   
Use cxgb4_bar2_sge_qregs() to obtain the proper location within the
bar2 region for a given qid.
   
Rework the DB and GTS write functions to make use of this bar2 info.
   
Signed-off-by: Steve Wise sw...@opengridcomputing.com
Signed-off-by: Hariprasad Shenai haripra...@chelsio.com
---
 drivers/infiniband/hw/cxgb4/cq.c   | 22 ++--
 drivers/infiniband/hw/cxgb4/device.c   | 16 +++--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |  5 +--
 drivers/infiniband/hw/cxgb4/qp.c   | 64 
++
 drivers/infiniband/hw/cxgb4/t4.h   | 60 
---
 5 files changed, 98 insertions(+), 69 deletions(-)
   
diff --git a/drivers/infiniband/hw/cxgb4/cq.c 
b/drivers/infiniband/hw/cxgb4/cq.c
index 68ddb37..8e5bbcb 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, 
struct t4_cq *cq,
goto err4;
   
cq-gen = 1;
+   cq-gts = rdev-lldi.gts_reg;
cq-rdev = rdev;
-   if (user) {
-   u32 off = (cq-cqid  rdev-cqshift)  PAGE_MASK;
   
-   cq-ugts = (u64)rdev-bar2_pa + off;
-   } else if (is_t4(rdev-lldi.adapter_type)) {
-   cq-gts = rdev-lldi.gts_reg;
-   cq-qid_mask = -1U;
-   } else {
-   u32 off = ((cq-cqid  rdev-cqshift)  PAGE_MASK) + 
12;
-
-   cq-gts = rdev-bar2_kva + off;
-   cq-qid_mask = rdev-qpmask;
+   cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, 
T4_BAR2_QTYPE_INGRESS,
+ cq-bar2_qid,
+ user ? cq-bar2_pa : NULL);
+   if (user  !cq-bar2_va) {
+   pr_warn(MOD %s: cqid %u not in BAR2 range.\n,
+   pci_name(rdev-lldi.pdev), cq-cqid);
+   ret = -EINVAL;
+   goto err4;
}
return 0;
 err4:
@@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device 
*ibdev, int entries,
insert_mmap(ucontext, mm);
   
mm2-key = uresp.gts_key;
-   mm2-addr = chp-cq.ugts;
+   mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa;
  
   Why are you using a cast here at all?  bar2_pa is already u64...
  
 
  So it should just have the (uintptr_t) cast?
 
 No, it should be no cast at all.  The uintptr_t cast is only for casting
 an int-ptr or ptr-int.  In those cases, if the size of an int != size
 of ptr, you can loose data, and uintptr_t tells the compiler I know I'm
 casting between possibly lossy data sizes and either A) I've checked and
 it's OK or B) I'm ok with ptr truncation and the loss won't hurt us.
 It basically turns off size checks when sticking a ptr into an int.  You
 should therefore use it only in those circumstances.  For example, when
 storing a cookie that doesn't have a strict uniqueness requirement, the
 loss due to truncation is probably OK.  Or if you know you are only
 doing something like initially storing an int into a pointer, and then
 later storing that pointer back into an int, so there can never be any
 truncation because the source of the ptr was always int sized.  Those
 are the times to use uintptr.  In this case, you have a real u64 going
 into a real u64, there should be no casts.


My bad.  I thought bar2_pa was a ptr...



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.

2015-06-09 Thread Jerome Glisse
On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote:
 On Mon, 8 Jun 2015, Jerome Glisse wrote:
  On Mon, Jun 08, 2015 at 12:40:18PM -0700, Mark Hairgrove wrote:
   On Thu, 21 May 2015, j.gli...@gmail.com wrote:
From: Jérôme Glisse jgli...@redhat.com
   
This patch only introduce core HMM functions for registering a new
mirror and stopping a mirror as well as HMM device registering and
unregistering.
   
[...]
   
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+   /* release() - mirror must stop using the address space.
+*
+* @mirror: The mirror that link process address space with the 
device.
+*
+* When this is call, device driver must kill all device thread 
using
+* this mirror. Also, this callback is the last thing call by 
HMM and
+* HMM will not access the mirror struct after this call (ie no 
more
+* dereference of it so it is safe for the device driver to 
free it).
+* It is call either from :
+*   - mm dying (all process using this mm exiting).
+*   - hmm_mirror_unregister() (if no other thread holds a 
reference)
+*   - outcome of some device error reported by any of the 
device
+* callback against that mirror.
+*/
+   void (*release)(struct hmm_mirror *mirror);
+};
  
   The comment that -release is called when the mm dies doesn't match the
   implementation. -release is only called when the mirror is destroyed, and
   that can only happen after the mirror has been unregistered. This may not
   happen until after the mm dies.
  
   Is the intent for the driver to get the callback when the mm goes down?
   That seems beneficial so the driver can kill whatever's happening on the
   device. Otherwise the device may continue operating in a dead address
   space until the driver's file gets closed and it unregisters the mirror.
 
  This was the intent before merging free  release. I guess i need to
  reinstate the free versus release callback. Sadly the lifetime for HMM
  is more complex than mmu_notifier as we intend the mirror struct to
  be embedded into a driver private struct.
 
 Can you clarify how that's different from mmu_notifiers? Those are also
 embedded into a driver-owned struct.

For HMM you want to be able to kill a mirror from HMM, you might have kernel
thread call inside HMM with a mirror (outside any device file lifetime) ...
The mirror is not only use at register  unregister, there is a lot more thing
you can call using the HMM mirror struct.

So the HMM mirror lifetime as a result is more complex, it can not simply be
free from the mmu_notifier_release callback or randomly. It needs to be
refcounted. The mmu_notifier code assume that the mmu_notifier struct is
embedded inside a struct that has a lifetime properly synchronize with the
mm. For HMM mirror this is not something that sounds like a good idea as there
is too many way to get it wrong.

So idea of HMM mirror is that it can out last the mm lifetime but the HMM
struct can not. So you have hmm_mirror ~ hmm - mm and the mirror can be
unlink and have different lifetime from the hmm that itself has same life
time as mm.

 Is the goal to allow calling hmm_mirror_unregister from within the mm is
 dying HMM callback? I don't know whether that's really necessary as long
 as there's some association between the driver files and the mirrors.

No this is not a goal and i actualy forbid that.

 
   If so, I think there's a race here in the case of mm teardown happening
   concurrently with hmm_mirror_unregister.
  
   [...]
  
   Do you agree that this sequence can happen, or am I missing something
   which prevents it?
 
  Can't happen because child have mm-hmm = NULL ie only one hmm per mm
  and hmm is tie to only one mm. It is the responsability of the device
  driver to make sure same apply to private reference to the hmm mirror
  struct ie hmm_mirror should never be tie to a private file struct.
 
 It's useful for the driver to have some association between files and
 mirrors. If the file is closed prior to process exit we would like to
 unregister the mirror, otherwise it will persist until process teardown.
 The association doesn't have to be 1:1 but having the files ref count the
 mirror or something would be useful.

This is allowed, i might have put strong word here, but you can associate
with a file struct. What you can not do is use the mirror from a different
process ie one with a different mm struct as mirror is linked to a single
mm. So on fork there is no callback to update the private file struct, when
the device file is duplicated (well just refcount inc) against a different
process. This is something you need to be carefull in your driver. Inside
the dummy driver i abuse that to actually test proper behavior of HMM but
it 

rdmacm issue

2015-06-09 Thread Bob Ciotti

We have an issue where lustre servers and clients cannot talk to each other.
There are about 11,000 clients all trying to connect to a server that just been 
rebooted
(nbp6-oss3 in this example)

pfe21 is a lustre client thats trying to remount the filesystem from nbp6-oss3.

running rping server on pfe21 hangs and waits until the client tried to 
connect, then it prints out
debug information up to cq_thread started. and hangs there, for a minute or so 
until issuing the two UNREACHABLE errors:

pfe21 ~ # rping -v -s -d -P -p2 -a 10.151.27.19
port 2
created cm_id 0x60e350
rdma_bind_addr successful
rdma_listen
cma_event type RDMA_CM_EVENT_CONNECT_REQUEST cma_id 0x60b620 (child)
child cma 0x60b620
created pd 0x60bd80
created channel 0x60bda0
created cq 0x60bdc0
created qp 0x60bf00
rping_setup_buffers called on cb 0x60b8c0
allocated  registered buffers...
accepting client connection request
cq_thread started.
cma_event type RDMA_CM_EVENT_UNREACHABLE cma_id 0x60b620 (child)
cma event RDMA_CM_EVENT_UNREACHABLE, error -110


The rping client is started below. As soon as it starts, it runs up to the point
of cq_thread started. Hangs there and eventually times out as well, issuing 4 
error
messages:

nbp6-oss3 ~ # rping -c -vp-d -S 30 -p 2 -a 10.151.27.19
size 30
port 2
created cm_id 0x60f640
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x60f640 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x60f640 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x60ab10
created channel 0x60ab30
created cq 0x60ab50
created qp 0x60ac60
rping_setup_buffers called on cb 0x6072e0
allocated  registered buffers...
cq_thread started.
cma_event type RDMA_CM_EVENT_UNREACHABLE cma_id 0x60f640 (parent)
cma event RDMA_CM_EVENT_UNREACHABLE, error -110
wait for CONNECTED state 4
connect error -1


Any ideas? The neighbor entries on both side were in a reachable state before 
the test. And the two systems did manage to find one another. Keep in mind that 
when this
is going on, 11,000+ clients are trying to connect to nbp6-oss3

Normally lustre mounts fine and rping have no issues. We have noticed some 
neighbor resolution issues and are considering ucast_solicit, mcast_solicit, 
and unres_qlen changes
because we also occationally experience issues with icmp ping. Its typically 
been the experience that changing these configs have little effect.  Yes its 
probably the case that
unicast arp refresh has long failed and 11,000+ clients may be multicasting for
for arp.

Any help or insight greatly appreciated.

thx, bob



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-09 Thread Christoph Lameter
On Mon, 8 Jun 2015, Hefty, Sean wrote:

 You're assuming that the only start time of interest is when a send operati=
 on has been posted.  Jason asked what I would do with libfabric.  That inte=
 rface supports triggered operations.  It has also been designed such that a=
  rendezvous (that has to be one of the most difficult words in the English =
 language to spell correctly, even with spell check) protocol could be imple=
 mented by the provider.  On the receive side, it may be of interest to repo=
 rt the start and ending time for larger transfers, primarily for debugging =
 purposes.

There are multiple problems with libfrabric related to the use cases in my
area. Most of all the lack of multicast support. Then there is the build
up of software bloat on top. The interest here is in low latency
operations. Redenzvous and other new features are really not wanted if
they increase the latency.

 I have no idea how the time stamps are expected to be used, so why limit it=
 ?  An app could just as easily create their own time stamp when reading a w=
 ork completion, especially when the data is going into an anonymous receive=
  buffer.  That would seem to work for your use case.

No it cannot as described earlier. The work can be completed much earlier
than when the polling thread gets around to check for it. We do that today
since there is nothing better but this means that there is a gap there.
On the send side you have no easy way to telling when the operation was
complete without the timestamp.

 I have no problem with a bare metal interface exposing this.  But pretendin=
 g that it's generic and that this is the one and only way that this could b=
 e implemented doesn't make it so.

This is a way it was implemented and its usable. Shooting for pie in the
sky does not bring us anything. Nor ideas of requirements from a new
experimental API that does not support the basic features that we need
and seems to be on its way to mess up the latencies of access to RDMA
operations.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size

2015-06-09 Thread Doug Ledford
On Tue, 2015-06-09 at 18:23 +0530, Hariprasad Shenai wrote:
 Handle this configuration:
 
 Queues Per Page * SGE BAR2 Queue Register Area Size  Page Size
 
 Use cxgb4_bar2_sge_qregs() to obtain the proper location within the
 bar2 region for a given qid.
 
 Rework the DB and GTS write functions to make use of this bar2 info.
 
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 Signed-off-by: Hariprasad Shenai haripra...@chelsio.com
 ---
  drivers/infiniband/hw/cxgb4/cq.c   | 22 ++--
  drivers/infiniband/hw/cxgb4/device.c   | 16 +++--
  drivers/infiniband/hw/cxgb4/iw_cxgb4.h |  5 +--
  drivers/infiniband/hw/cxgb4/qp.c   | 64 
 ++
  drivers/infiniband/hw/cxgb4/t4.h   | 60 ---
  5 files changed, 98 insertions(+), 69 deletions(-)
 
 diff --git a/drivers/infiniband/hw/cxgb4/cq.c 
 b/drivers/infiniband/hw/cxgb4/cq.c
 index 68ddb37..8e5bbcb 100644
 --- a/drivers/infiniband/hw/cxgb4/cq.c
 +++ b/drivers/infiniband/hw/cxgb4/cq.c
 @@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct 
 t4_cq *cq,
   goto err4;
  
   cq-gen = 1;
 + cq-gts = rdev-lldi.gts_reg;
   cq-rdev = rdev;
 - if (user) {
 - u32 off = (cq-cqid  rdev-cqshift)  PAGE_MASK;
  
 - cq-ugts = (u64)rdev-bar2_pa + off;
 - } else if (is_t4(rdev-lldi.adapter_type)) {
 - cq-gts = rdev-lldi.gts_reg;
 - cq-qid_mask = -1U;
 - } else {
 - u32 off = ((cq-cqid  rdev-cqshift)  PAGE_MASK) + 12;
 -
 - cq-gts = rdev-bar2_kva + off;
 - cq-qid_mask = rdev-qpmask;
 + cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS,
 +   cq-bar2_qid,
 +   user ? cq-bar2_pa : NULL);
 + if (user  !cq-bar2_va) {
 + pr_warn(MOD %s: cqid %u not in BAR2 range.\n,
 + pci_name(rdev-lldi.pdev), cq-cqid);
 + ret = -EINVAL;
 + goto err4;
   }
   return 0;
  err4:
 @@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int 
 entries,
   insert_mmap(ucontext, mm);
  
   mm2-key = uresp.gts_key;
 - mm2-addr = chp-cq.ugts;
 + mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa;

Why are you using a cast here at all?  bar2_pa is already u64...


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


[PATCH] IB/core: Don't advertise SA in RoCE port capabilities

2015-06-09 Thread Moni Shoua
The Subnet Administrator (SA) is not a component of the RoCE spec.
Therefore, it should not be a capability of a RoCE port.

Change-Id: Iadfaa56bdc9f6e28f46d009064c2d15969293cf7
Signed-off-by: Moni Shoua mo...@mellanox.com
---
 include/rdma/ib_verbs.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7d78794..7725cee 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -381,7 +381,6 @@ union rdma_protocol_stats {
 #define RDMA_CORE_PORT_IBA_ROCE(RDMA_CORE_CAP_PROT_ROCE \
| RDMA_CORE_CAP_IB_MAD  \
| RDMA_CORE_CAP_IB_CM   \
-   | RDMA_CORE_CAP_IB_SA   \
| RDMA_CORE_CAP_AF_IB   \
| RDMA_CORE_CAP_ETH_AH)
 #define RDMA_CORE_PORT_IWARP   (RDMA_CORE_CAP_PROT_IWARP \
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: Change To max_cm_mtu when changing mode to connected

2015-06-09 Thread Doug Ledford
On Tue, 2015-06-09 at 14:17 +0300, Erez Shitrit wrote:
 On Tue, Jun 9, 2015 at 12:04 AM, Doug Ledford dledf...@redhat.com wrote:
  On Mon, 2015-06-08 at 10:42 -0600, Jason Gunthorpe wrote:
  On Sun, Jun 07, 2015 at 01:36:11PM +0300, Erez Shitrit wrote:
  
   When switching between modes (datagram / connected) change the MTU
   accordingly.
   datagram mode up to 4K, connected mode up to (64K - 0x10).
 
  Is this a bug fix (describe the user visible impact)? Should it go to
  -stable? Add a Fixes: line?
 
  I'm not sure I would call it a bug.  Setting the MTU to max
  automatically is a policy decision more than anything else.  Currently,
  you have to enable connected mode, then set the MTU.  Both initscripts
  and NetworkManager do this for you on Red Hat, so the user sees no
  difference here.  I can't speak to other OSes.  I'm OK with setting
  connected mode MTU to max by default once we get the scatter/gather
  support for IPoIB added.
 
 Hi Doug,
 There is no connection to the scatter/gather patch in that fix.

Yes, I'm well aware of that.  However, 64k MTU in connected mode does
order 5 page allocations.  That has been the source of a number of bugs
in our bug tracking system in the past.  I wouldn't make it default to
turning on 64k MTU without user intervention for that reason.  Once the
SG patch for IPoIB is included, that problem goes away and I'm OK making
IPoIB turn on the large MTU by default.

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] IB/ipoib: Change To max_cm_mtu when changing mode to connected

2015-06-09 Thread Doug Ledford
On Tue, 2015-06-09 at 09:06 +0300, Erez Shitrit wrote:
 On Tue, Jun 9, 2015 at 12:04 AM, Doug Ledford dledf...@redhat.com wrote:
  On Mon, 2015-06-08 at 10:42 -0600, Jason Gunthorpe wrote:
  On Sun, Jun 07, 2015 at 01:36:11PM +0300, Erez Shitrit wrote:
  
   When switching between modes (datagram / connected) change the MTU
   accordingly.
   datagram mode up to 4K, connected mode up to (64K - 0x10).
 
  Is this a bug fix (describe the user visible impact)? Should it go to
  -stable? Add a Fixes: line?
 
  I'm not sure I would call it a bug.  Setting the MTU to max
  automatically is a policy decision more than anything else.  Currently,
  you have to enable connected mode, then set the MTU.  Both initscripts
  and NetworkManager do this for you on Red Hat, so the user sees no
  difference here.  I can't speak to other OSes.  I'm OK with setting
 
 The only real reason for a user to switch to CM mode is for the
 performance he can get, and this is only by the huge MTU the CM gives,
 so it somehow should be part of the mode setting.
 As you said RH does it (in the new versions of its OS's only)

No, we do it in all versions of our OS, but you have to set the MTU
parameter, we don't default it to anything for you (but this is no
different than setting CM).  This is normal device setup.  Just like we
don't default Ethernet interfaces to jumbo frames, the user needs to set
the MTU on that interface.  And the user gets benefit at MTUs less than
jumbo size.

  via
 scripting outside of the driver code, other vendor doesn't.
 
  connected mode MTU to max by default once we get the scatter/gather
  support for IPoIB added.
 
 OK, great. Thanks.
 
  --
  Doug Ledford dledf...@redhat.com
GPG KeyID: 0E572FDD
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.

2015-06-09 Thread Mark Hairgrove


On Tue, 9 Jun 2015, Jerome Glisse wrote:

 On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote:
  Can you clarify how that's different from mmu_notifiers? Those are also
  embedded into a driver-owned struct.
 
 For HMM you want to be able to kill a mirror from HMM, you might have kernel
 thread call inside HMM with a mirror (outside any device file lifetime) ...
 The mirror is not only use at register  unregister, there is a lot more thing
 you can call using the HMM mirror struct.
 
 So the HMM mirror lifetime as a result is more complex, it can not simply be
 free from the mmu_notifier_release callback or randomly. It needs to be
 refcounted.

Sure, there are driver - HMM calls like hmm_mirror_fault that 
mmu_notifiers don't have, but I don't understand why that fundamentally 
makes HMM mirror lifetimes more complex. Decoupling hmm_mirror lifetime 
from mm lifetime adds complexity too.

 The mmu_notifier code assume that the mmu_notifier struct is
 embedded inside a struct that has a lifetime properly synchronize with the
 mm. For HMM mirror this is not something that sounds like a good idea as there
 is too many way to get it wrong.

What kind of synchronization with the mm are you referring to here? 
Clients of mmu_notifiers don't have to do anything as far as I know. 
They're guaranteed that the mm won't go away because each registered 
notifier bumps mm_count.

 So idea of HMM mirror is that it can out last the mm lifetime but the HMM
 struct can not. So you have hmm_mirror ~ hmm - mm and the mirror can be
 unlink and have different lifetime from the hmm that itself has same life
 time as mm.

Per the earlier discussion hmm_mirror_destroy is missing a call to 
hmm_unref. If that's added back I don't understand how the mirror can 
persist past the hmm struct. The mirror can be unlinked from hmm's list, 
yes, but that doesn't mean that hmm/mm can be torn down. The hmm/mm 
structs will stick around until hmm_destroy since that does the 
mmu_notifier_unregister. hmm_destroy can't be called until the last 
hmm_mirror_destroy.

Doesn't that mean that hmm/mm are guaranteed to be allocated until the 
last hmm_mirror_unregister? That sounds like a good guarantee to make.


 
  Is the goal to allow calling hmm_mirror_unregister from within the mm is
  dying HMM callback? I don't know whether that's really necessary as long
  as there's some association between the driver files and the mirrors.
 
 No this is not a goal and i actualy forbid that.
 
  
If so, I think there's a race here in the case of mm teardown happening
concurrently with hmm_mirror_unregister.
   
[...]
   
Do you agree that this sequence can happen, or am I missing something
which prevents it?
  
   Can't happen because child have mm-hmm = NULL ie only one hmm per mm
   and hmm is tie to only one mm. It is the responsability of the device
   driver to make sure same apply to private reference to the hmm mirror
   struct ie hmm_mirror should never be tie to a private file struct.
  
  It's useful for the driver to have some association between files and
  mirrors. If the file is closed prior to process exit we would like to
  unregister the mirror, otherwise it will persist until process teardown.
  The association doesn't have to be 1:1 but having the files ref count the
  mirror or something would be useful.
 
 This is allowed, i might have put strong word here, but you can associate
 with a file struct. What you can not do is use the mirror from a different
 process ie one with a different mm struct as mirror is linked to a single
 mm. So on fork there is no callback to update the private file struct, when
 the device file is duplicated (well just refcount inc) against a different
 process. This is something you need to be carefull in your driver. Inside
 the dummy driver i abuse that to actually test proper behavior of HMM but
 it should not be use as an example.

So to confirm, on all file operations from user space the driver is 
expected to check that current-mm matches the mm associated with the 
struct file's hmm_mirror?

On file-release the driver still ought to call hmm_mirror_unregister 
regardless of whether the mms match, otherwise we'll never tear down the 
mirror. That means we're not saved from the race condition because 
hmm_mirror_unregister can happen in one thread while hmm_notifier_release 
might be happening in another thread.


  
  But even if we assume no association at all between files and mirrors, are
  you sure that prevents the race? The driver may choose to unregister the
  hmm_device at any point once its files are closed. In the case of module
  unload the device unregister can't be prevented. If mm teardown hasn't
  happened yet mirrors may still be active and registered on that
  hmm_device. The driver thus has to first call hmm_mirror_unregister on all
  active mirrors, then call hmm_device_unregister. mm teardown of those
  mirrors may trigger at any point in this