Re: [PATCH] IB/core: Get rid of redundant verb ib_destroy_mr
On 6/9/2015 1:55 AM, Hefty, Sean wrote: --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -1174,6 +1174,18 @@ static int clean_mr(struct mlx5_ib_mr *mr) int umred = mr-umred; int err; + if (mr-sig) { + if (mlx5_core_destroy_psv(dev-mdev, + mr-sig-psv_memory.psv_idx)) + mlx5_ib_warn(dev, failed to destroy mem psv %d\n, +mr-sig-psv_memory.psv_idx); + if (mlx5_core_destroy_psv(dev-mdev, + mr-sig-psv_wire.psv_idx)) + mlx5_ib_warn(dev, failed to destroy wire psv %d\n, +mr-sig-psv_wire.psv_idx); + kfree(mr-sig); + } + if (!umred) { err = destroy_mkey(dev, mr); if (err) { This patch doesn't introduce this problem, but the err check suggests that deregistration can fail for some reason. If so, should mr-sig be cleared above, so that a second call to dereg doesn't attempt to free the memory twice? Hi Sean, clean_mr() failure is not propagated back at the moment, but you are correct, it's not a good idea to rely on that. I'll clear mr-sig in v2. Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Indirect memory registration feature
On 6/9/2015 9:20 AM, Christoph Hellwig wrote: On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote: I wouldn't say this is about offloading bounce buffering to silicon. The RDMA stack always imposed the alignment limitation as we can only give a page lists to the devices. Other drivers (qlogic/emulex FC drivers for example), use an _arbitrary_ SG lists where each element can point to any {addr, len}. Those are drivers for protocols that support real SG lists. It seems only Infiniband and NVMe expose this silly limit. I agree this is indeed a limitation and that's why SG_GAPS was added in the first place. I think the next gen of nvme devices will support real SG lists. This feature enables existing Infiniband devices that can handle SG lists to receive them via the RDMA stack (ib_core). If the memory registration process wasn't such a big fiasco in the first place, wouldn't this way makes the most sense? So please fix it in the proper layers first, I agree that we can take care of bounce buffering in the block layer (or scsi for SG_IO) if the driver doesn't want to see any type of unaligned SG lists. But do you think that it should come before the stack can support this? Yes, absolutely. The other thing that needs to come first is a proper abstraction for MRs instead of hacking another type into all drivers. I'm very much open to the idea of consolidating the memory registration code instead of doing it in every ULP (srp, iser, xprtrdma, svcrdma, rds, more to come...) using a general memory registration API. The main challenge is to abstract the different methods (and considerations) of memory registration behind an API. Do we completely mask out the way we are doing it? I'm worried that we might end up either compromising on performance or trying to understand too much what the caller is trying to achieve. For example: - frwr requires a queue-pair for the post (and it must be the ULP queue-pair to ensure the registration is done before the data-transfer begins). While fmrs does not need the queue-pair. - the ULP would probably always initiate data transfer after the registration (send a request or do the rdma r/w). It is useful to link the frwr post with the next wr in a single post_send call. I wander how would an API allow such a thing (while other registration methods don't use work request interface). - There is the fmr_pool API which tries to tackle the disadvantages of fmrs (very slow unmap) by delaying the fmr unmap until some dirty watermark of remapping is met. I'm not sure how this can be done. - How would the API choose the method to register memory? - If there is an alignment issue, do we fail? do we bounce? - There is the whole T10-DIF support... ... CC'ing Bart Chuck who share the suffer of memory registration. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipoib: Change To max_cm_mtu when changing mode to connected
On Tue, Jun 9, 2015 at 12:04 AM, Doug Ledford dledf...@redhat.com wrote: On Mon, 2015-06-08 at 10:42 -0600, Jason Gunthorpe wrote: On Sun, Jun 07, 2015 at 01:36:11PM +0300, Erez Shitrit wrote: When switching between modes (datagram / connected) change the MTU accordingly. datagram mode up to 4K, connected mode up to (64K - 0x10). Is this a bug fix (describe the user visible impact)? Should it go to -stable? Add a Fixes: line? I'm not sure I would call it a bug. Setting the MTU to max automatically is a policy decision more than anything else. Currently, you have to enable connected mode, then set the MTU. Both initscripts and NetworkManager do this for you on Red Hat, so the user sees no difference here. I can't speak to other OSes. I'm OK with setting connected mode MTU to max by default once we get the scatter/gather support for IPoIB added. Hi Doug, There is no connection to the scatter/gather patch in that fix. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Indirect memory registration feature
On Tue, Jun 09, 2015 at 11:44:16AM +0300, Sagi Grimberg wrote: For example: - frwr requires a queue-pair for the post (and it must be the ULP queue-pair to ensure the registration is done before the data-transfer begins). While fmrs does not need the queue-pair. How do you transfer data without a queue pair? - the ULP would probably always initiate data transfer after the registration (send a request or do the rdma r/w). It is useful to link the frwr post with the next wr in a single post_send call. I wander how would an API allow such a thing (while other registration methods don't use work request interface). - There is the fmr_pool API which tries to tackle the disadvantages of fmrs (very slow unmap) by delaying the fmr unmap until some dirty watermark of remapping is met. I'm not sure how this can be done. - How would the API choose the method to register memory? Seems like the general preference in existing users is FRWR - FMR - FR with an exception to use physical for single segment I/O requests. - If there is an alignment issue, do we fail? do we bounce? Leave it to the high level driver for now. NFS can simply split requests when the offset mismatches for blok drivers the enhanced SG_GAPS can take care of it. If indirect is supported the driver does not need to set SG_GAPS. registrations tha - There is the whole T10-DIF support... Just tell the driver if the registration model supports it. Do the indirect registrations support it, btw? As far as I can tell the NFS abstraction fits SRP fairly well. iSER is a bit of a mess with it's bounce buffering so I'm not too sure about that one. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipoib: Change To max_cm_mtu when changing mode to connected
On Tue, Jun 9, 2015 at 12:04 AM, Doug Ledford dledf...@redhat.com wrote: On Mon, 2015-06-08 at 10:42 -0600, Jason Gunthorpe wrote: On Sun, Jun 07, 2015 at 01:36:11PM +0300, Erez Shitrit wrote: When switching between modes (datagram / connected) change the MTU accordingly. datagram mode up to 4K, connected mode up to (64K - 0x10). Is this a bug fix (describe the user visible impact)? Should it go to -stable? Add a Fixes: line? I'm not sure I would call it a bug. Setting the MTU to max automatically is a policy decision more than anything else. Currently, you have to enable connected mode, then set the MTU. Both initscripts and NetworkManager do this for you on Red Hat, so the user sees no difference here. I can't speak to other OSes. I'm OK with setting The only real reason for a user to switch to CM mode is for the performance he can get, and this is only by the huge MTU the CM gives, so it somehow should be part of the mode setting. As you said RH does it (in the new versions of its OS's only) via scripting outside of the driver code, other vendor doesn't. connected mode MTU to max by default once we get the scatter/gather support for IPoIB added. OK, great. Thanks. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Indirect memory registration feature
On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote: I wouldn't say this is about offloading bounce buffering to silicon. The RDMA stack always imposed the alignment limitation as we can only give a page lists to the devices. Other drivers (qlogic/emulex FC drivers for example), use an _arbitrary_ SG lists where each element can point to any {addr, len}. Those are drivers for protocols that support real SG lists. It seems only Infiniband and NVMe expose this silly limit. So please fix it in the proper layers first, I agree that we can take care of bounce buffering in the block layer (or scsi for SG_IO) if the driver doesn't want to see any type of unaligned SG lists. But do you think that it should come before the stack can support this? Yes, absolutely. The other thing that needs to come first is a proper abstraction for MRs instead of hacking another type into all drivers. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-4.1 1/2] cxgb4: Support for user mode bar2 mappings with T4
Enhance cxgb4_t4_bar2_sge_qregs() and cxgb4_bar2_sge_qregs() to support T4 user mode mappings. Update all the current users as well. Signed-off-by: Steve Wise sw...@opengridcomputing.com Signed-off-by: Hariprasad Shenai haripra...@chelsio.com --- drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 1 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 4 +++- drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h | 1 + drivers/net/ethernet/chelsio/cxgb4/sge.c| 4 ++-- drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 7 --- 5 files changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h index 524d110..e052f05 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h @@ -1185,6 +1185,7 @@ enum t4_bar2_qtype { T4_BAR2_QTYPE_EGRESS, T4_BAR2_QTYPE_INGRESS }; int cxgb4_t4_bar2_sge_qregs(struct adapter *adapter, unsigned int qid, enum t4_bar2_qtype qtype, + int user, u64 *pbar2_qoffset, unsigned int *pbar2_qid); diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 803d91b..a935559 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -2145,6 +2145,7 @@ EXPORT_SYMBOL(cxgb4_read_sge_timestamp); int cxgb4_bar2_sge_qregs(struct net_device *dev, unsigned int qid, enum cxgb4_bar2_qtype qtype, +int user, u64 *pbar2_qoffset, unsigned int *pbar2_qid) { @@ -2153,6 +2154,7 @@ int cxgb4_bar2_sge_qregs(struct net_device *dev, (qtype == CXGB4_BAR2_QTYPE_EGRESS ? T4_BAR2_QTYPE_EGRESS : T4_BAR2_QTYPE_INGRESS), +user, pbar2_qoffset, pbar2_qid); } @@ -2351,7 +2353,7 @@ static void process_db_drop(struct work_struct *work) int ret; ret = cxgb4_t4_bar2_sge_qregs(adap, qid, T4_BAR2_QTYPE_EGRESS, - bar2_qoffset, bar2_qid); + 0, bar2_qoffset, bar2_qid); if (ret) dev_err(adap-pdev_dev, doorbell drop recovery: qid=%d, pidx_inc=%d\n, qid, pidx_inc); diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h index 78ab4d4..e33934a 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h @@ -306,6 +306,7 @@ enum cxgb4_bar2_qtype { CXGB4_BAR2_QTYPE_EGRESS, CXGB4_BAR2_QTYPE_INGRESS }; int cxgb4_bar2_sge_qregs(struct net_device *dev, unsigned int qid, enum cxgb4_bar2_qtype qtype, +int user, u64 *pbar2_qoffset, unsigned int *pbar2_qid); diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c index 0d2edda..1b99aec 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -2429,8 +2429,8 @@ static void __iomem *bar2_address(struct adapter *adapter, u64 bar2_qoffset; int ret; - ret = cxgb4_t4_bar2_sge_qregs(adapter, qid, qtype, - bar2_qoffset, pbar2_qid); + ret = cxgb4_t4_bar2_sge_qregs(adapter, qid, qtype, 0, + bar2_qoffset, pbar2_qid); if (ret) return NULL; diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c index 5959e3a..2d76a07 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c +++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c @@ -5102,6 +5102,7 @@ int t4_prep_adapter(struct adapter *adapter) * @adapter: the adapter * @qid: the Queue ID * @qtype: the Ingress or Egress type for @qid + * @user: true if this request is for a user mode queue * @pbar2_qoffset: BAR2 Queue Offset * @pbar2_qid: BAR2 Queue ID or 0 for Queue ID inferred SGE Queues * @@ -5125,6 +5126,7 @@ int t4_prep_adapter(struct adapter *adapter) int cxgb4_t4_bar2_sge_qregs(struct adapter *adapter, unsigned int qid, enum t4_bar2_qtype qtype, + int user, u64 *pbar2_qoffset, unsigned int *pbar2_qid) { @@ -5132,9 +5134,8 @@ int cxgb4_t4_bar2_sge_qregs(struct adapter *adapter, u64 bar2_page_offset, bar2_qoffset; unsigned int bar2_qid, bar2_qid_offset,
[PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size
Handle this configuration: Queues Per Page * SGE BAR2 Queue Register Area Size Page Size Use cxgb4_bar2_sge_qregs() to obtain the proper location within the bar2 region for a given qid. Rework the DB and GTS write functions to make use of this bar2 info. Signed-off-by: Steve Wise sw...@opengridcomputing.com Signed-off-by: Hariprasad Shenai haripra...@chelsio.com --- drivers/infiniband/hw/cxgb4/cq.c | 22 ++-- drivers/infiniband/hw/cxgb4/device.c | 16 +++-- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 5 +-- drivers/infiniband/hw/cxgb4/qp.c | 64 ++ drivers/infiniband/hw/cxgb4/t4.h | 60 --- 5 files changed, 98 insertions(+), 69 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c index 68ddb37..8e5bbcb 100644 --- a/drivers/infiniband/hw/cxgb4/cq.c +++ b/drivers/infiniband/hw/cxgb4/cq.c @@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct t4_cq *cq, goto err4; cq-gen = 1; + cq-gts = rdev-lldi.gts_reg; cq-rdev = rdev; - if (user) { - u32 off = (cq-cqid rdev-cqshift) PAGE_MASK; - cq-ugts = (u64)rdev-bar2_pa + off; - } else if (is_t4(rdev-lldi.adapter_type)) { - cq-gts = rdev-lldi.gts_reg; - cq-qid_mask = -1U; - } else { - u32 off = ((cq-cqid rdev-cqshift) PAGE_MASK) + 12; - - cq-gts = rdev-bar2_kva + off; - cq-qid_mask = rdev-qpmask; + cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS, + cq-bar2_qid, + user ? cq-bar2_pa : NULL); + if (user !cq-bar2_va) { + pr_warn(MOD %s: cqid %u not in BAR2 range.\n, + pci_name(rdev-lldi.pdev), cq-cqid); + ret = -EINVAL; + goto err4; } return 0; err4: @@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int entries, insert_mmap(ucontext, mm); mm2-key = uresp.gts_key; - mm2-addr = chp-cq.ugts; + mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa; mm2-len = PAGE_SIZE; insert_mmap(ucontext, mm2); } diff --git a/drivers/infiniband/hw/cxgb4/device.c b/drivers/infiniband/hw/cxgb4/device.c index 1ffbd03..ab36804 100644 --- a/drivers/infiniband/hw/cxgb4/device.c +++ b/drivers/infiniband/hw/cxgb4/device.c @@ -788,13 +788,7 @@ static int c4iw_rdev_open(struct c4iw_rdev *rdev) goto err1; } - /* -* qpshift is the number of bits to shift the qpid left in order -* to get the correct address of the doorbell for that qp. -*/ - rdev-qpshift = PAGE_SHIFT - ilog2(rdev-lldi.udb_density); rdev-qpmask = rdev-lldi.udb_density - 1; - rdev-cqshift = PAGE_SHIFT - ilog2(rdev-lldi.ucq_density); rdev-cqmask = rdev-lldi.ucq_density - 1; PDBG(%s dev %s stag start 0x%0x size 0x%0x num stags %d pbl start 0x%0x size 0x%0x rq start 0x%0x size 0x%0x @@ -808,14 +802,12 @@ static int c4iw_rdev_open(struct c4iw_rdev *rdev) rdev-lldi.vr-qp.size, rdev-lldi.vr-cq.start, rdev-lldi.vr-cq.size); - PDBG(udb len 0x%x udb base %p db_reg %p gts_reg %p qpshift %lu -qpmask 0x%x cqshift %lu cqmask 0x%x\n, + PDBG(udb len 0x%x udb base %p db_reg %p gts_reg %p +qpmask 0x%x cqmask 0x%x\n, (unsigned)pci_resource_len(rdev-lldi.pdev, 2), (void *)pci_resource_start(rdev-lldi.pdev, 2), -rdev-lldi.db_reg, -rdev-lldi.gts_reg, -rdev-qpshift, rdev-qpmask, -rdev-cqshift, rdev-cqmask); +rdev-lldi.db_reg, rdev-lldi.gts_reg, +rdev-qpmask, rdev-cqmask); if (c4iw_num_stags(rdev) == 0) { err = -EINVAL; diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h index d87e165..7a1b675 100644 --- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h +++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h @@ -164,9 +164,7 @@ struct wr_log_entry { struct c4iw_rdev { struct c4iw_resource resource; - unsigned long qpshift; u32 qpmask; - unsigned long cqshift; u32 cqmask; struct c4iw_dev_ucontext uctx; struct gen_pool *pbl_pool; @@ -1025,6 +1023,9 @@ void c4iw_ev_dispatch(struct c4iw_dev *dev, struct t4_cqe *err_cqe); extern struct cxgb4_client t4c_client; extern c4iw_handler_func c4iw_handlers[NUM_CPL_CMDS]; +void __iomem *c4iw_bar2_addrs(struct c4iw_rdev *rdev, unsigned int qid, + enum cxgb4_bar2_qtype qtype, + unsigned int *pbar2_qid, u64 *pbar2_pa); extern void c4iw_log_wr_stats(struct t4_wq *wq, struct
[PATCH for-4.1 0/2] Adds support for user mode bar2 mapping and bar2 qid density
Hi, This patch series adds support for user mode bar2 mappings for T4 adapter and also adds support for bar2 qid densities exceeding page size. This patch series has been created against Doug's github tree 'for-4.1' branch and includes patches on cxgb4 and iw_cxgb4 driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. Thanks Hariprasad Shenai (2): cxgb4: Support for user mode bar2 mappings with T4 iw_cxgb4: support for bar2 qid densities exceeding the page size drivers/infiniband/hw/cxgb4/cq.c| 22 - drivers/infiniband/hw/cxgb4/device.c| 16 ++- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 5 +- drivers/infiniband/hw/cxgb4/qp.c| 64 - drivers/infiniband/hw/cxgb4/t4.h| 60 +++ drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 1 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 4 +- drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h | 1 + drivers/net/ethernet/chelsio/cxgb4/sge.c| 4 +- drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 7 +-- 10 files changed, 109 insertions(+), 75 deletions(-) -- 2.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Indirect memory registration feature
On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote: I think O_DIRECT allows user-space to generate vectored SG lists that are not aligned as well. Indeed if you do sub-page size direct I/O. Although SG_GAPS helps with that. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 1/3] ivtv: use arch_phys_wc_add() and require PAT disabled
On 06/09/2015 02:56 AM, Mauro Carvalho Chehab wrote: Em Mon, 08 Jun 2015 17:20:20 -0700 Luis R. Rodriguez mcg...@do-not-panic.com escreveu: From: Luis R. Rodriguez mcg...@suse.com We are burrying direct access to MTRR code support on x86 in order to take advantage of PAT. In the future we also want to make the default behaviour of ioremap_nocache() to use strong UC, use of mtrr_add() on those systems would make write-combining void. In order to help both enable us to later make strong UC default and in order to phase out direct MTRR access code port the driver over to arch_phys_wc_add() and annotate that the device driver requires systems to boot with PAT disabled, with the nopat kernel parameter. This is a worthy comprmise given that the hardware is really rare these days, and perhaps only some lost souls in some third world country are expected to be using this feature of the device driver. Acked-by: Andy Walls awa...@md.metrocast.net Cc: Andy Walls awa...@md.metrocast.net Cc: Doug Ledford dledf...@redhat.com Cc: Mauro Carvalho Chehab mche...@osg.samsung.com Provided that you fix the issues below: Acked-by: Mauro Carvalho Chehab mche...@osg.samsung.com Cc: Andy Lutomirski l...@amacapital.net Cc: Suresh Siddha sbsid...@gmail.com Cc: Ingo Molnar mi...@elte.hu Cc: Thomas Gleixner t...@linutronix.de Cc: Juergen Gross jgr...@suse.com Cc: Daniel Vetter daniel.vet...@ffwll.ch Cc: Dave Airlie airl...@redhat.com Cc: Bjorn Helgaas bhelg...@google.com Cc: Antonino Daplas adap...@gmail.com Cc: Jean-Christophe Plagniol-Villard plagn...@jcrosoft.com Cc: Tomi Valkeinen tomi.valkei...@ti.com Cc: Dave Hansen dave.han...@linux.intel.com Cc: Arnd Bergmann a...@arndb.de Cc: Michael S. Tsirkin m...@redhat.com Cc: Stefan Bader stefan.ba...@canonical.com Cc: Ville Syrjälä syrj...@sci.fi Cc: Mel Gorman mgor...@suse.de Cc: Vlastimil Babka vba...@suse.cz Cc: Borislav Petkov b...@suse.de Cc: Davidlohr Bueso dbu...@suse.de Cc: konrad.w...@oracle.com Cc: ville.syrj...@linux.intel.com Cc: david.vra...@citrix.com Cc: jbeul...@suse.com Cc: toshi.k...@hp.com Cc: Roger Pau Monné roger@citrix.com Cc: linux-fb...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: ivtv-de...@ivtvdriver.org Cc: linux-me...@vger.kernel.org Cc: xen-de...@lists.xensource.com Signed-off-by: Luis R. Rodriguez mcg...@suse.com --- drivers/media/pci/ivtv/Kconfig | 3 +++ drivers/media/pci/ivtv/ivtvfb.c | 58 - 2 files changed, 26 insertions(+), 35 deletions(-) diff --git a/drivers/media/pci/ivtv/Kconfig b/drivers/media/pci/ivtv/Kconfig index dd6ee57e..b2a7f88 100644 --- a/drivers/media/pci/ivtv/Kconfig +++ b/drivers/media/pci/ivtv/Kconfig @@ -57,5 +57,8 @@ config VIDEO_FB_IVTV This is used in the Hauppauge PVR-350 card. There is a driver homepage at http://www.ivtvdriver.org. + If you have this hardware you will need to boot with PAT disabled + on your x86 systems, use the nopat kernel parameter. + Hmm... FB_IVTV is not hardware... it is framebuffer support for IVTV. It is optional to use FB API for the video output port of this board, instead of using V4L2 API. That's not true. It is hardware: it drives the video output OSD which overlays the video output. The reason it is optional is that it is hard to unload a framebuffer module and most people don't need it. So V4L2 drives the video output and ivtvfb drives the OSD overlay. So it is not a case of 'instead of'. I would say, instead, something like: In order to use this module, you will need to boot with PAT disabled on x86 systems, using the nopat kernel parameter. I do agree with this change, but that's because this module is optional and not for the reasons you mentioned above. Regards, Hans To compile this driver as a module, choose M here: the module will be called ivtvfb. diff --git a/drivers/media/pci/ivtv/ivtvfb.c b/drivers/media/pci/ivtv/ivtvfb.c index 9ff1230..7685ae3 100644 --- a/drivers/media/pci/ivtv/ivtvfb.c +++ b/drivers/media/pci/ivtv/ivtvfb.c @@ -44,8 +44,8 @@ #include linux/ivtvfb.h #include linux/slab.h -#ifdef CONFIG_MTRR -#include asm/mtrr.h +#ifdef CONFIG_X86_64 +#include asm/pat.h #endif #include ivtv-driver.h @@ -155,12 +155,11 @@ struct osd_info { /* Buffer size */ u32 video_buffer_size; -#ifdef CONFIG_MTRR /* video_base rounded down as required by hardware MTRRs */ unsigned long fb_start_aligned_physaddr; /* video_base rounded up as required by hardware MTRRs */ unsigned long fb_end_aligned_physaddr; -#endif +int wc_cookie; /* Store the buffer offset */ int set_osd_coords_x; @@ -1099,6 +1098,8 @@ static int ivtvfb_init_vidmode(struct ivtv *itv) static int ivtvfb_init_io(struct ivtv *itv) { struct osd_info *oi = itv-osd_info; +/* Find the largest power of two that maps the
Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
On 6/9/2015 12:37 AM, Hefty, Sean wrote: Previously, every vendor implemented its net device notifiers in its own driver. This introduces a huge code duplication as figuring 28 files changed, 2253 insertions(+), 860 deletions(-) How does adding 1400 lines of code help reduce code duplication? Can you please explain and justify why this change is actually needed? Let's look at this change from several perspectives: (1) Each vedor lost ~250 lines of GID management code just by this change. In the future it's very probable that more vendor drivers will implement RoCE. This removes the burden and code duplication required by them to implement a full RoCE support and is a lot more scalable than the current approach. (2) All vendors are now aligned. For example, mlx4 driver had bonding support but ocrdma didn't have such support. The user expects the same behavior regardless the vendor's driver. (3) When making something more general it usually requires more lines of code as it introduces API and doesn't cut corners assuming anything on the vendor's driver. (4) This is a per-requisite to the RoCE V2 series. I'm sure you remember we first submitted this patch-set as a part of the RoCE V2 series. Adding more features to the RoCE GID management will make the code duplication a lot worse than just ~250 lines. I don't think it's fair playing lets divide the RoCE V2 patch-set to several patch-sets and then say why do we need this first part at all. Let alone, the other there reasons are more than enough IMHO. Matan -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/4] Sending kernel pathrecord query to user cache server
From: Kaike Wan kaike@intel.com A SA cache is undeniably critical for fabric scalability and performance. In user space, the ibacm application provides a good example of pathrecord cache for address and route resolution. With the recent implementation of the provider architecture, ibacm offers more extensibility as a SA cache. In kernel, ipoib implements its own small cache for pathrecords, which is however not available for general use. Furthermore, the implementation of a SA cache in user space offers better flexibility, larger capacity, and more robustness for the system. In this patch series, a mechanism is implemented to allow ib_sa to send pathrecord query to a user application (eg ibacm) through netlink. Potentially, this mechanism could be easily extended to other SA queries. With a customized test implemented in rdma_cm module (not included in this series), it was shown that the time to retrieve 1 million pathrecords dropped from 46660 jiffies (46.66 seconds) to 16119 jiffies (or 16.119 seconds) on a two-node system, a reduction of more than 60%. Changes since v3: - Patch 1: added basic RESOLVE attribute types. - Patch 4: changed the encoding of the RESOLVE request message based on the new attribute types and the input comp_mask. Changed the response handling by iterating all attributes. Changes since v2: - Redesigned the communication protocol between the kernel and user space application. Instead of the MAD packet format, the new protocol uses netlink message header and attributes to exchange request and response between the kernel and user space.The design was described here: http://www.spinics.net/lists/linux-rdma/msg25621.html Changes since v1: - Move kzalloc changes into a separate patch (Patch 3). - Remove redundant include line (Patch 4). - Rename struct rdma_nl_resp_msg as structure ib_nl_resp_msg (Patch 4). Kaike Wan (4): IB/netlink: Add defines for local service requests through netlink IB/core: Check the presence of netlink multicast group listeners IB/sa: Allocate SA query with kzalloc IB/sa: Route SA pathrecord query through netlink drivers/infiniband/core/netlink.c |8 + drivers/infiniband/core/sa_query.c | 523 +++- include/rdma/rdma_netlink.h|7 + include/uapi/rdma/rdma_netlink.h | 82 ++ 4 files changed, 615 insertions(+), 5 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size
-Original Message- From: Doug Ledford [mailto:dledf...@redhat.com] Sent: Tuesday, June 09, 2015 9:03 AM To: Hariprasad Shenai Cc: linux-rdma@vger.kernel.org; sw...@opengridcomputing.com; lee...@chelsio.com; nirran...@chelsio.com Subject: Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size On Tue, 2015-06-09 at 18:23 +0530, Hariprasad Shenai wrote: Handle this configuration: Queues Per Page * SGE BAR2 Queue Register Area Size Page Size Use cxgb4_bar2_sge_qregs() to obtain the proper location within the bar2 region for a given qid. Rework the DB and GTS write functions to make use of this bar2 info. Signed-off-by: Steve Wise sw...@opengridcomputing.com Signed-off-by: Hariprasad Shenai haripra...@chelsio.com --- drivers/infiniband/hw/cxgb4/cq.c | 22 ++-- drivers/infiniband/hw/cxgb4/device.c | 16 +++-- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 5 +-- drivers/infiniband/hw/cxgb4/qp.c | 64 ++ drivers/infiniband/hw/cxgb4/t4.h | 60 --- 5 files changed, 98 insertions(+), 69 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c index 68ddb37..8e5bbcb 100644 --- a/drivers/infiniband/hw/cxgb4/cq.c +++ b/drivers/infiniband/hw/cxgb4/cq.c @@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct t4_cq *cq, goto err4; cq-gen = 1; + cq-gts = rdev-lldi.gts_reg; cq-rdev = rdev; - if (user) { - u32 off = (cq-cqid rdev-cqshift) PAGE_MASK; - cq-ugts = (u64)rdev-bar2_pa + off; - } else if (is_t4(rdev-lldi.adapter_type)) { - cq-gts = rdev-lldi.gts_reg; - cq-qid_mask = -1U; - } else { - u32 off = ((cq-cqid rdev-cqshift) PAGE_MASK) + 12; - - cq-gts = rdev-bar2_kva + off; - cq-qid_mask = rdev-qpmask; + cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS, + cq-bar2_qid, + user ? cq-bar2_pa : NULL); + if (user !cq-bar2_va) { + pr_warn(MOD %s: cqid %u not in BAR2 range.\n, + pci_name(rdev-lldi.pdev), cq-cqid); + ret = -EINVAL; + goto err4; } return 0; err4: @@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int entries, insert_mmap(ucontext, mm); mm2-key = uresp.gts_key; - mm2-addr = chp-cq.ugts; + mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa; Why are you using a cast here at all? bar2_pa is already u64... So it should just have the (uintptr_t) cast? Steve. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/4] IB/core: Check the presence of netlink multicast group listeners
From: Kaike Wan kaike@intel.com This patch adds a function to check if listeners for a netlink multicast group are present. Signed-off-by: Kaike Wan kaike@intel.com Signed-off-by: John Fleck john.fl...@intel.com Signed-off-by: Ira Weiny ira.we...@intel.com Reviewed-by: Sean Hefty sean.he...@intel.com --- drivers/infiniband/core/netlink.c |8 include/rdma/rdma_netlink.h |7 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/netlink.c b/drivers/infiniband/core/netlink.c index 23dd5a5..e0fc913 100644 --- a/drivers/infiniband/core/netlink.c +++ b/drivers/infiniband/core/netlink.c @@ -49,6 +49,14 @@ static DEFINE_MUTEX(ibnl_mutex); static struct sock *nls; static LIST_HEAD(client_list); +int ibnl_chk_listeners(unsigned int group) +{ + if (netlink_has_listeners(nls, group) == 0) + return -1; + return 0; +} +EXPORT_SYMBOL(ibnl_chk_listeners); + int ibnl_add_client(int index, int nops, const struct ibnl_client_cbs cb_table[]) { diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h index 0790882..5852661 100644 --- a/include/rdma/rdma_netlink.h +++ b/include/rdma/rdma_netlink.h @@ -77,4 +77,11 @@ int ibnl_unicast(struct sk_buff *skb, struct nlmsghdr *nlh, int ibnl_multicast(struct sk_buff *skb, struct nlmsghdr *nlh, unsigned int group, gfp_t flags); +/** + * Check if there are any listeners to the netlink group + * @group: the netlink group ID + * Returns 0 on success or a negative for no listeners. + */ +int ibnl_chk_listeners(unsigned int group); + #endif /* _RDMA_NETLINK_H */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/4] IB/sa: Allocate SA query with kzalloc
From: Kaike Wan kaike@intel.com Replace kmalloc with kzalloc so that all uninitialized fields in SA query will be zero-ed out to avoid unintentional consequence. This prepares the SA query structure to accept new fields in the future. Signed-off-by: Kaike Wan kaike@intel.com Signed-off-by: John Fleck john.fl...@intel.com Signed-off-by: Ira Weiny ira.we...@intel.com Reviewed-by: Sean Hefty sean.he...@intel.com --- drivers/infiniband/core/sa_query.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index c38f030..17e1cf7 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -739,7 +739,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, port = sa_dev-port[port_num - sa_dev-start_port]; agent = port-agent; - query = kmalloc(sizeof *query, gfp_mask); + query = kzalloc(sizeof(*query), gfp_mask); if (!query) return -ENOMEM; @@ -861,7 +861,7 @@ int ib_sa_service_rec_query(struct ib_sa_client *client, method != IB_SA_METHOD_DELETE) return -EINVAL; - query = kmalloc(sizeof *query, gfp_mask); + query = kzalloc(sizeof(*query), gfp_mask); if (!query) return -ENOMEM; @@ -953,7 +953,7 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, port = sa_dev-port[port_num - sa_dev-start_port]; agent = port-agent; - query = kmalloc(sizeof *query, gfp_mask); + query = kzalloc(sizeof(*query), gfp_mask); if (!query) return -ENOMEM; @@ -1050,7 +1050,7 @@ int ib_sa_guid_info_rec_query(struct ib_sa_client *client, port = sa_dev-port[port_num - sa_dev-start_port]; agent = port-agent; - query = kmalloc(sizeof *query, gfp_mask); + query = kzalloc(sizeof(*query), gfp_mask); if (!query) return -ENOMEM; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size
On Tue, 2015-06-09 at 09:07 -0500, Steve Wise wrote: -Original Message- From: Doug Ledford [mailto:dledf...@redhat.com] Sent: Tuesday, June 09, 2015 9:03 AM To: Hariprasad Shenai Cc: linux-rdma@vger.kernel.org; sw...@opengridcomputing.com; lee...@chelsio.com; nirran...@chelsio.com Subject: Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size On Tue, 2015-06-09 at 18:23 +0530, Hariprasad Shenai wrote: Handle this configuration: Queues Per Page * SGE BAR2 Queue Register Area Size Page Size Use cxgb4_bar2_sge_qregs() to obtain the proper location within the bar2 region for a given qid. Rework the DB and GTS write functions to make use of this bar2 info. Signed-off-by: Steve Wise sw...@opengridcomputing.com Signed-off-by: Hariprasad Shenai haripra...@chelsio.com --- drivers/infiniband/hw/cxgb4/cq.c | 22 ++-- drivers/infiniband/hw/cxgb4/device.c | 16 +++-- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 5 +-- drivers/infiniband/hw/cxgb4/qp.c | 64 ++ drivers/infiniband/hw/cxgb4/t4.h | 60 --- 5 files changed, 98 insertions(+), 69 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c index 68ddb37..8e5bbcb 100644 --- a/drivers/infiniband/hw/cxgb4/cq.c +++ b/drivers/infiniband/hw/cxgb4/cq.c @@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct t4_cq *cq, goto err4; cq-gen = 1; + cq-gts = rdev-lldi.gts_reg; cq-rdev = rdev; - if (user) { - u32 off = (cq-cqid rdev-cqshift) PAGE_MASK; - cq-ugts = (u64)rdev-bar2_pa + off; - } else if (is_t4(rdev-lldi.adapter_type)) { - cq-gts = rdev-lldi.gts_reg; - cq-qid_mask = -1U; - } else { - u32 off = ((cq-cqid rdev-cqshift) PAGE_MASK) + 12; - - cq-gts = rdev-bar2_kva + off; - cq-qid_mask = rdev-qpmask; + cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS, + cq-bar2_qid, + user ? cq-bar2_pa : NULL); + if (user !cq-bar2_va) { + pr_warn(MOD %s: cqid %u not in BAR2 range.\n, + pci_name(rdev-lldi.pdev), cq-cqid); + ret = -EINVAL; + goto err4; } return 0; err4: @@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int entries, insert_mmap(ucontext, mm); mm2-key = uresp.gts_key; - mm2-addr = chp-cq.ugts; + mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa; Why are you using a cast here at all? bar2_pa is already u64... So it should just have the (uintptr_t) cast? No, it should be no cast at all. The uintptr_t cast is only for casting an int-ptr or ptr-int. In those cases, if the size of an int != size of ptr, you can loose data, and uintptr_t tells the compiler I know I'm casting between possibly lossy data sizes and either A) I've checked and it's OK or B) I'm ok with ptr truncation and the loss won't hurt us. It basically turns off size checks when sticking a ptr into an int. You should therefore use it only in those circumstances. For example, when storing a cookie that doesn't have a strict uniqueness requirement, the loss due to truncation is probably OK. Or if you know you are only doing something like initially storing an int into a pointer, and then later storing that pointer back into an int, so there can never be any truncation because the source of the ptr was always int sized. Those are the times to use uintptr. In this case, you have a real u64 going into a real u64, there should be no casts. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH 0/5] Indirect memory registration feature
On Jun 9, 2015, at 4:44 AM, Sagi Grimberg sa...@dev.mellanox.co.il wrote: On 6/9/2015 9:20 AM, Christoph Hellwig wrote: On Mon, Jun 08, 2015 at 05:42:15PM +0300, Sagi Grimberg wrote: I wouldn't say this is about offloading bounce buffering to silicon. The RDMA stack always imposed the alignment limitation as we can only give a page lists to the devices. Other drivers (qlogic/emulex FC drivers for example), use an _arbitrary_ SG lists where each element can point to any {addr, len}. Those are drivers for protocols that support real SG lists. It seems only Infiniband and NVMe expose this silly limit. I agree this is indeed a limitation and that's why SG_GAPS was added in the first place. I think the next gen of nvme devices will support real SG lists. This feature enables existing Infiniband devices that can handle SG lists to receive them via the RDMA stack (ib_core). If the memory registration process wasn't such a big fiasco in the first place, wouldn't this way makes the most sense? So please fix it in the proper layers first, I agree that we can take care of bounce buffering in the block layer (or scsi for SG_IO) if the driver doesn't want to see any type of unaligned SG lists. But do you think that it should come before the stack can support this? Yes, absolutely. The other thing that needs to come first is a proper abstraction for MRs instead of hacking another type into all drivers. I'm very much open to the idea of consolidating the memory registration code instead of doing it in every ULP (srp, iser, xprtrdma, svcrdma, rds, more to come...) using a general memory registration API. The main challenge is to abstract the different methods (and considerations) of memory registration behind an API. Do we completely mask out the way we are doing it? I'm worried that we might end up either compromising on performance or trying to understand too much what the caller is trying to achieve. The point of an API like this is to flatten the developer’s learning curve at the cost of adding another layer of abstraction. For in-kernel storage ULPs, I’m not convinced that’s a good trade-off. The other major issue is dealing with multiple registration methods. Having new ULPs stick with just one or two seems like it could take care of that without fuss. HCA vendors seem to be settling on FRMR. But FRMR has some limitations, some of which I discuss below. IMO it would be better to think about improving existing limitations with FRMR before building a shim over it. That could help both the learning curve and the complexity issues. On with the pre-caffeine musings: For example: - frwr requires a queue-pair for the post (and it must be the ULP queue-pair to ensure the registration is done before the data-transfer begins). The QP does guarantee ordering between registration and use in an RDMA transfer, but it comes at a cost. And not just a QP is required, but a QP in RTS (ie, connected). Without a connected QP, AIUI you can’t invalidate registered FRMRs, you have to destroy them. So, for RPC, if the server is down and there are pending RPCs, any FRMRs associated with that transport are pinned (can’t be re-used or invalidated) until the server comes back. If an RPC is killed or times out while the server is down, the associated FRMRs are in limbo. Registration via WR also means the ULP has to handle the registration and invalidation completions (or decide to leave them unsignaled, but then it has to worry about send queue wrapping). All transport connect operations must be serialized with posting to ensure that ib_post_send() has a valid (not NULL) QP and ID handle. And, if a QP or ID handle is used during completion handling, you have to be careful there too. (Not to say any of this is impossible to deal with. Obviously RPC/RDMA works today.) While fmrs does not need the queue-pair. - the ULP would probably always initiate data transfer after the registration (send a request or do the rdma r/w). It is useful to link the frwr post with the next wr in a single post_send call. I wander how would an API allow such a thing (while other registration methods don't use work request interface). rkey management is also important. Registration is done before use so the ULP can send the rkey for that FRMR to the remote to initiate the remote DMA operation. However, after a transport disconnect, the rkey in the FRMR may not be the same one the hardware knows about. Recovery in this case means the FRMR has to be destroyed. Invalidating the FRMR also requires the ULP to know the hardware rkey, so it doesn’t work in this case. What all this means is that significant extra complexity is required to deal with transport disconnection, which is very rare compared to normal data transfer operations. FMR, for instance, has much less of an issue here because the map and unmap verbs are synchronous, do not rely on having a QP, and do not generate send
Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size
On Tue, 2015-06-09 at 09:18 -0500, Steve Wise wrote: Why are you using a cast here at all? bar2_pa is already u64... So it should just have the (uintptr_t) cast? No, it should be no cast at all. The uintptr_t cast is only for casting an int-ptr or ptr-int. In those cases, if the size of an int != size of ptr, you can loose data, and uintptr_t tells the compiler I know I'm casting between possibly lossy data sizes and either A) I've checked and it's OK or B) I'm ok with ptr truncation and the loss won't hurt us. It basically turns off size checks when sticking a ptr into an int. You should therefore use it only in those circumstances. For example, when storing a cookie that doesn't have a strict uniqueness requirement, the loss due to truncation is probably OK. Or if you know you are only doing something like initially storing an int into a pointer, and then later storing that pointer back into an int, so there can never be any truncation because the source of the ptr was always int sized. Those are the times to use uintptr. In this case, you have a real u64 going into a real u64, there should be no casts. My bad. I thought bar2_pa was a ptr... I didn't look up the actual structure definition, but: + cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS, + cq-bar2_qid, + user ? cq-bar2_pa : NULL); +void __iomem *c4iw_bar2_addrs(struct c4iw_rdev *rdev, unsigned int qid, + enum cxgb4_bar2_qtype qtype, + unsigned int *pbar2_qid, u64 *pbar2_pa) Looks like either it's a u64 or else there should be compiler warnings about passing cq-bar2_pa to c4iw_bar2_addrs. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
[PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink
From: Kaike Wan kaike@intel.com This patch adds netlink defines for SA client, local service group, local service operations, and related attributes. Signed-off-by: Kaike Wan kaike@intel.com Signed-off-by: John Fleck john.fl...@intel.com Signed-off-by: Ira Weiny ira.we...@intel.com Reviewed-by: Sean Hefty sean.he...@intel.com --- include/uapi/rdma/rdma_netlink.h | 82 ++ 1 files changed, 82 insertions(+), 0 deletions(-) diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index 6e4bb42..341e9be 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -7,12 +7,14 @@ enum { RDMA_NL_RDMA_CM = 1, RDMA_NL_NES, RDMA_NL_C4IW, + RDMA_NL_SA, RDMA_NL_NUM_CLIENTS }; enum { RDMA_NL_GROUP_CM = 1, RDMA_NL_GROUP_IWPM, + RDMA_NL_GROUP_LS, RDMA_NL_NUM_GROUPS }; @@ -128,5 +130,85 @@ enum { IWPM_NLA_ERR_MAX }; +/* Local service group opcodes */ +enum { + RDMA_NL_LS_OP_RESOLVE = 0, + RDMA_NL_LS_OP_SET_TIMEOUT, + RDMA_NL_LS_NUM_OPS +}; + +/* Local service netlink message flags */ +#define RDMA_NL_LS_F_OK0x0100 /* Success response */ +#define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */ + +/* Local service attribute type */ +enum { + LS_NLA_TYPE_STATUS = 0, + LS_NLA_TYPE_PATH_RECORD, + LS_NLA_TYPE_TIMEOUT, + LS_NLA_TYPE_SERVICE_ID, + LS_NLA_TYPE_DGID, + LS_NLA_TYPE_SGID, + LS_NLA_TYPE_TCLASS, + LS_NLA_TYPE_REVERSIBLE, + LS_NLA_TYPE_NUM_PATH, + LS_NLA_TYPE_PKEY, + LS_NLA_TYPE_QOS_CLASS, + LS_NLA_TYPE_MAX +}; + +/* Local service status attribute */ +enum { + LS_NLA_STATUS_SUCCESS = 0, + LS_NLA_STATUS_EINVAL, + LS_NLA_STATUS_ENODATA, + LS_NLA_STATUS_MAX +}; + +struct rdma_nla_ls_status { + __u32 status; +}; + +/* Local service pathrecord attribute: struct ib_path_rec_data */ + +/* Local service timeout attribute */ +struct rdma_nla_ls_timeout { + __u32 timeout; +}; + +/* Local Service ServiceID attribute */ +struct rdma_nla_ls_service_id { + __be64 service_id; +}; + +/* Local Service DGID/SGID attribute: big endian */ +struct rdma_nla_ls_gid { + __u8gid[16]; +}; + +/* Local Service Traffic Class attribute */ +struct rdma_nla_ls_tclass { + __u8tclass; +}; + +/* Local Service Reversible attribute */ +struct rdma_nla_ls_reversible { + __u32 reversible; +}; + +/* Local Service numb_path attribute */ +struct rdma_nla_ls_numb_path { + __u8numb_path; +}; + +/* Local Service Pkey attribute*/ +struct rdma_nla_ls_pkey { + __be16 pkey; +}; + +/* Local Service Qos Class attribute */ +struct rdma_nla_ls_qos_class { + __be16 qos_class; +}; #endif /* _UAPI_RDMA_NETLINK_H */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 4/4] IB/sa: Route SA pathrecord query through netlink
From: Kaike Wan kaike@intel.com This patch routes a SA pathrecord query to netlink first and processes the response appropriately. If a failure is returned, the request will be sent through IB. The decision whether to route the request to netlink first is determined by the presence of a listener for the local service netlink multicast group. If the user-space local service netlink multicast group listener is not present, the request will be sent through IB, just like what is currently being done. Signed-off-by: Kaike Wan kaike@intel.com Signed-off-by: John Fleck john.fl...@intel.com Signed-off-by: Ira Weiny ira.we...@intel.com Reviewed-by: Sean Hefty sean.he...@intel.com --- drivers/infiniband/core/sa_query.c | 515 +++- include/uapi/rdma/rdma_netlink.h |2 +- 2 files changed, 515 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 17e1cf7..e063593 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -45,12 +45,21 @@ #include uapi/linux/if_ether.h #include rdma/ib_pack.h #include rdma/ib_cache.h +#include rdma/rdma_netlink.h +#include net/netlink.h +#include uapi/rdma/ib_user_sa.h +#include rdma/ib_marshall.h #include sa.h MODULE_AUTHOR(Roland Dreier); MODULE_DESCRIPTION(InfiniBand subnet administration query support); MODULE_LICENSE(Dual BSD/GPL); +#define IB_SA_LOCAL_SVC_TIMEOUT_MIN100 +#define IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT2000 +#define IB_SA_LOCAL_SVC_TIMEOUT_MAX20 +static int sa_local_svc_timeout_ms = IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT; + struct ib_sa_sm_ah { struct ib_ah*ah; struct kref ref; @@ -80,8 +89,24 @@ struct ib_sa_query { struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; int id; + u32 flags; }; +#define IB_SA_ENABLE_LOCAL_SERVICE 0x0001 +#define IB_SA_CANCEL 0x0002 + +#define IB_SA_LOCAL_SVC_ENABLED(query) \ + ((query)-flags IB_SA_ENABLE_LOCAL_SERVICE) +#define IB_SA_ENABLE_LOCAL_SVC(query) \ + ((query)-flags |= IB_SA_ENABLE_LOCAL_SERVICE) +#define IB_SA_DISABLE_LOCAL_SVC(query) \ + ((query)-flags = ~IB_SA_ENABLE_LOCAL_SERVICE) + +#define IB_SA_QUERY_CANCELLED(query) \ + ((query)-flags IB_SA_CANCEL) +#define IB_SA_CANCEL_QUERY(query) \ + ((query)-flags |= IB_SA_CANCEL) + struct ib_sa_service_query { void (*callback)(int, struct ib_sa_service_rec *, void *); void *context; @@ -106,6 +131,26 @@ struct ib_sa_mcmember_query { struct ib_sa_query sa_query; }; +struct ib_nl_request_info { + struct list_head list; + u32 seq; + unsigned long timeout; + struct ib_sa_query *query; +}; + +struct ib_nl_attr_info { + u16 len;/* Total attr len: header + payload + padding */ + ib_sa_comp_mask comp_mask; + void *input; + void (*set_attrs)(struct sk_buff *skb, struct ib_nl_attr_info *info); +}; + +static LIST_HEAD(ib_nl_request_list); +static DEFINE_SPINLOCK(ib_nl_request_lock); +static atomic_t ib_nl_sa_request_seq; +static struct workqueue_struct *ib_nl_wq; +static struct delayed_work ib_nl_timed_work; + static void ib_sa_add_one(struct ib_device *device); static void ib_sa_remove_one(struct ib_device *device); @@ -381,6 +426,433 @@ static const struct ib_field guidinfo_rec_table[] = { .size_bits= 512 }, }; +static int ib_nl_send_msg(int opcode, struct ib_nl_attr_info *attrs, u32 seq) +{ + struct sk_buff *skb = NULL; + struct nlmsghdr *nlh; + void *data; + int ret = 0; + + if (attrs-len = 0) + return -EMSGSIZE; + + skb = nlmsg_new(attrs-len, GFP_KERNEL); + if (!skb) { + pr_err(alloc failed ret=%d\n, ret); + return -ENOMEM; + } + + /* Put nlmsg header only for now */ + data = ibnl_put_msg(skb, nlh, seq, 0, RDMA_NL_SA, + opcode, GFP_KERNEL); + if (!data) { + kfree_skb(skb); + return -EMSGSIZE; + } + + /* Add attributes */ + attrs-set_attrs(skb, attrs); + + /* Repair the nlmsg header length */ + nlmsg_end(skb, nlh); + + ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL); + if (!ret) { + ret = attrs-len; + } else { + if (ret != -ESRCH) + pr_err(ibnl_multicast failed l=%d, r=%d\n, + attrs-len, ret); + ret = 0; + } + return ret; +} + +static struct ib_nl_request_info * +ib_nl_alloc_request(struct ib_sa_query *query) +{ + struct ib_nl_request_info *rinfo; + + rinfo = kzalloc(sizeof(*rinfo), GFP_ATOMIC); + if (rinfo == NULL) + return ERR_PTR(-ENOMEM); + +
[PATCH v4 1/1] ibacm: Add support for pathrecord query through netlink
From: Kaike Wan kaike@intel.com This patch enables ibacm to process pathrecord queries through netlink. Since ibacm can cache pathrecords, this implementation provides an easy pathrecord cache for kernel components and therefore offers great performance advantage on large fabric systems. Signed-off-by: Kaike Wan kaike@intel.com Signed-off-by: John Fleck john.fl...@intel.com Signed-off-by: Ira Weiny ira.we...@intel.com Reviewed-by: Sean Hefty sean.he...@intel.com --- src/acm.c | 360 - 1 files changed, 357 insertions(+), 3 deletions(-) diff --git a/src/acm.c b/src/acm.c index 7649725..d180bd2 100644 --- a/src/acm.c +++ b/src/acm.c @@ -46,6 +46,8 @@ #include infiniband/acm_prov.h #include infiniband/umad.h #include infiniband/verbs.h +#include infiniband/umad_types.h +#include infiniband/umad_sa.h #include dlist.h #include dlfcn.h #include search.h @@ -55,6 +57,8 @@ #include netinet/in.h #include linux/netlink.h #include linux/rtnetlink.h +#include rdma/rdma_netlink.h +#include rdma/ib_user_sa.h #include poll.h #include acm_mad.h #include acm_util.h @@ -66,6 +70,7 @@ #define MAX_EP_ADDR 4 #define NL_MSG_BUF_SIZE 4096 #define ACM_PROV_NAME_SIZE 64 +#define NL_CLIENT_INDEX 0 struct acmc_subnet { DLIST_ENTRYentry; @@ -151,6 +156,26 @@ struct acmc_sa_req { struct acm_sa_mad mad; }; +struct acm_nl_status { + struct nlattr attr_hdr; + struct rdma_nla_ls_status status; +}; + +struct acm_nl_path { + struct nlattr attr_hdr; + struct ib_path_rec_data rec; +}; + +struct acm_nl_msg { + struct nlmsghdr nlmsg_header; + union { + uint8_t data[ACM_MSG_DATA_LENGTH]; + struct nlattr attr[0]; + struct acm_nl_statusstatus[0]; + struct acm_nl_path path[0]; + }; +}; + static char def_prov_name[ACM_PROV_NAME_SIZE] = ibacmp; static DLIST_ENTRY provider_list; static struct acmc_prov *def_provider = NULL; @@ -172,6 +197,7 @@ static struct acmc_ep *acm_find_ep(struct acmc_port *port, uint16_t pkey); static int acm_ep_insert_addr(struct acmc_ep *ep, const char *name, uint8_t *addr, size_t addr_len, uint8_t addr_type); static void acm_event_handler(struct acmc_device *dev); +static int acm_nl_send(SOCKET sock, struct acm_msg *msg); static struct sa_data { int timeout; @@ -466,7 +492,11 @@ int acm_resolve_response(uint64_t id, struct acm_msg *msg) goto release; } - ret = send(client-sock, (char *) msg, msg-hdr.length, 0); + if (id == NL_CLIENT_INDEX) + ret = acm_nl_send(client-sock, msg); + else + ret = send(client-sock, (char *) msg, msg-hdr.length, 0); + if (ret != msg-hdr.length) acm_log(0, ERROR - failed to send response\n); else @@ -597,6 +627,8 @@ static void acm_svr_accept(void) } for (i = 0; i FD_SETSIZE - 1; i++) { + if (i == NL_CLIENT_INDEX) + continue; if (!atomic_get(client_array[i].refcnt)) break; } @@ -1346,6 +1378,323 @@ static void acm_ipnl_handler(void) } } +static int acm_nl_send(SOCKET sock, struct acm_msg *msg) +{ + struct sockaddr_nl dst_addr; + struct acm_nl_msg acmnlmsg; + struct acm_nl_msg *orig; + int ret; + int datalen; + + orig = (struct acm_nl_msg *) msg-hdr.tid; + + memset(dst_addr, 0, sizeof(dst_addr)); + dst_addr.nl_family = AF_NETLINK; + dst_addr.nl_groups = (1 (RDMA_NL_GROUP_LS - 1)); + + memset(acmnlmsg, 0, sizeof(acmnlmsg)); + acmnlmsg.nlmsg_header.nlmsg_len = NLMSG_HDRLEN; + acmnlmsg.nlmsg_header.nlmsg_pid = getpid(); + acmnlmsg.nlmsg_header.nlmsg_type = orig-nlmsg_header.nlmsg_type; + acmnlmsg.nlmsg_header.nlmsg_flags = NLM_F_REQUEST; + acmnlmsg.nlmsg_header.nlmsg_seq = orig-nlmsg_header.nlmsg_seq; + + if (msg-hdr.status != ACM_STATUS_SUCCESS) { + acm_log(2, acm status no success = %d\n, msg-hdr.status); + acmnlmsg.nlmsg_header.nlmsg_flags |= RDMA_NL_LS_F_ERR; + acmnlmsg.nlmsg_header.nlmsg_len += + NLA_ALIGN(sizeof(struct acm_nl_status)); + acmnlmsg.status[0].attr_hdr.nla_type = LS_NLA_TYPE_STATUS; + acmnlmsg.status[0].attr_hdr.nla_len = NLA_HDRLEN + + sizeof(struct rdma_nla_ls_status); + if (msg-hdr.status == ACM_STATUS_EINVAL) + acmnlmsg.status[0].status.status = LS_NLA_STATUS_EINVAL; + else + acmnlmsg.status[0].status.status = + LS_NLA_STATUS_ENODATA; +
RE: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size
-Original Message- From: Doug Ledford [mailto:dledf...@redhat.com] Sent: Tuesday, June 09, 2015 9:16 AM To: Steve Wise Cc: 'Hariprasad Shenai'; linux-rdma@vger.kernel.org; lee...@chelsio.com; nirran...@chelsio.com Subject: Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size On Tue, 2015-06-09 at 09:07 -0500, Steve Wise wrote: -Original Message- From: Doug Ledford [mailto:dledf...@redhat.com] Sent: Tuesday, June 09, 2015 9:03 AM To: Hariprasad Shenai Cc: linux-rdma@vger.kernel.org; sw...@opengridcomputing.com; lee...@chelsio.com; nirran...@chelsio.com Subject: Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size On Tue, 2015-06-09 at 18:23 +0530, Hariprasad Shenai wrote: Handle this configuration: Queues Per Page * SGE BAR2 Queue Register Area Size Page Size Use cxgb4_bar2_sge_qregs() to obtain the proper location within the bar2 region for a given qid. Rework the DB and GTS write functions to make use of this bar2 info. Signed-off-by: Steve Wise sw...@opengridcomputing.com Signed-off-by: Hariprasad Shenai haripra...@chelsio.com --- drivers/infiniband/hw/cxgb4/cq.c | 22 ++-- drivers/infiniband/hw/cxgb4/device.c | 16 +++-- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 5 +-- drivers/infiniband/hw/cxgb4/qp.c | 64 ++ drivers/infiniband/hw/cxgb4/t4.h | 60 --- 5 files changed, 98 insertions(+), 69 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c index 68ddb37..8e5bbcb 100644 --- a/drivers/infiniband/hw/cxgb4/cq.c +++ b/drivers/infiniband/hw/cxgb4/cq.c @@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct t4_cq *cq, goto err4; cq-gen = 1; + cq-gts = rdev-lldi.gts_reg; cq-rdev = rdev; - if (user) { - u32 off = (cq-cqid rdev-cqshift) PAGE_MASK; - cq-ugts = (u64)rdev-bar2_pa + off; - } else if (is_t4(rdev-lldi.adapter_type)) { - cq-gts = rdev-lldi.gts_reg; - cq-qid_mask = -1U; - } else { - u32 off = ((cq-cqid rdev-cqshift) PAGE_MASK) + 12; - - cq-gts = rdev-bar2_kva + off; - cq-qid_mask = rdev-qpmask; + cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS, + cq-bar2_qid, + user ? cq-bar2_pa : NULL); + if (user !cq-bar2_va) { + pr_warn(MOD %s: cqid %u not in BAR2 range.\n, + pci_name(rdev-lldi.pdev), cq-cqid); + ret = -EINVAL; + goto err4; } return 0; err4: @@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int entries, insert_mmap(ucontext, mm); mm2-key = uresp.gts_key; - mm2-addr = chp-cq.ugts; + mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa; Why are you using a cast here at all? bar2_pa is already u64... So it should just have the (uintptr_t) cast? No, it should be no cast at all. The uintptr_t cast is only for casting an int-ptr or ptr-int. In those cases, if the size of an int != size of ptr, you can loose data, and uintptr_t tells the compiler I know I'm casting between possibly lossy data sizes and either A) I've checked and it's OK or B) I'm ok with ptr truncation and the loss won't hurt us. It basically turns off size checks when sticking a ptr into an int. You should therefore use it only in those circumstances. For example, when storing a cookie that doesn't have a strict uniqueness requirement, the loss due to truncation is probably OK. Or if you know you are only doing something like initially storing an int into a pointer, and then later storing that pointer back into an int, so there can never be any truncation because the source of the ptr was always int sized. Those are the times to use uintptr. In this case, you have a real u64 going into a real u64, there should be no casts. My bad. I thought bar2_pa was a ptr... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote: On Mon, 8 Jun 2015, Jerome Glisse wrote: On Mon, Jun 08, 2015 at 12:40:18PM -0700, Mark Hairgrove wrote: On Thu, 21 May 2015, j.gli...@gmail.com wrote: From: Jérôme Glisse jgli...@redhat.com This patch only introduce core HMM functions for registering a new mirror and stopping a mirror as well as HMM device registering and unregistering. [...] +/* struct hmm_device_operations - HMM device operation callback + */ +struct hmm_device_ops { + /* release() - mirror must stop using the address space. +* +* @mirror: The mirror that link process address space with the device. +* +* When this is call, device driver must kill all device thread using +* this mirror. Also, this callback is the last thing call by HMM and +* HMM will not access the mirror struct after this call (ie no more +* dereference of it so it is safe for the device driver to free it). +* It is call either from : +* - mm dying (all process using this mm exiting). +* - hmm_mirror_unregister() (if no other thread holds a reference) +* - outcome of some device error reported by any of the device +* callback against that mirror. +*/ + void (*release)(struct hmm_mirror *mirror); +}; The comment that -release is called when the mm dies doesn't match the implementation. -release is only called when the mirror is destroyed, and that can only happen after the mirror has been unregistered. This may not happen until after the mm dies. Is the intent for the driver to get the callback when the mm goes down? That seems beneficial so the driver can kill whatever's happening on the device. Otherwise the device may continue operating in a dead address space until the driver's file gets closed and it unregisters the mirror. This was the intent before merging free release. I guess i need to reinstate the free versus release callback. Sadly the lifetime for HMM is more complex than mmu_notifier as we intend the mirror struct to be embedded into a driver private struct. Can you clarify how that's different from mmu_notifiers? Those are also embedded into a driver-owned struct. For HMM you want to be able to kill a mirror from HMM, you might have kernel thread call inside HMM with a mirror (outside any device file lifetime) ... The mirror is not only use at register unregister, there is a lot more thing you can call using the HMM mirror struct. So the HMM mirror lifetime as a result is more complex, it can not simply be free from the mmu_notifier_release callback or randomly. It needs to be refcounted. The mmu_notifier code assume that the mmu_notifier struct is embedded inside a struct that has a lifetime properly synchronize with the mm. For HMM mirror this is not something that sounds like a good idea as there is too many way to get it wrong. So idea of HMM mirror is that it can out last the mm lifetime but the HMM struct can not. So you have hmm_mirror ~ hmm - mm and the mirror can be unlink and have different lifetime from the hmm that itself has same life time as mm. Is the goal to allow calling hmm_mirror_unregister from within the mm is dying HMM callback? I don't know whether that's really necessary as long as there's some association between the driver files and the mirrors. No this is not a goal and i actualy forbid that. If so, I think there's a race here in the case of mm teardown happening concurrently with hmm_mirror_unregister. [...] Do you agree that this sequence can happen, or am I missing something which prevents it? Can't happen because child have mm-hmm = NULL ie only one hmm per mm and hmm is tie to only one mm. It is the responsability of the device driver to make sure same apply to private reference to the hmm mirror struct ie hmm_mirror should never be tie to a private file struct. It's useful for the driver to have some association between files and mirrors. If the file is closed prior to process exit we would like to unregister the mirror, otherwise it will persist until process teardown. The association doesn't have to be 1:1 but having the files ref count the mirror or something would be useful. This is allowed, i might have put strong word here, but you can associate with a file struct. What you can not do is use the mirror from a different process ie one with a different mm struct as mirror is linked to a single mm. So on fork there is no callback to update the private file struct, when the device file is duplicated (well just refcount inc) against a different process. This is something you need to be carefull in your driver. Inside the dummy driver i abuse that to actually test proper behavior of HMM but it
rdmacm issue
We have an issue where lustre servers and clients cannot talk to each other. There are about 11,000 clients all trying to connect to a server that just been rebooted (nbp6-oss3 in this example) pfe21 is a lustre client thats trying to remount the filesystem from nbp6-oss3. running rping server on pfe21 hangs and waits until the client tried to connect, then it prints out debug information up to cq_thread started. and hangs there, for a minute or so until issuing the two UNREACHABLE errors: pfe21 ~ # rping -v -s -d -P -p2 -a 10.151.27.19 port 2 created cm_id 0x60e350 rdma_bind_addr successful rdma_listen cma_event type RDMA_CM_EVENT_CONNECT_REQUEST cma_id 0x60b620 (child) child cma 0x60b620 created pd 0x60bd80 created channel 0x60bda0 created cq 0x60bdc0 created qp 0x60bf00 rping_setup_buffers called on cb 0x60b8c0 allocated registered buffers... accepting client connection request cq_thread started. cma_event type RDMA_CM_EVENT_UNREACHABLE cma_id 0x60b620 (child) cma event RDMA_CM_EVENT_UNREACHABLE, error -110 The rping client is started below. As soon as it starts, it runs up to the point of cq_thread started. Hangs there and eventually times out as well, issuing 4 error messages: nbp6-oss3 ~ # rping -c -vp-d -S 30 -p 2 -a 10.151.27.19 size 30 port 2 created cm_id 0x60f640 cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x60f640 (parent) cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x60f640 (parent) rdma_resolve_addr - rdma_resolve_route successful created pd 0x60ab10 created channel 0x60ab30 created cq 0x60ab50 created qp 0x60ac60 rping_setup_buffers called on cb 0x6072e0 allocated registered buffers... cq_thread started. cma_event type RDMA_CM_EVENT_UNREACHABLE cma_id 0x60f640 (parent) cma event RDMA_CM_EVENT_UNREACHABLE, error -110 wait for CONNECTED state 4 connect error -1 Any ideas? The neighbor entries on both side were in a reachable state before the test. And the two systems did manage to find one another. Keep in mind that when this is going on, 11,000+ clients are trying to connect to nbp6-oss3 Normally lustre mounts fine and rping have no issues. We have noticed some neighbor resolution issues and are considering ucast_solicit, mcast_solicit, and unres_qlen changes because we also occationally experience issues with icmp ping. Its typically been the experience that changing these configs have little effect. Yes its probably the case that unicast arp refresh has long failed and 11,000+ clients may be multicasting for for arp. Any help or insight greatly appreciated. thx, bob -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V2 0/9] Add completion timestamping support
On Mon, 8 Jun 2015, Hefty, Sean wrote: You're assuming that the only start time of interest is when a send operati= on has been posted. Jason asked what I would do with libfabric. That inte= rface supports triggered operations. It has also been designed such that a= rendezvous (that has to be one of the most difficult words in the English = language to spell correctly, even with spell check) protocol could be imple= mented by the provider. On the receive side, it may be of interest to repo= rt the start and ending time for larger transfers, primarily for debugging = purposes. There are multiple problems with libfrabric related to the use cases in my area. Most of all the lack of multicast support. Then there is the build up of software bloat on top. The interest here is in low latency operations. Redenzvous and other new features are really not wanted if they increase the latency. I have no idea how the time stamps are expected to be used, so why limit it= ? An app could just as easily create their own time stamp when reading a w= ork completion, especially when the data is going into an anonymous receive= buffer. That would seem to work for your use case. No it cannot as described earlier. The work can be completed much earlier than when the polling thread gets around to check for it. We do that today since there is nothing better but this means that there is a gap there. On the send side you have no easy way to telling when the operation was complete without the timestamp. I have no problem with a bare metal interface exposing this. But pretendin= g that it's generic and that this is the one and only way that this could b= e implemented doesn't make it so. This is a way it was implemented and its usable. Shooting for pie in the sky does not bring us anything. Nor ideas of requirements from a new experimental API that does not support the basic features that we need and seems to be on its way to mess up the latencies of access to RDMA operations. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.1 2/2] iw_cxgb4: support for bar2 qid densities exceeding the page size
On Tue, 2015-06-09 at 18:23 +0530, Hariprasad Shenai wrote: Handle this configuration: Queues Per Page * SGE BAR2 Queue Register Area Size Page Size Use cxgb4_bar2_sge_qregs() to obtain the proper location within the bar2 region for a given qid. Rework the DB and GTS write functions to make use of this bar2 info. Signed-off-by: Steve Wise sw...@opengridcomputing.com Signed-off-by: Hariprasad Shenai haripra...@chelsio.com --- drivers/infiniband/hw/cxgb4/cq.c | 22 ++-- drivers/infiniband/hw/cxgb4/device.c | 16 +++-- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 5 +-- drivers/infiniband/hw/cxgb4/qp.c | 64 ++ drivers/infiniband/hw/cxgb4/t4.h | 60 --- 5 files changed, 98 insertions(+), 69 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c index 68ddb37..8e5bbcb 100644 --- a/drivers/infiniband/hw/cxgb4/cq.c +++ b/drivers/infiniband/hw/cxgb4/cq.c @@ -156,19 +156,17 @@ static int create_cq(struct c4iw_rdev *rdev, struct t4_cq *cq, goto err4; cq-gen = 1; + cq-gts = rdev-lldi.gts_reg; cq-rdev = rdev; - if (user) { - u32 off = (cq-cqid rdev-cqshift) PAGE_MASK; - cq-ugts = (u64)rdev-bar2_pa + off; - } else if (is_t4(rdev-lldi.adapter_type)) { - cq-gts = rdev-lldi.gts_reg; - cq-qid_mask = -1U; - } else { - u32 off = ((cq-cqid rdev-cqshift) PAGE_MASK) + 12; - - cq-gts = rdev-bar2_kva + off; - cq-qid_mask = rdev-qpmask; + cq-bar2_va = c4iw_bar2_addrs(rdev, cq-cqid, T4_BAR2_QTYPE_INGRESS, + cq-bar2_qid, + user ? cq-bar2_pa : NULL); + if (user !cq-bar2_va) { + pr_warn(MOD %s: cqid %u not in BAR2 range.\n, + pci_name(rdev-lldi.pdev), cq-cqid); + ret = -EINVAL; + goto err4; } return 0; err4: @@ -971,7 +969,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int entries, insert_mmap(ucontext, mm); mm2-key = uresp.gts_key; - mm2-addr = chp-cq.ugts; + mm2-addr = (u64)(uintptr_t)chp-cq.bar2_pa; Why are you using a cast here at all? bar2_pa is already u64... -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
[PATCH] IB/core: Don't advertise SA in RoCE port capabilities
The Subnet Administrator (SA) is not a component of the RoCE spec. Therefore, it should not be a capability of a RoCE port. Change-Id: Iadfaa56bdc9f6e28f46d009064c2d15969293cf7 Signed-off-by: Moni Shoua mo...@mellanox.com --- include/rdma/ib_verbs.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 7d78794..7725cee 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -381,7 +381,6 @@ union rdma_protocol_stats { #define RDMA_CORE_PORT_IBA_ROCE(RDMA_CORE_CAP_PROT_ROCE \ | RDMA_CORE_CAP_IB_MAD \ | RDMA_CORE_CAP_IB_CM \ - | RDMA_CORE_CAP_IB_SA \ | RDMA_CORE_CAP_AF_IB \ | RDMA_CORE_CAP_ETH_AH) #define RDMA_CORE_PORT_IWARP (RDMA_CORE_CAP_PROT_IWARP \ -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipoib: Change To max_cm_mtu when changing mode to connected
On Tue, 2015-06-09 at 14:17 +0300, Erez Shitrit wrote: On Tue, Jun 9, 2015 at 12:04 AM, Doug Ledford dledf...@redhat.com wrote: On Mon, 2015-06-08 at 10:42 -0600, Jason Gunthorpe wrote: On Sun, Jun 07, 2015 at 01:36:11PM +0300, Erez Shitrit wrote: When switching between modes (datagram / connected) change the MTU accordingly. datagram mode up to 4K, connected mode up to (64K - 0x10). Is this a bug fix (describe the user visible impact)? Should it go to -stable? Add a Fixes: line? I'm not sure I would call it a bug. Setting the MTU to max automatically is a policy decision more than anything else. Currently, you have to enable connected mode, then set the MTU. Both initscripts and NetworkManager do this for you on Red Hat, so the user sees no difference here. I can't speak to other OSes. I'm OK with setting connected mode MTU to max by default once we get the scatter/gather support for IPoIB added. Hi Doug, There is no connection to the scatter/gather patch in that fix. Yes, I'm well aware of that. However, 64k MTU in connected mode does order 5 page allocations. That has been the source of a number of bugs in our bug tracking system in the past. I wouldn't make it default to turning on 64k MTU without user intervention for that reason. Once the SG patch for IPoIB is included, that problem goes away and I'm OK making IPoIB turn on the large MTU by default. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH] IB/ipoib: Change To max_cm_mtu when changing mode to connected
On Tue, 2015-06-09 at 09:06 +0300, Erez Shitrit wrote: On Tue, Jun 9, 2015 at 12:04 AM, Doug Ledford dledf...@redhat.com wrote: On Mon, 2015-06-08 at 10:42 -0600, Jason Gunthorpe wrote: On Sun, Jun 07, 2015 at 01:36:11PM +0300, Erez Shitrit wrote: When switching between modes (datagram / connected) change the MTU accordingly. datagram mode up to 4K, connected mode up to (64K - 0x10). Is this a bug fix (describe the user visible impact)? Should it go to -stable? Add a Fixes: line? I'm not sure I would call it a bug. Setting the MTU to max automatically is a policy decision more than anything else. Currently, you have to enable connected mode, then set the MTU. Both initscripts and NetworkManager do this for you on Red Hat, so the user sees no difference here. I can't speak to other OSes. I'm OK with setting The only real reason for a user to switch to CM mode is for the performance he can get, and this is only by the huge MTU the CM gives, so it somehow should be part of the mode setting. As you said RH does it (in the new versions of its OS's only) No, we do it in all versions of our OS, but you have to set the MTU parameter, we don't default it to anything for you (but this is no different than setting CM). This is normal device setup. Just like we don't default Ethernet interfaces to jumbo frames, the user needs to set the MTU on that interface. And the user gets benefit at MTUs less than jumbo size. via scripting outside of the driver code, other vendor doesn't. connected mode MTU to max by default once we get the scatter/gather support for IPoIB added. OK, great. Thanks. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
On Tue, 9 Jun 2015, Jerome Glisse wrote: On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote: Can you clarify how that's different from mmu_notifiers? Those are also embedded into a driver-owned struct. For HMM you want to be able to kill a mirror from HMM, you might have kernel thread call inside HMM with a mirror (outside any device file lifetime) ... The mirror is not only use at register unregister, there is a lot more thing you can call using the HMM mirror struct. So the HMM mirror lifetime as a result is more complex, it can not simply be free from the mmu_notifier_release callback or randomly. It needs to be refcounted. Sure, there are driver - HMM calls like hmm_mirror_fault that mmu_notifiers don't have, but I don't understand why that fundamentally makes HMM mirror lifetimes more complex. Decoupling hmm_mirror lifetime from mm lifetime adds complexity too. The mmu_notifier code assume that the mmu_notifier struct is embedded inside a struct that has a lifetime properly synchronize with the mm. For HMM mirror this is not something that sounds like a good idea as there is too many way to get it wrong. What kind of synchronization with the mm are you referring to here? Clients of mmu_notifiers don't have to do anything as far as I know. They're guaranteed that the mm won't go away because each registered notifier bumps mm_count. So idea of HMM mirror is that it can out last the mm lifetime but the HMM struct can not. So you have hmm_mirror ~ hmm - mm and the mirror can be unlink and have different lifetime from the hmm that itself has same life time as mm. Per the earlier discussion hmm_mirror_destroy is missing a call to hmm_unref. If that's added back I don't understand how the mirror can persist past the hmm struct. The mirror can be unlinked from hmm's list, yes, but that doesn't mean that hmm/mm can be torn down. The hmm/mm structs will stick around until hmm_destroy since that does the mmu_notifier_unregister. hmm_destroy can't be called until the last hmm_mirror_destroy. Doesn't that mean that hmm/mm are guaranteed to be allocated until the last hmm_mirror_unregister? That sounds like a good guarantee to make. Is the goal to allow calling hmm_mirror_unregister from within the mm is dying HMM callback? I don't know whether that's really necessary as long as there's some association between the driver files and the mirrors. No this is not a goal and i actualy forbid that. If so, I think there's a race here in the case of mm teardown happening concurrently with hmm_mirror_unregister. [...] Do you agree that this sequence can happen, or am I missing something which prevents it? Can't happen because child have mm-hmm = NULL ie only one hmm per mm and hmm is tie to only one mm. It is the responsability of the device driver to make sure same apply to private reference to the hmm mirror struct ie hmm_mirror should never be tie to a private file struct. It's useful for the driver to have some association between files and mirrors. If the file is closed prior to process exit we would like to unregister the mirror, otherwise it will persist until process teardown. The association doesn't have to be 1:1 but having the files ref count the mirror or something would be useful. This is allowed, i might have put strong word here, but you can associate with a file struct. What you can not do is use the mirror from a different process ie one with a different mm struct as mirror is linked to a single mm. So on fork there is no callback to update the private file struct, when the device file is duplicated (well just refcount inc) against a different process. This is something you need to be carefull in your driver. Inside the dummy driver i abuse that to actually test proper behavior of HMM but it should not be use as an example. So to confirm, on all file operations from user space the driver is expected to check that current-mm matches the mm associated with the struct file's hmm_mirror? On file-release the driver still ought to call hmm_mirror_unregister regardless of whether the mms match, otherwise we'll never tear down the mirror. That means we're not saved from the race condition because hmm_mirror_unregister can happen in one thread while hmm_notifier_release might be happening in another thread. But even if we assume no association at all between files and mirrors, are you sure that prevents the race? The driver may choose to unregister the hmm_device at any point once its files are closed. In the case of module unload the device unregister can't be prevented. If mm teardown hasn't happened yet mirrors may still be active and registered on that hmm_device. The driver thus has to first call hmm_mirror_unregister on all active mirrors, then call hmm_device_unregister. mm teardown of those mirrors may trigger at any point in this