Re: [PATCH] RDMA/cxgb3: fail get_dma_mr if the memory footprint can exceed 32b

2015-07-23 Thread Doug Ledford
On 07/22/2015 03:14 PM, Steve Wise wrote:
 T3 HW only supports MRs of length  4GB.  If the system can have more
 than that we need to fail dma mr allocation so we con't create a MR that
 cannot span the entire possible memory space.
 
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 ---
 
  drivers/infiniband/hw/cxgb3/iwch_provider.c |4 
  1 files changed, 4 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
 b/drivers/infiniband/hw/cxgb3/iwch_provider.c
 index b1b7323..bbbe018 100644
 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
 +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
 @@ -736,6 +736,10 @@ static struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, 
 int acc)
   /*
* T3 only supports 32 bits of size.
*/
 + if (sizeof(phys_addr_t)  4) {
 + pr_warn_once(MOD Cannot support dma_mrs on this platform.\n);
 + return ERR_PTR(-ENOTSUPP);
 + }
   bl.size = 0x;
   bl.addr = 0;
   kva = 0;

Should this be a static check of the pointer size versus installed
memory?  Would it be possible to have this work for machines with less
than 4GB of physical memory even if they have 64bit pointers, or are you
concerned that hotplug memory could take us over the limit after
registration and cause problems?


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


RE: [PATCH] RDMA/cxgb3: fail get_dma_mr if the memory footprint can exceed 32b

2015-07-23 Thread Steve Wise


 -Original Message-
 From: Doug Ledford [mailto:dledf...@redhat.com]
 Sent: Thursday, July 23, 2015 4:33 PM
 To: Steve Wise
 Cc: linux-rdma@vger.kernel.org
 Subject: Re: [PATCH] RDMA/cxgb3: fail get_dma_mr if the memory footprint can 
 exceed 32b
 
 On 07/22/2015 03:14 PM, Steve Wise wrote:
  T3 HW only supports MRs of length  4GB.  If the system can have more
  than that we need to fail dma mr allocation so we con't create a MR that
  cannot span the entire possible memory space.
 
  Signed-off-by: Steve Wise sw...@opengridcomputing.com
  ---
 
   drivers/infiniband/hw/cxgb3/iwch_provider.c |4 
   1 files changed, 4 insertions(+), 0 deletions(-)
 
  diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
  b/drivers/infiniband/hw/cxgb3/iwch_provider.c
  index b1b7323..bbbe018 100644
  --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
  +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
  @@ -736,6 +736,10 @@ static struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, 
  int acc)
  /*
   * T3 only supports 32 bits of size.
   */
  +   if (sizeof(phys_addr_t)  4) {
  +   pr_warn_once(MOD Cannot support dma_mrs on this platform.\n);
  +   return ERR_PTR(-ENOTSUPP);
  +   }
  bl.size = 0x;
  bl.addr = 0;
  kva = 0;
 
 Should this be a static check of the pointer size versus installed
 memory?  Would it be possible to have this work for machines with less
 than 4GB of physical memory even if they have 64bit pointers, or are you
 concerned that hotplug memory could take us over the limit after
 registration and cause problems?

NFSRDMA doesn't need dma-mrs for T3 since it has FRMR + local dma lkey support. 
 And since the deficiency really can cause problems on 64b systems if the 
memory grows  4GB after dma-mr allocation, I decided to just not allow them 
for potential large memory systems.

Steve

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-23 Thread Hefty, Sean
 The lkey is possibly useful if someone wants to do single op transfers
 larger than the S/G limit of the SQE. I haven't noticed any ULPs doing
 that..

That changes how the buffer is identified, which gets back to my question of 
are we identifying local buffers by address or through some sort of 
iova/tag/descriptor/mr/whatever.  Do we have this right in the API?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx5: Expose correct page_size_cap in device attributes

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 05:41:38PM -0400, Doug Ledford wrote:

 I assume this prevents the driver from working at all on certain arches
 (like ppc with 64k page size)?

Nothing uses page_size_cap correctly, so it has no impact.

Sagi, that is a good point, your generic code for the cleanup series
really should check that PAGE_SIZE is in page_size_cap and at least
fail the mr allocation if it isn't...

Maybe that would be enough to drop SRP's bogus usage of it...

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 11:30:08PM +, Hefty, Sean wrote:

 I don't see where usnic, ipath, qib, or opa technically need an lkey
 at all.

The lkey is possibly useful if someone wants to do single op transfers
larger than the S/G limit of the SQE. I haven't noticed any ULPs doing
that..

qib family may not technically need a lkey, but those drivers are the
only modern drivers not to support IB_DEVICE_LOCAL_DMA_LKEY.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/2] ocrdma license change

2015-07-23 Thread Doug Ledford
On 07/23/2015 07:33 PM, Devesh Sharma wrote:
 Doug- Thanks for your help!  Resubmitting, per your request
 
 Devesh Sharma (2):
   RDMA/ocrdma: update ocrdma license to dual-license
   RDMA/ocrdma: update ocrdma module license srting
 
  drivers/infiniband/hw/ocrdma/ocrdma.h   |   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_abi.h   |   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_ah.c|   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_ah.h|   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_hw.c|   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_hw.h|   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_main.c  |   55 
 +--
  drivers/infiniband/hw/ocrdma/ocrdma_sli.h   |   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_stats.c |   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_stats.h |   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |   53 -
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |   53 -
  12 files changed, 409 insertions(+), 229 deletions(-)

Thanks Devesh.  I've picked these up and have them slated for 4.2-rc so
that we won't need to track down any additional contributors.


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-23 Thread Hefty, Sean
  This may be a nit, but IMO, the use of the term 'rkey' versus 'stag'
 matters.  They convey different ways of finding a data
 buffer.  For
  example, do you locate a buffer using the stag, then verify that the
 offset + length fits into the target buffer?
 
 Yes.  HW always uses the stag to locate a record that contains the stag
 state (valid or invalid), the access flags, the 8b key, the
 va_base, length, PBL describing the host pages, etc.  HW validates all
 that before using the buffer.  NOTE: An stag of 0 is the
 special local-dma-lkey which HW treats differently: If the stag is 0, then
 the address in the SGE is the bus/dma address itself and
 no lookup of a MR/PBL/etc is needed.   Stag 0 can ONLY be used by kernel
 users and MUST never be accepted/used from an ingress
 packet and MUST never be emitted on the wire in a READ or WRITE.
 
  Or do you locate the buffer
  by address, then verify that the key matches?
 
 
 This is never done.

These were more rhetorical questions to highlight that stag is a better choice 
for the name than rkey.  Lkey seems better than stag, though I'd rather see 
lkey removed completely.  I don't see where usnic, ipath, qib, or opa 
technically need an lkey at all.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

2015-07-23 Thread Chuck Lever

On Jul 23, 2015, at 6:27 AM, Sagi Grimberg sa...@dev.mellanox.co.il wrote:

 On 7/22/2015 8:57 PM, Jason Gunthorpe wrote:
 On Wed, Jul 22, 2015 at 08:33:16PM +0300, Sagi Grimberg wrote:
   memset(fr_wr, 0, sizeof(fr_wr));
 + ib_set_fastreg_wr(mr, mr-lkey, ISER_FASTREG_LI_WRID,
 +   false, fr_wr);
 
 Shouldn't ib_set_fastreg_wr take care of this memset?  Also it seems
 instead of the singalled flag to it we might just set that or
 other flags later if we really want to.
 
 Seems reasonable.
 
 If you want to micro optimize then just zero the few items that are
 defined to be accessed for fastreg, no need to zero the whole
 structure. Infact, you may have already done that, so just drop the
 memset entirely.
 
 I will.
 
 
 The reason I didn't put it in was that ib_send_wr is not a small struct
 (92 bytes IIRC). So I'm a bit reluctant to add an unconditional memset.
 Maybe it's better that the callers can carefully set it to save some
 cycles?
 
 If you want to optimize this path, then Sean is right, move the post
 into the driver and stop pretending that ib_post_send is a performance
 API.
 
 ib_post_fastreg_wr would be a function that needs 3 register passed
 arguments and does a simple copy to the driver's actual sendq
 
 That will require to take the SQ lock and write a doorbell for each
 registration and post you want to do. I'm confident that constructing
 a post chain with a single sq lock acquire and a single doorbell will
 be much much better even with conditional jumps and memsets.

I agree. xprtrdma uses several MRs per RPC. It would be more efficient
to chain together several WRs and post once to deal with these,
especially for HCAs/providers that have a shallow page_list depth.


 svcrdma, isert (and iser - not upstream yet) are doing it. I think that
 others should do it too. My tests shows that this makes a difference in
 small IO workloads.


--
Chuck Lever



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rdma_getaddrinfo and GUID

2015-07-23 Thread Hefty, Sean
 I'm use
 cat /sys/class/infiniband/mlx4_0/ports/1/gids/0
 
 fe80::::0002:c903:00ef:6601
 
 In node I'm use gid provided below service null, hints contains af_ib and
 rai_family|rai_numerichost

Ah, yes, the rdma_cm supports GIDs.  It does not support GUIDs.  Though it's 
usually trivial to convert a GUID into a GID.


Re: [PATCH v8 4/4] IB/sa: Route SA pathrecord query through netlink

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 07:26:32PM +, Wan, Kaike wrote:
 The nl sequence increased sequentially. What's the difference?

Hmm, looks like in my version the DGID was getting mangled and this
causes ipoib to enter a tight query loop..

Not your problem.
  Actually, looks like nothing removes nl queries from the timeout loop when
  they successfully complete. (ie this series doesn't even have a hope of
  working properly)
 
 That is incorrect. The first thing in ib_nl_handle_resolve_resp is
 to find the request and remove it from the ib_nl_rquest_list. the
 timeout routine will remove the entries that have timed out from the
 nl request list.

Yep, okay, I see it now.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH WIP 37/43] xprtrdma: Port to new memory registration API

2015-07-23 Thread Steve Wise


 -Original Message-
 From: Sagi Grimberg [mailto:sa...@dev.mellanox.co.il]
 Sent: Thursday, July 23, 2015 5:21 AM
 To: Steve Wise; Sagi Grimberg; linux-rdma@vger.kernel.org
 Cc: Liran Liss; Oren Duer
 Subject: Re: [PATCH WIP 37/43] xprtrdma: Port to new memory registration API
 
 On 7/22/2015 10:21 PM, Steve Wise wrote:
 
  On 7/22/2015 1:55 AM, Sagi Grimberg wrote:
  Signed-off-by: Sagi Grimberg sa...@mellanox.com
  ---
net/sunrpc/xprtrdma/frwr_ops.c  | 80
  ++---
net/sunrpc/xprtrdma/xprt_rdma.h |  4 ++-
2 files changed, 47 insertions(+), 37 deletions(-)
 
  Did you intend to change svcrdma as well?
 
 All the ULPs need to convert. I didn't have a chance to convert
 svcrdma yet. Want to take it?

Not right now.  My focus is still on enabling iSER.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] IB: Replace safe uses for ib_get_dma_mr with pd-local_dma_lkey

2015-07-23 Thread Bart Van Assche

On 07/22/15 16:34, Jason Gunthorpe wrote:

The remaining users of ib_get_dma_mr are all unsafe:

   [ ... ]

  drivers/infiniband/ulp/srp/ib_srp.c:
srp_dev-mr = ib_get_dma_mr(srp_dev-pd,
IB_ACCESS_LOCAL_WRITE |
IB_ACCESS_REMOTE_READ |
IB_ACCESS_REMOTE_WRITE);


Hello Jason,

This statement might need some clarification. Are you aware that this 
memory region is only used if the kernel module parameter 
register_always is zero ?


I will try to find some time to test the SRP changes in this series but 
I'm not sure yet when I will be able to do that.


Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 40/43] mlx5: Allocate private context for arbitrary scatterlist registration

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 02:25:32AM -0700, Christoph Hellwig wrote:
 On Wed, Jul 22, 2015 at 11:30:48AM -0600, Jason Gunthorpe wrote:
  On Wed, Jul 22, 2015 at 09:55:40AM +0300, Sagi Grimberg wrote:
   + size += max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
   + mr-klms = kzalloc(size, GFP_KERNEL);
   + if (!mr-klms)
   + return -ENOMEM;
   +
   + mr-pl_map = dma_map_single(device-dma_device, mr-klms,
   + size, DMA_TO_DEVICE);
  
  This is a misuse of the DMA API, you must call dma_map_single after
  the memory is set by the CPU, not before.
 
  The fast reg varient is using coherent allocations, which is OK..
 
 It's fine as long as you dma_sync_*_for_{cpu,device} in the right
 places, which is what a lot of drivers do for longer held allocations.

Right, that is the other better option.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 01:19:16PM +0300, Sagi Grimberg wrote:
 Again, related to my prior comments, please have two of these:
 
 ib_map_mr_sg_rkey()
 ib_map_mr_sg_lkey()
 
 So we force ULPs to think about what they are doing properly, and we
 get a chance to actually force lkey to be local use only for IB.
 
 The lkey/rkey decision is passed in the fastreg post_send().

That is too late to check the access flags.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 02:19:55AM -0700, Christoph Hellwig wrote:
 Although I wonder if we really need to differentiate between rkey and
 leky in this ib_map_mr_sg function, or if we should do it when
 allocating the mr, i.e. in ib_alloc_mr.

The allocation is agnostic to the usage, the map is what solidifies
things into a certain use, effectively based on the access flags..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v8 1/4] IB/netlink: Add defines for local service requests through netlink

2015-07-23 Thread Wan, Kaike
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
 Sent: Wednesday, July 22, 2015 4:56 PM
 To: Wan, Kaike
 Cc: linux-rdma@vger.kernel.org; Fleck, John; Weiny, Ira
 Subject: Re: [PATCH v8 1/4] IB/netlink: Add defines for local service requests
 through netlink
 
 On Thu, Jul 09, 2015 at 01:34:25PM -0400, kaike@intel.com wrote:
  +   LS_NLA_TYPE_STATUS = 0,
 
 This is never used, drop it.

Done.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-23 Thread Steve Wise


 -Original Message-
 From: Hefty, Sean [mailto:sean.he...@intel.com]
 Sent: Thursday, July 23, 2015 1:53 PM
 To: Steve Wise; 'Christoph Hellwig'
 Cc: 'Sagi Grimberg'; 'Steve Wise'; 'Jason Gunthorpe'; 'Tom Talpey'; 'Doug 
 Ledford'; sa...@mellanox.com; ogerl...@mellanox.com;
 r...@mellanox.com; linux-rdma@vger.kernel.org; e...@mellanox.com; 
 target-de...@vger.kernel.org; linux-...@vger.kernel.org;
 trond.mykleb...@primarydata.com; bfie...@fieldses.org; 'Oren Duer'
 Subject: RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
 
  There is confusion about lkeys and rkeys with regard to iWARP.  In the
  iWARP verbs, there is no distinction between an lkey and
  rkey: they are the same key, called a Steering Tag or STAG.  When you
  create a MR, the lkey == rkey == STAG for iwarp transports.
  Somewhat related, but really a different issue, is that SGEs that are the
  target of a read need REMOTE_WRITE access flags on their
  STAG for iWARP.
 
  Clear as mud? :)
 
 This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters. 
  They convey different ways of finding a data
buffer.  For
 example, do you locate a buffer using the stag, then verify that the offset + 
 length fits into the target buffer?  

Yes.  HW always uses the stag to locate a record that contains the stag state 
(valid or invalid), the access flags, the 8b key, the
va_base, length, PBL describing the host pages, etc.  HW validates all that 
before using the buffer.  NOTE: An stag of 0 is the
special local-dma-lkey which HW treats differently: If the stag is 0, then the 
address in the SGE is the bus/dma address itself and
no lookup of a MR/PBL/etc is needed.   Stag 0 can ONLY be used by kernel users 
and MUST never be accepted/used from an ingress
packet and MUST never be emitted on the wire in a READ or WRITE.

 Or do you locate the buffer
 by address, then verify that the key matches?
 

This is never done.

 Consider if we allow an app to specify the rkey/stag, or reference the buffer 
 using an offset, rather than a virtual address.
 
 This seems to be part of the difference between an lkey and an rkey.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V6 05/10] IB/core: Add RoCE GID table management

2015-07-23 Thread Jason Gunthorpe
On Tue, Jul 21, 2015 at 11:42:29AM +0300, Matan Barak wrote:
 On Fri, Jul 17, 2015 at 10:02 PM, Jason Gunthorpe
 jguntho...@obsidianresearch.com wrote:
  On Wed, Jun 24, 2015 at 03:59:21PM +0300, Matan Barak wrote:
  +
  + /* in rdma_cap_roce_gid_table, this funciton should be protected by a
  +  * sleep-able lock.
  +  */
  + write_lock(table-data_vec[ix].lock);
 
  I'm having a hard time understanding this comment
 
 
 The same function is used for both RoCE and IB. Since RoCE unlocks the
 rwlock, calls the vendor's callback and
 locks it again. If two write_gid(s) are executed simultaneously,  you
 need to protect them from writing to the
 same entry. The vendor's callback might sleep so we require a sleep-able lock.

You might want to retouch that comment a bit..

  +int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
  +  union ib_gid *gid, struct ib_gid_attr *attr)
  +{
  + struct ib_gid_table **ports_table =
  + READ_ONCE(ib_dev-cache.gid_cache);
  + /* all table reads depend on ports_table, no need for smp_rmb() */
  + if (!ports_table)
  + return -EOPNOTSUPP;
 
  This common pattern does look genuinely odd...
 
  The gid_cache is part of the common API, it really shouldn't be kfree'd
  while held struct ib_devices are around. The kfree for all the cache.c
  stuff should probably be called from ib_device_release, not from the
  client release.
 
  That is actually something the current code does that is possibly
  wrong. It is trivially fixed by moving all the kfrees to
  ib_device_release.
 
 
 But cache as a whole is implemented as a client (cache_client).
 Isn't it a bit odd to free a client in ib_device_release?

The cache is part of the core code, it exports core APIs that need to
continue working as long as a ib_device object exists.

It abuses the client stuff to hook registration, that doesn't make it
'wholly implemented as a client'

It is not odd for core code to free its memory in ib_device_release.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 01:07:56PM +0300, Sagi Grimberg wrote:
 On 7/22/2015 10:05 PM, Jason Gunthorpe wrote:
 The reason I named max_entries is because might might not be pages but
 real SG elements. It stands for maximum registration entries.
 
 Do you have a better name?

I wouldn't try and be both..

Use 'max_num_sg' and document that no aggregate scatterlist with
length larger than 'max_num_sg*PAGE_SIZE' or with more entries than
max_num_sg can be submitted?

Maybe document with ARB_SG that it is not length limited?

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v8 4/4] IB/sa: Route SA pathrecord query through netlink

2015-07-23 Thread Wan, Kaike
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
 Sent: Wednesday, July 22, 2015 5:09 PM
 To: Wan, Kaike
 Cc: linux-rdma@vger.kernel.org; Fleck, John; Weiny, Ira
 Subject: Re: [PATCH v8 4/4] IB/sa: Route SA pathrecord query through netlink
 
 On Thu, Jul 09, 2015 at 01:34:28PM -0400, kaike@intel.com wrote:
  From: Kaike Wan kaike@intel.com
 
  This patch routes a SA pathrecord query to netlink first and processes
  the response appropriately. If a failure is returned, the request will
  be sent through IB. The decision whether to route the request to
  netlink first is determined by the presence of a listener for the
  local service netlink multicast group. If the user-space local service
  netlink multicast group listener is not present, the request will be
  sent through IB, just like what is currently being done.
 
 So, I tested it again, and the UAPI looks pretty reasonable now.
 
 The netlink validation still needs to be fixed.
 
  There is where we might have a problem: nla_parse() allows only one
  entry for each attribute type in the returned array tb[]. If we have
  multiple (say 6) pathrecords returned, each having different flags
  (for different path use), a later one will replace an earlier one in
  tb[].
 
 The parse is OK, continue to use the for loop, nla_parse does more than just
 extract into tb, it validates all the sizes are correct, ignore tb.
 
 My testing shows it seems to get into some kind of fast repeating query loop:
 
 [ 4904.969188] Return answer 0
 [ 4904.969483] Return answer 0
 [ 4904.969703] Return answer 0
 [ 4904.969824] Return answer 0
 [ 4904.969943] Return answer 0
 [ 4904.970058] Return answer 0
 [ 4904.970175] Return answer 0
 [ 4904.970289] Return answer 0
 
 diff --git a/drivers/infiniband/core/sa_query.c
 b/drivers/infiniband/core/sa_query.c
 index 63ea36d05072..e6b0181e7076 100644
 --- a/drivers/infiniband/core/sa_query.c
 +++ b/drivers/infiniband/core/sa_query.c
 @@ -640,6 +640,8 @@ static void ib_nl_process_good_resolve_rsp(struct
 ib_sa_query *query,
 }
 }
 }
 +
 +   printk(Return answer %u\n,status);
 query-callback(query, status, mad);
 }
 
 You'll have to figure that out. I'm just doing ping X while running a 
 responder
 on the netlink socket, v7 didn't do this, IIRC.

I did not see such a repeated query loop. With a modified version of your 
debugging code:

--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -653,6 +653,7 @@ static void ib_nl_process_good_resolve_rsp(struct 
ib_sa_query *query,
}
}
}
+   printk(Request %u returns answer %u\n, nlh-nlmsg_seq);
query-callback(query, status, mad);
}


I ran rping a few times and never saw such a behavior:

[root@phifs011 ~]# rping -c -C 1 -a 10.228.216.26
Request 2 returns answer 0
client DISCONNECT EVENT...
[root@phifs011 ~]# rping -c -C 1 -a 10.228.216.26
Request 3 returns answer 0
client DISCONNECT EVENT...
[root@phifs011 ~]# rping -c -C 1 -a 10.228.216.26
Request 4 returns answer 0
client DISCONNECT EVENT...
[root@phifs011 ~]# rping -c -C 1 -a 10.228.216.26
Request 5 returns answer 0
client DISCONNECT EVENT...
[root@phifs011 ~]#  

The nl sequence increased sequentially. What's the difference?

In my debugging patch, I have also other print statements and I have never seen 
such a loop at all.


 
 I think it has something to do with the timer as the first request works fine,
 then a pause, then a storm.
 
 Actually, looks like nothing removes nl queries from the timeout loop when
 they successfully complete. (ie this series doesn't even have a hope of
 working properly)

That is incorrect. The first thing in ib_nl_handle_resolve_resp is to find the 
request and remove it from the ib_nl_rquest_list. the timeout routine will 
remove the entries that have timed out from the nl request list.

 
 Please actually test this patch completely and thoroughly before sending
 another version. 

Generally, I tested the patches (along with a debugging patch) with rping, 
ping, SRP, and another performance path (no debugging patch). I also tested 
another ibacm patch for setting timeout. When making major changes,  I also ran 
tests for timeout and error response cases.


I'm broadly happy with this, so get your whole team to look
 it over agin.
 
  +   if (query-callback) {
  +   head = (const struct nlattr *) nlmsg_data(nlh);
  +   len = nlmsg_len(nlh);
  +   switch (query-path_use) {
  +   case LS_RESOLVE_PATH_USE_UNIDIRECTIONAL:
  +   mask = IB_PATH_PRIMARY | IB_PATH_OUTBOUND;
  +   break;
  +
  +   case LS_RESOLVE_PATH_USE_ALL:
  +   case LS_RESOLVE_PATH_USE_GMP:
  +   default:
  +   mask = 

RE: rdma_getaddrinfo and GUID

2015-07-23 Thread Hefty, Sean
  Hello, does it possible to use rdma_getaddrinfo and specify in node
 port
  GUID?
 
  The current code does not support this.
 
 
 Why rdma_getaddrinfo does not return error? Why it returns success and
 result contains port guid?

I don't think I'm following your questions on how rdma_getaddrinfo is being 
called.  Are you referring to a GUID (64-bit value) or a GID (128-bit value)?  
How are you calling rdma_getaddrinfo with this value?
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: [PATCH 00/10] IB: Replace safe uses for ib_get_dma_mr with pd-local_dma_lkey

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 06:47:26AM -0700, Bart Van Assche wrote:
 On 07/22/15 16:34, Jason Gunthorpe wrote:
 The remaining users of ib_get_dma_mr are all unsafe:
[ ... ]
   drivers/infiniband/ulp/srp/ib_srp.c:
  srp_dev-mr = ib_get_dma_mr(srp_dev-pd,
  IB_ACCESS_LOCAL_WRITE |
  IB_ACCESS_REMOTE_READ |
  IB_ACCESS_REMOTE_WRITE);
 
 Hello Jason,
 
 This statement might need some clarification. Are you aware that this memory
 region is only used if the kernel module parameter register_always is zero ?

It doesn't matter, just making the above call is enough to create the
security problem, even if the rkey is not used.

I looked at the register_always stuff, it isn't obvious to me that it
interacts with this path:

static int srp_map_sg(struct srp_map_state *state, struct srp_rdma_ch *ch,
if (dev-use_fast_reg) {
} else {
use_mr = !!ch-fmr_pool;
[..]
if (srp_map_sg_entry(state, ch, sg, i, use_mr)) {

static int srp_map_sg_entry(struct srp_map_state *state,
if (!use_mr) {
srp_map_desc(state, dma_addr, dma_len, target-rkey);

Perhaps the above is dead code, all our IB drivers either support FMR
or FRWR, so maybe use_mr is always true?

It looks to me like register_always is similar to iSER, it is trying
to avoid a MR if there is only 1 S/G entry. That performance behavior
needs to default to disabled. The kernel must default to secure out of
the box.

 I will try to find some time to test the SRP changes in this series but I'm
 not sure yet when I will be able to do that.

This probably also takes care of the security issue for SRP, what do you
think?

From d1ae6713fc011b6ff392e42cfb342899da8561ff Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe jguntho...@obsidianresearch.com
Date: Thu, 23 Jul 2015 12:19:53 -0600
Subject: [PATCH] IB/srp: Do not create an all physical insecure rkey by
 default

The ULP only needs this if the insecure register_always performance
optimization is enabled, or if FRWR/FMR is not supported in the driver.

Do not create an all physical MR unless it is needed to support either
of those modes. Default register_always to true so the out of the box
configuration does not create an insecure all physical MR.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 drivers/infiniband/ulp/srp/ib_srp.c | 31 +--
 drivers/infiniband/ulp/srp/ib_srp.h |  2 +-
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index fb9fed0fac28..a1e3818d0791 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -69,7 +69,7 @@ static unsigned int cmd_sg_entries;
 static unsigned int indirect_sg_entries;
 static bool allow_ext_sg;
 static bool prefer_fr;
-static bool register_always;
+static bool register_always = true;
 static int topspin_workarounds = 1;
 
 module_param(srp_sg_tablesize, uint, 0444);
@@ -3150,7 +3150,8 @@ static ssize_t srp_create_target(struct device *dev,
target-scsi_host   = target_host;
target-srp_host= host;
target-lkey= host-srp_dev-pd-local_dma_lkey;
-   target-rkey= host-srp_dev-mr-rkey;
+   if (host-srp_dev-rkey_mr)
+   target-rkey= host-srp_dev-rkey_mr-rkey;
target-cmd_sg_cnt  = cmd_sg_entries;
target-sg_tablesize= indirect_sg_entries ? : cmd_sg_entries;
target-allow_ext_sg= allow_ext_sg;
@@ -3381,6 +3382,7 @@ static void srp_add_one(struct ib_device *device)
struct srp_host *host;
int mr_page_shift, s, e, p;
u64 max_pages_per_mr;
+   unsigned int mr_flags = 0;
 
dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL);
if (!dev_attr)
@@ -3399,8 +3401,11 @@ static void srp_add_one(struct ib_device *device)
device-map_phys_fmr  device-unmap_fmr);
srp_dev-has_fr = (dev_attr-device_cap_flags 
   IB_DEVICE_MEM_MGT_EXTENSIONS);
-   if (!srp_dev-has_fmr  !srp_dev-has_fr)
+   if (!srp_dev-has_fmr  !srp_dev-has_fr) {
dev_warn(device-dev, neither FMR nor FR is supported\n);
+   /* Fall back to using an insecure all physical rkey */
+   mr_flags |= IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE;
+   }
 
srp_dev-use_fast_reg = (srp_dev-has_fr 
 (!srp_dev-has_fmr || prefer_fr));
@@ -3436,12 +3441,17 @@ static void srp_add_one(struct ib_device *device)
if (IS_ERR(srp_dev-pd))
goto free_dev;
 
-   srp_dev-mr = ib_get_dma_mr(srp_dev-pd,
-   IB_ACCESS_LOCAL_WRITE |
-   IB_ACCESS_REMOTE_READ |
-   IB_ACCESS_REMOTE_WRITE);
-   if 

Re: [PATCH v1 2/2] RDMA/ocrdma: update ocrdma module license srting

2015-07-23 Thread Tejun Heo
On Fri, Jul 24, 2015 at 05:04:00AM +0530, Devesh Sharma wrote:
 replace module_license from GPL to Dual BSD/GPL
 
 Cc: Tejun Heo t...@kernel.org
 Cc: Duan Jiong duanj.f...@cn.fujitsu.com
 Cc: Roland Dreier rol...@purestorage.com
 Cc: Jes Sorensen jes.soren...@redhat.com
 Cc: Sasha Levin levinsasha...@gmail.com
 Cc: Dan Carpenter dan.carpen...@oracle.com
 Cc: Prarit Bhargava pra...@redhat.com
 Cc: Colin Ian King colin.k...@canonical.com
 Cc: Wei Yongjun yongjun_...@trendmicro.com.cn
 Cc: Moni Shoua mo...@mellanox.com
 Cc: Rasmus Villemoes li...@rasmusvillemoes.dk
 Cc: Li RongQing roy.qing...@gmail.com
 Cc: Devendra Naga devendra.a...@gmail.com
 Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com

Ditto,

 Acked-by: Tejun Heo t...@kernel.org

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 1/2] RDMA/ocrdma: update ocrdma license to dual-license

2015-07-23 Thread Tejun Heo
On Fri, Jul 24, 2015 at 05:03:59AM +0530, Devesh Sharma wrote:
 Change of license from GPLv2 to dual-license (GPLv2 and BSD 2-Clause)
 
 Cc: Tejun Heo t...@kernel.org
 Cc: Duan Jiong duanj.f...@cn.fujitsu.com
 Cc: Roland Dreier rol...@purestorage.com
 Cc: Jes Sorensen jes.soren...@redhat.com
 Cc: Sasha Levin levinsasha...@gmail.com
 Cc: Dan Carpenter dan.carpen...@oracle.com
 Cc: Prarit Bhargava pra...@redhat.com
 Cc: Colin Ian King colin.k...@canonical.com
 Cc: Wei Yongjun yongjun_...@trendmicro.com.cn
 Cc: Moni Shoua mo...@mellanox.com
 Cc: Rasmus Villemoes li...@rasmusvillemoes.dk
 Cc: Li RongQing roy.qing...@gmail.com
 Cc: Devendra Naga devendra.a...@gmail.com
 Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com

I didn't do any significant work on it but FWIW

 Acked-by: Tejun Heo t...@kernel.org

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] IB/core: Guarantee that a local_dma_lkey is available

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 01:47:02PM +0300, Sagi Grimberg wrote:
 +/* Return a pd for in-kernel use that has a local_dma_lkey which provides
 +   local access to all physical memory. */
 
 Why not kdoc style? we need to move the ib_verbs.h kdocs here anyway.
 Might be a good chance to do that for ib_alloc_pd().

Right, took care of these.

 diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
 index 986fddb08579..cfda95d7b067 100644
 +++ b/include/rdma/ib_verbs.h
 @@ -1255,6 +1255,8 @@ struct ib_pd {
  struct ib_device   *device;
  struct ib_uobject  *uobject;
  atomic_tusecnt; /* count all resources */
 +struct ib_mr   *local_mr;
 +u32 local_dma_lkey;
 
 Maybe its better to place the local_dma_lkey in the first cacheline as
 it is normally accessed in the hot path?

Sure, but it doesn't matter too much as it is probably the only hot
item in the pd... At the very least is avoids computing an EA...

I've posted an updated series on my github, none of this is a
functional change, so I'm going to continue to wait for
testing/reviewing of the various pieces before reposting such a big
series.

https://github.com/jgunthorpe/linux/commits/remove-ib_get_dma_mr

Thanks,
Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote:
 On 7/22/2015 8:44 PM, Jason Gunthorpe wrote:
 On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote:
 +/**
 + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
 + * @mr:memory region
 + * @sg:dma mapped scatterlist
 + * @sg_nents:  number of entries in sg
 + * @access:access permissions
 
 I know moving the access flags here was my idea originally, but I seem
 convinced by your argument that it might fit in better with the posting
 helper.  Or did someone else come up with a better argument that mine
 for moving it here?
 
 I was hoping we'd move the DMA flush and translate into here and make
 it mandatory. Is there any reason not to do that?
 
 The reason I didn't added it in was so the ULPs can make sure they meet
 the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
 SG list set partials and iSER to detect gaps (they need to dma map
 for that).

I would like to see the kdoc for ib_map_mr_sg explain exactly what is
required of the caller, maybe just hoist this bit from the
ib_sg_to_pages

Not entirely required if we are going to have an API to do the test..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] IB: Replace safe uses for ib_get_dma_mr with pd-local_dma_lkey

2015-07-23 Thread Bart Van Assche

On 07/23/2015 11:31 AM, Jason Gunthorpe wrote:

It looks to me like register_always is similar to iSER, it is trying
to avoid a MR if there is only 1 S/G entry. That performance behavior
needs to default to disabled. The kernel must default to secure out of
the box.

[ ... ]

This probably also takes care of the security issue for SRP, what do you
think?
[ ... ]
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index fb9fed0fac28..a1e3818d0791 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -69,7 +69,7 @@ static unsigned int cmd_sg_entries;
  static unsigned int indirect_sg_entries;
  static bool allow_ext_sg;
  static bool prefer_fr;
-static bool register_always;
+static bool register_always = true;


If we want security by default then I propose not only to change the 
default value of register_always from false into true but also to change 
the default value of prefer_fr from false into true such that fast 
registration becomes the default instead of FMR.



+   if (register_always)
+   mr_flags |= IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE;


Hmm ... did you perhaps intend if (!register_always) instead ?

Thanks,

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] IB: Replace safe uses for ib_get_dma_mr with pd-local_dma_lkey

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 11:42:11AM -0700, Bart Van Assche wrote:
 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
 b/drivers/infiniband/ulp/srp/ib_srp.c
 index fb9fed0fac28..a1e3818d0791 100644
 +++ b/drivers/infiniband/ulp/srp/ib_srp.c
 @@ -69,7 +69,7 @@ static unsigned int cmd_sg_entries;
   static unsigned int indirect_sg_entries;
   static bool allow_ext_sg;
   static bool prefer_fr;
 -static bool register_always;
 +static bool register_always = true;
 
 If we want security by default then I propose not only to change the default
 value of register_always from false into true but also to change the default
 value of prefer_fr from false into true such that fast registration becomes
 the default instead of FMR.

Yes, I was frowning at that stuff too.. We are trying to get rid of
FMR, so nothing should prefer it over FRWR...

Sagi, perhaps that belongs in your MR unification series?

 +if (register_always)
 +mr_flags |= IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE;
 
 Hmm ... did you perhaps intend if (!register_always) instead ?

I did, thank you. For your testing convenience, the updated version is
on github now:

https://github.com/jgunthorpe/linux/commits/remove-ib_get_dma_mr

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote:

 So we force ULPs to think about what they are doing properly, and we
 get a chance to actually force lkey to be local use only for IB.
 
 The lkey/rkey decision is passed in the fastreg post_send().
 
 That is too late to check the access flags.
 
 Why? the access permissions are kept in the mr context?

Sure, one could do if (key == mr-lkey) .. check lkey flags in the
post, but that seems silly considering we want the post inlined..

 I can move it to the post interface if it makes more sense.
 the access is kind of out of place in the mapping routine anyway...

All the dma routines have an access equivalent during map, I don't
think it is out of place..

To my mind, the map is the point where the MR should crystallize into
an rkey or lkey MR, not at the post.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-23 Thread Hefty, Sean
 There is confusion about lkeys and rkeys with regard to iWARP.  In the
 iWARP verbs, there is no distinction between an lkey and
 rkey: they are the same key, called a Steering Tag or STAG.  When you
 create a MR, the lkey == rkey == STAG for iwarp transports.
 Somewhat related, but really a different issue, is that SGEs that are the
 target of a read need REMOTE_WRITE access flags on their
 STAG for iWARP.
 
 Clear as mud? :)

This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters.  
They convey different ways of finding a data buffer.  For example, do you 
locate a buffer using the stag, then verify that the offset + length fits into 
the target buffer?  Or do you locate the buffer by address, then verify that 
the key matches?

Consider if we allow an app to specify the rkey/stag, or reference the buffer 
using an offset, rather than a virtual address.

This seems to be part of the difference between an lkey and an rkey. 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 07:59:48PM +0300, Sagi Grimberg wrote:
 I don't mean to be negative about your ideas, I just don't think that
 doing all the work in the drivers is going to get us to a better place.

No worries, I'm hoping someone can put the peices together and figure
out how to code share all the duplication we seem to have in the ULPs.

The more I've look at them, the more it seems like they get basic
things wrong, like SQE accouting in NFS, dma flush ordering in NFS,
rkey security in SRP/iSER..

Sharing code means we can fix those problems for good.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote:
 I was hoping we'd move the DMA flush and translate into here and make
 it mandatory. Is there any reason not to do that?
 
 The reason I didn't added it in was so the ULPs can make sure they meet
 the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
 SG list set partials and iSER to detect gaps (they need to dma map
 for that).

The ULP can always get the sg list's virtual address to check for
gaps. Page aligned gaps are always OK.

BTW, the logic in ib_sg_to_pages should be checking that directly, as
coded, it won't work with swiotlb:

// Only the first SG entry can start unaligned
if (i  page_addr != dma_addr)
return EINVAL;
// Only the last SG entry can end unaligned
if ((page_addr + dma_len)  PAGE_MASK != end_dma_addr)
 if (!is_last)
 return EINVAL;

Don't use sg-offset after dma mapping.

The biggest problem with checking the virtual address is
swiotlb. However, if swiotlb is used this API is basically broken as
swiotlb downgrades everything to a 2k alignment, which means we only
ever get 1 s/g entry.

To efficiently support swiotlb we'd need to see the driver be able to
work with a page size of IO_TLB_SEGSIZE (2k) so it can handle the
de-alignment that happens during bouncing.

My biggest problem with pushing the dma address up to the ULP is
basically that: The ULP has no idea what the driver can handle, maybe
the driver can handle the 2k pages.

So, that leaves a flow where the ULP does a basic sanity check on the
virtual side, then asks the IB core to map it. The mapping could still
fail because of swiotlb.

If the mapping fails, then the ULP has to bounce buffer, or MR split,
or totally fail.

For bounce buffer, all solutions have a DMA map/unmap cost, so it
doesn't matter if ib_map_mr_sg does that internally.

For MR fragment, the DMA mapping is still usable. Maybe we do need a
slightly different core API to help MR fragmentation? Sounds like NFS
uses this too?

 num_mrs = ib_mr_fragment_sg(scatterlist);
 while (..)
   ib_map_fragment_sg(mr[i++],scatterlist,offset);

Perhaps?

Maybe that is even better because something like iser could do
the parallel:
 ib_mr_needs_fragment_sg(reference_mr,scatterlist)

Which hides all the various restrictions in driver code.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 1/2] RDMA/ocrdma: update ocrdma license to dual-license

2015-07-23 Thread Devesh Sharma
Change of license from GPLv2 to dual-license (GPLv2 and BSD 2-Clause)

Cc: Tejun Heo t...@kernel.org
Cc: Duan Jiong duanj.f...@cn.fujitsu.com
Cc: Roland Dreier rol...@purestorage.com
Cc: Jes Sorensen jes.soren...@redhat.com
Cc: Sasha Levin levinsasha...@gmail.com
Cc: Dan Carpenter dan.carpen...@oracle.com
Cc: Prarit Bhargava pra...@redhat.com
Cc: Colin Ian King colin.k...@canonical.com
Cc: Wei Yongjun yongjun_...@trendmicro.com.cn
Cc: Moni Shoua mo...@mellanox.com
Cc: Rasmus Villemoes li...@rasmusvillemoes.dk
Cc: Li RongQing roy.qing...@gmail.com
Cc: Devendra Naga devendra.a...@gmail.com
Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com
---
 drivers/infiniband/hw/ocrdma/ocrdma.h   |   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_abi.h   |   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c|   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_ah.h|   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c|   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_hw.h|   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h   |   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_stats.c |   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_stats.h |   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |   53 +--
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |   53 +--
 12 files changed, 408 insertions(+), 228 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h 
b/drivers/infiniband/hw/ocrdma/ocrdma.h
index b396344..6a36338 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -1,21 +1,36 @@
-/***
- * This file is part of the Emulex RoCE Device Driver for  *
- * RoCE (RDMA over Converged Ethernet) adapters.   *
- * Copyright (C) 2008-2012 Emulex. All rights reserved.*
- * EMULEX and SLI are trademarks of Emulex.*
- * www.emulex.com  *
- * *
- * This program is free software; you can redistribute it and/or   *
- * modify it under the terms of version 2 of the GNU General   *
- * Public License as published by the Free Software Foundation.*
- * This program is distributed in the hope that it will be useful. *
- * ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND  *
- * WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY,  *
- * FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE  *
- * DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD *
- * TO BE LEGALLY INVALID.  See the GNU General Public License for  *
- * more details, a copy of which can be found in the file COPYING  *
- * included with this package. *
+/* This file is part of the Emulex RoCE Device Driver for
+ * RoCE (RDMA over Converged Ethernet) adapters.
+ * Copyright (C) 2012-2015 Emulex. All rights reserved.
+ * EMULEX and SLI are trademarks of Emulex.
+ * www.emulex.com
+ *
+ * This software is available to you under a choice of one of two licenses.
+ * You may choose to be licensed under the terms of the GNU General Public
+ * License (GPL) Version 2, available from the file COPYING in the main
+ * directory of this source tree, or the BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * - Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions and the following disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH 

[PATCH v1 0/2] ocrdma license change

2015-07-23 Thread Devesh Sharma
Doug- Thanks for your help!  Resubmitting, per your request

Devesh Sharma (2):
  RDMA/ocrdma: update ocrdma license to dual-license
  RDMA/ocrdma: update ocrdma module license srting

 drivers/infiniband/hw/ocrdma/ocrdma.h   |   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_abi.h   |   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c|   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_ah.h|   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c|   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_hw.h|   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |   55 +--
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h   |   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_stats.c |   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_stats.h |   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |   53 -
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |   53 -
 12 files changed, 409 insertions(+), 229 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 2/2] RDMA/ocrdma: update ocrdma module license srting

2015-07-23 Thread Devesh Sharma
replace module_license from GPL to Dual BSD/GPL

Cc: Tejun Heo t...@kernel.org
Cc: Duan Jiong duanj.f...@cn.fujitsu.com
Cc: Roland Dreier rol...@purestorage.com
Cc: Jes Sorensen jes.soren...@redhat.com
Cc: Sasha Levin levinsasha...@gmail.com
Cc: Dan Carpenter dan.carpen...@oracle.com
Cc: Prarit Bhargava pra...@redhat.com
Cc: Colin Ian King colin.k...@canonical.com
Cc: Wei Yongjun yongjun_...@trendmicro.com.cn
Cc: Moni Shoua mo...@mellanox.com
Cc: Rasmus Villemoes li...@rasmusvillemoes.dk
Cc: Li RongQing roy.qing...@gmail.com
Cc: Devendra Naga devendra.a...@gmail.com
Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 6ded95a..b119a34 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -61,7 +61,7 @@
 MODULE_VERSION(OCRDMA_ROCE_DRV_VERSION);
 MODULE_DESCRIPTION(OCRDMA_ROCE_DRV_DESC   OCRDMA_ROCE_DRV_VERSION);
 MODULE_AUTHOR(Emulex Corporation);
-MODULE_LICENSE(GPL);
+MODULE_LICENSE(Dual BSD/GPL);
 
 static LIST_HEAD(ocrdma_dev_list);
 static DEFINE_SPINLOCK(ocrdma_devlist_lock);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

2015-07-23 Thread Jason Gunthorpe
On Thu, Jul 23, 2015 at 01:27:23PM +0300, Sagi Grimberg wrote:
 ib_post_fastreg_wr would be a function that needs 3 register passed
 arguments and does a simple copy to the driver's actual sendq
 
 That will require to take the SQ lock and write a doorbell for each
 registration and post you want to do. I'm confident that constructing
 a post chain with a single sq lock acquire and a single doorbell will
 be much much better even with conditional jumps and memsets.

You are still thinking at a micro level, the ULP should be working at
a higher level and requesting the MR(s) and the actual work together
so the driver can run the the whole chain of posts without extra stack
traffic, locking or doorbells.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-23 Thread Sagi Grimberg

On 7/23/2015 7:14 PM, Jason Gunthorpe wrote:

On Thu, Jul 23, 2015 at 01:19:16PM +0300, Sagi Grimberg wrote:

Again, related to my prior comments, please have two of these:

ib_map_mr_sg_rkey()
ib_map_mr_sg_lkey()

So we force ULPs to think about what they are doing properly, and we
get a chance to actually force lkey to be local use only for IB.


The lkey/rkey decision is passed in the fastreg post_send().


That is too late to check the access flags.


Why? the access permissions are kept in the mr context?
I can move it to the post interface if it makes more sense.
the access is kind of out of place in the mapping routine anyway...

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-23 Thread Sagi Grimberg

On 7/22/2015 10:05 PM, Jason Gunthorpe wrote:

On Wed, Jul 22, 2015 at 07:58:23PM +0300, Sagi Grimberg wrote:

On 7/22/2015 7:44 PM, Christoph Hellwig wrote:

On Wed, Jul 22, 2015 at 10:34:05AM -0600, Jason Gunthorpe wrote:

+/**
+ * ib_alloc_mr() - Allocates a memory region
+ * @pd:protection domain associated with the region
+ * @mr_type:   memory region type
+ * @max_entries:   maximum registration entries available
+ * @flags: create flags
+ */


Can you update this comment to elaborate some more on what the
parameters are? 'max_entries' is the number of s/g elements or
something?


+enum ib_mr_type {
+   IB_MR_TYPE_FAST_REG,
+   IB_MR_TYPE_SIGNATURE,
  };


Sure would be nice to have some documentation for what these things
do..


Agreed on both counts.  Otherwise this looks pretty good to me.


I can add some more documentation here...


So, I was wrong, 'max_entries' is the number of page entires, not
really the s/g element limit?


The max_entries stands for the maximum number of sg entries. Other than
that, the SG list must meet the requirements documented in ib_map_mr_sg.

The reason I named max_entries is because might might not be pages but
real SG elements. It stands for maximum registration entries.

Do you have a better name?



In other words the ULP can submit at most max_entires*PAGE_SIZE bytes
for the non ARB_SG case

For the ARB_SG case.. It is some other more difficult computation?


Not really. The ULP needs to submit sg_nents  max_entries. The SG
list needs to meed the alignment requirements.

For ARB_SG, the condition is the same, but the SG is free from the
alignment constraints.



It is somewhat ugly to ask for this upfront as a hard limit..

Is there any reason we can't use a hint_prealloc_pages as the argument
here, and then realloc in the map routine if the hint turns out to be
too small for a particular s/g list?


The reason is that it is not possible. The memory key allocation
reserves resources in the device translation tables. realloc means
reallocating the memory key. In any event, this is not possible in
the IO path.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-23 Thread Sagi Grimberg

On 7/23/2015 12:30 PM, Christoph Hellwig wrote:

On Thu, Jul 23, 2015 at 12:57:34AM +, Hefty, Sean wrote:

+enum ib_mr_type {
+   IB_MR_TYPE_FAST_REG,
+   IB_MR_TYPE_SIGNATURE,


If we're going to go through the trouble of changing everything, I vote
for dropping the word 'fast'. It's a marketing term.  It's goofy.  And
the IB spec is goofy for using it.


So IB_MR_TYPE_MEM_REG?



Yes.  Especially as the infrastructure will be usable to support FMR
on legacy adapters as well except that instead of the ib_post_send it'll
need a call to the FMR code at the very end.

While we're at it  wonder if we should consolidate the type and the
flags field as well, as the split between the two is a little confusing.


I can do that.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-23 Thread Sagi Grimberg

On 7/22/2015 8:44 PM, Jason Gunthorpe wrote:

On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote:

+/**
+ * ib_map_mr_sg() - Populates MR with a dma mapped SG list
+ * @mr:memory region
+ * @sg:dma mapped scatterlist
+ * @sg_nents:  number of entries in sg
+ * @access:access permissions


I know moving the access flags here was my idea originally, but I seem
convinced by your argument that it might fit in better with the posting
helper.  Or did someone else come up with a better argument that mine
for moving it here?


I was hoping we'd move the DMA flush and translate into here and make
it mandatory. Is there any reason not to do that?


The reason I didn't added it in was so the ULPs can make sure they meet
the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his
SG list set partials and iSER to detect gaps (they need to dma map
for that).




+int ib_map_mr_sg(struct ib_mr *mr,
+struct scatterlist *sg,
+unsigned short sg_nents,
+unsigned int access)
+{
+   int rc;
+
+   if (!mr-device-map_mr_sg)
+   return -ENOSYS;
+
+   rc = mr-device-map_mr_sg(mr, sg, sg_nents);


Do we really need a driver callout here?  It seems like we should


The call out makes sense to me..

The driver will convert the scatter list directly into whatever HW
representation it needs and prepare everything for posting. Every
driver has a different HW format, so it must be a callout.


Also it seems like this returns 0/-error.  How do callers like SRP
see that it only did a partial mapping and it needs another MR?


I would think it is an error to pass in more sg_nents than the MR was
created with, so SRP should never get a partial mapping as it should
never ask for more than max_entries.

(? Sagi, did I get the intent of this right?)


Error is returned when:
- sg_nents  max_entries
- sg has gaps
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

2015-07-23 Thread Christoph Hellwig
 If you want to micro optimize then just zero the few items that are
 defined to be accessed for fastreg, no need to zero the whole
 structure. Infact, you may have already done that, so just drop the
 memset entirely.

Oh, indeed.

 If you want to optimize this path, then Sean is right, move the post
 into the driver and stop pretending that ib_post_send is a performance
 API.
 
 ib_post_fastreg_wr would be a function that needs 3 register passed
 arguments and does a simple copy to the driver's actual sendq

Now that sounds even better.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 40/43] mlx5: Allocate private context for arbitrary scatterlist registration

2015-07-23 Thread Christoph Hellwig
On Wed, Jul 22, 2015 at 11:30:48AM -0600, Jason Gunthorpe wrote:
 On Wed, Jul 22, 2015 at 09:55:40AM +0300, Sagi Grimberg wrote:
  +   size += max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
  +   mr-klms = kzalloc(size, GFP_KERNEL);
  +   if (!mr-klms)
  +   return -ENOMEM;
  +
  +   mr-pl_map = dma_map_single(device-dma_device, mr-klms,
  +   size, DMA_TO_DEVICE);
 
 This is a misuse of the DMA API, you must call dma_map_single after
 the memory is set by the CPU, not before.

 The fast reg varient is using coherent allocations, which is OK..

It's fine as long as you dma_sync_*_for_{cpu,device} in the right
places, which is what a lot of drivers do for longer held allocations.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 00/43] New fast registration API

2015-07-23 Thread Christoph Hellwig
On Wed, Jul 22, 2015 at 11:27:02AM -0600, Jason Gunthorpe wrote:
 What is SRP trying to accomplish with that?
 
 The only reason that springs to mind is to emulate IB_MR_MAP_ARB_SG ?

It's not emulating IB_MR_MAP_ARB_SG, it simply allows muliple memory
registrations per I/O request.  Be that to support gappy SGLs in a
generic way, or to allow larger I/O sizes than the HCA MR size.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 00/43] New fast registration API

2015-07-23 Thread Christoph Hellwig
On Wed, Jul 22, 2015 at 08:42:32PM +0300, Sagi Grimberg wrote:
 We can do that, but I'd prefer not to pollute the API just for this
 single use case. What we can do, is add a pool API that would take care
 of that. But even then we might end up with different strategies as not
 all ULPs can use it the same way (protocol constraints)...
 
 Today SRP has this logic that registers multiple SG aligned partials.
 We can just have it pass a partial SG list to what we have today instead
 of building the page vectors...
 
 Or if we can come up with something that will keep the API trivial, we
 can take care of that too.


Supporting an array or list of MRs seems pretty easy.  If you ignore the
weird fallback to physical DMA case when a MR fails case the SRP memory
registration code isn't significanly more complex than that in iSER for
example.  And I think NFS needs the same support as well, as it allows
using additional MRs when detecting a gap.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-23 Thread Christoph Hellwig
On Thu, Jul 23, 2015 at 12:57:34AM +, Hefty, Sean wrote:
  +enum ib_mr_type {
  +   IB_MR_TYPE_FAST_REG,
  +   IB_MR_TYPE_SIGNATURE,
 
 If we're going to go through the trouble of changing everything, I vote
 for dropping the word 'fast'. It's a marketing term.  It's goofy.  And
 the IB spec is goofy for using it.

Yes.  Especially as the infrastructure will be usable to support FMR
on legacy adapters as well except that instead of the ib_post_send it'll
need a call to the FMR code at the very end.

While we're at it  wonder if we should consolidate the type and the
flags field as well, as the split between the two is a little confusing.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

2015-07-23 Thread Sagi Grimberg

On 7/22/2015 8:57 PM, Jason Gunthorpe wrote:

On Wed, Jul 22, 2015 at 08:33:16PM +0300, Sagi Grimberg wrote:

memset(fr_wr, 0, sizeof(fr_wr));
+   ib_set_fastreg_wr(mr, mr-lkey, ISER_FASTREG_LI_WRID,
+ false, fr_wr);


Shouldn't ib_set_fastreg_wr take care of this memset?  Also it seems
instead of the singalled flag to it we might just set that or
other flags later if we really want to.


Seems reasonable.

If you want to micro optimize then just zero the few items that are
defined to be accessed for fastreg, no need to zero the whole
structure. Infact, you may have already done that, so just drop the
memset entirely.


I will.




The reason I didn't put it in was that ib_send_wr is not a small struct
(92 bytes IIRC). So I'm a bit reluctant to add an unconditional memset.
Maybe it's better that the callers can carefully set it to save some
cycles?


If you want to optimize this path, then Sean is right, move the post
into the driver and stop pretending that ib_post_send is a performance
API.

ib_post_fastreg_wr would be a function that needs 3 register passed
arguments and does a simple copy to the driver's actual sendq


That will require to take the SQ lock and write a doorbell for each
registration and post you want to do. I'm confident that constructing
a post chain with a single sq lock acquire and a single doorbell will
be much much better even with conditional jumps and memsets.

svcrdma, isert (and iser - not upstream yet) are doing it. I think that
others should do it too. My tests shows that this makes a difference in
small IO workloads.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 40/43] mlx5: Allocate private context for arbitrary scatterlist registration

2015-07-23 Thread Sagi Grimberg

On 7/23/2015 12:25 PM, Christoph Hellwig wrote:

On Wed, Jul 22, 2015 at 11:30:48AM -0600, Jason Gunthorpe wrote:

On Wed, Jul 22, 2015 at 09:55:40AM +0300, Sagi Grimberg wrote:

+   size += max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
+   mr-klms = kzalloc(size, GFP_KERNEL);
+   if (!mr-klms)
+   return -ENOMEM;
+
+   mr-pl_map = dma_map_single(device-dma_device, mr-klms,
+   size, DMA_TO_DEVICE);


This is a misuse of the DMA API, you must call dma_map_single after
the memory is set by the CPU, not before.

The fast reg varient is using coherent allocations, which is OK..


It's fine as long as you dma_sync_*_for_{cpu,device} in the right
places, which is what a lot of drivers do for longer held allocations.


OK. I'll fix that.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 00/43] New fast registration API

2015-07-23 Thread Sagi Grimberg

On 7/23/2015 12:28 PM, Christoph Hellwig wrote:

On Wed, Jul 22, 2015 at 08:42:32PM +0300, Sagi Grimberg wrote:

We can do that, but I'd prefer not to pollute the API just for this
single use case. What we can do, is add a pool API that would take care
of that. But even then we might end up with different strategies as not
all ULPs can use it the same way (protocol constraints)...

Today SRP has this logic that registers multiple SG aligned partials.
We can just have it pass a partial SG list to what we have today instead
of building the page vectors...

Or if we can come up with something that will keep the API trivial, we
can take care of that too.



Supporting an array or list of MRs seems pretty easy.


I'm missing the simplicity here...


If you ignore the
weird fallback to physical DMA case when a MR fails case the SRP memory
registration code isn't significanly more complex than that in iSER for
example.  And I think NFS needs the same support as well, as it allows
using additional MRs when detecting a gap.



This kinda changing the semantics a bit. With this we need to return a
value of how many MRs used to register. It will also make it a bit
sloppy as the actual mapping is driven from the drivers (which use their
internal buffers).

Don't you think that a separate pool API is better for addressing this?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 37/43] xprtrdma: Port to new memory registration API

2015-07-23 Thread Sagi Grimberg

On 7/22/2015 7:04 PM, Chuck Lever wrote:


On Jul 22, 2015, at 11:41 AM, Sagi Grimberg sa...@dev.mellanox.co.il wrote:




+   for (i = 0; i  nsegs;) {
+   sg_set_page(frmr-sg[i], seg-mr_page,
+   seg-mr_len, offset_in_page(seg-mr_offset));


Cautionary note: here we’re dealing with both the “contiguous
set of pages” case and the “small region of bytes in a single page”
case. See rpcrdma_convert_iovs(): sometimes RPC send or receive
buffers can be registered (RDMA_NOMSG).


I noticed that (I think). I think this is handled correctly.
What exactly is the caution note here?


Well the sg is turned into a page list below your API. Just
want to make sure that we have tested your xprtrdma alterations
with all the ULP possibilities. When you are further along I
can pull this and run my functional tests.



mr = frmr-fr_mr;
+   access = writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+  IB_ACCESS_REMOTE_READ;
+   rc = ib_map_mr_sg(mr, frmr-sg, frmr-sg_nents, access);


I like this (and the matching ib_dma_unmap_sg). But why wouldn’t
this function be called ib_dma_map_sg() ? The name ib_map_mr_sg()
had me thinking for a moment that this API actually posted the
FASTREG WR, but I see that it doesn’t.


Umm, ib_dma_map_sg is already taken :)

This is what I came up with, it maps the SG elements to the MR
private context.

I'd like to keep the post API for now. It will be possible to
to add a wrapper function that would do:
- dma_map_sg
- ib_map_mr_sg
- init fastreg send_wr
- post_send (maybe)


Where xprtrdma might improve is by setting up all the FASTREG
WRs for one RPC with a single chain and post_send. We could do
that with your INDIR_MR concept, for example.


BTW, it would be great if you can play with it a little bit. I'm more
confident with the iSER part... I added two small fixes when I tested
with mlx4. It seems to work...





-   while (seg1-mr_nsegs--)
-   rpcrdma_unmap_one(ia-ri_device, seg++);
+   ib_dma_unmap_sg(ia-ri_device, frmr-sg, frmr-sg_nents, seg1-mr_dir);


-mr_dir was previously set by rpcrdma_map_one(), which you’ve replaced
with ib_map_mr_sg(). So maybe frwr_op_map() needs to save “direction”
in the rpcrdma_frmr.


Yep, that's correct, if I had turned on dma mapping debug it would shout
at me here...

Note, I added in the git repo a patch to allow arbitrary sg lists in
frwr_op_map() which would allow you to skip the holes check... seems to
work with mlx5...

I did noticed the mlx4 gives a protection error with after the conversion... 
I'll look into that...


Should also get Steve and Devesh to try this with their adapters.


Ah, yes please. I've only compiled tested drivers other than mlx4, mlx5
which means there is a 99.9% (probably 100%) that it doesn't work.

It would be great to get help on porting the rest of the ULPs as well,
but that can wait until we converge on the API...
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] IB/iser: Use pd-local_dma_lkey

2015-07-23 Thread Sagi Grimberg

On 7/23/2015 2:34 AM, Jason Gunthorpe wrote:

Replace all leys with  pd-local_dma_lkey. This driver does not support
iWarp, so this is safe.

The insecure use of ib_get_dma_mr is thus isolated to an rkey, and this
looks trivially fixed by forcing the use of registration in a future
patch.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com


Looks good.

Reviewed-by: Sagi Grimberg sa...@mellanox.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/10] IB/srp: Use pd-local_dma_lkey

2015-07-23 Thread Sagi Grimberg

On 7/23/2015 2:34 AM, Jason Gunthorpe wrote:

Replace all leys with  pd-local_dma_lkey. This driver does not support
iWarp, so this is safe.

The insecure use of ib_get_dma_mr is thus isolated to an rkey, and will
have to be fixed separately.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com


Looks good.

Reviewed-by: Sagi Grimberg sa...@mellanox.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/10] iser-target: Remove ib_get_dma_mr calls

2015-07-23 Thread Sagi Grimberg

On 7/23/2015 2:34 AM, Jason Gunthorpe wrote:

The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com


Looks good.

Reviewed-by: Sagi Grimberg sa...@mellanox.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] IB: Replace safe uses for ib_get_dma_mr with pd-local_dma_lkey

2015-07-23 Thread Sagi Grimberg



Sagi, IB/iser should have special attention paid, as it is less clear to me if
it got everything.


It looks fine. I'll pull those as well.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html