date:20150113

[PATCH 2/2] RDMA/libocrdma: update libocrdma version string

2015-01-13 Thread Devesh Sharma

version string updated from 1.0.4 to 1.0.5
Signed-off-by: Devesh Sharma 
---
 configure.in |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/configure.in b/configure.in
index 3b0cf21..140c07b 100644
--- a/configure.in
+++ b/configure.in
@@ -1,11 +1,11 @@
 dnl Process this file with autoconf to produce a configure script.
 
 AC_PREREQ(2.57)
-AC_INIT(libocrdma, 1.0.4, linux-rdma@vger.kernel.org)
+AC_INIT(libocrdma, 1.0.5, linux-rdma@vger.kernel.org)
 AC_CONFIG_SRCDIR([src/ocrdma_main.h])
 AC_CONFIG_AUX_DIR(config)
 AM_CONFIG_HEADER(config.h)
-AM_INIT_AUTOMAKE(libocrdma, 1.0.4)
+AM_INIT_AUTOMAKE(libocrdma, 1.0.5)
 AM_PROG_LIBTOOL
 
 AC_ARG_ENABLE(libcheck, [ --disable-libcheckdo not test for the presence 
of ib libraries],
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] libocrdma bug fixes

2015-01-13 Thread Devesh Sharma

This patch series has a major bug fix related to
retuning the correct error codes whenever any
immediate error is encounterd.

Devesh Sharma (1):
  RDMA/libocrdma: update libocrdma version string

Padmanabh Ratnakar (1):
  RDMA/libocrdma: return positive error codes

 configure.in   |4 ++--
 src/ocrdma_verbs.c |   32 
 2 files changed, 18 insertions(+), 18 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] RDMA/libocrdma: return positive error codes

2015-01-13 Thread Devesh Sharma

From: Padmanabh Ratnakar 

As per the standard practice if any faiulre encountered in the
library code, the library should return a positive error code
to the user.

A bug has been reported in a used case scenario of KVM migration
as well.

This patch fixs the return code problem.

Signed-off-by: Padmanabh Ratnakar 
Signed-off-by: Devesh Sharma 
---
 src/ocrdma_verbs.c |   32 
 1 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/src/ocrdma_verbs.c b/src/ocrdma_verbs.c
index edff8b6..c089a5f 100644
--- a/src/ocrdma_verbs.c
+++ b/src/ocrdma_verbs.c
@@ -778,7 +778,7 @@ static int ocrdma_qp_state_machine(struct ocrdma_qp *qp,
ocrdma_del_flush_qp(qp);
break;
default:
-   status = -EINVAL;
+   status = EINVAL;
break;
};
break;
@@ -794,7 +794,7 @@ static int ocrdma_qp_state_machine(struct ocrdma_qp *qp,
break;
default:
/* invalid state change. */
-   status = -EINVAL;
+   status = EINVAL;
break;
};
break;
@@ -808,7 +808,7 @@ static int ocrdma_qp_state_machine(struct ocrdma_qp *qp,
break;
default:
/* invalid state change. */
-   status = -EINVAL;
+   status = EINVAL;
break;
};
break;
@@ -823,7 +823,7 @@ static int ocrdma_qp_state_machine(struct ocrdma_qp *qp,
break;
default:
/* invalid state change. */
-   status = -EINVAL;
+   status = EINVAL;
break;
};
break;
@@ -836,7 +836,7 @@ static int ocrdma_qp_state_machine(struct ocrdma_qp *qp,
break;
default:
/* invalid state change. */
-   status = -EINVAL;
+   status = EINVAL;
break;
};
break;
@@ -847,7 +847,7 @@ static int ocrdma_qp_state_machine(struct ocrdma_qp *qp,
break;
default:
/* invalid state change. */
-   status = -EINVAL;
+   status = EINVAL;
break;
};
break;
@@ -857,12 +857,12 @@ static int ocrdma_qp_state_machine(struct ocrdma_qp *qp,
case OCRDMA_QPS_RST:
break;
default:
-   status = -EINVAL;
+   status = EINVAL;
break;
};
break;
default:
-   status = -EINVAL;
+   status = EINVAL;
break;
};
if (!status)
@@ -1226,7 +1226,7 @@ static inline int ocrdma_build_inline_sges(struct 
ocrdma_qp *qp,
ocrdma_err
("%s() supported_len=0x%x, unspported len req=0x%x\n",
__func__, qp->max_inline_data, hdr->total_len);
-   return -EINVAL;
+   return EINVAL;
}
 
dpp_addr = (char *)sge;
@@ -1391,7 +1391,7 @@ int ocrdma_post_send(struct ibv_qp *ib_qp, struct 
ibv_send_wr *wr,
if (qp->state != OCRDMA_QPS_RTS && qp->state != OCRDMA_QPS_SQD) {
pthread_spin_unlock(&qp->q_lock);
*bad_wr = wr;
-   return -EINVAL;
+   return EINVAL;
}
 
while (wr) {
@@ -1399,14 +1399,14 @@ int ocrdma_post_send(struct ibv_qp *ib_qp, struct 
ibv_send_wr *wr,
if (qp->qp_type == IBV_QPT_UD && (wr->opcode != IBV_WR_SEND &&
wr->opcode != IBV_WR_SEND_WITH_IMM)) {
*bad_wr = wr;
-   status = -EINVAL;
+   status = EINVAL;
break;
}
 
if (ocrdma_hwq_free_cnt(&qp->sq) == 0 ||
wr->num_sge > qp->sq.max_sges) {
*bad_wr = wr;
-   status = -ENOMEM;
+   status = ENOMEM;
break;
}
hdr = ocrdma_hwq_head(&qp->sq);
@@ -1441,7 +1441,7 @@ int ocrdma_post_send(struct ibv_qp *ib_qp, struct 
ibv_send_wr *wr,
ocrdma_build_read(qp, hdr, wr);
break;
default:
-   status = -EINVAL;
+   status = EINVAL;
break;
}
if (status) {
@@ -1509,13 +1509,13 @@ int ocrdma_post_recv

[PATCH FIX for-3.19] IB/ipoib: Fix failed multicast joins/sends

2015-01-13 Thread Doug Ledford

The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
it is used to mean "our device is administratively allowed to send
multicast joins/leaves/packets" and in other places it means "our
multicast join task thread is currently running and will process your
request if you put it on the queue".  However, this latter meaning is in
fact flawed as there is a race condition between the join task testing
the mcast list and finding it empty of remaining work, dropping the
mcast mutex and also the priv->lock spinlock, and clearing the
IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
the flag in the former fashion, and when all tasks complete and the task
thread clears the RUN flag, all of those other locations will fail to
ever again queue any work.  This results in the interface coming up fine
initially, but having problems adding new multicast groups after the
first round of groups have all been added and the RUN flag is cleared by
the join task thread when it thinks it is done.  To resolve this issue,
convert all locations in the code to treat the RUN flag as an indicator
that the multicast portion of this interface is in fact administratively
up and joins/leaves/sends can be performed.  There is no harm (other
than a slight performance penalty) to never clearing this flag and using
it in this fashion as it simply means that a few places that used to
micro-optimize how often this task was queued on a work queue will now
queue the task a few extra times.  We can address that suboptimal
behavior in future patches.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index bc50dd0d0e4..91b8fe118ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -630,8 +630,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
}
 
ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
-
-   clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -641,8 +639,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
ipoib_dbg_mcast(priv, "starting multicast thread\n");
 
mutex_lock(&mcast_mutex);
-   if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-   queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+   set_bit(IPOIB_MCAST_RUN, &priv->flags);
+   queue_delayed_work(priv->wq, &priv->mcast_task, 0);
mutex_unlock(&mcast_mutex);
 
return 0;
@@ -725,7 +723,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, 
struct sk_buff *skb)
memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
__ipoib_mcast_add(dev, mcast);
list_add_tail(&mcast->list, &priv->multicast_list);
-   if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+   if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
queue_delayed_work(priv->wq, &priv->mcast_task, 0);
}
 
@@ -951,7 +949,8 @@ void ipoib_mcast_restart_task(struct work_struct *work)
/*
 * Restart our join task if needed
 */
-   ipoib_mcast_start_thread(dev);
+   if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+   queue_delayed_work(priv->wq, &priv->mcast_task, 0);
rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/20] IB/core: Add IB_DEVICE_OPA_MAD_SUPPORT device cap flag

2015-01-13 Thread Or Gerlitz


On 1/12/2015 7:11 PM, ira.we...@intel.com wrote:

Add a device capability flag to flag OPA MAD support on devices.


You should put few words here telling what is OPA/OPA MADs and why 
supporting them fits the IB core.


See for example the IB core patch [1] that added signature verbs API is 
states in  a manner which is both
generic across vendors and across different Mellanox  HCAs. Also see [2] 
the IB core patch that added support

for BMME API  as another example.


[1] 1b01d33 IB/core: Introduce signature verbs API
[2] 00f7ec3 RDMA/core: Add memory management extensions support
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/20] IB/core: Cache device attributes for use by upper level drivers

2015-01-13 Thread Or Gerlitz


On 1/12/2015 7:10 PM, ira.we...@intel.com wrote:

From: Ira Weiny 


Please avoid empty change-logs, e.g one liner will perfectly do here.

Signed-off-by: Ira Weiny 
---
  drivers/infiniband/core/device.c |2 ++
  include/rdma/ib_verbs.h  |1 +
  2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 18c1ece..3a6afde 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -294,6 +294,8 @@ int ib_register_device(struct ib_device *device,
spin_lock_init(&device->event_handler_lock);
spin_lock_init(&device->client_data_lock);
  
+	device->query_device(device, &device->attributes);

+
ret = read_port_table_lengths(device);
if (ret) {
printk(KERN_WARNING "Couldn't create table lengths cache for device 
%s\n",
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0d74f1d..86fc90f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1675,6 +1675,7 @@ struct ib_device {
u32  local_dma_lkey;
u8   node_type;
u8   phys_port_cnt;
+   struct ib_device_attrattributes;

I think cached_dev_attrs will be better name.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/20] IB/mad: Create jumbo_mad data structures

2015-01-13 Thread Or Gerlitz


On 1/12/2015 7:11 PM, ira.we...@intel.com wrote:

Define jumbo_mad and jumbo_rmpp_mad


For the sake of review and maintenance, please add few more words here 
on what are these creatures...


Create an RMPP Base header to share between ib_rmpp_mad and jumbo_rmpp_mad


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/20] IB/core: Add IB_DEVICE_OPA_MAD_SUPPORT device cap flag

2015-01-13 Thread Or Gerlitz


On 1/12/2015 7:11 PM, ira.we...@intel.com wrote:

--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
 enum ib_signature_prot_cap {
IB_PROT_T10DIF_TYPE_1 = 1,
IB_PROT_T10DIF_TYPE_2 = 1 << 1,
@@ -210,6 +214,7 @@ struct ib_device_attr {
int sig_prot_cap;
int sig_guard_cap;
struct ib_odp_caps  odp_caps;
+   u64 device_cap_flags2;


Just make the existing kernel size device_cap_flags field a u64, note 
it's not blankly copied to user space in uverbs as part of a chunk,

so just go there and copy the lower 32 bits.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/20] IB/core: Add mad_size to ib_device_attr

2015-01-13 Thread Or Gerlitz


On 1/12/2015 7:10 PM, ira.we...@intel.com wrote:

Change all IB drivers to report the IB management size.


Do you mean the maximal size of IB MADs they support? if this is the 
case, please reflect it in the field name.



Add check to verify that all devices support at least IB_MGMT_MAD_SIZE


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/20] IB/mad: Change cast in rcv_has_same_class

2015-01-13 Thread Or Gerlitz


On 1/12/2015 7:10 PM, ira.we...@intel.com wrote:

From: Ira Weiny 

Save dereference and clarifies that rcv_has_same_class can process both IB and
OPA MADs.


I don't see any clarification below... something missing here?



Signed-off-by: Ira Weiny 
---
  drivers/infiniband/core/mad.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 66b3940..819b794 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1750,7 +1750,7 @@ static int is_rmpp_data_mad(struct ib_mad_agent_private 
*mad_agent_priv,
  static inline int rcv_has_same_class(struct ib_mad_send_wr_private *wr,
 struct ib_mad_recv_wc *rwc)
  {
-   return ((struct ib_mad *)(wr->send_buf.mad))->mad_hdr.mgmt_class ==
+   return ((struct ib_mad_hdr *)(wr->send_buf.mad))->mgmt_class ==
rwc->recv_buf.mad->mad_hdr.mgmt_class;
  }
  


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow

2015-01-13 Thread Or Gerlitz


On 1/14/2015 8:01 AM, Doug Ledford wrote:
I've tested my patch on a plain upstream kernel and it resolves the 
problem there too. 


Good, please make a proper submission on the list so we can review it 
inline and send questions/feedback.


Or.

So, I feel confident that the issue is resolved properly with this 
patch. Roland, do you need a different submission, or can you take the 
patch from my last email? 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow

2015-01-13 Thread Doug Ledford

On Tue, 2015-01-13 at 16:01 -0500, Doug Ledford wrote:
> On Tue, 2015-01-13 at 22:13 +0200, Or Gerlitz wrote:
> > On Tue, Jan 13, 2015 at 6:45 PM, Doug Ledford  wrote:
> > > On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
> > >> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz  wrote:
> > >> > From: Erez Shitrit 
> > >> >
> > >> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> > >> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
> > >> > aren't working.
> > >>
> > >> Doug, can you ack the breakage introduced by your commit and the fix?
> > >
> > > I haven't double checked the breakage, I'll take your word for it
> > 
> > just try ping6 or iperf multicast and see it for yourself, please.
> 
> I have.  I have them working now.
> 
> > 
> > > (at the time I did my work, I had multicast debugging on and I verified 
> > > the
> > > join/leave process, but I had assumed that the process would work the
> > > same for optional multicast groups as it does for the IPoIB broadcast
> > > group and other default IPoIB groups, so I didn't specifically test
> > > additional multicast groups above and beyond the broadcast/etc groups).
> > >
> > > However, the fix is not workable.  In particular, as soon as this patch
> > > is added to the kernel, you will start getting messages like this:
> > >
> > > mlx4_ib0: ipoib_mcast_leave on an in-flight join
> > 
> > 
> > I don't see it on my systems, is that upstream you're running? what entity 
> > does
> > ,;x4_ib0: prefixed prints and under what settings, is that the IPoIB driver?
> 
> No, that's my internal rhel7 tree, but it's so close to bare upstream
> when it comes to IPoIB that there really shouldn't be any difference.

I've tested my patch on a plain upstream kernel and it resolves the
problem there too.  So, I feel confident that the issue is resolved
properly with this patch.

Roland, do you need a different submission, or can you take the patch
from my last email?

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part

Re: [PATCH] infiniband: mlx5: avoid a compile-time warning

2015-01-13 Thread David Miller

From: Arnd Bergmann 
Date: Tue, 13 Jan 2015 17:09:43 +0100

> The return type of find_first_bit() is architecture specific,
> on ARM it is 'unsigned int', while the asm-generic code used
> on x86 and a lot of other architectures returns 'unsigned long'.
> 
> When building the mlx5 driver on ARM, we get a warning about
> this:
> 
> infiniband/hw/mlx5/mem.c: In function 'mlx5_ib_cont_pages':
> infiniband/hw/mlx5/mem.c:84:143: warning: comparison of distinct pointer 
> types lacks a cast
>  m = min(m, find_first_bit(&tmp, sizeof(tmp)));
> 
> This patch changes the driver to use min_t to make it behave
> the same way on all architectures.
> 
> Signed-off-by: Arnd Bergmann 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mlx5: avoid build warnings on 32-bit

2015-01-13 Thread David Miller

From: Arnd Bergmann 
Date: Tue, 13 Jan 2015 17:08:06 +0100

> The mlx5 driver passes a string pointer in through a 'u64' variable,
> which on 32-bit machines causes a build warning:
> 
> drivers/net/ethernet/mellanox/mlx5/core/debugfs.c: In function 
> 'qp_read_field':
> drivers/net/ethernet/mellanox/mlx5/core/debugfs.c:303:11: warning: cast from 
> pointer to integer of different size [-Wpointer-to-int-cast]
> 
> The code is in fact safe, so we can shut up the warning by adding
> extra type casts.
> 
> Signed-off-by: Arnd Bergmann 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow

2015-01-13 Thread Doug Ledford

On Tue, 2015-01-13 at 22:13 +0200, Or Gerlitz wrote:
> On Tue, Jan 13, 2015 at 6:45 PM, Doug Ledford  wrote:
> > On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
> >> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz  wrote:
> >> > From: Erez Shitrit 
> >> >
> >> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> >> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
> >> > aren't working.
> >>
> >> Doug, can you ack the breakage introduced by your commit and the fix?
> >
> > I haven't double checked the breakage, I'll take your word for it
> 
> just try ping6 or iperf multicast and see it for yourself, please.

I have.  I have them working now.

> 
> > (at the time I did my work, I had multicast debugging on and I verified the
> > join/leave process, but I had assumed that the process would work the
> > same for optional multicast groups as it does for the IPoIB broadcast
> > group and other default IPoIB groups, so I didn't specifically test
> > additional multicast groups above and beyond the broadcast/etc groups).
> >
> > However, the fix is not workable.  In particular, as soon as this patch
> > is added to the kernel, you will start getting messages like this:
> >
> > mlx4_ib0: ipoib_mcast_leave on an in-flight join
> 
> 
> I don't see it on my systems, is that upstream you're running? what entity 
> does
> ,;x4_ib0: prefixed prints and under what settings, is that the IPoIB driver?

No, that's my internal rhel7 tree, but it's so close to bare upstream
when it comes to IPoIB that there really shouldn't be any difference.
And mlx4_ib0 is just the name that I renamed ib0 to (An internal
standard in my test cluster is that all ib interfaces are named based
upon the hardware they are tied to, so mlx4_ib0, mlx5_ib0, qib_ib0, etc.
Makes my life in verifying code coverage easier)  I can test again with
an upstream kernel later today.  But, for now, suffice it to say the
problem is not resolved with the patch in this thread, but with the much
simpler patch I've attached to this email (note, this is made against
rhel7, I'm only attaching it so you can try it yourself, assuming it
even applies, and once I've tested it on an upstream kernel myself and
verified that it works properly, I'll submit the final upstream version
under a new thread).

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD

commit dc2c71896edc3b549e79a51cf83db813d6012a34
Author: Doug Ledford 
Date:   Tue Jan 13 13:35:40 2015 -0500

IB/ipoib: Fix failed multicast joins/sends

The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
it is used to mean "our device is administratively allowed to send
multicast joins/leaves/packets" and in other places it means "our
multicast join task thread is currently running and will process your
request if you put it on the queue".  However, this latter meaning is in
fact flawed as there is a race condition between the join task testing
the mcast list and finding it empty of remaining work, dropping the
mcast mutex and also the priv->lock spinlock, and clearing the
IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
the flag in the former fashion, and when all tasks complete and the task
thread clears the RUN flag, all of those other locations will fail to
ever again queue any work.  This results in the interface coming up fine
initially, but having problems adding new multicast groups after the
first round of groups have all been added and the RUN flag is cleared by
the join task thread when it thinks it is done.  To resolve this issue,
convert all locations in the code to treat the RUN flag as an indicator
that the multicast portion of this interface is in fact administratively
up and joins/leaves/sends can be performed.  There is no harm (other
than a slight performance penalty) to never clearing this flag and using
it in this fashion as it simply means that a few places that used to
micro-optimize how often this task was queued on a work queue will now
queue the task a few extra times.  We can address that suboptimal
behavior in future patches.

Signed-off-by: Doug Ledford 

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 8a538c010b9..cba6e160df2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -638,8 +638,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}

 	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
-
-	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 }

 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -649,8 +647,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");

 	mutex_lock(&mcast_mutex);
-	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_t

Re: [PATCH] mlx5: avoid build warnings on 32-bit

2015-01-13 Thread Arnd Bergmann

On Tuesday 13 January 2015 18:28:03 Eli Cohen wrote:
> On Tue, Jan 13, 2015 at 05:08:06PM +0100, Arnd Bergmann wrote:
> 
> Hi Arnd,
> wouldn't it work by casting to uintptr_t instead of unsigned long?
> 

These are the same on all architectures that Linux can run on,
but if you have a strong preference, I can send an updated
patch, the effect is exactly the same.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow

2015-01-13 Thread Or Gerlitz

On Tue, Jan 13, 2015 at 6:45 PM, Doug Ledford  wrote:
> On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
>> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz  wrote:
>> > From: Erez Shitrit 
>> >
>> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
>> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
>> > aren't working.
>>
>> Doug, can you ack the breakage introduced by your commit and the fix?
>
> I haven't double checked the breakage, I'll take your word for it

just try ping6 or iperf multicast and see it for yourself, please.


> (at the time I did my work, I had multicast debugging on and I verified the
> join/leave process, but I had assumed that the process would work the
> same for optional multicast groups as it does for the IPoIB broadcast
> group and other default IPoIB groups, so I didn't specifically test
> additional multicast groups above and beyond the broadcast/etc groups).
>
> However, the fix is not workable.  In particular, as soon as this patch
> is added to the kernel, you will start getting messages like this:
>
> mlx4_ib0: ipoib_mcast_leave on an in-flight join


I don't see it on my systems, is that upstream you're running? what entity does
,;x4_ib0: prefixed prints and under what settings, is that the IPoIB driver?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow

2015-01-13 Thread Doug Ledford

On Wed, 2015-01-07 at 17:04 +0200, Or Gerlitz wrote:
> From: Erez Shitrit 
> 
> Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> both IPv6 traffic and for the most cases all IPv4 multicast traffic
> aren't working.
> 
> After this change there is no mechanism to handle the work that does the
> join process for the rest of the mcg's. For example, if in the list of
> all the mcg's there is a send-only request, after its processing, the
> code in ipoib_mcast_sendonly_join_complete() will not requeue the
> mcast task, but leaves the bit that signals this task is running,
> and hence the task will never run.
> 
> Also, whenever the kernel sends multicast packet (w.o joining to this
> group), we don't call ipoib_send_only_join(), the code tries to start
> the mcast task but it failed because the bit IPOIB_MCAST_RUN is always
> set, As a result the multicast packet will never be sent.
> 
> The fix handles all the join requests via the same logic, and call
> explicitly to sendonly join whenever there is a packet from sendonly type.
> 
> Since ipoib_mcast_sendonly_join() is now called from the driver TX flow,
> we can't take mutex there. Locking isn't required there since the multicast
> join callback will be called only after the SA agent initialized the relevant
> multicast object.
> 
> Fixes: 016d9fb25cd9 ('IPoIB: fix MCAST_FLAG_BUSY usage')
> Reported-by: Eyal Perry 
> Signed-off-by: Erez Shitrit 
> Signed-off-by: Or Gerlitz 
> ---
> V0 --> V1 changes: Added credits (...) and furnished the change-log abit.
> 
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   15 ++-
>  1 files changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
> b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index bc50dd0..0ea4b08 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -301,9 +301,10 @@ ipoib_mcast_sendonly_join_complete(int status,
>   dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
>   }
>   netif_tx_unlock_bh(dev);
> +
> + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>   }
>  out:
> - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>   if (status)
>   mcast->mc = NULL;
>   complete(&mcast->done);

This chunk is wrong.  We are in our complete routine, which means
ib_sa_join_multicast is calling us for this mcast group, and we will
never see another return for this group.  We must clear the BUSY flag no
matter what as the BUSY flag now indicates that our mcast join is still
outstanding in the lower layer ib_sa_ area, not that we have joined the
group.  Please re-read my patches that re-worked the BUSY flag usage.
The BUSY flag was poorly named/used in the past, which is why a previous
patch introduced the JOINING or whatever flag it was called.  My
patchset reworks the flag usage to be more sane.  BUSY now means
*exactly* that: this mcast group is in the process of joining, aka it's
BUSY.  It doesn't mean we've joined the group and there are no more
outstanding join requests.  That's signified by mcast->mc !=
IS_ERR_OR_NULL.

> @@ -342,7 +343,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
> *mcast)
>   rec.port_gid = priv->local_gid;
>   rec.pkey = cpu_to_be16(priv->pkey);
>  
> - mutex_lock(&mcast_mutex);
>   init_completion(&mcast->done);
>   set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>   mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
> @@ -364,7 +364,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
> *mcast)
>   ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
>   "sendonly join\n", mcast->mcmember.mgid.raw);
>   }
> - mutex_unlock(&mcast_mutex);
>  
>   return ret;
>  }

No!  You can not, under any circumstances, remove this locking!  One of
the things that frustrated me for a bit until I tracked it down was how
ib_sa_join_multicast returns errors to the ipoib layer.  When you call
ib_sa_join_multicast, the return value is either a valid mcast->mc
pointer or IS_ERR(err).  If it's a valid pointer, that does not mean we
have successfully joined, it means that we might join, but it isn't
until we have completed the callback that we know.  The callback will
clear out mcast->mc if we encounter an error during the callback and
know that by returning an error from the callback, the lower layer is
going to delete the mcast->mc context out from underneath us.  As it
turns out, we often get our callbacks called even before we get the
initial return from ib_sa_join_multicast.  If we don't have this
locking, and we get any error in the callback, the callback will clear
mcast->mc to indicate that we have no valid group, then we will return
from ib_sa_join_multicast and set mcast->mc to an invalid group.  To
prevent that, the callback grabs this mutex

Re: [PATCH v2 00/20] NFS/RDMA client for 3.20

2015-01-13 Thread Steve Wise


Reviewed-by: Steve Wise 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 00/10] NFS/RDMA server for 3.20

2015-01-13 Thread Steve Wise



Reviewed-by: Steve Wise 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow

2015-01-13 Thread Doug Ledford

On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz  wrote:
> > From: Erez Shitrit 
> >
> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
> > aren't working.
> 
> Doug, can you ack the breakage introduced by your commit and the fix?

I haven't double checked the breakage, I'll take your word for it (at
the time I did my work, I had multicast debugging on and I verified the
join/leave process, but I had assumed that the process would work the
same for optional multicast groups as it does for the IPoIB broadcast
group and other default IPoIB groups, so I didn't specifically test
additional multicast groups above and beyond the broadcast/etc groups).

However, the fix is not workable.  In particular, as soon as this patch
is added to the kernel, you will start getting messages like this:

mlx4_ib0: ipoib_mcast_leave on an in-flight join

Every time you get this message, you've run into a "shouldn't ever
happen" situation.  If this happens, then we've lost track of the mcast
flags settings or we've genuinely tried to remove a mcast group where
the lower layer is still working on our join.  Either way, it means
we've screwed up.  Further, with this patch in place, I'm seeing random
acts of badness now with non-default IPoIB pkey joins.  Sometimes they
work, sometimes they don't.

So, no, this patch doesn't work.  I'll do some more investigating and
report back.

> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] infiniband: mlx5: avoid a compile-time warning

2015-01-13 Thread Eli Cohen

Acked-by: Eli Cohen 

On Tue, Jan 13, 2015 at 05:09:43PM +0100, Arnd Bergmann wrote:
> The return type of find_first_bit() is architecture specific,
> on ARM it is 'unsigned int', while the asm-generic code used
> on x86 and a lot of other architectures returns 'unsigned long'.
> 
> When building the mlx5 driver on ARM, we get a warning about
> this:
> 
> infiniband/hw/mlx5/mem.c: In function 'mlx5_ib_cont_pages':
> infiniband/hw/mlx5/mem.c:84:143: warning: comparison of distinct pointer 
> types lacks a cast
>  m = min(m, find_first_bit(&tmp, sizeof(tmp)));
> 
> This patch changes the driver to use min_t to make it behave
> the same way on all architectures.
> 
> Signed-off-by: Arnd Bergmann 
> 
> diff --git a/drivers/infiniband/hw/mlx5/mem.c 
> b/drivers/infiniband/hw/mlx5/mem.c
> index b56e4c5593ee..611a9fdf2f38 100644
> --- a/drivers/infiniband/hw/mlx5/mem.c
> +++ b/drivers/infiniband/hw/mlx5/mem.c
> @@ -81,7 +81,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int 
> *count, int *shift,
>   for (k = 0; k < len; k++) {
>   if (!(i & mask)) {
>   tmp = (unsigned long)pfn;
> - m = min(m, find_first_bit(&tmp, sizeof(tmp)));
> + m = min_t(unsigned long, m, 
> find_first_bit(&tmp, sizeof(tmp)));
>   skip = 1 << m;
>   mask = skip - 1;
>   base = pfn;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mlx5: avoid build warnings on 32-bit

2015-01-13 Thread Eli Cohen

On Tue, Jan 13, 2015 at 05:08:06PM +0100, Arnd Bergmann wrote:

Hi Arnd,
wouldn't it work by casting to uintptr_t instead of unsigned long?

> The mlx5 driver passes a string pointer in through a 'u64' variable,
> which on 32-bit machines causes a build warning:
> 
> drivers/net/ethernet/mellanox/mlx5/core/debugfs.c: In function 
> 'qp_read_field':
> drivers/net/ethernet/mellanox/mlx5/core/debugfs.c:303:11: warning: cast from 
> pointer to integer of different size [-Wpointer-to-int-cast]
> 
> The code is in fact safe, so we can shut up the warning by adding
> extra type casts.
> 
> Signed-off-by: Arnd Bergmann 
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
> index 10e1f1a18255..4878025e231c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
> @@ -300,11 +300,11 @@ static u64 qp_read_field(struct mlx5_core_dev *dev, 
> struct mlx5_core_qp *qp,
>   param = qp->pid;
>   break;
>   case QP_STATE:
> - param = (u64)mlx5_qp_state_str(be32_to_cpu(ctx->flags) >> 28);
> + param = (unsigned 
> long)mlx5_qp_state_str(be32_to_cpu(ctx->flags) >> 28);
>   *is_str = 1;
>   break;
>   case QP_XPORT:
> - param = (u64)mlx5_qp_type_str((be32_to_cpu(ctx->flags) >> 16) & 
> 0xff);
> + param = (unsigned 
> long)mlx5_qp_type_str((be32_to_cpu(ctx->flags) >> 16) & 0xff);
>   *is_str = 1;
>   break;
>   case QP_MTU:
> @@ -464,7 +464,7 @@ static ssize_t dbg_read(struct file *filp, char __user 
> *buf, size_t count,
>  
>  
>   if (is_str)
> - ret = snprintf(tbuf, sizeof(tbuf), "%s\n", (const char *)field);
> + ret = snprintf(tbuf, sizeof(tbuf), "%s\n", (const char 
> *)(unsigned long)field);
>   else
>   ret = snprintf(tbuf, sizeof(tbuf), "0x%llx\n", field);
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 14/20] xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()

2015-01-13 Thread Chuck Lever

Move the details of how to create and destroy rpcrdma_req and
rpcrdma_rep structures into helper functions.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |  148 ---
 1 file changed, 95 insertions(+), 53 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index fd71501..24ea6dd 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1075,6 +1075,69 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia)
}
 }
 
+static struct rpcrdma_req *
+rpcrdma_create_req(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
+   size_t wlen = 1 << fls(cdata->inline_wsize +
+  sizeof(struct rpcrdma_req));
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_req *req;
+   int rc;
+
+   rc = -ENOMEM;
+   req = kmalloc(wlen, GFP_KERNEL);
+   if (req == NULL)
+   goto out;
+   memset(req, 0, sizeof(struct rpcrdma_req));
+
+   rc = rpcrdma_register_internal(ia, req->rl_base, wlen -
+  offsetof(struct rpcrdma_req, rl_base),
+  &req->rl_handle, &req->rl_iov);
+   if (rc)
+   goto out_free;
+
+   req->rl_size = wlen - sizeof(struct rpcrdma_req);
+   req->rl_buffer = &r_xprt->rx_buf;
+   return req;
+
+out_free:
+   kfree(req);
+out:
+   return ERR_PTR(rc);
+}
+
+static struct rpcrdma_rep *
+rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
+   size_t rlen = 1 << fls(cdata->inline_rsize +
+  sizeof(struct rpcrdma_rep));
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_rep *rep;
+   int rc;
+
+   rc = -ENOMEM;
+   rep = kmalloc(rlen, GFP_KERNEL);
+   if (rep == NULL)
+   goto out;
+   memset(rep, 0, sizeof(struct rpcrdma_rep));
+
+   rc = rpcrdma_register_internal(ia, rep->rr_base, rlen -
+  offsetof(struct rpcrdma_rep, rr_base),
+  &rep->rr_handle, &rep->rr_iov);
+   if (rc)
+   goto out_free;
+
+   rep->rr_buffer = &r_xprt->rx_buf;
+   return rep;
+
+out_free:
+   kfree(rep);
+out:
+   return ERR_PTR(rc);
+}
+
 static int
 rpcrdma_init_fmrs(struct rpcrdma_ia *ia, struct rpcrdma_buffer *buf)
 {
@@ -1167,7 +1230,7 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
char *p;
-   size_t len, rlen, wlen;
+   size_t len;
int i, rc;
 
buf->rb_max_requests = cdata->max_requests;
@@ -1227,62 +1290,29 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
break;
}
 
-   /*
-* Allocate/init the request/reply buffers. Doing this
-* using kmalloc for now -- one for each buf.
-*/
-   wlen = 1 << fls(cdata->inline_wsize + sizeof(struct rpcrdma_req));
-   rlen = 1 << fls(cdata->inline_rsize + sizeof(struct rpcrdma_rep));
-   dprintk("RPC:   %s: wlen = %zu, rlen = %zu\n",
-   __func__, wlen, rlen);
-
for (i = 0; i < buf->rb_max_requests; i++) {
struct rpcrdma_req *req;
struct rpcrdma_rep *rep;
 
-   req = kmalloc(wlen, GFP_KERNEL);
-   if (req == NULL) {
+   req = rpcrdma_create_req(r_xprt);
+   if (IS_ERR(req)) {
dprintk("RPC:   %s: request buffer %d alloc"
" failed\n", __func__, i);
-   rc = -ENOMEM;
+   rc = PTR_ERR(req);
goto out;
}
-   memset(req, 0, sizeof(struct rpcrdma_req));
buf->rb_send_bufs[i] = req;
-   buf->rb_send_bufs[i]->rl_buffer = buf;
-
-   rc = rpcrdma_register_internal(ia, req->rl_base,
-   wlen - offsetof(struct rpcrdma_req, rl_base),
-   &buf->rb_send_bufs[i]->rl_handle,
-   &buf->rb_send_bufs[i]->rl_iov);
-   if (rc)
-   goto out;
 
-   buf->rb_send_bufs[i]->rl_size = wlen -
-   sizeof(struct rpcrdma_req);
-
-   rep = kmalloc(rlen, GFP_KERNEL);
-   if (rep == NULL) {
+   rep = rpcrdma_create_rep(r_xprt);
+   if (IS_ERR(rep)) {
dprintk("RPC:   %s: reply buffer %d alloc failed\n",
__func__, i);
-   rc = -ENOMEM;
+   rc = PTR_ERR(rep);
goto out;

[PATCH v2 12/20] xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack

2015-01-13 Thread Chuck Lever

Reduce stack footprint of the connection upcall handler function.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   15 ---
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 123bb04..958b372 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -425,8 +425,8 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct 
rdma_cm_event *event)
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
struct sockaddr_in *addr = (struct sockaddr_in *) &ep->rep_remote_addr;
 #endif
-   struct ib_qp_attr attr;
-   struct ib_qp_init_attr iattr;
+   struct ib_qp_attr *attr = &ia->ri_qp_attr;
+   struct ib_qp_init_attr *iattr = &ia->ri_qp_init_attr;
int connstate = 0;
 
switch (event->event) {
@@ -449,12 +449,13 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct 
rdma_cm_event *event)
break;
case RDMA_CM_EVENT_ESTABLISHED:
connstate = 1;
-   ib_query_qp(ia->ri_id->qp, &attr,
-   IB_QP_MAX_QP_RD_ATOMIC | IB_QP_MAX_DEST_RD_ATOMIC,
-   &iattr);
+   ib_query_qp(ia->ri_id->qp, attr,
+   IB_QP_MAX_QP_RD_ATOMIC | IB_QP_MAX_DEST_RD_ATOMIC,
+   iattr);
dprintk("RPC:   %s: %d responder resources"
" (%d initiator)\n",
-   __func__, attr.max_dest_rd_atomic, attr.max_rd_atomic);
+   __func__, attr->max_dest_rd_atomic,
+   attr->max_rd_atomic);
goto connected;
case RDMA_CM_EVENT_CONNECT_ERROR:
connstate = -ENOTCONN;
@@ -487,7 +488,7 @@ connected:
 
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
if (connstate == 1) {
-   int ird = attr.max_dest_rd_atomic;
+   int ird = attr->max_dest_rd_atomic;
int tird = ep->rep_remote_cma.responder_resources;
printk(KERN_INFO "rpcrdma: connection to %pI4:%u "
"on %s, memreg %d slots %d ird %d%s\n",
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ec596ce..2b4e778 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -71,6 +71,8 @@ struct rpcrdma_ia {
enum rpcrdma_memreg ri_memreg_strategy;
unsigned intri_max_frmr_depth;
struct ib_device_attr   ri_devattr;
+   struct ib_qp_attr   ri_qp_attr;
+   struct ib_qp_init_attr  ri_qp_init_attr;
 };
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 17/20] xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req

2015-01-13 Thread Chuck Lever

The rl_base field is the buffer where each RPC/RDMA send header is
built.

Pre-posted send buffers are supposed to be the same size on the
client and server. For Solaris and Linux, that size is currently
1024 bytes, the inline threshold.

The size of the rl_base buffer is currently dependent on
RPCRDMA_MAX_DATA_SEGS. When the client constructs a chunk list in
the RPC/RDMA header, each segment in the list takes up a little
room in the buffer.

If we want a large r/wsize maximum, MAX_SEGS will grow
significantly, but notice that the inline threshold size is not
supposed to change (since it should match on the client and server).

Therefore the inline size is the real limit on the size of the
RPC/RDMA send header.

No longer use RPCRDMA_MAX_DATA_SEGS to determine the size or
placement of the RPC/RDMA header send buffer. The buffer size should
always be the same as the inline threshold size.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   11 +--
 net/sunrpc/xprtrdma/transport.c |9 +
 net/sunrpc/xprtrdma/verbs.c |   22 +++---
 net/sunrpc/xprtrdma/xprt_rdma.h |6 ++
 4 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 8a6bdbd..c1d4a09 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -294,7 +294,7 @@ ssize_t
 rpcrdma_marshal_chunks(struct rpc_rqst *rqst, ssize_t result)
 {
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
-   struct rpcrdma_msg *headerp = (struct rpcrdma_msg *)req->rl_base;
+   struct rpcrdma_msg *headerp = rdmab_to_msg(req->rl_rdmabuf);
 
if (req->rl_rtype != rpcrdma_noch)
result = rpcrdma_create_chunks(rqst, &rqst->rq_snd_buf,
@@ -406,8 +406,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
base = rqst->rq_svec[0].iov_base;
rpclen = rqst->rq_svec[0].iov_len;
 
-   /* build RDMA header in private area at front */
-   headerp = (struct rpcrdma_msg *) req->rl_base;
+   headerp = rdmab_to_msg(req->rl_rdmabuf);
/* don't byte-swap XID, it's already done in request */
headerp->rm_xid = rqst->rq_xid;
headerp->rm_vers = rpcrdma_version;
@@ -528,7 +527,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
dprintk("RPC:   %s: %s: hdrlen %zd rpclen %zd padlen %zd"
" headerp 0x%p base 0x%p lkey 0x%x\n",
__func__, transfertypes[req->rl_wtype], hdrlen, rpclen, padlen,
-   headerp, base, req->rl_iov.lkey);
+   headerp, base, rdmab_lkey(req->rl_rdmabuf));
 
/*
 * initialize send_iov's - normally only two: rdma chunk header and
@@ -537,9 +536,9 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * header and any write data. In all non-rdma cases, any following
 * data has been copied into the RPC header buffer.
 */
-   req->rl_send_iov[0].addr = req->rl_iov.addr;
+   req->rl_send_iov[0].addr = rdmab_addr(req->rl_rdmabuf);
req->rl_send_iov[0].length = hdrlen;
-   req->rl_send_iov[0].lkey = req->rl_iov.lkey;
+   req->rl_send_iov[0].lkey = rdmab_lkey(req->rl_rdmabuf);
 
req->rl_send_iov[1].addr = rdmab_addr(req->rl_sendbuf);
req->rl_send_iov[1].length = rpclen;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index a9d5662..2c2fabe 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -470,6 +470,8 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
if (req == NULL)
return NULL;
 
+   if (req->rl_rdmabuf == NULL)
+   goto out_rdmabuf;
if (req->rl_sendbuf == NULL)
goto out_sendbuf;
if (size > req->rl_sendbuf->rg_size)
@@ -480,6 +482,13 @@ out:
req->rl_connect_cookie = 0; /* our reserved value */
return req->rl_sendbuf->rg_base;
 
+out_rdmabuf:
+   min_size = RPCRDMA_INLINE_WRITE_THRESHOLD(task->tk_rqstp);
+   rb = rpcrdma_alloc_regbuf(&r_xprt->rx_ia, min_size, flags);
+   if (IS_ERR(rb))
+   goto out_fail;
+   req->rl_rdmabuf = rb;
+
 out_sendbuf:
/* XDR encoding and RPC/RDMA marshaling of this request has not
 * yet occurred. Thus a lower bound is needed to prevent buffer
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4089440..c81749b 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1078,30 +1078,14 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia)
 static struct rpcrdma_req *
 rpcrdma_create_req(struct rpcrdma_xprt *r_xprt)
 {
-   struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
-   size_t wlen = cdata->inline_wsize;
-   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_req *req;
-   int rc;
 
-   rc = -ENOMEM;
-   req = kmalloc(sizeof(*req) + wlen, GFP_KERNEL);
+   req = kzal

[PATCH v2 20/20] xprtrdma: Clean up after adding regbuf management

2015-01-13 Thread Chuck Lever

rpcrdma_{de}register_internal() are used only in verbs.c now.

MAX_RPCRDMAHDR is no longer used and can be removed.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |4 ++--
 net/sunrpc/xprtrdma/xprt_rdma.h |9 -
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 669f68a..f7c3d10 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1729,7 +1729,7 @@ rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
  * Wrappers for internal-use kmalloc memory registration, used by buffer code.
  */
 
-int
+static int
 rpcrdma_register_internal(struct rpcrdma_ia *ia, void *va, int len,
struct ib_mr **mrp, struct ib_sge *iov)
 {
@@ -1780,7 +1780,7 @@ rpcrdma_register_internal(struct rpcrdma_ia *ia, void 
*va, int len,
return rc;
 }
 
-int
+static int
 rpcrdma_deregister_internal(struct rpcrdma_ia *ia,
struct ib_mr *mr, struct ib_sge *iov)
 {
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 5630353..c9d2a02 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -171,10 +171,6 @@ enum rpcrdma_chunktype {
 /* temporary static scatter/gather max */
 #define RPCRDMA_MAX_DATA_SEGS  (64)/* max scatter/gather */
 #define RPCRDMA_MAX_SEGS   (RPCRDMA_MAX_DATA_SEGS + 2) /* head+tail = 2 */
-#define MAX_RPCRDMAHDR (\
-   /* max supported RPC/RDMA header */ \
-   sizeof(struct rpcrdma_msg) + (2 * sizeof(u32)) + \
-   (sizeof(struct rpcrdma_read_chunk) * RPCRDMA_MAX_SEGS) + sizeof(u32))
 
 struct rpcrdma_buffer;
 
@@ -401,11 +397,6 @@ void rpcrdma_buffer_put(struct rpcrdma_req *);
 void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
 void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);
 
-int rpcrdma_register_internal(struct rpcrdma_ia *, void *, int,
-   struct ib_mr **, struct ib_sge *);
-int rpcrdma_deregister_internal(struct rpcrdma_ia *,
-   struct ib_mr *, struct ib_sge *);
-
 int rpcrdma_register_external(struct rpcrdma_mr_seg *,
int, int, struct rpcrdma_xprt *);
 int rpcrdma_deregister_external(struct rpcrdma_mr_seg *,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 11/20] xprtrdma: Take struct ib_device_attr off the stack

2015-01-13 Thread Chuck Lever

Device attributes are large, and are used in more than one place.
Stash a copy in dynamically allocated memory.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   37 +
 net/sunrpc/xprtrdma/xprt_rdma.h |1 +
 2 files changed, 14 insertions(+), 24 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index aa012a3..123bb04 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -588,8 +588,8 @@ int
 rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
 {
int rc, mem_priv;
-   struct ib_device_attr devattr;
struct rpcrdma_ia *ia = &xprt->rx_ia;
+   struct ib_device_attr *devattr = &ia->ri_devattr;
 
ia->ri_id = rpcrdma_create_id(xprt, ia, addr);
if (IS_ERR(ia->ri_id)) {
@@ -605,26 +605,21 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct 
sockaddr *addr, int memreg)
goto out2;
}
 
-   /*
-* Query the device to determine if the requested memory
-* registration strategy is supported. If it isn't, set the
-* strategy to a globally supported model.
-*/
-   rc = ib_query_device(ia->ri_id->device, &devattr);
+   rc = ib_query_device(ia->ri_id->device, devattr);
if (rc) {
dprintk("RPC:   %s: ib_query_device failed %d\n",
__func__, rc);
goto out3;
}
 
-   if (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
+   if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
ia->ri_have_dma_lkey = 1;
ia->ri_dma_lkey = ia->ri_id->device->local_dma_lkey;
}
 
if (memreg == RPCRDMA_FRMR) {
/* Requires both frmr reg and local dma lkey */
-   if ((devattr.device_cap_flags &
+   if ((devattr->device_cap_flags &
 (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) !=
(IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) {
dprintk("RPC:   %s: FRMR registration "
@@ -634,7 +629,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
/* Mind the ia limit on FRMR page list depth */
ia->ri_max_frmr_depth = min_t(unsigned int,
RPCRDMA_MAX_DATA_SEGS,
-   devattr.max_fast_reg_page_list_len);
+   devattr->max_fast_reg_page_list_len);
}
}
if (memreg == RPCRDMA_MTHCAFMR) {
@@ -736,20 +731,13 @@ int
 rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
struct rpcrdma_create_data_internal *cdata)
 {
-   struct ib_device_attr devattr;
+   struct ib_device_attr *devattr = &ia->ri_devattr;
struct ib_cq *sendcq, *recvcq;
int rc, err;
 
-   rc = ib_query_device(ia->ri_id->device, &devattr);
-   if (rc) {
-   dprintk("RPC:   %s: ib_query_device failed %d\n",
-   __func__, rc);
-   return rc;
-   }
-
/* check provider's send/recv wr limits */
-   if (cdata->max_requests > devattr.max_qp_wr)
-   cdata->max_requests = devattr.max_qp_wr;
+   if (cdata->max_requests > devattr->max_qp_wr)
+   cdata->max_requests = devattr->max_qp_wr;
 
ep->rep_attr.event_handler = rpcrdma_qp_async_error_upcall;
ep->rep_attr.qp_context = ep;
@@ -784,8 +772,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia,
 
}
ep->rep_attr.cap.max_send_wr *= depth;
-   if (ep->rep_attr.cap.max_send_wr > devattr.max_qp_wr) {
-   cdata->max_requests = devattr.max_qp_wr / depth;
+   if (ep->rep_attr.cap.max_send_wr > devattr->max_qp_wr) {
+   cdata->max_requests = devattr->max_qp_wr / depth;
if (!cdata->max_requests)
return -EINVAL;
ep->rep_attr.cap.max_send_wr = cdata->max_requests *
@@ -868,10 +856,11 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia,
 
/* Client offers RDMA Read but does not initiate */
ep->rep_remote_cma.initiator_depth = 0;
-   if (devattr.max_qp_rd_atom > 32)/* arbitrary but <= 255 */
+   if (devattr->max_qp_rd_atom > 32)   /* arbitrary but <= 255 */
ep->rep_remote_cma.responder_resources = 32;
else
-   ep->rep_remote_cma.responder_resources = devattr.max_qp_rd_atom;
+   ep->rep_remote_cma.responder_resources =
+   devattr->max_qp_rd_atom;
 
ep->rep_remote_cma.retry_count = 7;
ep->rep_remote_cma.flow_control = 0;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h

[PATCH v2 13/20] xprtrdma: Simplify synopsis of rpcrdma_buffer_create()

2015-01-13 Thread Chuck Lever

Clean up: There is one call site for rpcrdma_buffer_create(). All of
the arguments there are fields of an rpcrdma_xprt.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/transport.c |3 +--
 net/sunrpc/xprtrdma/verbs.c |7 +--
 net/sunrpc/xprtrdma/xprt_rdma.h |4 +---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index a487bde..808b3c5 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -364,8 +364,7 @@ xprt_setup_rdma(struct xprt_create *args)
 * any inline data. Also specify any padding which will be provided
 * from a preregistered zero buffer.
 */
-   rc = rpcrdma_buffer_create(&new_xprt->rx_buf, new_ep, &new_xprt->rx_ia,
-   &new_xprt->rx_data);
+   rc = rpcrdma_buffer_create(new_xprt);
if (rc)
goto out3;
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 958b372..fd71501 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1161,9 +1161,11 @@ out_free:
 }
 
 int
-rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct rpcrdma_ep *ep,
-   struct rpcrdma_ia *ia, struct rpcrdma_create_data_internal *cdata)
+rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
 {
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
char *p;
size_t len, rlen, wlen;
int i, rc;
@@ -1200,6 +1202,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct 
rpcrdma_ep *ep,
 * Register the zeroed pad buffer, if any.
 */
if (cdata->padding) {
+   struct rpcrdma_ep *ep = &r_xprt->rx_ep;
rc = rpcrdma_register_internal(ia, p, cdata->padding,
&ep->rep_pad_mr, &ep->rep_pad);
if (rc)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 2b4e778..5c2fac3 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -354,9 +354,7 @@ int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct 
rpcrdma_ep *,
 /*
  * Buffer calls - xprtrdma/verbs.c
  */
-int rpcrdma_buffer_create(struct rpcrdma_buffer *, struct rpcrdma_ep *,
-   struct rpcrdma_ia *,
-   struct rpcrdma_create_data_internal *);
+int rpcrdma_buffer_create(struct rpcrdma_xprt *);
 void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
 
 struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 18/20] xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep

2015-01-13 Thread Chuck Lever

The rr_base field is the buffer where each RPC/RDMA reply header
lands.

In some cases the RPC reply header also lands in this buffer, just
after the RPC/RDMA header.

The pre-posted receive buffers are supposed to be the same size
on the client and server. For Solaris and Linux, that size is
supposed to be 1024 bytes, the inline threshold.

The size of the rr_base buffer is currently dependent on
RPCRDMA_MAX_DATA_SEGS. When the server constructs a chunk list in
the RPC/RDMA header, each segment in the list takes up a little
room in the buffer.

If we want a large r/wsize maximum, MAX_SEGS will grow
significantly, but notice that the inline threshold size won't
change.

Therefore the inline size is the real limit on the size of the
RPC/RDMA header. The largest RPC reply the client can receive via
RDMA SEND is also no bigger than the inline size.

Thus the size of the pre-posted receive buffer should be exactly the
inline size * 2. The MAX_RPCRDMAHDR term should be replaced, and
rounding up ( 1 << fls(yada) ) is not necessary.

RPC replies received via RDMA WRITE (long replies) are caught in
rq_rcv_buf, which is the second half of the RPC send buffer. Ie,
such replies are not involved in any way with rr_base.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |5 +++--
 net/sunrpc/xprtrdma/verbs.c |   27 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |   14 ++
 3 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c1d4a09..02efcaa 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -572,6 +572,7 @@ rpcrdma_count_chunks(struct rpcrdma_rep *rep, unsigned int 
max, int wrchunk, __b
 {
unsigned int i, total_len;
struct rpcrdma_write_chunk *cur_wchunk;
+   char *base = (char *)rdmab_to_msg(rep->rr_rdmabuf);
 
i = be32_to_cpu(**iptrp);
if (i > max)
@@ -599,7 +600,7 @@ rpcrdma_count_chunks(struct rpcrdma_rep *rep, unsigned int 
max, int wrchunk, __b
return -1;
cur_wchunk = (struct rpcrdma_write_chunk *) w;
}
-   if ((char *) cur_wchunk > rep->rr_base + rep->rr_len)
+   if ((char *)cur_wchunk > base + rep->rr_len)
return -1;
 
*iptrp = (__be32 *) cur_wchunk;
@@ -753,7 +754,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
dprintk("RPC:   %s: short/invalid reply\n", __func__);
goto repost;
}
-   headerp = (struct rpcrdma_msg *) rep->rr_base;
+   headerp = rdmab_to_msg(rep->rr_rdmabuf);
if (headerp->rm_vers != rpcrdma_version) {
dprintk("RPC:   %s: invalid version %d\n",
__func__, be32_to_cpu(headerp->rm_vers));
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c81749b..7aac422 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -298,8 +298,9 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct 
list_head *sched_list)
 
rep->rr_len = wc->byte_len;
ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
-   rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
-   prefetch(rep->rr_base);
+  rdmab_addr(rep->rr_rdmabuf),
+  rep->rr_len, DMA_FROM_DEVICE);
+   prefetch(rdmab_to_msg(rep->rr_rdmabuf));
 
 out_schedule:
list_add_tail(&rep->rr_list, sched_list);
@@ -1092,23 +1093,21 @@ static struct rpcrdma_rep *
 rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
-   size_t rlen = 1 << fls(cdata->inline_rsize +
-  sizeof(struct rpcrdma_rep));
+   size_t rlen = cdata->inline_rsize << 1;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_rep *rep;
int rc;
 
rc = -ENOMEM;
-   rep = kmalloc(rlen, GFP_KERNEL);
+   rep = kzalloc(sizeof(*rep), GFP_KERNEL);
if (rep == NULL)
goto out;
-   memset(rep, 0, sizeof(*rep));
 
-   rc = rpcrdma_register_internal(ia, rep->rr_base, rlen -
-  offsetof(struct rpcrdma_rep, rr_base),
-  &rep->rr_handle, &rep->rr_iov);
-   if (rc)
+   rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, rlen, GFP_KERNEL);
+   if (IS_ERR(rep->rr_rdmabuf)) {
+   rc = PTR_ERR(rep->rr_rdmabuf);
goto out_free;
+   }
 
rep->rr_buffer = &r_xprt->rx_buf;
return rep;
@@ -1306,7 +1305,7 @@ rpcrdma_destroy_rep(struct rpcrdma_ia *ia, struct 
rpcrdma_rep *rep)
if (!rep)
return;
 
-   rpcrdma_deregister_internal(ia, rep->rr_handle, &rep->rr_iov);
+   rpcrdma_free_regbuf(ia, rep->rr_rdmabuf);
kfree(rep);
 }
 
@@ -2209,11 +2208,13 @@ rpcrdm

[PATCH v2 19/20] xprtrdma: Allocate zero pad separately from rpcrdma_buffer

2015-01-13 Thread Chuck Lever

Use the new rpcrdma_alloc_regbuf() API to shrink the amount of
contiguous memory needed for a buffer pool by moving the zero
pad buffer into a regbuf.

This is for consistency with the other uses of internally
registered memory.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |4 ++--
 net/sunrpc/xprtrdma/verbs.c |   29 ++---
 net/sunrpc/xprtrdma/xprt_rdma.h |3 +--
 3 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 02efcaa..7e9acd9 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -549,9 +549,9 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
if (padlen) {
struct rpcrdma_ep *ep = &r_xprt->rx_ep;
 
-   req->rl_send_iov[2].addr = ep->rep_pad.addr;
+   req->rl_send_iov[2].addr = rdmab_addr(ep->rep_padbuf);
req->rl_send_iov[2].length = padlen;
-   req->rl_send_iov[2].lkey = ep->rep_pad.lkey;
+   req->rl_send_iov[2].lkey = rdmab_lkey(ep->rep_padbuf);
 
req->rl_send_iov[3].addr = req->rl_send_iov[1].addr + rpclen;
req->rl_send_iov[3].length = rqst->rq_slen - rpclen;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 7aac422..669f68a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -794,6 +794,14 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia,
ep->rep_attr.qp_type = IB_QPT_RC;
ep->rep_attr.port_num = ~0;
 
+   if (cdata->padding) {
+   ep->rep_padbuf = rpcrdma_alloc_regbuf(ia, cdata->padding,
+ GFP_KERNEL);
+   if (IS_ERR(ep->rep_padbuf))
+   return PTR_ERR(ep->rep_padbuf);
+   } else
+   ep->rep_padbuf = NULL;
+
dprintk("RPC:   %s: requested max: dtos: send %d recv %d; "
"iovs: send %d recv %d\n",
__func__,
@@ -876,6 +884,7 @@ out2:
dprintk("RPC:   %s: ib_destroy_cq returned %i\n",
__func__, err);
 out1:
+   rpcrdma_free_regbuf(ia, ep->rep_padbuf);
return rc;
 }
 
@@ -902,11 +911,7 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia)
ia->ri_id->qp = NULL;
}
 
-   /* padding - could be done in rpcrdma_buffer_destroy... */
-   if (ep->rep_pad_mr) {
-   rpcrdma_deregister_internal(ia, ep->rep_pad_mr, &ep->rep_pad);
-   ep->rep_pad_mr = NULL;
-   }
+   rpcrdma_free_regbuf(ia, ep->rep_padbuf);
 
rpcrdma_clean_cq(ep->rep_attr.recv_cq);
rc = ib_destroy_cq(ep->rep_attr.recv_cq);
@@ -1220,12 +1225,10 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
 *   1.  arrays for send and recv pointers
 *   2.  arrays of struct rpcrdma_req to fill in pointers
 *   3.  array of struct rpcrdma_rep for replies
-*   4.  padding, if any
 * Send/recv buffers in req/rep need to be registered
 */
len = buf->rb_max_requests *
(sizeof(struct rpcrdma_req *) + sizeof(struct rpcrdma_rep *));
-   len += cdata->padding;
 
p = kzalloc(len, GFP_KERNEL);
if (p == NULL) {
@@ -1241,18 +1244,6 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
buf->rb_recv_bufs = (struct rpcrdma_rep **) p;
p = (char *) &buf->rb_recv_bufs[buf->rb_max_requests];
 
-   /*
-* Register the zeroed pad buffer, if any.
-*/
-   if (cdata->padding) {
-   struct rpcrdma_ep *ep = &r_xprt->rx_ep;
-   rc = rpcrdma_register_internal(ia, p, cdata->padding,
-   &ep->rep_pad_mr, &ep->rep_pad);
-   if (rc)
-   goto out;
-   }
-   p += cdata->padding;
-
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
switch (ia->ri_memreg_strategy) {
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 2b69316..5630353 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -88,8 +88,7 @@ struct rpcrdma_ep {
int rep_connected;
struct ib_qp_init_attr  rep_attr;
wait_queue_head_t   rep_connect_wait;
-   struct ib_sge   rep_pad;/* holds zeroed pad */
-   struct ib_mr*rep_pad_mr;/* holds zeroed pad */
+   struct rpcrdma_regbuf   *rep_padbuf;
struct rdma_conn_param  rep_remote_cma;
struct sockaddr_storage rep_remote_addr;
struct delayed_work rep_connect_worker;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 16/20] xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req

2015-01-13 Thread Chuck Lever

Because internal memory registration is an expensive and synchronous
operation, xprtrdma pre-registers send and receive buffers at mount
time, and then re-uses them for each RPC.

A "hardway" allocation is a memory allocation and registration that
replaces a send buffer during the processing of an RPC. Hardway must
be done if the RPC send buffer is too small to accommodate an RPC's
call or reply header.

For xprtrdma, each RPC send buffer is currently part of struct
rpcrdma_req so that xprt_rdma_free(), which is passed nothing but
the address of an RPC send buffer, can find its matching struct
rpcrdma_req and rpcrdma_rep quickly (via container_of / offsetof).

That means that hardway currently has to replace a whole rpcrmda_req
when it replaces an RPC send buffer. This is often a fairly hefty
chunk of contiguous memory due to the size of the rl_segments array.

Some obscure re-use of fields in rpcrdma_req is done so that
xprt_rdma_free() can detect replaced buffers, and free them. The
original buffer and rpcrdma_req is restored in the process.

This commit breaks apart the RPC send buffer and struct rpcrdma_req
so that increasing the size of the rl_segments array does not change
the alignment of each RPC send buffer. (Increasing rl_segments is
needed to bump up the maximum r/wsize for NFS/RDMA).

This change opens up some interesting possibilities for improving
the design of xprt_rdma_allocate().

xprt_rdma_allocate() is now the one place where RPC send buffers
are allocated or re-allocated, and they are now always left in place
by xprt_rdma_free().

A large re-allocation that includes both the rl_segments array and
the RPC send buffer is no longer needed. Send buffer re-allocation
becomes quite rare. Good send buffer alignment is guaranteed no
matter what the size of the rl_segments array is.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |6 +-
 net/sunrpc/xprtrdma/transport.c |  146 ---
 net/sunrpc/xprtrdma/verbs.c |   16 ++--
 net/sunrpc/xprtrdma/xprt_rdma.h |   14 +++-
 4 files changed, 78 insertions(+), 104 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index f2eda15..8a6bdbd 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -541,9 +541,9 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
req->rl_send_iov[0].length = hdrlen;
req->rl_send_iov[0].lkey = req->rl_iov.lkey;
 
-   req->rl_send_iov[1].addr = req->rl_iov.addr + (base - req->rl_base);
+   req->rl_send_iov[1].addr = rdmab_addr(req->rl_sendbuf);
req->rl_send_iov[1].length = rpclen;
-   req->rl_send_iov[1].lkey = req->rl_iov.lkey;
+   req->rl_send_iov[1].lkey = rdmab_lkey(req->rl_sendbuf);
 
req->rl_niovs = 2;
 
@@ -556,7 +556,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 
req->rl_send_iov[3].addr = req->rl_send_iov[1].addr + rpclen;
req->rl_send_iov[3].length = rqst->rq_slen - rpclen;
-   req->rl_send_iov[3].lkey = req->rl_iov.lkey;
+   req->rl_send_iov[3].lkey = rdmab_lkey(req->rl_sendbuf);
 
req->rl_niovs = 4;
}
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 808b3c5..a9d5662 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -449,77 +449,72 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task 
*task)
 /*
  * The RDMA allocate/free functions need the task structure as a place
  * to hide the struct rpcrdma_req, which is necessary for the actual send/recv
- * sequence. For this reason, the recv buffers are attached to send
- * buffers for portions of the RPC. Note that the RPC layer allocates
- * both send and receive buffers in the same call. We may register
- * the receive buffer portion when using reply chunks.
+ * sequence.
+ *
+ * The RPC layer allocates both send and receive buffers in the same call
+ * (rq_send_buf and rq_rcv_buf are both part of a single contiguous buffer).
+ * We may register rq_rcv_buf when using reply chunks.
  */
 static void *
 xprt_rdma_allocate(struct rpc_task *task, size_t size)
 {
struct rpc_xprt *xprt = task->tk_rqstp->rq_xprt;
-   struct rpcrdma_req *req, *nreq;
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct rpcrdma_regbuf *rb;
+   struct rpcrdma_req *req;
+   size_t min_size;
+   gfp_t flags = task->tk_flags & RPC_TASK_SWAPPER ?
+   GFP_ATOMIC : GFP_NOFS;
 
-   req = rpcrdma_buffer_get(&rpcx_to_rdmax(xprt)->rx_buf);
+   req = rpcrdma_buffer_get(&r_xprt->rx_buf);
if (req == NULL)
return NULL;
 
-   if (size > req->rl_size) {
-   dprintk("RPC:   %s: size %zd too large for buffer[%zd]: "
-   "prog %d vers %d proc %d\n",
-   __func__, size, req->rl_size,
-   task->tk_cl

[PATCH v2 15/20] xprtrdma: Add struct rpcrdma_regbuf and helpers

2015-01-13 Thread Chuck Lever

There are several spots that allocate a buffer via kmalloc (usually
contiguously with another data structure) and then register that
buffer internally. I'd like to split the buffers out of these data
structures to allow the data structures to scale.

Start by adding functions that can kmalloc and register a buffer,
and can manage/preserve the buffer's associated ib_sge and ib_mr
fields.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   55 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |   43 ++
 2 files changed, 98 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 24ea6dd..cdd6aac 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1828,6 +1828,61 @@ rpcrdma_deregister_internal(struct rpcrdma_ia *ia,
return rc;
 }
 
+/**
+ * rpcrdma_alloc_regbuf - kmalloc and register memory for SEND/RECV buffers
+ * @ia: controlling rpcrdma_ia
+ * @size: size of buffer to be allocated, in bytes
+ * @flags: GFP flags
+ *
+ * Returns pointer to private header of an area of internally
+ * registered memory, or an ERR_PTR. The registered buffer follows
+ * the end of the private header.
+ *
+ * xprtrdma uses a regbuf for posting an outgoing RDMA SEND, or for
+ * receiving the payload of RDMA RECV operations. regbufs are not
+ * used for RDMA READ/WRITE operations, thus are registered only for
+ * LOCAL access.
+ */
+struct rpcrdma_regbuf *
+rpcrdma_alloc_regbuf(struct rpcrdma_ia *ia, size_t size, gfp_t flags)
+{
+   struct rpcrdma_regbuf *rb;
+   int rc;
+
+   rc = -ENOMEM;
+   rb = kmalloc(sizeof(*rb) + size, flags);
+   if (rb == NULL)
+   goto out;
+
+   rb->rg_size = size;
+   rb->rg_owner = NULL;
+   rc = rpcrdma_register_internal(ia, rb->rg_base, size,
+  &rb->rg_mr, &rb->rg_iov);
+   if (rc)
+   goto out_free;
+
+   return rb;
+
+out_free:
+   kfree(rb);
+out:
+   return ERR_PTR(rc);
+}
+
+/**
+ * rpcrdma_free_regbuf - deregister and free registered buffer
+ * @ia: controlling rpcrdma_ia
+ * @rb: regbuf to be deregistered and freed
+ */
+void
+rpcrdma_free_regbuf(struct rpcrdma_ia *ia, struct rpcrdma_regbuf *rb)
+{
+   if (rb) {
+   rpcrdma_deregister_internal(ia, rb->rg_mr, &rb->rg_iov);
+   kfree(rb);
+   }
+}
+
 /*
  * Wrappers for chunk registration, shared by read/write chunk code.
  */
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 5c2fac3..36c37c6 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -106,6 +106,44 @@ struct rpcrdma_ep {
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 
+/* Registered buffer -- registered kmalloc'd memory for RDMA SEND/RECV
+ *
+ * The below structure appears at the front of a large region of kmalloc'd
+ * memory, which always starts on a good alignment boundary.
+ */
+
+struct rpcrdma_regbuf {
+   size_t  rg_size;
+   struct rpcrdma_req  *rg_owner;
+   struct ib_mr*rg_mr;
+   struct ib_sge   rg_iov;
+   __be32  rg_base[0] __attribute__ ((aligned(256)));
+};
+
+static inline u64
+rdmab_addr(struct rpcrdma_regbuf *rb)
+{
+   return rb->rg_iov.addr;
+}
+
+static inline u32
+rdmab_length(struct rpcrdma_regbuf *rb)
+{
+   return rb->rg_iov.length;
+}
+
+static inline u32
+rdmab_lkey(struct rpcrdma_regbuf *rb)
+{
+   return rb->rg_iov.lkey;
+}
+
+static inline struct rpcrdma_msg *
+rdmab_to_msg(struct rpcrdma_regbuf *rb)
+{
+   return (struct rpcrdma_msg *)rb->rg_base;
+}
+
 enum rpcrdma_chunktype {
rpcrdma_noch = 0,
rpcrdma_readch,
@@ -372,6 +410,11 @@ int rpcrdma_register_external(struct rpcrdma_mr_seg *,
 int rpcrdma_deregister_external(struct rpcrdma_mr_seg *,
struct rpcrdma_xprt *);
 
+struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(struct rpcrdma_ia *,
+   size_t, gfp_t);
+void rpcrdma_free_regbuf(struct rpcrdma_ia *,
+struct rpcrdma_regbuf *);
+
 /*
  * RPC/RDMA connection management calls - xprtrdma/rpc_rdma.c
  */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 04/20] xprtrdma: Clean up hdrlen

2015-01-13 Thread Chuck Lever

Clean up: Replace naked integers with a documenting macro.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/rpc_rdma.h |5 -
 net/sunrpc/xprtrdma/rpc_rdma.c  |   12 +++-
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h
index 1578ed2..f33c5a4 100644
--- a/include/linux/sunrpc/rpc_rdma.h
+++ b/include/linux/sunrpc/rpc_rdma.h
@@ -98,7 +98,10 @@ struct rpcrdma_msg {
} rm_body;
 };
 
-#define RPCRDMA_HDRLEN_MIN 28
+/*
+ * Smallest RPC/RDMA header: rm_xid through rm_type, then rm_nochunks
+ */
+#define RPCRDMA_HDRLEN_MIN (sizeof(__be32) * 7)
 
 enum rpcrdma_errcode {
ERR_VERS = 1,
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 150dd76..dcf5ebc 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -472,7 +472,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
return -EIO;
}
 
-   hdrlen = 28; /*sizeof *headerp;*/
+   hdrlen = RPCRDMA_HDRLEN_MIN;
padlen = 0;
 
/*
@@ -748,7 +748,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
}
return;
}
-   if (rep->rr_len < 28) {
+   if (rep->rr_len < RPCRDMA_HDRLEN_MIN) {
dprintk("RPC:   %s: short/invalid reply\n", __func__);
goto repost;
}
@@ -830,8 +830,9 @@ repost:
} else {
/* else ordinary inline */
rdmalen = 0;
-   iptr = (__be32 *)((unsigned char *)headerp + 28);
-   rep->rr_len -= 28; /*sizeof *headerp;*/
+   iptr = (__be32 *)((unsigned char *)headerp +
+   RPCRDMA_HDRLEN_MIN);
+   rep->rr_len -= RPCRDMA_HDRLEN_MIN;
status = rep->rr_len;
}
/* Fix up the rpc results for upper layer */
@@ -845,7 +846,8 @@ repost:
headerp->rm_body.rm_chunks[2] != xdr_one ||
req->rl_nchunks == 0)
goto badheader;
-   iptr = (__be32 *)((unsigned char *)headerp + 28);
+   iptr = (__be32 *)((unsigned char *)headerp +
+   RPCRDMA_HDRLEN_MIN);
rdmalen = rpcrdma_count_chunks(rep, req->rl_nchunks, 0, &iptr);
if (rdmalen < 0)
goto badheader;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 10/20] xprtrdma: Free the pd if ib_query_qp() fails

2015-01-13 Thread Chuck Lever

If ib_query_qp() fails or the memory registration mode isn't
supported, don't leak the PD. An orphaned IB/core resource will
cause IB module removal to hang.

Fixes: bd7ed1d13304 ("RPC/RDMA: check selected memory registration ...")
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c61bb61..aa012a3 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -614,7 +614,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
if (rc) {
dprintk("RPC:   %s: ib_query_device failed %d\n",
__func__, rc);
-   goto out2;
+   goto out3;
}
 
if (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
@@ -672,14 +672,14 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct 
sockaddr *addr, int memreg)
"phys register failed with %lX\n",
__func__, PTR_ERR(ia->ri_bind_mem));
rc = -ENOMEM;
-   goto out2;
+   goto out3;
}
break;
default:
printk(KERN_ERR "RPC: Unsupported memory "
"registration mode: %d\n", memreg);
rc = -ENOMEM;
-   goto out2;
+   goto out3;
}
dprintk("RPC:   %s: memory registration strategy is %d\n",
__func__, memreg);
@@ -689,6 +689,10 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
 
rwlock_init(&ia->ri_qplock);
return 0;
+
+out3:
+   ib_dealloc_pd(ia->ri_pd);
+   ia->ri_pd = NULL;
 out2:
rdma_destroy_id(ia->ri_id);
ia->ri_id = NULL;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 08/20] xprtrdma: Move credit update to RPC reply handler

2015-01-13 Thread Chuck Lever

Reduce work in the receive CQ handler, which is run at hardware
interrupt level, by moving the RPC/RDMA credit update logic to the
RPC reply handler.

This has some additional benefits: More header sanity checking is
done before trusting the incoming credit value, and the receive CQ
handler no longer touches the RPC/RDMA header. Finally, no longer
any need to update and read rb_credits atomically, so the rb_credits
field can be removed.

This further extends work begun by commit e7ce710a8802 ("xprtrdma:
Avoid deadlock when credit window is reset").

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   10 --
 net/sunrpc/xprtrdma/verbs.c |   15 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |1 -
 3 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index dcf5ebc..d731010 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -736,7 +736,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
struct rpc_xprt *xprt = rep->rr_xprt;
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
__be32 *iptr;
-   int rdmalen, status;
+   int credits, rdmalen, status;
unsigned long cwnd;
 
/* Check status. If bad, signal disconnect and return rep to pool */
@@ -871,8 +871,14 @@ badheader:
break;
}
 
+   credits = be32_to_cpu(headerp->rm_credit);
+   if (credits == 0)
+   credits = 1;/* don't deadlock */
+   else if (credits > r_xprt->rx_buf.rb_max_requests)
+   credits = r_xprt->rx_buf.rb_max_requests;
+
cwnd = xprt->cwnd;
-   xprt->cwnd = atomic_read(&r_xprt->rx_buf.rb_credits) << RPC_CWNDSHIFT;
+   xprt->cwnd = credits << RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)
xprt_release_rqst_cong(rqst->rq_task);
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1000f63..71a071a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -49,6 +49,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #include "xprt_rdma.h"
@@ -298,17 +299,7 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct 
list_head *sched_list)
rep->rr_len = wc->byte_len;
ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
-
-   if (rep->rr_len >= 16) {
-   struct rpcrdma_msg *p = (struct rpcrdma_msg *)rep->rr_base;
-   unsigned int credits = ntohl(p->rm_credit);
-
-   if (credits == 0)
-   credits = 1;/* don't deadlock */
-   else if (credits > rep->rr_buffer->rb_max_requests)
-   credits = rep->rr_buffer->rb_max_requests;
-   atomic_set(&rep->rr_buffer->rb_credits, credits);
-   }
+   prefetch(rep->rr_base);
 
 out_schedule:
list_add_tail(&rep->rr_list, sched_list);
@@ -480,7 +471,6 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct 
rdma_cm_event *event)
case RDMA_CM_EVENT_DEVICE_REMOVAL:
connstate = -ENODEV;
 connected:
-   atomic_set(&rpcx_to_rdmax(ep->rep_xprt)->rx_buf.rb_credits, 1);
dprintk("RPC:   %s: %sconnected\n",
__func__, connstate > 0 ? "" : "dis");
ep->rep_connected = connstate;
@@ -1186,7 +1176,6 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct 
rpcrdma_ep *ep,
 
buf->rb_max_requests = cdata->max_requests;
spin_lock_init(&buf->rb_lock);
-   atomic_set(&buf->rb_credits, 1);
 
/* Need to allocate:
 *   1.  arrays for send and recv pointers
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 532d586..3fcc92b 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -248,7 +248,6 @@ struct rpcrdma_req {
  */
 struct rpcrdma_buffer {
spinlock_t  rb_lock;/* protects indexes */
-   atomic_trb_credits; /* most recent server credits */
int rb_max_requests;/* client max requests */
struct list_head rb_mws;/* optional memory windows/fmrs/frmrs */
struct list_head rb_all;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 06/20] xprtrdma: Remove rpcrdma_ep::rep_ia

2015-01-13 Thread Chuck Lever

Clean up: This field is not used.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |1 -
 net/sunrpc/xprtrdma/xprt_rdma.h |1 -
 2 files changed, 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 56f705d..56e14b3 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -825,7 +825,6 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia,
else if (ep->rep_cqinit <= 2)
ep->rep_cqinit = 0;
INIT_CQCOUNT(ep);
-   ep->rep_ia = ia;
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 9a7aab3..5160a84 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -83,7 +83,6 @@ struct rpcrdma_ep {
atomic_trep_cqcount;
int rep_cqinit;
int rep_connected;
-   struct rpcrdma_ia   *rep_ia;
struct ib_qp_init_attr  rep_attr;
wait_queue_head_t   rep_connect_wait;
struct ib_sge   rep_pad;/* holds zeroed pad */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 01/20] xprtrdma: human-readable completion status

2015-01-13 Thread Chuck Lever

Make it easier to grep the system log for specific error conditions.

The wc.opcode field is not included because opcode numbers are
sparse, and because wc.opcode is not necessarily valid when
completion reports an error.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   70 +++
 1 file changed, 57 insertions(+), 13 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c98e406..56f705d 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -173,18 +173,54 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, 
void *context)
}
 }
 
+static const char * const wc_status[] = {
+   "success",
+   "local length error",
+   "local QP operation error",
+   "local EE context operation error",
+   "local protection error",
+   "WR flushed",
+   "memory management operation error",
+   "bad response error",
+   "local access error",
+   "remote invalid request error",
+   "remote access error",
+   "remote operation error",
+   "transport retry counter exceeded",
+   "RNR retrycounter exceeded",
+   "local RDD violation error",
+   "remove invalid RD request",
+   "operation aborted",
+   "invalid EE context number",
+   "invalid EE context state",
+   "fatal error",
+   "response timeout error",
+   "general error",
+};
+
+#define COMPLETION_MSG(status) \
+   ((status) < ARRAY_SIZE(wc_status) ? \
+   wc_status[(status)] : "unexpected completion error")
+
 static void
 rpcrdma_sendcq_process_wc(struct ib_wc *wc)
 {
-   struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   if (likely(wc->status == IB_WC_SUCCESS))
+   return;
 
-   dprintk("RPC:   %s: frmr %p status %X opcode %d\n",
-   __func__, frmr, wc->status, wc->opcode);
+   /* WARNING: Only wr_id and status are reliable at this point */
+   if (wc->wr_id == 0ULL) {
+   if (wc->status != IB_WC_WR_FLUSH_ERR)
+   pr_err("RPC:   %s: SEND: %s\n",
+  __func__, COMPLETION_MSG(wc->status));
+   } else {
+   struct rpcrdma_mw *r;
 
-   if (wc->wr_id == 0ULL)
-   return;
-   if (wc->status != IB_WC_SUCCESS)
-   frmr->r.frmr.fr_state = FRMR_IS_STALE;
+   r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   r->r.frmr.fr_state = FRMR_IS_STALE;
+   pr_err("RPC:   %s: frmr %p (stale): %s\n",
+  __func__, r, COMPLETION_MSG(wc->status));
+   }
 }
 
 static int
@@ -248,16 +284,17 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct 
list_head *sched_list)
struct rpcrdma_rep *rep =
(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
 
-   dprintk("RPC:   %s: rep %p status %X opcode %X length %u\n",
-   __func__, rep, wc->status, wc->opcode, wc->byte_len);
+   /* WARNING: Only wr_id and status are reliable at this point */
+   if (wc->status != IB_WC_SUCCESS)
+   goto out_fail;
 
-   if (wc->status != IB_WC_SUCCESS) {
-   rep->rr_len = ~0U;
-   goto out_schedule;
-   }
+   /* status == SUCCESS means all fields in wc are trustworthy */
if (wc->opcode != IB_WC_RECV)
return;
 
+   dprintk("RPC:   %s: rep %p opcode 'recv', length %u: success\n",
+   __func__, rep, wc->byte_len);
+
rep->rr_len = wc->byte_len;
ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
rep->rr_iov.addr, rep->rr_len, DMA_FROM_DEVICE);
@@ -275,6 +312,13 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct 
list_head *sched_list)
 
 out_schedule:
list_add_tail(&rep->rr_list, sched_list);
+   return;
+out_fail:
+   if (wc->status != IB_WC_WR_FLUSH_ERR)
+   pr_err("RPC:   %s: rep %p: %s\n",
+  __func__, rep, COMPLETION_MSG(wc->status));
+   rep->rr_len = ~0U;
+   goto out_schedule;
 }
 
 static int

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 00/20] NFS/RDMA client for 3.20

2015-01-13 Thread Chuck Lever

The following series of patches for the Linux NFS client breaks up
the per-transport buffer pool data structures to help them scale
better with the size of SEND/RECV buffers (the inline threshold),
the maximum NFS r/wsize, and the number of RDMA credits (concurrent
RPC requests).

The primary change is that the header send buffers have been split
from struct rpcrdma_req. Specific benefits are outlined in the patch
descriptions.

More pre-requisites are required. Changes to raise the maximum
r/wsize and other limits are left for a future merge window.

See the topic branch "nfs-rdma-for-3.20" at:

  git://git.linux-nfs.org/projects/cel/cel-2.6.git

Changes since v1:
 - Rebased on v3.19-rc4
 - One short description fixed
 - linux-rdma included this time (sorry for the noise)

---

Chuck Lever (20):
  xprtrdma: human-readable completion status
  xprtrdma: Modernize htonl and ntohl
  xprtrdma: Display XIDs in host byte order
  xprtrdma: Clean up hdrlen
  xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt
  xprtrdma: Remove rpcrdma_ep::rep_ia
  xprtrdma: Remove rl_mr field, and the mr_chunk union
  xprtrdma: Move credit update to RPC reply handler
  xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt
  xprtrdma: Free the pd if ib_query_qp() fails
  xprtrdma: Take struct ib_device_attr off the stack
  xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack
  xprtrdma: Simplify synopsis of rpcrdma_buffer_create()
  xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()
  xprtrdma: Add struct rpcrdma_regbuf and helpers
  xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req
  xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req
  xprtrdma: Allocate RPC/RDMA receive buffer separately from struct 
rpcrdma_rep
  xprtrdma: Allocate zero pad separately from rpcrdma_buffer
  xprtrdma: Clean up after adding regbuf management


 include/linux/sunrpc/rpc_rdma.h |   14 +
 include/linux/sunrpc/svc_rdma.h |2 
 net/sunrpc/xprtrdma/rpc_rdma.c  |  108 ++
 net/sunrpc/xprtrdma/transport.c |  179 +++--
 net/sunrpc/xprtrdma/verbs.c |  411 ---
 net/sunrpc/xprtrdma/xprt_rdma.h |  111 +++
 6 files changed, 478 insertions(+), 347 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 05/20] xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt

2015-01-13 Thread Chuck Lever

Clean up: Use consistent field names in struct rpcrdma_xprt.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/transport.c |   19 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |6 +++---
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index bbd6155..ee57513 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -200,9 +200,9 @@ xprt_rdma_free_addresses(struct rpc_xprt *xprt)
 static void
 xprt_rdma_connect_worker(struct work_struct *work)
 {
-   struct rpcrdma_xprt *r_xprt =
-   container_of(work, struct rpcrdma_xprt, rdma_connect.work);
-   struct rpc_xprt *xprt = &r_xprt->xprt;
+   struct rpcrdma_xprt *r_xprt = container_of(work, struct rpcrdma_xprt,
+  rx_connect_worker.work);
+   struct rpc_xprt *xprt = &r_xprt->rx_xprt;
int rc = 0;
 
xprt_clear_connected(xprt);
@@ -235,7 +235,7 @@ xprt_rdma_destroy(struct rpc_xprt *xprt)
 
dprintk("RPC:   %s: called\n", __func__);
 
-   cancel_delayed_work_sync(&r_xprt->rdma_connect);
+   cancel_delayed_work_sync(&r_xprt->rx_connect_worker);
 
xprt_clear_connected(xprt);
 
@@ -374,7 +374,8 @@ xprt_setup_rdma(struct xprt_create *args)
 * connection loss notification is async. We also catch connection loss
 * when reaping receives.
 */
-   INIT_DELAYED_WORK(&new_xprt->rdma_connect, xprt_rdma_connect_worker);
+   INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
+ xprt_rdma_connect_worker);
new_ep->rep_func = rpcrdma_conn_func;
new_ep->rep_xprt = xprt;
 
@@ -434,17 +435,17 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task 
*task)
 
if (r_xprt->rx_ep.rep_connected != 0) {
/* Reconnect */
-   schedule_delayed_work(&r_xprt->rdma_connect,
-   xprt->reestablish_timeout);
+   schedule_delayed_work(&r_xprt->rx_connect_worker,
+ xprt->reestablish_timeout);
xprt->reestablish_timeout <<= 1;
if (xprt->reestablish_timeout > RPCRDMA_MAX_REEST_TO)
xprt->reestablish_timeout = RPCRDMA_MAX_REEST_TO;
else if (xprt->reestablish_timeout < RPCRDMA_INIT_REEST_TO)
xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO;
} else {
-   schedule_delayed_work(&r_xprt->rdma_connect, 0);
+   schedule_delayed_work(&r_xprt->rx_connect_worker, 0);
if (!RPC_IS_ASYNC(task))
-   flush_delayed_work(&r_xprt->rdma_connect);
+   flush_delayed_work(&r_xprt->rx_connect_worker);
}
 }
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index b799041..9a7aab3 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -318,16 +318,16 @@ struct rpcrdma_stats {
  * during unmount.
  */
 struct rpcrdma_xprt {
-   struct rpc_xprt xprt;
+   struct rpc_xprt rx_xprt;
struct rpcrdma_ia   rx_ia;
struct rpcrdma_ep   rx_ep;
struct rpcrdma_buffer   rx_buf;
struct rpcrdma_create_data_internal rx_data;
-   struct delayed_work rdma_connect;
+   struct delayed_work rx_connect_worker;
struct rpcrdma_statsrx_stats;
 };
 
-#define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, xprt)
+#define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)
 #define rpcx_to_rdmad(x) (rpcx_to_rdmax(x)->rx_data)
 
 /* Setting this to 0 ensures interoperability with early servers.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 07/20] xprtrdma: Remove rl_mr field, and the mr_chunk union

2015-01-13 Thread Chuck Lever

Clean up.

Since commit 0ac531c18323 ("xprtrdma: Remove REGISTER memory
registration mode"), the rl_mr pointer is no longer used anywhere.

After removal, there's only a single member of the mr_chunk union,
so mr_chunk can be removed as well, in favor of a single pointer
field.

Fixes: 0ac531c18323 ("xprtrdma: Remove REGISTER memory ...")
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   25 -
 net/sunrpc/xprtrdma/xprt_rdma.h |5 +
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 56e14b3..1000f63 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1493,8 +1493,8 @@ rpcrdma_buffer_put_mrs(struct rpcrdma_req *req, struct 
rpcrdma_buffer *buf)
int i;
 
for (i = 1, seg++; i < RPCRDMA_MAX_SEGS; seg++, i++)
-   rpcrdma_buffer_put_mr(&seg->mr_chunk.rl_mw, buf);
-   rpcrdma_buffer_put_mr(&seg1->mr_chunk.rl_mw, buf);
+   rpcrdma_buffer_put_mr(&seg->rl_mw, buf);
+   rpcrdma_buffer_put_mr(&seg1->rl_mw, buf);
 }
 
 static void
@@ -1580,7 +1580,7 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct 
rpcrdma_buffer *buf,
list_add(&r->mw_list, stale);
continue;
}
-   req->rl_segments[i].mr_chunk.rl_mw = r;
+   req->rl_segments[i].rl_mw = r;
if (unlikely(i-- == 0))
return req; /* Success */
}
@@ -1602,7 +1602,7 @@ rpcrdma_buffer_get_fmrs(struct rpcrdma_req *req, struct 
rpcrdma_buffer *buf)
r = list_entry(buf->rb_mws.next,
   struct rpcrdma_mw, mw_list);
list_del(&r->mw_list);
-   req->rl_segments[i].mr_chunk.rl_mw = r;
+   req->rl_segments[i].rl_mw = r;
if (unlikely(i-- == 0))
return req; /* Success */
}
@@ -1842,7 +1842,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_mr_seg *seg1 = seg;
-   struct rpcrdma_mw *mw = seg1->mr_chunk.rl_mw;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
struct ib_mr *mr = frmr->fr_mr;
struct ib_send_wr fastreg_wr, *bad_wr;
@@ -1931,12 +1931,12 @@ rpcrdma_deregister_frmr_external(struct rpcrdma_mr_seg 
*seg,
struct ib_send_wr invalidate_wr, *bad_wr;
int rc;
 
-   seg1->mr_chunk.rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
+   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
 
memset(&invalidate_wr, 0, sizeof invalidate_wr);
-   invalidate_wr.wr_id = (unsigned long)(void *)seg1->mr_chunk.rl_mw;
+   invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
invalidate_wr.opcode = IB_WR_LOCAL_INV;
-   invalidate_wr.ex.invalidate_rkey = 
seg1->mr_chunk.rl_mw->r.frmr.fr_mr->rkey;
+   invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
read_lock(&ia->ri_qplock);
@@ -1946,7 +1946,7 @@ rpcrdma_deregister_frmr_external(struct rpcrdma_mr_seg 
*seg,
read_unlock(&ia->ri_qplock);
if (rc) {
/* Force rpcrdma_buffer_get() to retry */
-   seg1->mr_chunk.rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
+   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
dprintk("RPC:   %s: failed ib_post_send for invalidate,"
" status %i\n", __func__, rc);
}
@@ -1978,8 +1978,7 @@ rpcrdma_register_fmr_external(struct rpcrdma_mr_seg *seg,
offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
break;
}
-   rc = ib_map_phys_fmr(seg1->mr_chunk.rl_mw->r.fmr,
-   physaddrs, i, seg1->mr_dma);
+   rc = ib_map_phys_fmr(seg1->rl_mw->r.fmr, physaddrs, i, seg1->mr_dma);
if (rc) {
dprintk("RPC:   %s: failed ib_map_phys_fmr "
"%u@0x%llx+%i (%d)... status %i\n", __func__,
@@ -1988,7 +1987,7 @@ rpcrdma_register_fmr_external(struct rpcrdma_mr_seg *seg,
while (i--)
rpcrdma_unmap_one(ia, --seg);
} else {
-   seg1->mr_rkey = seg1->mr_chunk.rl_mw->r.fmr->rkey;
+   seg1->mr_rkey = seg1->rl_mw->r.fmr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
seg1->mr_len = len;
@@ -2005,7 +2004,7 @@ rpcrdma_deregister_fmr_external(struct rpcrdma_mr_seg 
*seg,
LIST_HEAD(l);
int rc;
 
-   list_add(&seg1->mr_chunk.rl_mw->r.fmr->list, &l);
+   list_add(&seg1->rl_mw->r.fmr->list, &l);
rc = ib_unmap_fmr(&l);
read_lock(&ia->ri_qplock);
while (seg1->mr_nsegs--)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.

[PATCH v2 02/20] xprtrdma: Modernize htonl and ntohl

2015-01-13 Thread Chuck Lever

Clean up: Replace htonl and ntohl with the be32 equivalents.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/rpc_rdma.h |9 +++
 include/linux/sunrpc/svc_rdma.h |2 --
 net/sunrpc/xprtrdma/rpc_rdma.c  |   48 +--
 3 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h
index b78f16b..1578ed2 100644
--- a/include/linux/sunrpc/rpc_rdma.h
+++ b/include/linux/sunrpc/rpc_rdma.h
@@ -42,6 +42,9 @@
 
 #include 
 
+#define RPCRDMA_VERSION1
+#define rpcrdma_versioncpu_to_be32(RPCRDMA_VERSION)
+
 struct rpcrdma_segment {
__be32 rs_handle;   /* Registered memory handle */
__be32 rs_length;   /* Length of the chunk in bytes */
@@ -115,4 +118,10 @@ enum rpcrdma_proc {
RDMA_ERROR = 4  /* An RPC RDMA encoding error */
 };
 
+#define rdma_msg   cpu_to_be32(RDMA_MSG)
+#define rdma_nomsg cpu_to_be32(RDMA_NOMSG)
+#define rdma_msgp  cpu_to_be32(RDMA_MSGP)
+#define rdma_done  cpu_to_be32(RDMA_DONE)
+#define rdma_error cpu_to_be32(RDMA_ERROR)
+
 #endif /* _LINUX_SUNRPC_RPC_RDMA_H */
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 975da75..ddfe88f 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -63,8 +63,6 @@ extern atomic_t rdma_stat_rq_prod;
 extern atomic_t rdma_stat_sq_poll;
 extern atomic_t rdma_stat_sq_prod;
 
-#define RPCRDMA_VERSION 1
-
 /*
  * Contexts are built when an RDMA request is created and are a
  * record of the resources that can be recovered when the request
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index df01d12..a6fb30b 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -209,9 +209,11 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct 
xdr_buf *target,
if (cur_rchunk) {   /* read */
cur_rchunk->rc_discrim = xdr_one;
/* all read chunks have the same "position" */
-   cur_rchunk->rc_position = htonl(pos);
-   cur_rchunk->rc_target.rs_handle = htonl(seg->mr_rkey);
-   cur_rchunk->rc_target.rs_length = htonl(seg->mr_len);
+   cur_rchunk->rc_position = cpu_to_be32(pos);
+   cur_rchunk->rc_target.rs_handle =
+   cpu_to_be32(seg->mr_rkey);
+   cur_rchunk->rc_target.rs_length =
+   cpu_to_be32(seg->mr_len);
xdr_encode_hyper(
(__be32 
*)&cur_rchunk->rc_target.rs_offset,
seg->mr_base);
@@ -222,8 +224,10 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct 
xdr_buf *target,
cur_rchunk++;
r_xprt->rx_stats.read_chunk_count++;
} else {/* write/reply */
-   cur_wchunk->wc_target.rs_handle = htonl(seg->mr_rkey);
-   cur_wchunk->wc_target.rs_length = htonl(seg->mr_len);
+   cur_wchunk->wc_target.rs_handle =
+   cpu_to_be32(seg->mr_rkey);
+   cur_wchunk->wc_target.rs_length =
+   cpu_to_be32(seg->mr_len);
xdr_encode_hyper(
(__be32 
*)&cur_wchunk->wc_target.rs_offset,
seg->mr_base);
@@ -257,7 +261,7 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct xdr_buf 
*target,
*iptr++ = xdr_zero; /* encode a NULL reply chunk */
} else {
warray->wc_discrim = xdr_one;
-   warray->wc_nchunks = htonl(nchunks);
+   warray->wc_nchunks = cpu_to_be32(nchunks);
iptr = (__be32 *) cur_wchunk;
if (type == rpcrdma_writech) {
*iptr++ = xdr_zero; /* finish the write chunk list */
@@ -404,11 +408,11 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 
/* build RDMA header in private area at front */
headerp = (struct rpcrdma_msg *) req->rl_base;
-   /* don't htonl XID, it's already done in request */
+   /* don't byte-swap XID, it's already done in request */
headerp->rm_xid = rqst->rq_xid;
-   headerp->rm_vers = xdr_one;
-   headerp->rm_credit = htonl(r_xprt->rx_buf.rb_max_requests);
-   headerp->rm_type = htonl(RDMA_MSG);
+   headerp->rm_vers = rpcrdma_version;
+   headerp->rm_credit = cpu_to_be32(r_xprt->rx_buf.rb_max_requests);
+   headerp->rm_type = rdma_msg;
 
/*
 * Chunks needed for results?
@@ -482,11 +486,11 @@ rpcrdma_marshal_req(struct rpc_rqst *rq

[PATCH v2 09/20] xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt

2015-01-13 Thread Chuck Lever

Clean up: The rep_func field always refers to rpcrdma_conn_func().

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |4 +++-
 net/sunrpc/xprtrdma/transport.c |2 --
 net/sunrpc/xprtrdma/verbs.c |6 +++---
 net/sunrpc/xprtrdma/xprt_rdma.h |2 --
 4 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d731010..f2eda15 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -695,7 +695,9 @@ rpcrdma_connect_worker(struct work_struct *work)
 {
struct rpcrdma_ep *ep =
container_of(work, struct rpcrdma_ep, rep_connect_worker.work);
-   struct rpc_xprt *xprt = ep->rep_xprt;
+   struct rpcrdma_xprt *r_xprt =
+   container_of(ep, struct rpcrdma_xprt, rx_ep);
+   struct rpc_xprt *xprt = &r_xprt->rx_xprt;
 
spin_lock_bh(&xprt->transport_lock);
if (++xprt->connect_cookie == 0)/* maintain a reserved value */
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index ee57513..a487bde 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -376,8 +376,6 @@ xprt_setup_rdma(struct xprt_create *args)
 */
INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
  xprt_rdma_connect_worker);
-   new_ep->rep_func = rpcrdma_conn_func;
-   new_ep->rep_xprt = xprt;
 
xprt_rdma_format_addresses(xprt);
xprt->max_payload = rpcrdma_max_payload(new_xprt);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 71a071a..c61bb61 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -154,7 +154,7 @@ rpcrdma_qp_async_error_upcall(struct ib_event *event, void 
*context)
event->device->name, context);
if (ep->rep_connected == 1) {
ep->rep_connected = -EIO;
-   ep->rep_func(ep);
+   rpcrdma_conn_func(ep);
wake_up_all(&ep->rep_connect_wait);
}
 }
@@ -169,7 +169,7 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void 
*context)
event->device->name, context);
if (ep->rep_connected == 1) {
ep->rep_connected = -EIO;
-   ep->rep_func(ep);
+   rpcrdma_conn_func(ep);
wake_up_all(&ep->rep_connect_wait);
}
 }
@@ -474,7 +474,7 @@ connected:
dprintk("RPC:   %s: %sconnected\n",
__func__, connstate > 0 ? "" : "dis");
ep->rep_connected = connstate;
-   ep->rep_func(ep);
+   rpcrdma_conn_func(ep);
wake_up_all(&ep->rep_connect_wait);
/*FALLTHROUGH*/
default:
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 3fcc92b..657c370 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -87,8 +87,6 @@ struct rpcrdma_ep {
wait_queue_head_t   rep_connect_wait;
struct ib_sge   rep_pad;/* holds zeroed pad */
struct ib_mr*rep_pad_mr;/* holds zeroed pad */
-   void(*rep_func)(struct rpcrdma_ep *);
-   struct rpc_xprt *rep_xprt;  /* for rep_func */
struct rdma_conn_param  rep_remote_cma;
struct sockaddr_storage rep_remote_addr;
struct delayed_work rep_connect_worker;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 03/20] xprtrdma: Display XIDs in host byte order

2015-01-13 Thread Chuck Lever

xprtsock.c and the backchannel code display XIDs in host byte order.
Follow suit in xprtrdma.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a6fb30b..150dd76 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -766,7 +766,8 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
spin_unlock(&xprt->transport_lock);
dprintk("RPC:   %s: reply 0x%p failed "
"to match any request xid 0x%08x len %d\n",
-   __func__, rep, headerp->rm_xid, rep->rr_len);
+   __func__, rep, be32_to_cpu(headerp->rm_xid),
+   rep->rr_len);
 repost:
r_xprt->rx_stats.bad_reply_count++;
rep->rr_func = rpcrdma_reply_handler;
@@ -782,13 +783,14 @@ repost:
spin_unlock(&xprt->transport_lock);
dprintk("RPC:   %s: duplicate reply 0x%p to RPC "
"request 0x%p: xid 0x%08x\n", __func__, rep, req,
-   headerp->rm_xid);
+   be32_to_cpu(headerp->rm_xid));
goto repost;
}
 
dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
"   RPC request 0x%p xid 0x%08x\n",
-   __func__, rep, req, rqst, headerp->rm_xid);
+   __func__, rep, req, rqst,
+   be32_to_cpu(headerp->rm_xid));
 
/* from here on, the reply is no longer an orphan */
req->rl_reply = rep;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] infiniband: mlx5: avoid a compile-time warning

2015-01-13 Thread Arnd Bergmann

The return type of find_first_bit() is architecture specific,
on ARM it is 'unsigned int', while the asm-generic code used
on x86 and a lot of other architectures returns 'unsigned long'.

When building the mlx5 driver on ARM, we get a warning about
this:

infiniband/hw/mlx5/mem.c: In function 'mlx5_ib_cont_pages':
infiniband/hw/mlx5/mem.c:84:143: warning: comparison of distinct pointer types 
lacks a cast
 m = min(m, find_first_bit(&tmp, sizeof(tmp)));

This patch changes the driver to use min_t to make it behave
the same way on all architectures.

Signed-off-by: Arnd Bergmann 

diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index b56e4c5593ee..611a9fdf2f38 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -81,7 +81,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int 
*count, int *shift,
for (k = 0; k < len; k++) {
if (!(i & mask)) {
tmp = (unsigned long)pfn;
-   m = min(m, find_first_bit(&tmp, sizeof(tmp)));
+   m = min_t(unsigned long, m, 
find_first_bit(&tmp, sizeof(tmp)));
skip = 1 << m;
mask = skip - 1;
base = pfn;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] mlx5: avoid build warnings on 32-bit

2015-01-13 Thread Arnd Bergmann

The mlx5 driver passes a string pointer in through a 'u64' variable,
which on 32-bit machines causes a build warning:

drivers/net/ethernet/mellanox/mlx5/core/debugfs.c: In function 'qp_read_field':
drivers/net/ethernet/mellanox/mlx5/core/debugfs.c:303:11: warning: cast from 
pointer to integer of different size [-Wpointer-to-int-cast]

The code is in fact safe, so we can shut up the warning by adding
extra type casts.

Signed-off-by: Arnd Bergmann 

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
index 10e1f1a18255..4878025e231c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/debugfs.c
@@ -300,11 +300,11 @@ static u64 qp_read_field(struct mlx5_core_dev *dev, 
struct mlx5_core_qp *qp,
param = qp->pid;
break;
case QP_STATE:
-   param = (u64)mlx5_qp_state_str(be32_to_cpu(ctx->flags) >> 28);
+   param = (unsigned 
long)mlx5_qp_state_str(be32_to_cpu(ctx->flags) >> 28);
*is_str = 1;
break;
case QP_XPORT:
-   param = (u64)mlx5_qp_type_str((be32_to_cpu(ctx->flags) >> 16) & 
0xff);
+   param = (unsigned 
long)mlx5_qp_type_str((be32_to_cpu(ctx->flags) >> 16) & 0xff);
*is_str = 1;
break;
case QP_MTU:
@@ -464,7 +464,7 @@ static ssize_t dbg_read(struct file *filp, char __user 
*buf, size_t count,
 
 
if (is_str)
-   ret = snprintf(tbuf, sizeof(tbuf), "%s\n", (const char *)field);
+   ret = snprintf(tbuf, sizeof(tbuf), "%s\n", (const char 
*)(unsigned long)field);
else
ret = snprintf(tbuf, sizeof(tbuf), "0x%llx\n", field);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 10/10] svcrdma: Handle additional inline content

2015-01-13 Thread Chuck Lever

Most NFS RPCs place their large payload argument at the end of the
RPC header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
sends the complete RPC header inline, and the payload argument in
the read list. Data in the read list is the last part of the XDR
stream.

One important case is not like this, however. NFSv4 COMPOUND is a
counted array of operations. A WRITE operation, with its large data
payload, can appear in the middle of the compound's operations
array. Thus NFSv4 WRITE compounds can have header content after the
WRITE payload.

The Linux client, for example, performs an NFSv4 WRITE like this:

  { PUTFH, WRITE, GETATTR }

Though RFC 5667 is not precise about this, the proper way to convey
this compound is to place the GETATTR inline, _after_ the front of
the RPC header. The receiver inserts the read list payload into the
XDR stream after the initial WRITE arguments, and before the GETATTR
operation, thanks to the value of the read list "position" field.

The Linux client currently sends the GETATTR at the end of the
RPC/RDMA read list, which is incorrect. It will be corrected in the
future.

The Linux server currently rejects NFSv4 compounds with inline
content after the read list. For the above NFSv4 WRITE compound, the
NFS compound header indicates there are three operations, but the
server finds nonsense when it looks in the XDR stream for the third
operation, and the compound fails with OP_ILLEGAL.

Move trailing inline content to the end of the XDR buffer's page
list. This presents incoming NFSv4 WRITE compounds to NFSD in the
same way the socket transport does.

Signed-off-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   55 +++
 1 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index a345cad..f9f13a3 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -364,6 +364,56 @@ rdma_rcl_chunk_count(struct rpcrdma_read_chunk *ch)
return count;
 }
 
+/* If there was additional inline content, append it to the end of arg.pages.
+ * Tail copy has to be done after the reader function has determined how many
+ * pages are needed for RDMA READ.
+ */
+static int
+rdma_copy_tail(struct svc_rqst *rqstp, struct svc_rdma_op_ctxt *head,
+  u32 position, u32 byte_count, u32 page_offset, int page_no)
+{
+   char *srcp, *destp;
+   int ret;
+
+   ret = 0;
+   srcp = head->arg.head[0].iov_base + position;
+   byte_count = head->arg.head[0].iov_len - position;
+   if (byte_count > PAGE_SIZE) {
+   dprintk("svcrdma: large tail unsupported\n");
+   return 0;
+   }
+
+   /* Fit as much of the tail on the current page as possible */
+   if (page_offset != PAGE_SIZE) {
+   destp = page_address(rqstp->rq_arg.pages[page_no]);
+   destp += page_offset;
+   while (byte_count--) {
+   *destp++ = *srcp++;
+   page_offset++;
+   if (page_offset == PAGE_SIZE && byte_count)
+   goto more;
+   }
+   goto done;
+   }
+
+more:
+   /* Fit the rest on the next page */
+   page_no++;
+   destp = page_address(rqstp->rq_arg.pages[page_no]);
+   while (byte_count--)
+   *destp++ = *srcp++;
+
+   rqstp->rq_respages = &rqstp->rq_arg.pages[page_no+1];
+   rqstp->rq_next_page = rqstp->rq_respages + 1;
+
+done:
+   byte_count = head->arg.head[0].iov_len - position;
+   head->arg.page_len += byte_count;
+   head->arg.len += byte_count;
+   head->arg.buflen += byte_count;
+   return 1;
+}
+
 static int rdma_read_chunks(struct svcxprt_rdma *xprt,
struct rpcrdma_msg *rmsgp,
struct svc_rqst *rqstp,
@@ -440,9 +490,14 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
head->arg.page_len += pad;
head->arg.len += pad;
head->arg.buflen += pad;
+   page_offset += pad;
}
 
ret = 1;
+   if (position && position < head->arg.head[0].iov_len)
+   ret = rdma_copy_tail(rqstp, head, position,
+byte_count, page_offset, page_no);
+   head->arg.head[0].iov_len = position;
head->position = position;
 
  err:

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 09/10] svcrdma: Move read list XDR round-up logic

2015-01-13 Thread Chuck Lever

This is a pre-requisite for a subsequent patch.

Read list XDR round-up needs to be done _before_ additional inline
content is copied to the end of the XDR buffer's page list. Move
the logic added by commit e560e3b510d2 ("svcrdma: Add zero padding
if the client doesn't send it").

Signed-off-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   37 ---
 1 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 36cf51a..a345cad 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -43,7 +43,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -434,6 +433,15 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
}
}
 
+   /* Read list may need XDR round-up (see RFC 5666, s. 3.7) */
+   if (page_offset & 3) {
+   u32 pad = 4 - (page_offset & 3);
+
+   head->arg.page_len += pad;
+   head->arg.len += pad;
+   head->arg.buflen += pad;
+   }
+
ret = 1;
head->position = position;
 
@@ -446,32 +454,6 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
return ret;
 }
 
-/*
- * To avoid a separate RDMA READ just for a handful of zero bytes,
- * RFC 5666 section 3.7 allows the client to omit the XDR zero pad
- * in chunk lists.
- */
-static void
-rdma_fix_xdr_pad(struct xdr_buf *buf)
-{
-   unsigned int page_len = buf->page_len;
-   unsigned int size = (XDR_QUADLEN(page_len) << 2) - page_len;
-   unsigned int offset, pg_no;
-   char *p;
-
-   if (size == 0)
-   return;
-
-   pg_no = page_len >> PAGE_SHIFT;
-   offset = page_len & ~PAGE_MASK;
-   p = page_address(buf->pages[pg_no]);
-   memset(p + offset, 0, size);
-
-   buf->page_len += size;
-   buf->buflen += size;
-   buf->len += size;
-}
-
 static int rdma_read_complete(struct svc_rqst *rqstp,
  struct svc_rdma_op_ctxt *head)
 {
@@ -499,7 +481,6 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
}
 
/* Point rq_arg.pages past header */
-   rdma_fix_xdr_pad(&head->arg);
rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count];
rqstp->rq_arg.page_len = head->arg.page_len;
rqstp->rq_arg.page_base = head->arg.page_base;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 06/10] svcrdma: Plant reader function in struct svcxprt_rdma

2015-01-13 Thread Chuck Lever

The RDMA reader function doesn't change once an svcxprt_rdma is
instantiated. Instead of checking sc_devcap during every incoming
RPC, set the reader function once when the connection is accepted.

Signed-off-by: Chuck Lever 
---

 include/linux/sunrpc/svc_rdma.h  |   10 
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   71 +++---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |2 +
 3 files changed, 39 insertions(+), 44 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 2280325..f161e30 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -150,6 +150,10 @@ struct svcxprt_rdma {
struct ib_cq *sc_rq_cq;
struct ib_cq *sc_sq_cq;
struct ib_mr *sc_phys_mr;   /* MR for server memory */
+   int  (*sc_reader)(struct svcxprt_rdma *,
+ struct svc_rqst *,
+ struct svc_rdma_op_ctxt *,
+ int *, u32 *, u32, u32, u64, bool);
u32  sc_dev_caps;   /* distilled device caps */
u32  sc_dma_lkey;   /* local dma key */
unsigned int sc_frmr_pg_list_len;
@@ -195,6 +199,12 @@ extern int svc_rdma_xdr_get_reply_hdr_len(struct 
rpcrdma_msg *);
 
 /* svc_rdma_recvfrom.c */
 extern int svc_rdma_recvfrom(struct svc_rqst *);
+extern int rdma_read_chunk_lcl(struct svcxprt_rdma *, struct svc_rqst *,
+  struct svc_rdma_op_ctxt *, int *, u32 *,
+  u32, u32, u64, bool);
+extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, struct svc_rqst *,
+   struct svc_rdma_op_ctxt *, int *, u32 *,
+   u32, u32, u64, bool);
 
 /* svc_rdma_sendto.c */
 extern int svc_rdma_sendto(struct svc_rqst *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 577f865..c3aebc1 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -117,26 +117,16 @@ static int rdma_read_max_sge(struct svcxprt_rdma *xprt, 
int sge_count)
return min_t(int, sge_count, xprt->sc_max_sge);
 }
 
-typedef int (*rdma_reader_fn)(struct svcxprt_rdma *xprt,
- struct svc_rqst *rqstp,
- struct svc_rdma_op_ctxt *head,
- int *page_no,
- u32 *page_offset,
- u32 rs_handle,
- u32 rs_length,
- u64 rs_offset,
- int last);
-
 /* Issue an RDMA_READ using the local lkey to map the data sink */
-static int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
-  struct svc_rqst *rqstp,
-  struct svc_rdma_op_ctxt *head,
-  int *page_no,
-  u32 *page_offset,
-  u32 rs_handle,
-  u32 rs_length,
-  u64 rs_offset,
-  int last)
+int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
+   struct svc_rqst *rqstp,
+   struct svc_rdma_op_ctxt *head,
+   int *page_no,
+   u32 *page_offset,
+   u32 rs_handle,
+   u32 rs_length,
+   u64 rs_offset,
+   bool last)
 {
struct ib_send_wr read_wr;
int pages_needed = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
@@ -221,15 +211,15 @@ static int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
 }
 
 /* Issue an RDMA_READ using an FRMR to map the data sink */
-static int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
-   struct svc_rqst *rqstp,
-   struct svc_rdma_op_ctxt *head,
-   int *page_no,
-   u32 *page_offset,
-   u32 rs_handle,
-   u32 rs_length,
-   u64 rs_offset,
-   int last)
+int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
+struct svc_rqst *rqstp,
+struct svc_rdma_op_ctxt *head,
+int *page_no,
+u32 *page_offset,
+u32 rs_handle,
+u32 rs_length,
+u64 rs_offset,
+bool last)
 {
struct ib_send_wr read_wr;
struct ib_send_wr inv_wr;
@@ -374,9 +364,9 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
 {
int page_no, ret;
struct rp

[PATCH v2 05/10] svcrdma: Find rmsgp more reliably

2015-01-13 Thread Chuck Lever

xdr_start() can return the wrong rmsgp address if an assumption
about how the xdr_buf was constructed changes.  When it gets it
wrong, the client receives a reply that has gibberish in the
RPC/RDMA header, preventing it from matching a waiting RPC request.

Instead, make (and document) just one assumption: that the RDMA
header for the client's RPC call is at the start of the first page
in rq_pages.

Signed-off-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   18 --
 1 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 7d79897..7de33d1 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -483,18 +483,6 @@ void svc_rdma_prep_reply_hdr(struct svc_rqst *rqstp)
 {
 }
 
-/*
- * Return the start of an xdr buffer.
- */
-static void *xdr_start(struct xdr_buf *xdr)
-{
-   return xdr->head[0].iov_base -
-   (xdr->len -
-xdr->page_len -
-xdr->tail[0].iov_len -
-xdr->head[0].iov_len);
-}
-
 int svc_rdma_sendto(struct svc_rqst *rqstp)
 {
struct svc_xprt *xprt = rqstp->rq_xprt;
@@ -512,8 +500,10 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 
dprintk("svcrdma: sending response for rqstp=%p\n", rqstp);
 
-   /* Get the RDMA request header. */
-   rdma_argp = xdr_start(&rqstp->rq_arg);
+   /* Get the RDMA request header. The receive logic always
+* places this at the start of page 0.
+*/
+   rdma_argp = page_address(rqstp->rq_pages[0]);
 
/* Build an req vec for the XDR */
ctxt = svc_rdma_get_context(rdma);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 07/10] svcrdma: rc_position sanity checking

2015-01-13 Thread Chuck Lever

An RPC/RDMA client may send large RPC arguments via a read
list. This is a list of scatter/gather elements which convey
RPC call arguments too large to fit in a small RDMA SEND.

Each entry in the read list has a "position" field, whose value is
the byte offset in the XDR stream where the data in that entry is to
be inserted. Entries which share the same "position" value make up
the same RPC argument. The receiver inserts entries with the same
position field value in list order into the XDR stream.

Currently the Linux NFS/RDMA server cannot handle receiving read
chunks in more than one position, mostly because no current client
sends read lists with elements in more than one position. As a
sanity check, ensure that all received chunks have the same
"rc_position."

Signed-off-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   16 
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index c3aebc1..a67dd1a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -365,6 +365,7 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
int page_no, ret;
struct rpcrdma_read_chunk *ch;
u32 handle, page_offset, byte_count;
+   u32 position;
u64 rs_offset;
bool last;
 
@@ -389,10 +390,17 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
head->arg.len = rqstp->rq_arg.len;
head->arg.buflen = rqstp->rq_arg.buflen;
 
-   page_no = 0; page_offset = 0;
-   for (ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
-ch->rc_discrim != 0; ch++) {
-   handle = be32_to_cpu(ch->rc_target.rs_handle);
+   ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
+   position = be32_to_cpu(ch->rc_position);
+
+   ret = 0;
+   page_no = 0;
+   page_offset = 0;
+   for (; ch->rc_discrim != xdr_zero; ch++) {
+   if (be32_to_cpu(ch->rc_position) != position)
+   goto err;
+
+   handle = be32_to_cpu(ch->rc_target.rs_handle),
byte_count = be32_to_cpu(ch->rc_target.rs_length);
xdr_decode_hyper((__be32 *)&ch->rc_target.rs_offset,
 &rs_offset);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 08/10] svcrdma: Support RDMA_NOMSG requests

2015-01-13 Thread Chuck Lever

Currently the Linux server can not decode RDMA_NOMSG type requests.
Operations whose length exceeds the fixed size of RDMA SEND buffers,
like large NFSv4 CREATE(NF4LNK) operations, must be conveyed via
RDMA_NOMSG.

For an RDMA_MSG type request, the client sends the RPC/RDMA, RPC
headers, and some or all of the NFS arguments via RDMA SEND.

For an RDMA_NOMSG type request, the client sends just the RPC/RDMA
header via RDMA SEND. The request's read list contains elements for
the entire RPC message, including the RPC header.

NFSD expects the RPC/RMDA header and RPC header to be contiguous in
page zero of the XDR buffer. Add logic in the RDMA READ path to make
the read list contents land where the server prefers, when the
incoming message is a type RDMA_NOMSG message.

Signed-off-by: Chuck Lever 
---

 include/linux/sunrpc/svc_rdma.h |1 +
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   39 +--
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f161e30..c343a94 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -79,6 +79,7 @@ struct svc_rdma_op_ctxt {
enum ib_wr_opcode wr_op;
enum ib_wc_status wc_status;
u32 byte_len;
+   u32 position;
struct svcxprt_rdma *xprt;
unsigned long flags;
enum dma_data_direction direction;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index a67dd1a..36cf51a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -60,6 +60,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
   struct svc_rdma_op_ctxt *ctxt,
   u32 byte_count)
 {
+   struct rpcrdma_msg *rmsgp;
struct page *page;
u32 bc;
int sge_no;
@@ -82,7 +83,14 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
/* If data remains, store it in the pagelist */
rqstp->rq_arg.page_len = bc;
rqstp->rq_arg.page_base = 0;
-   rqstp->rq_arg.pages = &rqstp->rq_pages[1];
+
+   /* RDMA_NOMSG: RDMA READ data should land just after RDMA RECV data */
+   rmsgp = (struct rpcrdma_msg *)rqstp->rq_arg.head[0].iov_base;
+   if (be32_to_cpu(rmsgp->rm_type) == RDMA_NOMSG)
+   rqstp->rq_arg.pages = &rqstp->rq_pages[0];
+   else
+   rqstp->rq_arg.pages = &rqstp->rq_pages[1];
+
sge_no = 1;
while (bc && sge_no < ctxt->count) {
page = ctxt->pages[sge_no];
@@ -383,7 +391,6 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
 */
head->arg.head[0] = rqstp->rq_arg.head[0];
head->arg.tail[0] = rqstp->rq_arg.tail[0];
-   head->arg.pages = &head->pages[head->count];
head->hdr_count = head->count;
head->arg.page_base = 0;
head->arg.page_len = 0;
@@ -393,9 +400,17 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
position = be32_to_cpu(ch->rc_position);
 
+   /* RDMA_NOMSG: RDMA READ data should land just after RDMA RECV data */
+   if (position == 0) {
+   head->arg.pages = &head->pages[0];
+   page_offset = head->byte_len;
+   } else {
+   head->arg.pages = &head->pages[head->count];
+   page_offset = 0;
+   }
+
ret = 0;
page_no = 0;
-   page_offset = 0;
for (; ch->rc_discrim != xdr_zero; ch++) {
if (be32_to_cpu(ch->rc_position) != position)
goto err;
@@ -418,7 +433,10 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
head->arg.buflen += ret;
}
}
+
ret = 1;
+   head->position = position;
+
  err:
/* Detach arg pages. svc_recv will replenish them */
for (page_no = 0;
@@ -465,6 +483,21 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
put_page(rqstp->rq_pages[page_no]);
rqstp->rq_pages[page_no] = head->pages[page_no];
}
+
+   /* Adjustments made for RDMA_NOMSG type requests */
+   if (head->position == 0) {
+   if (head->arg.len <= head->sge[0].length) {
+   head->arg.head[0].iov_len = head->arg.len -
+   head->byte_len;
+   head->arg.page_len = 0;
+   } else {
+   head->arg.head[0].iov_len = head->sge[0].length -
+   head->byte_len;
+   head->arg.page_len = head->arg.len -
+   head->sge[0].length;
+   }
+   }
+
/* Point rq_arg.pages past header */
rdma_fix_xdr_pad(&head->arg);

[PATCH v2 04/10] svcrdma: Scrub BUG_ON() and WARN_ON() call sites

2015-01-13 Thread Chuck Lever

Current convention is to avoid using BUG_ON() in places where an
oops could cause complete system failure.

Replace BUG_ON() call sites in svcrdma with an assertion error
message and allow execution to continue safely.

Some BUG_ON() calls are removed because they have never fired in
production (that we are aware of).

Some WARN_ON() calls are also replaced where a back trace is not
helpful; e.g., in a workqueue task.

Signed-off-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   11 
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|   28 +++-
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   43 +++---
 3 files changed, 49 insertions(+), 33 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index b3b7bb8..577f865 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -95,14 +95,6 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
rqstp->rq_respages = &rqstp->rq_pages[sge_no];
rqstp->rq_next_page = rqstp->rq_respages + 1;
 
-   /* We should never run out of SGE because the limit is defined to
-* support the max allowed RPC data length
-*/
-   BUG_ON(bc && (sge_no == ctxt->count));
-   BUG_ON((rqstp->rq_arg.head[0].iov_len + rqstp->rq_arg.page_len)
-  != byte_count);
-   BUG_ON(rqstp->rq_arg.len != byte_count);
-
/* If not all pages were used from the SGL, free the remaining ones */
bc = sge_no;
while (sge_no < ctxt->count) {
@@ -477,8 +469,6 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
int page_no;
int ret;
 
-   BUG_ON(!head);
-
/* Copy RPC pages */
for (page_no = 0; page_no < head->count; page_no++) {
put_page(rqstp->rq_pages[page_no]);
@@ -567,7 +557,6 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
}
dprintk("svcrdma: processing ctxt=%p on xprt=%p, rqstp=%p, status=%d\n",
ctxt, rdma_xprt, rqstp, ctxt->wc_status);
-   BUG_ON(ctxt->wc_status != IB_WC_SUCCESS);
atomic_inc(&rdma_stat_recv);
 
/* Build up the XDR from the receive buffers. */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 9f1b506..7d79897 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -60,8 +60,11 @@ static int map_xdr(struct svcxprt_rdma *xprt,
u32 page_off;
int page_no;
 
-   BUG_ON(xdr->len !=
-  (xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len));
+   if (xdr->len !=
+   (xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len)) {
+   pr_err("svcrdma: map_xdr: XDR buffer length error\n");
+   return -EIO;
+   }
 
/* Skip the first sge, this is for the RPCRDMA header */
sge_no = 1;
@@ -150,7 +153,11 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
svc_rqst *rqstp,
int bc;
struct svc_rdma_op_ctxt *ctxt;
 
-   BUG_ON(vec->count > RPCSVC_MAXPAGES);
+   if (vec->count > RPCSVC_MAXPAGES) {
+   pr_err("svcrdma: Too many pages (%lu)\n", vec->count);
+   return -EIO;
+   }
+
dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, "
"write_len=%d, vec->sge=%p, vec->count=%lu\n",
rmr, (unsigned long long)to, xdr_off,
@@ -190,7 +197,10 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
svc_rqst *rqstp,
sge_off = 0;
sge_no++;
xdr_sge_no++;
-   BUG_ON(xdr_sge_no > vec->count);
+   if (xdr_sge_no > vec->count) {
+   pr_err("svcrdma: Too many sges (%d)\n", xdr_sge_no);
+   goto err;
+   }
bc -= sge_bytes;
if (sge_no == xprt->sc_max_sge)
break;
@@ -421,7 +431,10 @@ static int send_reply(struct svcxprt_rdma *rdma,
ctxt->sge[sge_no].lkey = rdma->sc_dma_lkey;
ctxt->sge[sge_no].length = sge_bytes;
}
-   BUG_ON(byte_count != 0);
+   if (byte_count != 0) {
+   pr_err("svcrdma: Could not map %d bytes\n", byte_count);
+   goto err;
+   }
 
/* Save all respages in the ctxt and remove them from the
 * respages array. They are our pages until the I/O
@@ -442,7 +455,10 @@ static int send_reply(struct svcxprt_rdma *rdma,
}
rqstp->rq_next_page = rqstp->rq_respages + 1;
 
-   BUG_ON(sge_no > rdma->sc_max_sge);
+   if (sge_no > rdma->sc_max_sge) {
+   pr_err("svcrdma: Too many sges (%d)\n", sge_no);
+   goto err;
+   }
memset(&send_wr, 0, sizeof send_wr);
ctxt->wr_op = IB_WR_SEND;
send_wr.wr_id = (unsigned long)ctxt;
diff --git a/net/sunrpc/xprtrdma/svc_rd

[PATCH v2 03/10] svcrdma: Clean up read chunk counting

2015-01-13 Thread Chuck Lever

The byte_count argument is not used, and the function is called
only from one place.

Signed-off-by: Chuck Lever 
---

 include/linux/sunrpc/svc_rdma.h |2 --
 net/sunrpc/xprtrdma/svc_rdma_marshal.c  |   16 
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   15 ---
 3 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 975da75..2280325 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -178,8 +178,6 @@ struct svcxprt_rdma {
 #define RPCRDMA_MAX_REQ_SIZE4096
 
 /* svc_rdma_marshal.c */
-extern void svc_rdma_rcl_chunk_counts(struct rpcrdma_read_chunk *,
- int *, int *);
 extern int svc_rdma_xdr_decode_req(struct rpcrdma_msg **, struct svc_rqst *);
 extern int svc_rdma_xdr_decode_deferred_req(struct svc_rqst *);
 extern int svc_rdma_xdr_encode_error(struct svcxprt_rdma *,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_marshal.c 
b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
index 65b1462..b681855 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_marshal.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_marshal.c
@@ -71,22 +71,6 @@ static u32 *decode_read_list(u32 *va, u32 *vaend)
 }
 
 /*
- * Determine number of chunks and total bytes in chunk list. The chunk
- * list has already been verified to fit within the RPCRDMA header.
- */
-void svc_rdma_rcl_chunk_counts(struct rpcrdma_read_chunk *ch,
-  int *ch_count, int *byte_count)
-{
-   /* compute the number of bytes represented by read chunks */
-   *byte_count = 0;
-   *ch_count = 0;
-   for (; ch->rc_discrim != 0; ch++) {
-   *byte_count = *byte_count + ntohl(ch->rc_target.rs_length);
-   *ch_count = *ch_count + 1;
-   }
-}
-
-/*
  * Decodes a write chunk list. The expected format is as follows:
  *descrim  : xdr_one
  *nchunks  : 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 2c67de0..b3b7bb8 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -365,12 +365,22 @@ static int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
return ret;
 }
 
+static unsigned int
+rdma_rcl_chunk_count(struct rpcrdma_read_chunk *ch)
+{
+   unsigned int count;
+
+   for (count = 0; ch->rc_discrim != xdr_zero; ch++)
+   count++;
+   return count;
+}
+
 static int rdma_read_chunks(struct svcxprt_rdma *xprt,
struct rpcrdma_msg *rmsgp,
struct svc_rqst *rqstp,
struct svc_rdma_op_ctxt *head)
 {
-   int page_no, ch_count, ret;
+   int page_no, ret;
struct rpcrdma_read_chunk *ch;
u32 page_offset, byte_count;
u64 rs_offset;
@@ -381,8 +391,7 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
if (!ch)
return 0;
 
-   svc_rdma_rcl_chunk_counts(ch, &ch_count, &byte_count);
-   if (ch_count > RPCSVC_MAXPAGES)
+   if (rdma_rcl_chunk_count(ch) > RPCSVC_MAXPAGES)
return -EINVAL;
 
/* The request is completed when the RDMA_READs complete. The

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 01/10] svcrdma: Clean up dprintk

2015-01-13 Thread Chuck Lever

Nit: Fix inconsistent white space in dprintk messages.

Signed-off-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e011027..2c67de0 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -501,8 +501,8 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
ret = rqstp->rq_arg.head[0].iov_len
+ rqstp->rq_arg.page_len
+ rqstp->rq_arg.tail[0].iov_len;
-   dprintk("svcrdma: deferred read ret=%d, rq_arg.len =%d, "
-   "rq_arg.head[0].iov_base=%p, rq_arg.head[0].iov_len = %zd\n",
+   dprintk("svcrdma: deferred read ret=%d, rq_arg.len=%u, "
+   "rq_arg.head[0].iov_base=%p, rq_arg.head[0].iov_len=%zu\n",
ret, rqstp->rq_arg.len, rqstp->rq_arg.head[0].iov_base,
rqstp->rq_arg.head[0].iov_len);
 
@@ -591,8 +591,8 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
+ rqstp->rq_arg.tail[0].iov_len;
svc_rdma_put_context(ctxt, 0);
  out:
-   dprintk("svcrdma: ret = %d, rq_arg.len =%d, "
-   "rq_arg.head[0].iov_base=%p, rq_arg.head[0].iov_len = %zd\n",
+   dprintk("svcrdma: ret=%d, rq_arg.len=%u, "
+   "rq_arg.head[0].iov_base=%p, rq_arg.head[0].iov_len=%zd\n",
ret, rqstp->rq_arg.len,
rqstp->rq_arg.head[0].iov_base,
rqstp->rq_arg.head[0].iov_len);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 02/10] svcrdma: Remove unused variable

2015-01-13 Thread Chuck Lever

Nit: remove an unused variable to squelch a compiler warning.

Signed-off-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/svc_rdma_transport.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 4e61880..4ba11d0 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -687,7 +687,6 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv 
*serv,
 {
struct rdma_cm_id *listen_id;
struct svcxprt_rdma *cma_xprt;
-   struct svc_xprt *xprt;
int ret;
 
dprintk("svcrdma: Creating RDMA socket\n");
@@ -698,7 +697,6 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv 
*serv,
cma_xprt = rdma_create_xprt(serv, 1);
if (!cma_xprt)
return ERR_PTR(-ENOMEM);
-   xprt = &cma_xprt->sc_xprt;
 
listen_id = rdma_create_id(rdma_listen_handler, cma_xprt, RDMA_PS_TCP,
   IB_QPT_RC);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 00/10] NFS/RDMA server for 3.20

2015-01-13 Thread Chuck Lever

In addition to miscellaneous clean up, the following series of
patches for the Linux NFS server introduces support in the
server’s RPC/RDMA transport implementation for RDMA_NOMSG type
messages, and fixes a bug that prevents the server from handling
RPC/RDMA messages with inline content following the read list.

These patches are contained in the topic branch "nfsd-rdma-for-3.20"
at:

 git://git.linux-nfs.org/projects/cel/cel-2.6.git

Changes since v1:
 - Rebased on 3.19-rc4
 - Patch descriptions clarified
 - Bug addressed in "svcrdma: Handle additional inline content"

---

Chuck Lever (10):
  svcrdma: Handle additional inline content
  svcrdma: Move read list XDR round-up logic
  svcrdma: Support RDMA_NOMSG requests
  svcrdma: rc_position sanity checking
  svcrdma: Plant reader function in struct svcxprt_rdma
  svcrdma: Find rmsgp more reliably
  svcrdma: Scrub BUG_ON() and WARN_ON() call sites
  svcrdma: Clean up read chunk counting
  svcrdma: Remove unused variable
  svcrdma: Clean up dprintk


 include/linux/sunrpc/svc_rdma.h  |   13 +-
 net/sunrpc/xprtrdma/svc_rdma_marshal.c   |   16 --
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  244 +++---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|   46 +++---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   47 --
 5 files changed, 217 insertions(+), 149 deletions(-)

-- 
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 06/10] svcrdma: Plant reader function in struct svcxprt_rdma

2015-01-13 Thread Steve Wise


On 1/13/2015 4:05 AM, Sagi Grimberg wrote:

On 1/12/2015 6:45 PM, Steve Wise wrote:

On 1/12/2015 10:26 AM, Steve Wise wrote:



-Original Message-
From: Chuck Lever [mailto:chuck.le...@oracle.com]
Sent: Monday, January 12, 2015 10:20 AM
To: Steve Wise
Cc: Sagi Grimberg; linux-rdma@vger.kernel.org; Linux NFS Mailing List
Subject: Re: [PATCH v1 06/10] svcrdma: Plant reader function in
struct svcxprt_rdma


On Jan 12, 2015, at 11:08 AM, Steve Wise
 wrote:




-Original Message-
From: Chuck Lever [mailto:chuck.le...@oracle.com]
Sent: Sunday, January 11, 2015 6:41 PM
To: Sagi Grimberg; Steve Wise
Cc: linux-rdma@vger.kernel.org; Linux NFS Mailing List
Subject: Re: [PATCH v1 06/10] svcrdma: Plant reader function in
struct svcxprt_rdma


On Jan 11, 2015, at 12:45 PM, Sagi Grimberg
 wrote:


On 1/9/2015 9:22 PM, Chuck Lever wrote:

The RDMA reader function doesn't change once an svcxprt is
instantiated. Instead of checking sc_devcap during every incoming
RPC, set the reader function once when the connection is accepted.

General question(s),

Any specific reason why to use FRMR in the server side? And why 
only

for reads and not writes? Sorry if these are dumb questions...

Steve Wise presented patches a few months back to add FRMR, he
would have to answer this. Steve has a selection of iWARP adapters
and maybe could provide some idea of performance impact. I have
only CX-[23] here.


The rdma rpc server has always tried to use FRMR for rdma reads as
far as I recall.  The patch I submitted refactored the design

in

order to make it more efficient and to fix some bugs.   Unlike IB,
the iWARP  protocol only allows 1 target/sink SGE in an rdma

read

request message, so an FRMR is used to create that single
target/sink SGE allowing 1 read to be submitted instead of many.

How does this work when the client uses PHYSICAL memory registration?

Each page would require a separate rdma read WR.  That is why we use
FRMRs. :)


Correction, each physical scatter gather entry would require a separate
read WR.  There may be contiguous chunks of physical mem that can be
described with one RDMA SGE...



OK, thanks for clarifying that for me.

From the code, I think that FRMR is used also if the buffer can
fit in a single SGE. Wouldn't it be better to skip the Fastreg WR in
this case?



Perhaps.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 10/10] svcrdma: Handle additional inline content

2015-01-13 Thread Chuck Lever

On Jan 13, 2015, at 5:11 AM, Sagi Grimberg  wrote:

> On 1/12/2015 3:13 AM, Chuck Lever wrote:
>> 
>> On Jan 11, 2015, at 1:01 PM, Sagi Grimberg  wrote:
>> 
>>> On 1/9/2015 9:23 PM, Chuck Lever wrote:
 Most NFS RPCs place large payload arguments at the end of the RPC
 header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
 sends the complete RPC header inline, and the payload argument in a
 read list.

 One important case is not like this, however. NFSv4 WRITE compounds
 can have an operation after the WRITE operation. The proper way to
 convey an NFSv4 WRITE is to place the GETATTR inline, but _after_
 the read list position. (Note Linux clients currently do not do
 this, but they will be changed to do it in the future).

 The receiver could put trailing inline content in the XDR tail
 buffer. But the Linux server's NFSv4 compound processing does not
 consider the XDR tail buffer.

 So, move trailing inline content to the end of the page list. This
 presents the incoming compound to upper layers the same way the
 socket code does.

>>> 
>>> Would this memcpy be saved if you just posted a larger receive buffer
>>> and the client would used it "really inline" as part of it's post_send?
>> 
>> The receive buffer doesn’t need to be larger. Clients already should
>> construct this trailing inline content in their SEND buffers.
>> 
>> Not that the  Linux client doesn’t yet send the extra inline via RDMA
>> SEND, it uses a separate RDMA READ to move the extra bytes, and that’s
>> a bug.
>> 
>> If the client does send this inline as it’s supposed to, the server
>> would receive it in its pre-posted RECV buffer. This patch simply
>> moves that content into the XDR buffer page list, where the server’s
>> XDR decoder can find it.
> 
> Would it make more sense to manipulate pointers instead of copying data?

It would. My first approach was to use the tail iovec in xdr_buf.
Simply point the tail’s iov_addr at trailing inline content in the
RECV buffer.

But as mentioned, the server’s XDR decoders don’t look at the tail
iovec.

The socket transport delivers this little piece of data at the end of
the xdr_buf page list, because all it has to do is read data off the
socket and stick it in pages.

So svcrdma can do that too. It’s a little more awkward, but the upper
layer code stays the same.

> But if this is only 16 bytes than maybe it's not worth the trouble…

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH infiniband-diags] ibdiag_common.c: Add more supported device IDs

2015-01-13 Thread Or Gerlitz

On 1/13/2015 3:09 PM, Hal Rosenstock wrote:

On 1/13/2015 2:05 AM, Or Gerlitz wrote:

>On 1/12/2015 9:08 PM, Hal Rosenstock wrote:

>>Add support for ConnectX-3 and ConnectX-4

>
>So... ConnectX-3 isn't supported today?!

It's just device IDs 0x1012 and 0x1013
I don't see these devices in the mlx4 core driver PCI table 
(mlx4_pci_table in main.c) nor in libmlx4, not sure what you are 
referring to  here...

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH infiniband-diags] vendstat.c: Add additional vendor IDs to ext_fw_info_device

2015-01-13 Thread Hal Rosenstock

Add additional supported device IDs to ext_fw_info_device.
These IDs are ConnectX-4, some ConnectX-3s, and Switch-IB.
This affects is_ext_fw_info_supported which determine
which Mellanox specific vendor MADs are issued to a device.

Signed-off-by: Hal Rosenstock 
---
diff --git a/src/vendstat.c b/src/vendstat.c
index f28ff02..7fc4a11 100644
--- a/src/vendstat.c
+++ b/src/vendstat.c
@@ -145,8 +145,9 @@ typedef struct {
 static uint16_t ext_fw_info_device[][2] = {
{0x0245, 0x0245},   /* Switch-X */
{0xc738, 0xc738},   /* Switch-X */
+   {0xcb84, 0xcb84},   /* Switch-IB */
{0x01b3, 0x01b3},   /* IS-4 */
-   {0x1003, 0x1011},   /* Connect-X */
+   {0x1003, 0x1016},   /* Connect-X */
{0x, 0x}};
 
 static int is_ext_fw_info_supported(uint16_t device_id) {
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 infinband-diags] ibdiag_common.c: Add more supported device IDs in is_mlnx_ext_port_info_supported

2015-01-13 Thread Hal Rosenstock

Add support for some ConnectX-3, ConnectX-4, and Switch-IB in
is_mlnx_ext_port_info_supported.

This only affects tool(s) which invoke this routine
(is_mlnx_ext_port_info_supported) which is
currently just ibportstate.

is_mlnx_ext_port_info_supported determines whether or not the MLNX
vendor extended port info SMP is sent which is used to determine FDR10
speed.

Not having the device in the list would only result in FDR10 being
misinterpreted as QDR.

Signed-off-by: Hal Rosenstock 
---
Change since v1:
Improved patch description

diff --git a/src/ibdiag_common.c b/src/ibdiag_common.c
index 8c749c7..384d342 100644
--- a/src/ibdiag_common.c
+++ b/src/ibdiag_common.c
@@ -499,9 +499,9 @@ conv_cnt_human_readable(uint64_t val64, float *val, int 
data)
 int is_mlnx_ext_port_info_supported(uint32_t devid)
 {
if (ibd_ibnetdisc_flags & IBND_CONFIG_MLX_EPI) {
-   if (devid == 0xc738)
+   if (devid == 0xc738 || devid == 0xcb84)
return 1;
-   if (devid >= 0x1003 && devid <= 0x1011)
+   if (devid >= 0x1003 && devid <= 0x1016)
return 1;
}
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH infiniband-diags] ibnetdisc.c: Add additional supported device IDs to is_mlnx_ext_port_info_supported

2015-01-13 Thread Hal Rosenstock

Similar to previous ibdiag_common.c patch, this adds additional device IDs
to be supported to the ibnetdiscover internal is_mlnx_ext_port_info routine.
These IDs are ConnectX-4, some ConnectX-3s, and Switch-IB.

Signed-off-by: Hal Rosenstock 
---
diff --git a/libibnetdisc/src/ibnetdisc.c b/libibnetdisc/src/ibnetdisc.c
index 121fe35..f3c6000 100644
--- a/libibnetdisc/src/ibnetdisc.c
+++ b/libibnetdisc/src/ibnetdisc.c
@@ -203,9 +203,9 @@ static int is_mlnx_ext_port_info_supported(ibnd_port_t * 
port)
 {
uint16_t devid = (uint16_t) mad_get_field(port->node->info, 0, 
IB_NODE_DEVID_F);
 
-   if (devid == 0xc738)
+   if (devid == 0xc738 || devid == 0xcb84)
return 1;
-   if (devid >= 0x1003 && devid <= 0x1011)
+   if (devid >= 0x1003 && devid <= 0x1016)
return 1;
return 0;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH infiniband-diags] ibdiag_common.c: Add more supported device IDs

2015-01-13 Thread Hal Rosenstock

On 1/13/2015 2:05 AM, Or Gerlitz wrote:
> On 1/12/2015 9:08 PM, Hal Rosenstock wrote:
>> Add support for ConnectX-3 and ConnectX-4
> 
> So... ConnectX-3 isn't supported today?! 

It's just device IDs 0x1012 and 0x1013.

> on what infrastructure/tools exactly?

This only affects tool(s) which invoke this routine
(is_mlnx_ext_port_info_supported) which is
currently just ibportstate.

is_mlnx_ext_port_info_supported determines whether or not the MLNX
vendor extended port info SMP is sent which is used to determine FDR10
speed.

Not having the device in the list would only result in FDR10 being
misinterpreted as QDR.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 10/10] svcrdma: Handle additional inline content

2015-01-13 Thread Sagi Grimberg


On 1/12/2015 3:13 AM, Chuck Lever wrote:


On Jan 11, 2015, at 1:01 PM, Sagi Grimberg  wrote:


On 1/9/2015 9:23 PM, Chuck Lever wrote:

Most NFS RPCs place large payload arguments at the end of the RPC
header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
sends the complete RPC header inline, and the payload argument in a
read list.

One important case is not like this, however. NFSv4 WRITE compounds
can have an operation after the WRITE operation. The proper way to
convey an NFSv4 WRITE is to place the GETATTR inline, but _after_
the read list position. (Note Linux clients currently do not do
this, but they will be changed to do it in the future).

The receiver could put trailing inline content in the XDR tail
buffer. But the Linux server's NFSv4 compound processing does not
consider the XDR tail buffer.

So, move trailing inline content to the end of the page list. This
presents the incoming compound to upper layers the same way the
socket code does.



Would this memcpy be saved if you just posted a larger receive buffer
and the client would used it "really inline" as part of it's post_send?


The receive buffer doesn’t need to be larger. Clients already should
construct this trailing inline content in their SEND buffers.

Not that the  Linux client doesn’t yet send the extra inline via RDMA
SEND, it uses a separate RDMA READ to move the extra bytes, and that’s
a bug.

If the client does send this inline as it’s supposed to, the server
would receive it in its pre-posted RECV buffer. This patch simply
moves that content into the XDR buffer page list, where the server’s
XDR decoder can find it.


Would it make more sense to manipulate pointers instead of copying data?
But if this is only 16 bytes than maybe it's not worth the trouble...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 05/10] svcrdma: Find rmsgp more reliably

2015-01-13 Thread Sagi Grimberg


On 1/12/2015 2:30 AM, Chuck Lever wrote:

Hi Sagi-

Thanks for the review.

On Jan 11, 2015, at 12:37 PM, Sagi Grimberg  wrote:


On 1/9/2015 9:22 PM, Chuck Lever wrote:

xdr_start() can return the wrong rmsgp address if an assumption
about how the xdr_buf was constructed changes.  When it gets it
wrong, the client receives a reply that has gibberish in the
RPC/RDMA header, preventing it from matching a waiting RPC request.

Instead, make (and document) just one assumption: that the RDMA
header for the client's RPC call is at the start of the first page
in rq_pages.


Would it make more sense to add another pointer assigned at req
initialization (maybe in the RDMA request context) instead of hard
coding this assumption? I may be completely wrong here though...


I considered this. I couldn’t find an appropriate place to add
such a pointer.

I think that’s why xdr_start() was there in the first place: there
is no convenient place to save a pointer to the request’s RDMA
header.

Bruce might have other thoughts about this.


Yep, I didn't find any nice place to put that also, thought you might
have an idea...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 06/10] svcrdma: Plant reader function in struct svcxprt_rdma

2015-01-13 Thread Sagi Grimberg


On 1/12/2015 6:45 PM, Steve Wise wrote:

On 1/12/2015 10:26 AM, Steve Wise wrote:



-Original Message-
From: Chuck Lever [mailto:chuck.le...@oracle.com]
Sent: Monday, January 12, 2015 10:20 AM
To: Steve Wise
Cc: Sagi Grimberg; linux-rdma@vger.kernel.org; Linux NFS Mailing List
Subject: Re: [PATCH v1 06/10] svcrdma: Plant reader function in
struct svcxprt_rdma


On Jan 12, 2015, at 11:08 AM, Steve Wise
 wrote:




-Original Message-
From: Chuck Lever [mailto:chuck.le...@oracle.com]
Sent: Sunday, January 11, 2015 6:41 PM
To: Sagi Grimberg; Steve Wise
Cc: linux-rdma@vger.kernel.org; Linux NFS Mailing List
Subject: Re: [PATCH v1 06/10] svcrdma: Plant reader function in
struct svcxprt_rdma


On Jan 11, 2015, at 12:45 PM, Sagi Grimberg
 wrote:


On 1/9/2015 9:22 PM, Chuck Lever wrote:

The RDMA reader function doesn't change once an svcxprt is
instantiated. Instead of checking sc_devcap during every incoming
RPC, set the reader function once when the connection is accepted.

General question(s),

Any specific reason why to use FRMR in the server side? And why only
for reads and not writes? Sorry if these are dumb questions...

Steve Wise presented patches a few months back to add FRMR, he
would have to answer this. Steve has a selection of iWARP adapters
and maybe could provide some idea of performance impact. I have
only CX-[23] here.


The rdma rpc server has always tried to use FRMR for rdma reads as
far as I recall.  The patch I submitted refactored the design

in

order to make it more efficient and to fix some bugs.   Unlike IB,
the iWARP  protocol only allows 1 target/sink SGE in an rdma

read

request message, so an FRMR is used to create that single
target/sink SGE allowing 1 read to be submitted instead of many.

How does this work when the client uses PHYSICAL memory registration?

Each page would require a separate rdma read WR.  That is why we use
FRMRs. :)


Correction, each physical scatter gather entry would require a separate
read WR.  There may be contiguous chunks of physical mem that can be
described with one RDMA SGE...



OK, thanks for clarifying that for me.

From the code, I think that FRMR is used also if the buffer can
fit in a single SGE. Wouldn't it be better to skip the Fastreg WR in
this case?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

67 matches

Mail list logo