[PATCH for-next v3 2/2] rds: ib: Remove two ib_modify_qp() calls

2021-03-31 Thread Håkon Bugge
For some HCAs, ib_modify_qp() is an expensive operation running
virtualized.

For both the active and passive side, the QP returned by the CM has
the state set to RTS, so no need for this excess RTS -> RTS
transition. With IB Core's ability to set the RNR Retry timer, we use
this interface to shave off another ib_modify_qp().

Fixes: ec16227e1414 ("RDS/IB: Infiniband transport")
Signed-off-by: Håkon Bugge 
---
 net/rds/ib_cm.c  | 35 +--
 net/rds/rdma_transport.c |  1 +
 2 files changed, 2 insertions(+), 34 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index f5cbe96..26b069e 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -68,31 +68,6 @@ static void rds_ib_set_flow_control(struct rds_connection 
*conn, u32 credits)
 }
 
 /*
- * Tune RNR behavior. Without flow control, we use a rather
- * low timeout, but not the absolute minimum - this should
- * be tunable.
- *
- * We already set the RNR retry count to 7 (which is the
- * smallest infinite number :-) above.
- * If flow control is off, we want to change this back to 0
- * so that we learn quickly when our credit accounting is
- * buggy.
- *
- * Caller passes in a qp_attr pointer - don't waste stack spacv
- * by allocation this twice.
- */
-static void
-rds_ib_tune_rnr(struct rds_ib_connection *ic, struct ib_qp_attr *attr)
-{
-   int ret;
-
-   attr->min_rnr_timer = IB_RNR_TIMER_000_32;
-   ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_MIN_RNR_TIMER);
-   if (ret)
-   printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER): 
err=%d\n", -ret);
-}
-
-/*
  * Connection established.
  * We get here for both outgoing and incoming connection.
  */
@@ -100,7 +75,6 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, 
struct rdma_cm_even
 {
struct rds_ib_connection *ic = conn->c_transport_data;
const union rds_ib_conn_priv *dp = NULL;
-   struct ib_qp_attr qp_attr;
__be64 ack_seq = 0;
__be32 credit = 0;
u8 major = 0;
@@ -168,14 +142,6 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
 * the posted credit count. */
rds_ib_recv_refill(conn, 1, GFP_KERNEL);
 
-   /* Tune RNR behavior */
-   rds_ib_tune_rnr(ic, &qp_attr);
-
-   qp_attr.qp_state = IB_QPS_RTS;
-   err = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE);
-   if (err)
-   printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", 
err);
-
/* update ib_device with this local ipaddr */
err = rds_ib_update_ipaddr(ic->rds_ibdev, &conn->c_laddr);
if (err)
@@ -947,6 +913,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
  event->param.conn.responder_resources,
  event->param.conn.initiator_depth, isv6);
 
+   rdma_set_min_rnr_timer(cm_id, IB_RNR_TIMER_000_32);
/* rdma_accept() calls rdma_reject() internally if it fails */
if (rdma_accept(cm_id, &conn_param))
rds_ib_conn_error(conn, "rdma_accept failed\n");
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 5f741e5..a9e4ff9 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -87,6 +87,7 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id 
*cm_id,
 
case RDMA_CM_EVENT_ADDR_RESOLVED:
rdma_set_service_type(cm_id, conn->c_tos);
+   rdma_set_min_rnr_timer(cm_id, IB_RNR_TIMER_000_32);
/* XXX do we need to clean up if this fails? */
ret = rdma_resolve_route(cm_id,
 RDS_RDMA_RESOLVE_TIMEOUT_MS);
-- 
1.8.3.1



[PATCH for-next v3 0/2] Introduce rdma_set_min_rnr_timer() and use it in RDS

2021-03-31 Thread Håkon Bugge
ib_modify_qp() is an expensive operation on some HCAs running
virtualized. This series removes two ib_modify_qp() calls from RDS.

I am sending this as a v3, even though it is the first sent to
net. This because the IB Core commit has reach v3.

Håkon Bugge (2):
  IB/cma: Introduce rdma_set_min_rnr_timer()
  rds: ib: Remove two ib_modify_qp() calls

 drivers/infiniband/core/cma.c  | 41 ++
 drivers/infiniband/core/cma_priv.h |  2 ++
 include/rdma/rdma_cm.h |  2 ++
 net/rds/ib_cm.c| 35 +---
 net/rds/rdma_transport.c   |  1 +
 5 files changed, 47 insertions(+), 34 deletions(-)

--
1.8.3.1



[PATCH for-next v3 1/2] IB/cma: Introduce rdma_set_min_rnr_timer()

2021-03-31 Thread Håkon Bugge
Introduce the ability for kernel ULPs to adjust the minimum RNR Retry
timer. The INIT -> RTR transition executed by RDMA CM will be used for
this adjustment. This avoids an additional ib_modify_qp() call.

rdma_set_min_rnr_timer() must be called before the call to
rdma_connect() on the active side and before the call to rdma_accept()
on the passive side.

The default value of RNR Retry timer is zero, which translates to 655
ms. When the receiver is not ready to accept a send messages, it
encodes the RNR Retry timer value in the NAK. The requestor will then
wait at least the specified time value before retrying the send.

The 5-bit value to be supplied to the rdma_set_min_rnr_timer() is
documented in IBTA Table 45: "Encoding for RNR NAK Timer Field".

Signed-off-by: Håkon Bugge 
Acked-by: Jason Gunthorpe 
---
 drivers/infiniband/core/cma.c  | 41 ++
 drivers/infiniband/core/cma_priv.h |  2 ++
 include/rdma/rdma_cm.h |  2 ++
 3 files changed, 45 insertions(+)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9409651..5ce097d 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -852,6 +852,7 @@ static void cma_id_put(struct rdma_id_private *id_priv)
id_priv->id.qp_type = qp_type;
id_priv->tos_set = false;
id_priv->timeout_set = false;
+   id_priv->min_rnr_timer_set = false;
id_priv->gid_type = IB_GID_TYPE_IB;
spin_lock_init(&id_priv->lock);
mutex_init(&id_priv->qp_mutex);
@@ -1141,6 +1142,9 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct 
ib_qp_attr *qp_attr,
if ((*qp_attr_mask & IB_QP_TIMEOUT) && id_priv->timeout_set)
qp_attr->timeout = id_priv->timeout;
 
+   if ((*qp_attr_mask & IB_QP_MIN_RNR_TIMER) && id_priv->min_rnr_timer_set)
+   qp_attr->min_rnr_timer = id_priv->min_rnr_timer;
+
return ret;
 }
 EXPORT_SYMBOL(rdma_init_qp_attr);
@@ -2615,6 +2619,43 @@ int rdma_set_ack_timeout(struct rdma_cm_id *id, u8 
timeout)
 }
 EXPORT_SYMBOL(rdma_set_ack_timeout);
 
+/**
+ * rdma_set_min_rnr_timer() - Set the minimum RNR Retry timer of the
+ *   QP associated with a connection identifier.
+ * @id: Communication identifier to associated with service type.
+ * @min_rnr_timer: 5-bit value encoded as Table 45: "Encoding for RNR NAK
+ *Timer Field" in the IBTA specification.
+ *
+ * This function should be called before rdma_connect() on active
+ * side, and on passive side before rdma_accept(). The timer value
+ * will be associated with the local QP. When it receives a send it is
+ * not read to handle, typically if the receive queue is empty, an RNR
+ * Retry NAK is returned to the requester with the min_rnr_timer
+ * encoded. The requester will then wait at least the time specified
+ * in the NAK before retrying. The default is zero, which translates
+ * to a minimum RNR Timer value of 655 ms.
+ *
+ * Return: 0 for success
+ */
+int rdma_set_min_rnr_timer(struct rdma_cm_id *id, u8 min_rnr_timer)
+{
+   struct rdma_id_private *id_priv;
+
+   /* It is a five-bit value */
+   if (min_rnr_timer & 0xe0)
+   return -EINVAL;
+
+   if (id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT)
+   return -EINVAL;
+
+   id_priv = container_of(id, struct rdma_id_private, id);
+   id_priv->min_rnr_timer = min_rnr_timer;
+   id_priv->min_rnr_timer_set = true;
+
+   return 0;
+}
+EXPORT_SYMBOL(rdma_set_min_rnr_timer);
+
 static void cma_query_handler(int status, struct sa_path_rec *path_rec,
  void *context)
 {
diff --git a/drivers/infiniband/core/cma_priv.h 
b/drivers/infiniband/core/cma_priv.h
index caece96..bf83d32 100644
--- a/drivers/infiniband/core/cma_priv.h
+++ b/drivers/infiniband/core/cma_priv.h
@@ -86,9 +86,11 @@ struct rdma_id_private {
u8  tos;
u8  tos_set:1;
u8  timeout_set:1;
+   u8  min_rnr_timer_set:1;
u8  reuseaddr;
u8  afonly;
u8  timeout;
+   u8  min_rnr_timer;
enum ib_gid_typegid_type;
 
/*
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index 32a67af..8b0f66e 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -331,6 +331,8 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct 
sockaddr *addr,
 int rdma_set_afonly(struct rdma_cm_id *id, int afonly);
 
 int rdma_set_ack_timeout(struct rdma_cm_id *id, u8 timeout);
+
+int rdma_set_min_rnr_timer(struct rdma_cm_id *id, u8 min_rnr_timer);
  /**
  * rdma_get_service_id - Return the IB service ID for a specified address.
  * @id: Communication identifier associated with the address.
-- 
1.8.3.1



Re: [Linux-kernel-mentees] [PATCH net] rds: Prevent kernel-infoleak in rds_notify_queue_get()

2020-07-31 Thread Håkon Bugge



> On 31 Jul 2020, at 13:59, Greg Kroah-Hartman  
> wrote:
> 
> On Fri, Jul 31, 2020 at 01:14:09PM +0200, Håkon Bugge wrote:
>> 
>> 
>>> On 31 Jul 2020, at 11:59, Dan Carpenter  wrote:
>>> 
>>> On Fri, Jul 31, 2020 at 07:53:01AM +0300, Leon Romanovsky wrote:
>>>> On Thu, Jul 30, 2020 at 03:20:26PM -0400, Peilin Ye wrote:
>>>>> rds_notify_queue_get() is potentially copying uninitialized kernel stack
>>>>> memory to userspace since the compiler may leave a 4-byte hole at the end
>>>>> of `cmsg`.
>>>>> 
>>>>> In 2016 we tried to fix this issue by doing `= { 0 };` on `cmsg`, which
>>>>> unfortunately does not always initialize that 4-byte hole. Fix it by using
>>>>> memset() instead.
>>>> 
>>>> Of course, this is the difference between "{ 0 }" and "{}" initializations.
>>>> 
>>> 
>>> No, there is no difference.  Even struct assignments like:
>>> 
>>> foo = *bar;
>>> 
>>> can leave struct holes uninitialized.  Depending on the compiler the
>>> assignment can be implemented as a memset() or as a series of struct
>>> member assignments.
>> 
>> What about:
>> 
>> struct rds_rdma_notify {
>>  __u64  user_token;
>>  __s32  status;
>> } __attribute__((packed));
> 
> Why is this still a discussion at all?
> 
> Try it and see, run pahole and see if there are holes in this structure
> (odds are no), you don't need us to say what is happening here...

An older posting had this:

$ pahole -C "rds_rdma_notify" net/rds/recv.o
struct rds_rdma_notify {
__u64  user_token;   /* 0 8 */
__s32  status;   /* 8 4 */

/* size: 16, cachelines: 1, members: 2 */
/* padding: 4 */
/* last cacheline: 16 bytes */
};


Thxs, Håkon



Re: [Linux-kernel-mentees] [PATCH net] rds: Prevent kernel-infoleak in rds_notify_queue_get()

2020-07-31 Thread Håkon Bugge



> On 31 Jul 2020, at 11:59, Dan Carpenter  wrote:
> 
> On Fri, Jul 31, 2020 at 07:53:01AM +0300, Leon Romanovsky wrote:
>> On Thu, Jul 30, 2020 at 03:20:26PM -0400, Peilin Ye wrote:
>>> rds_notify_queue_get() is potentially copying uninitialized kernel stack
>>> memory to userspace since the compiler may leave a 4-byte hole at the end
>>> of `cmsg`.
>>> 
>>> In 2016 we tried to fix this issue by doing `= { 0 };` on `cmsg`, which
>>> unfortunately does not always initialize that 4-byte hole. Fix it by using
>>> memset() instead.
>> 
>> Of course, this is the difference between "{ 0 }" and "{}" initializations.
>> 
> 
> No, there is no difference.  Even struct assignments like:
> 
>   foo = *bar;
> 
> can leave struct holes uninitialized.  Depending on the compiler the
> assignment can be implemented as a memset() or as a series of struct
> member assignments.

What about:

struct rds_rdma_notify {
__u64  user_token;
__s32  status;
} __attribute__((packed));


Thxs, Håkon


> regards,
> dan carpenter
> 



Re: [PATCH net-next] net/rds: Use DMA memory pool allocation for rds_header

2019-10-02 Thread Håkon Bugge



> On 2 Oct 2019, at 07:20, Ka-Cheong Poon  wrote:
> 
> On 10/2/19 1:16 AM, David Miller wrote:
>> From: Ka-Cheong Poon 
>> Date: Mon, 30 Sep 2019 02:08:00 -0700
>>> Currently, RDS calls ib_dma_alloc_coherent() to allocate a large piece
>>> of contiguous DMA coherent memory to store struct rds_header for
>>> sending/receiving packets.  The memory allocated is then partitioned
>>> into struct rds_header.  This is not necessary and can be costly at
>>> times when memory is fragmented.  Instead, RDS should use the DMA
>>> memory pool interface to handle this.
>>> 
>>> Suggested-by: Håkon Bugge 
>>> Signed-off-by: Ka-Cheong Poon 
>> This is trading a one-time overhead for extra levels of dereferencing
>> on every single descriptor access in the fast paths.
>> I do not agree with this tradeoff, please implement this more
>> reasonably.
> 
> 
> The problem with the existing way of pre-allocation is
> that when there are a lot of RDS connections, the call to
> ib_dma_alloc_coherent() can fail because there are not
> enough contiguous memory pages available.  It is causing
> problems in production systems.
> 
> And the i_{recv|send|_hdrs_dma array dereferencing is done
> at send/receive ring initialization and refill.  It is not
> done at every access of the header.

The commit removes costly order:4 allocations (with the default 1024 #entries 
in the recv work-queue). These allocations may need reclaims and/or compaction. 
At worst, they may fail.

Consider a switch failure, thousands of RDS IB connections, and their 
corresponding QPs, need to be resurrected. Each QP needs this order:4 
allocation, and they are all created in close proximity in time, leading to an 
immense memory hog.


Thxs, Håkon




> Thanks.
> 
> 
> -- 
> K. Poon
> ka-cheong.p...@oracle.com
> 
> 
> 



Re: [PATCH] mlx4_ib: Increase the timeout for CM cache

2019-02-06 Thread Håkon Bugge



> On 6 Feb 2019, at 19:02, jackm  wrote:
> 
> On Wed, 6 Feb 2019 16:40:14 +0100
> Håkon Bugge  wrote:
> 
>> Jack,
>> 
>> A major contributor to the long processing time in the PF driver
>> proxying QP1 packets is:
>> 
>> create_pv_resources
>>   -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,  
>>   NULL, ctx, cq_size, 0);
>> 
>> That is, comp_vector is zero.
>> 
>> Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"),
>> the zero comp_vector has the intent of let the mlx4_core driver
>> select the least used vector.
>> 
>> But, in mlx4_ib_create_cq(), we have:
>> 
>>pr_info("eq_table: %p\n", dev->eq_table);
>>if (dev->eq_table) {
>>  vector = dev->eq_table[mlx4_choose_vector(dev->dev,
>> vector, ibdev->num_comp_vectors)];
>>}
>> 
>>cq->vector = vector;
>> 
>> and dev->eq_table is NULL, so all the CQs for the proxy QPs get
>> comp_vector zero.
>> 
>> I have to make some reservations, as this analysis is based on uek4,
>> but I think the code here is equal upstream, but need to double check.
>> 
>> 
>> Thxs, Håkon
>> 
> Hi Hakon and Jason,
> I was ill today (bad cold, took antihistamines all day, which knocked
> me out).
> I'll get to this tomorrow.

No problem Jack. I actually see that our uek4 is different in 
mlx4_ib_alloc_eqs(), and that may well be the root cause here.

Hence, moved the MLs to Bcc, and get back to you tomorrow.


Thxs, Håkon



Re: [PATCH] mlx4_ib: Increase the timeout for CM cache

2019-02-06 Thread Håkon Bugge



> On 6 Feb 2019, at 09:50, Håkon Bugge  wrote:
> 
> 
> 
>> On 5 Feb 2019, at 23:36, Jason Gunthorpe  wrote:
>> 
>> On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
>>> Using CX-3 virtual functions, either from a bare-metal machine or
>>> pass-through from a VM, MAD packets are proxied through the PF driver.
>>> 
>>> Since the VMs have separate name spaces for MAD Transaction Ids
>>> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
>>> in a cache.
>>> 
>>> Following the RDMA CM protocol, it is clear when an entry has to
>>> evicted form the cache. But life is not perfect, remote peers may die
>>> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
>>> the PF driver assumes the remote peer has gone.
>>> 
>>> We have experienced excessive amount of DREQ retries during fail-over
>>> testing, when running with eight VMs per database server.
>>> 
>>> The problem has been reproduced in a bare-metal system using one VM
>>> per physical node. In this environment, running 256 processes in each
>>> VM, each process uses RDMA CM to create an RC QP between himself and
>>> all (256) remote processes. All in all 16K QPs.
>>> 
>>> When tearing down these 16K QPs, excessive DREQ retries (and
>>> duplicates) are observed. With some cat/paste/awk wizardry on the
>>> infiniband_cm sysfs, we observe:
>>> 
>>> dreq:   5007
>>> cm_rx_msgs:
>>> drep:   3838
>>> dreq:  13018
>>>  rep:   8128
>>>  req:   8256
>>>  rtu:   8256
>>> cm_tx_msgs:
>>> drep:   8011
>>> dreq:  68856
>>>  rep:   8256
>>>  req:   8128
>>>  rtu:   8128
>>> cm_tx_retries:
>>> dreq:  60483
>>> 
>>> Note that the active/passive side is distributed.
>>> 
>>> Enabling pr_debug in cm.c gives tons of:
>>> 
>>> [171778.814239]  mlx4_ib_multiplex_cm_handler: id{slave:
>>> 1,sl_cm_id: 0xd393089f} is NULL!
>>> 
>>> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
>>> tear-down phase of the application is reduced from 113 to 67
>>> seconds. Retries/duplicates are also significantly reduced:
>>> 
>>> cm_rx_duplicates:
>>> dreq:   7726
>>> []
>>> cm_tx_retries:
>>> drep:  1
>>> dreq:   7779
>>> 
>>> Increasing the timeout further didn't help, as these duplicates and
>>> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
>>> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
>>> numbers fell down to about one hundred for both of them.
>>> 
>>> Adjustment of the CMA timeout is _not_ part of this commit.
>>> 
>>> Signed-off-by: Håkon Bugge 
>>> ---
>>> drivers/infiniband/hw/mlx4/cm.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> Jack? What do you think?
> 
> I am tempted to send a v2 making this a sysctl tuneable. This because, 
> full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction 
> in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 
> 0xd393089f} is NULL" with this commit.
> 
> But sure, Jack's opinion matters.

Jack,

A major contributor to the long processing time in the PF driver proxying QP1 
packets is:

create_pv_resources
   -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,
   NULL, ctx, cq_size, 0);

That is, comp_vector is zero.

Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), the zero 
comp_vector has the intent of let the mlx4_core driver select the least used 
vector.

But, in mlx4_ib_create_cq(), we have:

pr_info("eq_table: %p\n", dev->eq_table);
if (dev->eq_table) {
vector = dev->eq_table[mlx4_choose_vector(dev->dev, vector,
  
ibdev->num_comp_vectors)];
}

cq->vector = vector;

and dev->eq_table is NULL, so all the CQs for the proxy QPs get comp_vector 
zero.

I have to make some reservations, as this analysis is based on uek4, but I 
think the code here is equal upstream, but need to double check.


Thxs, Håkon








> 
> 
> Thxs, Håkon
> 
>> 
>>> diff --git a/drivers/infiniband/hw/mlx4/cm.c 
>>> b/drivers/infiniband/hw/mlx4/cm.c
>>> index fedaf8260105..8c79a480f2b7 100644
>>> --- a/drivers/infiniband/hw/mlx4/cm.c
>>> +++ b/drivers/infiniband/hw/mlx4/cm.c
>>> @@ -39,7 +39,7 @@
>>> 
>>> #include "mlx4_ib.h"
>>> 
>>> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
>>> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
>>> 
>>> struct id_map_entry {
>>> struct rb_node node;
>>> -- 
>>> 2.20.1



Re: [PATCH] mlx4_ib: Increase the timeout for CM cache

2019-02-06 Thread Håkon Bugge



> On 5 Feb 2019, at 23:36, Jason Gunthorpe  wrote:
> 
> On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
>> Using CX-3 virtual functions, either from a bare-metal machine or
>> pass-through from a VM, MAD packets are proxied through the PF driver.
>> 
>> Since the VMs have separate name spaces for MAD Transaction Ids
>> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
>> in a cache.
>> 
>> Following the RDMA CM protocol, it is clear when an entry has to
>> evicted form the cache. But life is not perfect, remote peers may die
>> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
>> the PF driver assumes the remote peer has gone.
>> 
>> We have experienced excessive amount of DREQ retries during fail-over
>> testing, when running with eight VMs per database server.
>> 
>> The problem has been reproduced in a bare-metal system using one VM
>> per physical node. In this environment, running 256 processes in each
>> VM, each process uses RDMA CM to create an RC QP between himself and
>> all (256) remote processes. All in all 16K QPs.
>> 
>> When tearing down these 16K QPs, excessive DREQ retries (and
>> duplicates) are observed. With some cat/paste/awk wizardry on the
>> infiniband_cm sysfs, we observe:
>> 
>>  dreq:   5007
>> cm_rx_msgs:
>>  drep:   3838
>>  dreq:  13018
>>   rep:   8128
>>   req:   8256
>>   rtu:   8256
>> cm_tx_msgs:
>>  drep:   8011
>>  dreq:  68856
>>   rep:   8256
>>   req:   8128
>>   rtu:   8128
>> cm_tx_retries:
>>  dreq:  60483
>> 
>> Note that the active/passive side is distributed.
>> 
>> Enabling pr_debug in cm.c gives tons of:
>> 
>> [171778.814239]  mlx4_ib_multiplex_cm_handler: id{slave:
>> 1,sl_cm_id: 0xd393089f} is NULL!
>> 
>> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
>> tear-down phase of the application is reduced from 113 to 67
>> seconds. Retries/duplicates are also significantly reduced:
>> 
>> cm_rx_duplicates:
>>  dreq:   7726
>> []
>> cm_tx_retries:
>>  drep:  1
>>  dreq:   7779
>> 
>> Increasing the timeout further didn't help, as these duplicates and
>> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
>> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
>> numbers fell down to about one hundred for both of them.
>> 
>> Adjustment of the CMA timeout is _not_ part of this commit.
>> 
>> Signed-off-by: Håkon Bugge 
>> ---
>> drivers/infiniband/hw/mlx4/cm.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Jack? What do you think?

I am tempted to send a v2 making this a sysctl tuneable. This because, 
full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction 
in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 
0xd393089f} is NULL" with this commit.

But sure, Jack's opinion matters.


Thxs, Håkon

> 
>> diff --git a/drivers/infiniband/hw/mlx4/cm.c 
>> b/drivers/infiniband/hw/mlx4/cm.c
>> index fedaf8260105..8c79a480f2b7 100644
>> --- a/drivers/infiniband/hw/mlx4/cm.c
>> +++ b/drivers/infiniband/hw/mlx4/cm.c
>> @@ -39,7 +39,7 @@
>> 
>> #include "mlx4_ib.h"
>> 
>> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
>> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
>> 
>> struct id_map_entry {
>>  struct rb_node node;
>> -- 
>> 2.20.1



[PATCH] mlx4_ib: Increase the timeout for CM cache

2019-01-31 Thread Håkon Bugge
Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.

Since the VMs have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.

Following the RDMA CM protocol, it is clear when an entry has to
evicted form the cache. But life is not perfect, remote peers may die
or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
the PF driver assumes the remote peer has gone.

We have experienced excessive amount of DREQ retries during fail-over
testing, when running with eight VMs per database server.

The problem has been reproduced in a bare-metal system using one VM
per physical node. In this environment, running 256 processes in each
VM, each process uses RDMA CM to create an RC QP between himself and
all (256) remote processes. All in all 16K QPs.

When tearing down these 16K QPs, excessive DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe:

  dreq:   5007
cm_rx_msgs:
  drep:   3838
  dreq:  13018
   rep:   8128
   req:   8256
   rtu:   8256
cm_tx_msgs:
  drep:   8011
  dreq:  68856
   rep:   8256
   req:   8128
   rtu:   8128
cm_tx_retries:
  dreq:  60483

Note that the active/passive side is distributed.

Enabling pr_debug in cm.c gives tons of:

[171778.814239]  mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!

By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from 113 to 67
seconds. Retries/duplicates are also significantly reduced:

cm_rx_duplicates:
  dreq:   7726
[]
cm_tx_retries:
  drep:  1
  dreq:   7779

Increasing the timeout further didn't help, as these duplicates and
retries stem from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about one hundred for both of them.

Adjustment of the CMA timeout is _not_ part of this commit.

Signed-off-by: Håkon Bugge 
---
 drivers/infiniband/hw/mlx4/cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
index fedaf8260105..8c79a480f2b7 100644
--- a/drivers/infiniband/hw/mlx4/cm.c
+++ b/drivers/infiniband/hw/mlx4/cm.c
@@ -39,7 +39,7 @@
 
 #include "mlx4_ib.h"
 
-#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
+#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
 
 struct id_map_entry {
struct rb_node node;
-- 
2.20.1



Re: [PATCH net 2/2] net/mlx5e: Cleanup of dcbnl related fields

2018-08-14 Thread Håkon Bugge



> On 14 Aug 2018, at 11:01, Yuval Shaia  wrote:
> 
> On Wed, Aug 08, 2018 at 03:48:08PM -0700, Saeed Mahameed wrote:
>> From: Huy Nguyen 
>> 
>> Remove unused netdev_registered_init/remove in en.h
>> Return ENOSUPPORT if the check MLX5_DSCP_SUPPORTED fails.
> 
> s/ENOSUPPORT/EOPNOTSUPP
> (noted by Haakon)

Sure did,

> 
>> Remove extra white space

and I also said that this has nothing to with the commit (no matter how 
tempting it can be) ;-)


Thxs, Håkon

>> 
>> Fixes: 2a5e7a1344f4 ("net/mlx5e: Add dcbnl dscp to priority support")
>> Signed-off-by: Huy Nguyen 
>> Cc: Yuval Shaia 
>> Reviewed-by: Parav Pandit 
>> Signed-off-by: Saeed Mahameed 
>> ---
>> drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 --
>> .../ethernet/mellanox/mlx5/core/en_dcbnl.c| 30 +++
>> 2 files changed, 11 insertions(+), 21 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
>> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> index eb9eb7aa953a..405236cf0b04 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> @@ -858,8 +858,6 @@ struct mlx5e_profile {
>>  mlx5e_fp_handle_rx_cqe handle_rx_cqe;
>>  mlx5e_fp_handle_rx_cqe handle_rx_cqe_mpwqe;
>>  } rx_handlers;
>> -void(*netdev_registered_init)(struct mlx5e_priv *priv);
>> -void(*netdev_registered_remove)(struct mlx5e_priv *priv);
>>  int max_tc;
>> };
>> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c 
>> b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
>> index e33afa8d2417..722998d68564 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
>> @@ -443,16 +443,12 @@ static int mlx5e_dcbnl_ieee_setapp(struct net_device 
>> *dev, struct dcb_app *app)
>>  bool is_new;
>>  int err;
>> 
>> -if (app->selector != IEEE_8021QAZ_APP_SEL_DSCP)
>> -return -EINVAL;
>> -
>> -if (!MLX5_CAP_GEN(priv->mdev, vport_group_manager))
>> -return -EINVAL;
>> -
>> -if (!MLX5_DSCP_SUPPORTED(priv->mdev))
>> -return -EINVAL;
>> +if (!MLX5_CAP_GEN(priv->mdev, vport_group_manager) ||
>> +!MLX5_DSCP_SUPPORTED(priv->mdev))
>> +return -EOPNOTSUPP;
>> 
>> -if (app->protocol >= MLX5E_MAX_DSCP)
>> +if ((app->selector != IEEE_8021QAZ_APP_SEL_DSCP) ||
>> +(app->protocol >= MLX5E_MAX_DSCP))
>>  return -EINVAL;
>> 
>>  /* Save the old entry info */
>> @@ -500,16 +496,12 @@ static int mlx5e_dcbnl_ieee_delapp(struct net_device 
>> *dev, struct dcb_app *app)
>>  struct mlx5e_priv *priv = netdev_priv(dev);
>>  int err;
>> 
>> -if (app->selector != IEEE_8021QAZ_APP_SEL_DSCP)
>> -return -EINVAL;
>> -
>> -if (!MLX5_CAP_GEN(priv->mdev, vport_group_manager))
>> -return -EINVAL;
>> -
>> -if (!MLX5_DSCP_SUPPORTED(priv->mdev))
>> -return -EINVAL;
>> +if  (!MLX5_CAP_GEN(priv->mdev, vport_group_manager) ||
>> + !MLX5_DSCP_SUPPORTED(priv->mdev))
>> +return -EOPNOTSUPP;
>> 
>> -if (app->protocol >= MLX5E_MAX_DSCP)
>> +if ((app->selector != IEEE_8021QAZ_APP_SEL_DSCP) ||
>> +(app->protocol >= MLX5E_MAX_DSCP))
>>  return -EINVAL;
>> 
>>  /* Skip if no dscp app entry */
>> @@ -1146,7 +1138,7 @@ static int mlx5e_set_trust_state(struct mlx5e_priv 
>> *priv, u8 trust_state)
>> {
>>  int err;
>> 
>> -err =  mlx5_set_trust_state(priv->mdev, trust_state);
>> +err = mlx5_set_trust_state(priv->mdev, trust_state);
>>  if (err)
>>  return err;
>>  priv->dcbx_dp.trust_state = trust_state;
>> -- 
>> 2.17.0
>> 



Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-11 Thread Håkon Bugge


> On 11 May 2018, at 01:31, Qing Huang  wrote:
> 
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
> 
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
> 
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
> 
> With 4KB ICM chunk size, the above issue is fixed.
> 
> However in order to support 4KB ICM chunk size, we need to fix another
> issue in large size kcalloc allocations.
> 
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
> 
> The solution is to use vzalloc to replace kcalloc. There is no need
> for contiguous memory pages for a driver meta data structure (no need
> of DMA ops).
> 
> Signed-off-by: Qing Huang 
> Acked-by: Daniel Jurgens 
> ---
> drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
> 1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
> b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index a822f7a..2b17a4b 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,12 @@
> #include "fw.h"
> 
> /*
> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
> - * per chunk.
> + * We allocate in 4KB page size chunks to avoid high order memory
> + * allocations in fragmented/high usage memory situation.
>  */
> enum {
> - MLX4_ICM_ALLOC_SIZE = 1 << 18,
> - MLX4_TABLE_CHUNK_SIZE   = 1 << 18
> + MLX4_ICM_ALLOC_SIZE = 1 << 12,
> + MLX4_TABLE_CHUNK_SIZE   = 1 << 12

Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the 
hw page size is 8KiB.


Thxs, Håkon

> };
> 
> static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
> *chunk)
> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
> mlx4_icm_table *table,
>   obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
>   num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
> 
> - table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
> + table->icm  = vzalloc(num_icm * sizeof(*table->icm));
>   if (!table->icm)
>   return -ENOMEM;
>   table->virt = virt;
> @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
> mlx4_icm_table *table,
>   mlx4_free_icm(dev, table->icm[i], use_coherent);
>   }
> 
> - kfree(table->icm);
> + vfree(table->icm);
> 
>   return -ENOMEM;
> }
> @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
> mlx4_icm_table *table)
>   mlx4_free_icm(dev, table->icm[i], table->coherent);
>   }
> 
> - kfree(table->icm);
> + vfree(table->icm);
> }
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] rds: ib: Fix missing call to rds_ib_dev_put in rds_ib_setup_qp

2018-04-25 Thread Håkon Bugge


> On 25 Apr 2018, at 13:22, Dag Moxnes  wrote:
> 
> The function rds_ib_setup_qp is calling rds_ib_get_client_data and
> should correspondingly call rds_ib_dev_put. This call was lost in
> the non-error path with the introduction of error handling done in
> commit 3b12f73a5c29 ("rds: ib: add error handle")
> 
> Signed-off-by: Dag Moxnes 

Reviewed-by: Håkon Bugge 


Thxs, Håkon

> ---
> net/rds/ib_cm.c |3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
> index eea1d86..13b38ad 100644
> --- a/net/rds/ib_cm.c
> +++ b/net/rds/ib_cm.c
> @@ -547,7 +547,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
>   rdsdebug("conn %p pd %p cq %p %p\n", conn, ic->i_pd,
>ic->i_send_cq, ic->i_recv_cq);
> 
> - return ret;
> + goto out;
> 
> sends_out:
>   vfree(ic->i_sends);
> @@ -572,6 +572,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
>   ic->i_send_cq = NULL;
> rds_ibdev_out:
>   rds_ib_remove_conn(rds_ibdev, conn);
> +out:
>   rds_ib_dev_put(rds_ibdev);
> 
>   return ret;
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH 1/1] Revert "rds: ib: add error handle"

2018-04-24 Thread Håkon Bugge


> On 24 Apr 2018, at 05:27, santosh.shilim...@oracle.com wrote:
> 
> On 4/23/18 6:39 PM, Zhu Yanjun wrote:
>> This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.
>> After long time discussion and investigations, it seems that there
>> is no mem leak. So this patch is reverted.
>> Signed-off-by: Zhu Yanjun 
>> ---
> Well your fix was not for any leaks but just proper labels for
> graceful exits. I don't know which long time discussion
> you are referring but there is no need to revert this change
> unless you see any issue with your change.

As spotted by Dax Moxnes, the patch misses to call rds_ib_dev_put() when 
exiting normally, only on err exit.


Thxs, Håkon

> 
> Regards,
> Santosh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-20 Thread Håkon Bugge
Hi Sowmini,

A little nit below. And some spelling issues in existing commentary you can 
consider fixing, since you reshuffle this file considerable.


Thxs, Håkon


> On 18 Mar 2018, at 21:45, Sowmini Varadhan  
> wrote:
> 
> On (03/18/18 00:55), Kirill Tkhai wrote:
>> 
>> I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is
>> another solution to do that, I'm not again that.
> 
> The patch below takes care of this. I've done some preliminary testing,
> and I'll send it upstream after doing additional self-review/testing.
> Please also take a look, if you can, to see if I missed something.
> 
> Thanks for the input,
> 
> --Sowmini
> ---patch follows
> 
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 08ea9cd..87c2643 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net)
>   return err;
> }
> 
> -static void __net_exit rds_tcp_exit_net(struct net *net)
> -{
> - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> -
> - if (rtn->rds_tcp_sysctl)
> - unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> -
> - if (net != &init_net && rtn->ctl_table)
> - kfree(rtn->ctl_table);
> -
> - /* If rds_tcp_exit_net() is called as a result of netns deletion,
> -  * the rds_tcp_kill_sock() device notifier would already have cleaned
> -  * up the listen socket, thus there is no work to do in this function.
> -  *
> -  * If rds_tcp_exit_net() is called as a result of module unload,
> -  * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
> -  * we do need to clean up the listen socket here.
> -  */
> - if (rtn->rds_tcp_listen_sock) {
> - struct socket *lsock = rtn->rds_tcp_listen_sock;
> -
> - rtn->rds_tcp_listen_sock = NULL;
> - rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
> - }
> -}
> -
> -static struct pernet_operations rds_tcp_net_ops = {
> - .init = rds_tcp_init_net,
> - .exit = rds_tcp_exit_net,
> - .id = &rds_tcp_netid,
> - .size = sizeof(struct rds_tcp_net),
> - .async = true,
> -};
> -
> static void rds_tcp_kill_sock(struct net *net)
> {
>   struct rds_tcp_connection *tc, *_tc;
> @@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net)
>   rds_conn_destroy(tc->t_cpath->cp_conn);
> }
> 
> -void *rds_tcp_listen_sock_def_readable(struct net *net)
> +static void __net_exit rds_tcp_exit_net(struct net *net)
> {
>   struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> - struct socket *lsock = rtn->rds_tcp_listen_sock;
> 
> - if (!lsock)
> - return NULL;
> + rds_tcp_kill_sock(net);
> 
> - return lsock->sk->sk_user_data;
> + if (rtn->rds_tcp_sysctl)
> + unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> +
> + if (net != &init_net && rtn->ctl_table)
> + kfree(rtn->ctl_table);

Well, this comes from the existing code, but as pointed out by Linus recently, 
kfree() handles NULL pointers. So, if rtn->ctl_table is NULL most likely, the 
code is OK _if_ you add an unlikely() around the if-clause. Otherwise, the “ && 
rtn->ctl_table” can simply be removed.

> }
> 
> -static int rds_tcp_dev_event(struct notifier_block *this,
> -  unsigned long event, void *ptr)
> +static struct pernet_operations rds_tcp_net_ops = {
> + .init = rds_tcp_init_net,
> + .exit = rds_tcp_exit_net,
> + .id = &rds_tcp_netid,
> + .size = sizeof(struct rds_tcp_net),
> + .async = true,
> +};
> +
> +void *rds_tcp_listen_sock_def_readable(struct net *net)
> {
> - struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> + struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> + struct socket *lsock = rtn->rds_tcp_listen_sock;
> 
> - /* rds-tcp registers as a pernet subys, so the ->exit will only
> -  * get invoked after network acitivity has quiesced. We need to
> -  * clean up all sockets  to quiesce network activity, and use
> -  * the unregistration of the per-net loopback device as a trigger
> -  * to start that cleanup.
> -  */
> - if (event == NETDEV_UNREGISTER_FINAL &&
> - dev->ifindex == LOOPBACK_IFINDEX)
> - rds_tcp_kill_sock(dev_net(dev));
> + if (!lsock)
> + return NULL;
> 
> - return NOTIFY_DONE;
> + return lsock->sk->sk_user_data;
> }
> 
> -static struct notifier_block rds_tcp_dev_notifier = {
> - .notifier_call= rds_tcp_dev_event,
> - .priority = -10, /* must be called after other network notifiers */
> -};
> -
> /* when sysctl is used to modify some kernel socket parameters,this

s/when/When/
s/parameters,this/parameters, this/

Well, not part of your commit.


>  * function  resets the RDS connections in that netns  so that we can

Two double spaces incidents above

Not part of y

[PATCH net] rds: Fix NULL pointer dereference in __rds_rdma_map

2017-12-06 Thread Håkon Bugge
This is a fix for syzkaller719569, where memory registration was
attempted without any underlying transport being loaded.

Analysis of the case reveals that it is the setsockopt() RDS_GET_MR
(2) and RDS_GET_MR_FOR_DEST (7) that are vulnerable.

Here is an example stack trace when the bug is hit:

BUG: unable to handle kernel NULL pointer dereference at 00c0
IP: __rds_rdma_map+0x36/0x440 [rds]
PGD 2f93d03067 P4D 2f93d03067 PUD 2f93d02067 PMD 0
Oops:  [#1] SMP
Modules linked in: bridge stp llc tun rpcsec_gss_krb5 nfsv4
dns_resolver nfs fscache rds binfmt_misc sb_edac intel_powerclamp
coretemp kvm_intel kvm irqbypass crct10dif_pclmul c rc32_pclmul
ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd
iTCO_wdt mei_me sg iTCO_vendor_support ipmi_si mei ipmi_devintf nfsd
shpchp pcspkr i2c_i801 ioatd ma ipmi_msghandler wmi lpc_ich mfd_core
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2
mgag200 i2c_algo_bit drm_kms_helper ixgbe syscopyarea ahci sysfillrect
sysimgblt libahci mdio fb_sys_fops ttm ptp libata sd_mod mlx4_core drm
crc32c_intel pps_core megaraid_sas i2c_core dca dm_mirror
dm_region_hash dm_log dm_mod
CPU: 48 PID: 45787 Comm: repro_set2 Not tainted 4.14.2-3.el7uek.x86_64 #2
Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 
3111 03/03/2017
task: 882f9190db00 task.stack: c9002b994000
RIP: 0010:__rds_rdma_map+0x36/0x440 [rds]
RSP: 0018:c9002b997df0 EFLAGS: 00010202
RAX:  RBX: 882fa2182580 RCX: 
RDX:  RSI: c9002b997e40 RDI: 882fa2182580
RBP: c9002b997e30 R08:  R09: 0002
R10: 885fb29e3838 R11:  R12: 882fa2182580
R13: 882fa2182580 R14: 0002 R15: 2ffc
FS:  7fbffa20b700() GS:882fbfb8() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00c0 CR3: 002f98a66006 CR4: 001606e0
Call Trace:
 rds_get_mr+0x56/0x80 [rds]
 rds_setsockopt+0x172/0x340 [rds]
 ? __fget_light+0x25/0x60
 ? __fdget+0x13/0x20
 SyS_setsockopt+0x80/0xe0
 do_syscall_64+0x67/0x1b0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7fbff9b117f9
RSP: 002b:7fbffa20aed8 EFLAGS: 0293 ORIG_RAX: 0036
RAX: ffda RBX: 000c84a4 RCX: 7fbff9b117f9
RDX: 0002 RSI: 4114 RDI: 109b
RBP: 7fbffa20af10 R08: 0020 R09: 7fbff9dd7860
R10: 2ffc R11: 0293 R12: 
R13: 7fbffa20b9c0 R14: 7fbffa20b700 R15: 0021

Code: 41 56 41 55 49 89 fd 41 54 53 48 83 ec 18 8b 87 f0 02 00 00 48
89 55 d0 48 89 4d c8 85 c0 0f 84 2d 03 00 00 48 8b 87 00 03 00 00 <48>
83 b8 c0 00 00 00 00 0f 84 25 03 00 0 0 48 8b 06 48 8b 56 08

The fix is to check the existence of an underlying transport in
__rds_rdma_map().

Signed-off-by: Håkon Bugge 
Reported-by: syzbot 
---
 net/rds/rdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 8886f15abe90..bc2f1e0977d6 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -183,7 +183,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct 
rds_get_mr_args *args,
long i;
int ret;
 
-   if (rs->rs_bound_addr == 0) {
+   if (rs->rs_bound_addr == 0 || !rs->rs_transport) {
ret = -ENOTCONN; /* XXX not a great errno */
goto out;
}
-- 
2.13.6



[PATCH net] rds: ib: Fix NULL pointer dereference in debug code

2017-11-07 Thread Håkon Bugge
() and ib_ib_port_recv() statements:

   /* XXX when can this fail? */
   ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
+   if (can_wait)
+   usleep_range(1000, 5000);
   rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
(long) ib_sg_dma_address(

The fix is simply to move the rdsdebug() statement up before the
ib_post_recv() and remove the printing of ret, which is taken care of
anyway by the non-debug code.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
Reviewed-by: Wei Lin Guay 
---
 net/rds/ib_recv.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 9722bf8..b4e421a 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -410,14 +410,14 @@ void rds_ib_recv_refill(struct rds_connection *conn, int 
prefill, gfp_t gfp)
break;
}
 
-   /* XXX when can this fail? */
-   ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
-   rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
+   rdsdebug("recv %p ibinc %p page %p addr %lu\n", recv,
 recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
 (long) ib_sg_dma_address(
ic->i_cm_id->device,
-   &recv->r_frag->f_sg),
-   ret);
+   &recv->r_frag->f_sg));
+
+   /* XXX when can this fail? */
+   ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
if (ret) {
rds_ib_conn_error(conn, "recv post on "
   "%pI4 returned %d, disconnecting and "
-- 
2.9.3



[PATCH] rds: ib: Fix uninitialized variable

2017-10-24 Thread Håkon Bugge
send_flags needs to be initialized before calling
rds_ib_set_wr_signal_state().

Signed-off-by: Håkon Bugge 
Acked-by: Santosh Shilimkar 
---
 net/rds/ib_send.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 6ab39db..8f46755 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -792,6 +792,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, struct 
rm_atomic_op *op)
send->s_atomic_wr.compare_add_mask = op->op_m_fadd.nocarry_mask;
send->s_atomic_wr.swap_mask = 0;
}
+   send->s_wr.send_flags = 0;
nr_sig = rds_ib_set_wr_signal_state(ic, send, op->op_notify);
send->s_atomic_wr.wr.num_sge = 1;
send->s_atomic_wr.wr.next = NULL;
-- 
1.8.3.1



Re: [PATCH] rds: Fix uninitialized variable

2017-10-24 Thread Håkon Bugge
Ignore this, sent the old one.

Håkon

> On 24 Oct 2017, at 18:07, Håkon Bugge  wrote:
> 
> send_flags needs to be initialized before calling
> rds_ib_set_wr_signal_state().
> 
> Signed-off-by: Håkon Bugge 
> ---
> net/rds/ib_send.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
> index 6ab39db..8f46755 100644
> --- a/net/rds/ib_send.c
> +++ b/net/rds/ib_send.c
> @@ -792,6 +792,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, 
> struct rm_atomic_op *op)
>   send->s_atomic_wr.compare_add_mask = op->op_m_fadd.nocarry_mask;
>   send->s_atomic_wr.swap_mask = 0;
>   }
> + send->s_wr.send_flags = 0;
>   nr_sig = rds_ib_set_wr_signal_state(ic, send, op->op_notify);
>   send->s_atomic_wr.wr.num_sge = 1;
>   send->s_atomic_wr.wr.next = NULL;
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[PATCH] rds: Fix uninitialized variable

2017-10-24 Thread Håkon Bugge
send_flags needs to be initialized before calling
rds_ib_set_wr_signal_state().

Signed-off-by: Håkon Bugge 
---
 net/rds/ib_send.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 6ab39db..8f46755 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -792,6 +792,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, struct 
rm_atomic_op *op)
send->s_atomic_wr.compare_add_mask = op->op_m_fadd.nocarry_mask;
send->s_atomic_wr.swap_mask = 0;
}
+   send->s_wr.send_flags = 0;
nr_sig = rds_ib_set_wr_signal_state(ic, send, op->op_notify);
send->s_atomic_wr.wr.num_sge = 1;
send->s_atomic_wr.wr.next = NULL;
-- 
1.8.3.1



[PATCH] rds: Fix uninitialized variable

2017-10-24 Thread Håkon Bugge
send_flags needs to be initialized before calling
rds_ib_set_wr_signal_state().

Signed-off-by: Håkon Bugge 
---
 net/rds/ib_send.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 6ab39db..8f46755 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -792,6 +792,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, struct 
rm_atomic_op *op)
send->s_atomic_wr.compare_add_mask = op->op_m_fadd.nocarry_mask;
send->s_atomic_wr.swap_mask = 0;
}
+   send->s_wr.send_flags = 0;
nr_sig = rds_ib_set_wr_signal_state(ic, send, op->op_notify);
send->s_atomic_wr.wr.num_sge = 1;
send->s_atomic_wr.wr.next = NULL;
-- 
1.8.3.1



[PATCH] rds: Fix inaccurate accounting of unsignaled wrs

2017-10-24 Thread Håkon Bugge
The number of unsignaled work-requests posted to the IB send queue is
tracked by a counter in the rds_ib_connection struct. When it reaches
zero, or the caller explicitly asks for it, the send-signaled bit is
set in send_flags and the counter is reset. This is performed by the
rds_ib_set_wr_signal_state() function.

However, this function is not always used which yields inaccurate
accounting. This commit fixes this, re-factors a code bloat related to
the matter, and makes the actual parameter type to the function
consistent.

Signed-off-by: Håkon Bugge 
---
 net/rds/ib_send.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 6ab39db..d07ecb0 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -661,13 +661,15 @@ int rds_ib_xmit(struct rds_connection *conn, struct 
rds_message *rm,
}
}
 
-   rds_ib_set_wr_signal_state(ic, send, 0);
+   rds_ib_set_wr_signal_state(ic, send, false);
 
/*
 * Always signal the last one if we're stopping due to flow 
control.
 */
-   if (ic->i_flowctl && flow_controlled && i == (work_alloc-1))
-   send->s_wr.send_flags |= IB_SEND_SIGNALED | 
IB_SEND_SOLICITED;
+   if (ic->i_flowctl && flow_controlled && i == (work_alloc - 1)) {
+   rds_ib_set_wr_signal_state(ic, send, true);
+   send->s_wr.send_flags |= IB_SEND_SOLICITED;
+   }
 
if (send->s_wr.send_flags & IB_SEND_SIGNALED)
nr_sig++;
@@ -705,11 +707,8 @@ int rds_ib_xmit(struct rds_connection *conn, struct 
rds_message *rm,
if (scat == &rm->data.op_sg[rm->data.op_count]) {
prev->s_op = ic->i_data_op;
prev->s_wr.send_flags |= IB_SEND_SOLICITED;
-   if (!(prev->s_wr.send_flags & IB_SEND_SIGNALED)) {
-   ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs;
-   prev->s_wr.send_flags |= IB_SEND_SIGNALED;
-   nr_sig++;
-   }
+   if (!(prev->s_wr.send_flags & IB_SEND_SIGNALED))
+   nr_sig += rds_ib_set_wr_signal_state(ic, prev, true);
ic->i_data_op = NULL;
}
 
-- 
1.8.3.1



[PATCH net v2] rds: Fix incorrect statistics counting

2017-09-06 Thread Håkon Bugge
In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock and has incremented the send_gen,
it is considered a race and we yield. The code incrementing the
s_send_lock_queue_raced statistics counter did not count this event
correctly.

This commit counts the race condition correctly.

Changes from v1:
- Removed check for *someone_on_xmit()*
- Fixed incorrect indentation

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
---
 net/rds/send.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 058a407..b52cdc8 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -428,14 +428,18 @@ int rds_send_xmit(struct rds_conn_path *cp)
 * some work and we will skip our goto
 */
if (ret == 0) {
+   bool raced;
+
smp_mb();
+   raced = send_gen != READ_ONCE(cp->cp_send_gen);
+
if ((test_bit(0, &conn->c_map_queued) ||
-!list_empty(&cp->cp_send_queue)) &&
-   send_gen == READ_ONCE(cp->cp_send_gen)) {
-   rds_stats_inc(s_send_lock_queue_raced);
+   !list_empty(&cp->cp_send_queue)) && !raced) {
if (batch_count < send_batch_count)
goto restart;
queue_delayed_work(rds_wq, &cp->cp_send_w, 1);
+   } else if (raced) {
+   rds_stats_inc(s_send_lock_queue_raced);
}
}
 out:
-- 
2.9.3



Re: [PATCH net] rds: Fix incorrect statistics counting

2017-09-06 Thread Håkon Bugge

> On 6 Sep 2017, at 17:58, Santosh Shilimkar  
> wrote:
> 
> On 9/6/2017 8:29 AM, Håkon Bugge wrote:
>> In rds_send_xmit() there is logic to batch the sends. However, if
>> another thread has acquired the lock, it is considered a race and we
>> yield. The code incrementing the s_send_lock_queue_raced statistics
>> counter did not count this event correctly.
>> This commit removes a small race in determining the race and
>> increments the statistics counter correctly.
>> Signed-off-by: Håkon Bugge 
>> Reviewed-by: Knut Omang 
>> ---
>>  net/rds/send.c | 16 +---
>>  1 file changed, 13 insertions(+), 3 deletions(-)
> Those counters are not really to give that accurate so
> am not very keen to add additional cycles in send paths
> and add additional code. Have you seen any real issue
> or this is just a observation. s_send_lock_queue_raced
> counter is never used to check for smaller increments
> and hence the question.

Hi Santosh,


Yes, I agree with accuracy of s_send_lock_queue_raced. But the main point is 
that the existing code counts some partial share of when it is _not_ raced.

So, in the critical path, my patch adds one test_bit(), which hits the local 
CPU cache, if not raced. If raced, some other thread is in control, so I would 
not think the added cycles would make any big difference.

I can send a v2 where the race tightening is removed if you like.


Thxs, Håkon



[PATCH net] rds: Fix incorrect statistics counting

2017-09-06 Thread Håkon Bugge
In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock, it is considered a race and we
yield. The code incrementing the s_send_lock_queue_raced statistics
counter did not count this event correctly.

This commit removes a small race in determining the race and
increments the statistics counter correctly.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
---
 net/rds/send.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 058a407..ecfe0b5 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -101,6 +101,11 @@ void rds_send_path_reset(struct rds_conn_path *cp)
 }
 EXPORT_SYMBOL_GPL(rds_send_path_reset);
 
+static bool someone_in_xmit(struct rds_conn_path *cp)
+{
+   return test_bit(RDS_IN_XMIT, &cp->cp_flags);
+}
+
 static int acquire_in_xmit(struct rds_conn_path *cp)
 {
return test_and_set_bit(RDS_IN_XMIT, &cp->cp_flags) == 0;
@@ -428,14 +433,19 @@ int rds_send_xmit(struct rds_conn_path *cp)
 * some work and we will skip our goto
 */
if (ret == 0) {
+   bool raced;
+
smp_mb();
+   raced = someone_in_xmit(cp) ||
+   send_gen != READ_ONCE(cp->cp_send_gen);
+
if ((test_bit(0, &conn->c_map_queued) ||
-!list_empty(&cp->cp_send_queue)) &&
-   send_gen == READ_ONCE(cp->cp_send_gen)) {
-   rds_stats_inc(s_send_lock_queue_raced);
+   !list_empty(&cp->cp_send_queue)) && !raced) {
if (batch_count < send_batch_count)
goto restart;
queue_delayed_work(rds_wq, &cp->cp_send_w, 1);
+   } else if (raced) {
+   rds_stats_inc(s_send_lock_queue_raced);
}
}
 out:
-- 
2.9.3



[PATCH net] rds: Fix non-atomic operation on shared flag variable

2017-09-05 Thread Håkon Bugge
The bits in m_flags in struct rds_message are used for a plurality of
reasons, and from different contexts. To avoid any missing updates to
m_flags, use the atomic set_bit() instead of the non-atomic equivalent.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
Reviewed-by: Wei Lin Guay 
---
 net/rds/send.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 41b9f0f..058a407 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -273,7 +273,7 @@ int rds_send_xmit(struct rds_conn_path *cp)
len = ntohl(rm->m_inc.i_hdr.h_len);
if (cp->cp_unacked_packets == 0 ||
cp->cp_unacked_bytes < len) {
-   __set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
+   set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
 
cp->cp_unacked_packets =
rds_sysctl_max_unacked_packets;
@@ -829,7 +829,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct 
rds_connection *conn,
 * throughput hits a certain threshold.
 */
if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2)
-   __set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
+   set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
 
list_add_tail(&rm->m_sock_item, &rs->rs_send_queue);
set_bit(RDS_MSG_ON_SOCK, &rm->m_flags);
-- 
2.9.3



[PATCH net] rds: Reintroduce statistics counting

2017-08-08 Thread Håkon Bugge
In commit 7e3f2952eeb1 ("rds: don't let RDS shutdown a connection
while senders are present"), refilling the receive queue was removed
from rds_ib_recv(), along with the increment of
s_ib_rx_refill_from_thread.

Commit 73ce4317bf98 ("RDS: make sure we post recv buffers")
re-introduces filling the receive queue from rds_ib_recv(), but does
not add the statistics counter. rds_ib_recv() was later renamed to
rds_ib_recv_path().

This commit reintroduces the statistics counting of
s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
Reviewed-by: Wei Lin Guay 
---
 net/rds/ib_recv.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index e10624a..9722bf8 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -1015,8 +1015,10 @@ void rds_ib_recv_cqe_handler(struct rds_ib_connection 
*ic,
if (rds_ib_ring_empty(&ic->i_recv_ring))
rds_ib_stats_inc(s_ib_rx_ring_empty);
 
-   if (rds_ib_ring_low(&ic->i_recv_ring))
+   if (rds_ib_ring_low(&ic->i_recv_ring)) {
rds_ib_recv_refill(conn, 0, GFP_NOWAIT);
+   rds_ib_stats_inc(s_ib_rx_refill_from_cq);
+   }
 }
 
 int rds_ib_recv_path(struct rds_conn_path *cp)
@@ -1029,6 +1031,7 @@ int rds_ib_recv_path(struct rds_conn_path *cp)
if (rds_conn_up(conn)) {
rds_ib_attempt_ack(ic);
rds_ib_recv_refill(conn, 0, GFP_KERNEL);
+   rds_ib_stats_inc(s_ib_rx_refill_from_thread);
}
 
return ret;
-- 
2.9.3



Re: [PATCH net] rds: Make sure updates to cp_send_gen can be observed

2017-07-20 Thread Håkon Bugge

> On 20 Jul 2017, at 13:02, Sowmini Varadhan  
> wrote:
> 
> On (07/20/17 12:28), H??kon Bugge wrote:
>> cp->cp_send_gen is treated as a normal variable, although it may be
>> used by different threads.
> 
> I'm confused by that assertion. If you look at the comments right
> above the change in your patch, there is a note that 
> acquire_in_xmit/release_in_xmit are the synchronization/serialization 
> points.
> 
> Can you please clarify?

The way the original code works, is that it is allowed for the compiler to keep 
the value of “cp->cp_send_gen + 1” in a register. The compiler has no 
requirement to store this value to memory, before leaving the function or 
calling another one.

Further, said register can be used in the comparison outside the 
acquire_in_xmit/release_in_xmit, at which point another thread may have changed 
its value.


Thxs, Håkon

> 
>> --- a/net/rds/send.c
>> +++ b/net/rds/send.c
>> @@ -170,8 +170,8 @@ int rds_send_xmit(struct rds_conn_path *cp)
>>   * The acquire_in_xmit() check above ensures that only one
>>   * caller can increment c_send_gen at any time.
>>   */
>> -cp->cp_send_gen++;
>> -send_gen = cp->cp_send_gen;
>> +send_gen = READ_ONCE(cp->cp_send_gen) + 1;
>> +WRITE_ONCE(cp->cp_send_gen, send_gen);
>> 
> 
> --Sowmini
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[PATCH net] rds: Make sure updates to cp_send_gen can be observed

2017-07-20 Thread Håkon Bugge
cp->cp_send_gen is treated as a normal variable, although it may be
used by different threads.

This is fixed by using {READ,WRITE}_ONCE when it is incremented and
READ_ONCE when it is read outside the {acquire,release}_in_xmit
protection.

Normative reference from the Linux-Kernel Memory Model:

Loads from and stores to shared (but non-atomic) variables should
be protected with the READ_ONCE(), WRITE_ONCE(), and
ACCESS_ONCE().

Clause 5.1.2.4/25 in the C standard is also relevant.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
---
 net/rds/send.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 5cc6403..fa0368c 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -170,8 +170,8 @@ int rds_send_xmit(struct rds_conn_path *cp)
 * The acquire_in_xmit() check above ensures that only one
 * caller can increment c_send_gen at any time.
 */
-   cp->cp_send_gen++;
-   send_gen = cp->cp_send_gen;
+   send_gen = READ_ONCE(cp->cp_send_gen) + 1;
+   WRITE_ONCE(cp->cp_send_gen, send_gen);
 
/*
 * rds_conn_shutdown() sets the conn state and then tests RDS_IN_XMIT,
@@ -431,7 +431,7 @@ int rds_send_xmit(struct rds_conn_path *cp)
smp_mb();
if ((test_bit(0, &conn->c_map_queued) ||
 !list_empty(&cp->cp_send_queue)) &&
-   send_gen == cp->cp_send_gen) {
+   send_gen == READ_ONCE(cp->cp_send_gen)) {
rds_stats_inc(s_send_lock_queue_raced);
if (batch_count < send_batch_count)
goto restart;
-- 
2.9.3