[PATCH for-next v3 2/2] rds: ib: Remove two ib_modify_qp() calls
For some HCAs, ib_modify_qp() is an expensive operation running virtualized. For both the active and passive side, the QP returned by the CM has the state set to RTS, so no need for this excess RTS -> RTS transition. With IB Core's ability to set the RNR Retry timer, we use this interface to shave off another ib_modify_qp(). Fixes: ec16227e1414 ("RDS/IB: Infiniband transport") Signed-off-by: Håkon Bugge --- net/rds/ib_cm.c | 35 +-- net/rds/rdma_transport.c | 1 + 2 files changed, 2 insertions(+), 34 deletions(-) diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index f5cbe96..26b069e 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -68,31 +68,6 @@ static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits) } /* - * Tune RNR behavior. Without flow control, we use a rather - * low timeout, but not the absolute minimum - this should - * be tunable. - * - * We already set the RNR retry count to 7 (which is the - * smallest infinite number :-) above. - * If flow control is off, we want to change this back to 0 - * so that we learn quickly when our credit accounting is - * buggy. - * - * Caller passes in a qp_attr pointer - don't waste stack spacv - * by allocation this twice. - */ -static void -rds_ib_tune_rnr(struct rds_ib_connection *ic, struct ib_qp_attr *attr) -{ - int ret; - - attr->min_rnr_timer = IB_RNR_TIMER_000_32; - ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_MIN_RNR_TIMER); - if (ret) - printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER): err=%d\n", -ret); -} - -/* * Connection established. * We get here for both outgoing and incoming connection. */ @@ -100,7 +75,6 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even { struct rds_ib_connection *ic = conn->c_transport_data; const union rds_ib_conn_priv *dp = NULL; - struct ib_qp_attr qp_attr; __be64 ack_seq = 0; __be32 credit = 0; u8 major = 0; @@ -168,14 +142,6 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even * the posted credit count. */ rds_ib_recv_refill(conn, 1, GFP_KERNEL); - /* Tune RNR behavior */ - rds_ib_tune_rnr(ic, &qp_attr); - - qp_attr.qp_state = IB_QPS_RTS; - err = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE); - if (err) - printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", err); - /* update ib_device with this local ipaddr */ err = rds_ib_update_ipaddr(ic->rds_ibdev, &conn->c_laddr); if (err) @@ -947,6 +913,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, event->param.conn.responder_resources, event->param.conn.initiator_depth, isv6); + rdma_set_min_rnr_timer(cm_id, IB_RNR_TIMER_000_32); /* rdma_accept() calls rdma_reject() internally if it fails */ if (rdma_accept(cm_id, &conn_param)) rds_ib_conn_error(conn, "rdma_accept failed\n"); diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 5f741e5..a9e4ff9 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -87,6 +87,7 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id, case RDMA_CM_EVENT_ADDR_RESOLVED: rdma_set_service_type(cm_id, conn->c_tos); + rdma_set_min_rnr_timer(cm_id, IB_RNR_TIMER_000_32); /* XXX do we need to clean up if this fails? */ ret = rdma_resolve_route(cm_id, RDS_RDMA_RESOLVE_TIMEOUT_MS); -- 1.8.3.1
[PATCH for-next v3 0/2] Introduce rdma_set_min_rnr_timer() and use it in RDS
ib_modify_qp() is an expensive operation on some HCAs running virtualized. This series removes two ib_modify_qp() calls from RDS. I am sending this as a v3, even though it is the first sent to net. This because the IB Core commit has reach v3. Håkon Bugge (2): IB/cma: Introduce rdma_set_min_rnr_timer() rds: ib: Remove two ib_modify_qp() calls drivers/infiniband/core/cma.c | 41 ++ drivers/infiniband/core/cma_priv.h | 2 ++ include/rdma/rdma_cm.h | 2 ++ net/rds/ib_cm.c| 35 +--- net/rds/rdma_transport.c | 1 + 5 files changed, 47 insertions(+), 34 deletions(-) -- 1.8.3.1
[PATCH for-next v3 1/2] IB/cma: Introduce rdma_set_min_rnr_timer()
Introduce the ability for kernel ULPs to adjust the minimum RNR Retry timer. The INIT -> RTR transition executed by RDMA CM will be used for this adjustment. This avoids an additional ib_modify_qp() call. rdma_set_min_rnr_timer() must be called before the call to rdma_connect() on the active side and before the call to rdma_accept() on the passive side. The default value of RNR Retry timer is zero, which translates to 655 ms. When the receiver is not ready to accept a send messages, it encodes the RNR Retry timer value in the NAK. The requestor will then wait at least the specified time value before retrying the send. The 5-bit value to be supplied to the rdma_set_min_rnr_timer() is documented in IBTA Table 45: "Encoding for RNR NAK Timer Field". Signed-off-by: Håkon Bugge Acked-by: Jason Gunthorpe --- drivers/infiniband/core/cma.c | 41 ++ drivers/infiniband/core/cma_priv.h | 2 ++ include/rdma/rdma_cm.h | 2 ++ 3 files changed, 45 insertions(+) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9409651..5ce097d 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -852,6 +852,7 @@ static void cma_id_put(struct rdma_id_private *id_priv) id_priv->id.qp_type = qp_type; id_priv->tos_set = false; id_priv->timeout_set = false; + id_priv->min_rnr_timer_set = false; id_priv->gid_type = IB_GID_TYPE_IB; spin_lock_init(&id_priv->lock); mutex_init(&id_priv->qp_mutex); @@ -1141,6 +1142,9 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, if ((*qp_attr_mask & IB_QP_TIMEOUT) && id_priv->timeout_set) qp_attr->timeout = id_priv->timeout; + if ((*qp_attr_mask & IB_QP_MIN_RNR_TIMER) && id_priv->min_rnr_timer_set) + qp_attr->min_rnr_timer = id_priv->min_rnr_timer; + return ret; } EXPORT_SYMBOL(rdma_init_qp_attr); @@ -2615,6 +2619,43 @@ int rdma_set_ack_timeout(struct rdma_cm_id *id, u8 timeout) } EXPORT_SYMBOL(rdma_set_ack_timeout); +/** + * rdma_set_min_rnr_timer() - Set the minimum RNR Retry timer of the + * QP associated with a connection identifier. + * @id: Communication identifier to associated with service type. + * @min_rnr_timer: 5-bit value encoded as Table 45: "Encoding for RNR NAK + *Timer Field" in the IBTA specification. + * + * This function should be called before rdma_connect() on active + * side, and on passive side before rdma_accept(). The timer value + * will be associated with the local QP. When it receives a send it is + * not read to handle, typically if the receive queue is empty, an RNR + * Retry NAK is returned to the requester with the min_rnr_timer + * encoded. The requester will then wait at least the time specified + * in the NAK before retrying. The default is zero, which translates + * to a minimum RNR Timer value of 655 ms. + * + * Return: 0 for success + */ +int rdma_set_min_rnr_timer(struct rdma_cm_id *id, u8 min_rnr_timer) +{ + struct rdma_id_private *id_priv; + + /* It is a five-bit value */ + if (min_rnr_timer & 0xe0) + return -EINVAL; + + if (id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT) + return -EINVAL; + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->min_rnr_timer = min_rnr_timer; + id_priv->min_rnr_timer_set = true; + + return 0; +} +EXPORT_SYMBOL(rdma_set_min_rnr_timer); + static void cma_query_handler(int status, struct sa_path_rec *path_rec, void *context) { diff --git a/drivers/infiniband/core/cma_priv.h b/drivers/infiniband/core/cma_priv.h index caece96..bf83d32 100644 --- a/drivers/infiniband/core/cma_priv.h +++ b/drivers/infiniband/core/cma_priv.h @@ -86,9 +86,11 @@ struct rdma_id_private { u8 tos; u8 tos_set:1; u8 timeout_set:1; + u8 min_rnr_timer_set:1; u8 reuseaddr; u8 afonly; u8 timeout; + u8 min_rnr_timer; enum ib_gid_typegid_type; /* diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h index 32a67af..8b0f66e 100644 --- a/include/rdma/rdma_cm.h +++ b/include/rdma/rdma_cm.h @@ -331,6 +331,8 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, int rdma_set_afonly(struct rdma_cm_id *id, int afonly); int rdma_set_ack_timeout(struct rdma_cm_id *id, u8 timeout); + +int rdma_set_min_rnr_timer(struct rdma_cm_id *id, u8 min_rnr_timer); /** * rdma_get_service_id - Return the IB service ID for a specified address. * @id: Communication identifier associated with the address. -- 1.8.3.1
Re: [Linux-kernel-mentees] [PATCH net] rds: Prevent kernel-infoleak in rds_notify_queue_get()
> On 31 Jul 2020, at 13:59, Greg Kroah-Hartman > wrote: > > On Fri, Jul 31, 2020 at 01:14:09PM +0200, Håkon Bugge wrote: >> >> >>> On 31 Jul 2020, at 11:59, Dan Carpenter wrote: >>> >>> On Fri, Jul 31, 2020 at 07:53:01AM +0300, Leon Romanovsky wrote: >>>> On Thu, Jul 30, 2020 at 03:20:26PM -0400, Peilin Ye wrote: >>>>> rds_notify_queue_get() is potentially copying uninitialized kernel stack >>>>> memory to userspace since the compiler may leave a 4-byte hole at the end >>>>> of `cmsg`. >>>>> >>>>> In 2016 we tried to fix this issue by doing `= { 0 };` on `cmsg`, which >>>>> unfortunately does not always initialize that 4-byte hole. Fix it by using >>>>> memset() instead. >>>> >>>> Of course, this is the difference between "{ 0 }" and "{}" initializations. >>>> >>> >>> No, there is no difference. Even struct assignments like: >>> >>> foo = *bar; >>> >>> can leave struct holes uninitialized. Depending on the compiler the >>> assignment can be implemented as a memset() or as a series of struct >>> member assignments. >> >> What about: >> >> struct rds_rdma_notify { >> __u64 user_token; >> __s32 status; >> } __attribute__((packed)); > > Why is this still a discussion at all? > > Try it and see, run pahole and see if there are holes in this structure > (odds are no), you don't need us to say what is happening here... An older posting had this: $ pahole -C "rds_rdma_notify" net/rds/recv.o struct rds_rdma_notify { __u64 user_token; /* 0 8 */ __s32 status; /* 8 4 */ /* size: 16, cachelines: 1, members: 2 */ /* padding: 4 */ /* last cacheline: 16 bytes */ }; Thxs, Håkon
Re: [Linux-kernel-mentees] [PATCH net] rds: Prevent kernel-infoleak in rds_notify_queue_get()
> On 31 Jul 2020, at 11:59, Dan Carpenter wrote: > > On Fri, Jul 31, 2020 at 07:53:01AM +0300, Leon Romanovsky wrote: >> On Thu, Jul 30, 2020 at 03:20:26PM -0400, Peilin Ye wrote: >>> rds_notify_queue_get() is potentially copying uninitialized kernel stack >>> memory to userspace since the compiler may leave a 4-byte hole at the end >>> of `cmsg`. >>> >>> In 2016 we tried to fix this issue by doing `= { 0 };` on `cmsg`, which >>> unfortunately does not always initialize that 4-byte hole. Fix it by using >>> memset() instead. >> >> Of course, this is the difference between "{ 0 }" and "{}" initializations. >> > > No, there is no difference. Even struct assignments like: > > foo = *bar; > > can leave struct holes uninitialized. Depending on the compiler the > assignment can be implemented as a memset() or as a series of struct > member assignments. What about: struct rds_rdma_notify { __u64 user_token; __s32 status; } __attribute__((packed)); Thxs, Håkon > regards, > dan carpenter >
Re: [PATCH net-next] net/rds: Use DMA memory pool allocation for rds_header
> On 2 Oct 2019, at 07:20, Ka-Cheong Poon wrote: > > On 10/2/19 1:16 AM, David Miller wrote: >> From: Ka-Cheong Poon >> Date: Mon, 30 Sep 2019 02:08:00 -0700 >>> Currently, RDS calls ib_dma_alloc_coherent() to allocate a large piece >>> of contiguous DMA coherent memory to store struct rds_header for >>> sending/receiving packets. The memory allocated is then partitioned >>> into struct rds_header. This is not necessary and can be costly at >>> times when memory is fragmented. Instead, RDS should use the DMA >>> memory pool interface to handle this. >>> >>> Suggested-by: Håkon Bugge >>> Signed-off-by: Ka-Cheong Poon >> This is trading a one-time overhead for extra levels of dereferencing >> on every single descriptor access in the fast paths. >> I do not agree with this tradeoff, please implement this more >> reasonably. > > > The problem with the existing way of pre-allocation is > that when there are a lot of RDS connections, the call to > ib_dma_alloc_coherent() can fail because there are not > enough contiguous memory pages available. It is causing > problems in production systems. > > And the i_{recv|send|_hdrs_dma array dereferencing is done > at send/receive ring initialization and refill. It is not > done at every access of the header. The commit removes costly order:4 allocations (with the default 1024 #entries in the recv work-queue). These allocations may need reclaims and/or compaction. At worst, they may fail. Consider a switch failure, thousands of RDS IB connections, and their corresponding QPs, need to be resurrected. Each QP needs this order:4 allocation, and they are all created in close proximity in time, leading to an immense memory hog. Thxs, Håkon > Thanks. > > > -- > K. Poon > ka-cheong.p...@oracle.com > > >
Re: [PATCH] mlx4_ib: Increase the timeout for CM cache
> On 6 Feb 2019, at 19:02, jackm wrote: > > On Wed, 6 Feb 2019 16:40:14 +0100 > Håkon Bugge wrote: > >> Jack, >> >> A major contributor to the long processing time in the PF driver >> proxying QP1 packets is: >> >> create_pv_resources >> -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler, >> NULL, ctx, cq_size, 0); >> >> That is, comp_vector is zero. >> >> Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), >> the zero comp_vector has the intent of let the mlx4_core driver >> select the least used vector. >> >> But, in mlx4_ib_create_cq(), we have: >> >>pr_info("eq_table: %p\n", dev->eq_table); >>if (dev->eq_table) { >> vector = dev->eq_table[mlx4_choose_vector(dev->dev, >> vector, ibdev->num_comp_vectors)]; >>} >> >>cq->vector = vector; >> >> and dev->eq_table is NULL, so all the CQs for the proxy QPs get >> comp_vector zero. >> >> I have to make some reservations, as this analysis is based on uek4, >> but I think the code here is equal upstream, but need to double check. >> >> >> Thxs, Håkon >> > Hi Hakon and Jason, > I was ill today (bad cold, took antihistamines all day, which knocked > me out). > I'll get to this tomorrow. No problem Jack. I actually see that our uek4 is different in mlx4_ib_alloc_eqs(), and that may well be the root cause here. Hence, moved the MLs to Bcc, and get back to you tomorrow. Thxs, Håkon
Re: [PATCH] mlx4_ib: Increase the timeout for CM cache
> On 6 Feb 2019, at 09:50, Håkon Bugge wrote: > > > >> On 5 Feb 2019, at 23:36, Jason Gunthorpe wrote: >> >> On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote: >>> Using CX-3 virtual functions, either from a bare-metal machine or >>> pass-through from a VM, MAD packets are proxied through the PF driver. >>> >>> Since the VMs have separate name spaces for MAD Transaction Ids >>> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping >>> in a cache. >>> >>> Following the RDMA CM protocol, it is clear when an entry has to >>> evicted form the cache. But life is not perfect, remote peers may die >>> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when >>> the PF driver assumes the remote peer has gone. >>> >>> We have experienced excessive amount of DREQ retries during fail-over >>> testing, when running with eight VMs per database server. >>> >>> The problem has been reproduced in a bare-metal system using one VM >>> per physical node. In this environment, running 256 processes in each >>> VM, each process uses RDMA CM to create an RC QP between himself and >>> all (256) remote processes. All in all 16K QPs. >>> >>> When tearing down these 16K QPs, excessive DREQ retries (and >>> duplicates) are observed. With some cat/paste/awk wizardry on the >>> infiniband_cm sysfs, we observe: >>> >>> dreq: 5007 >>> cm_rx_msgs: >>> drep: 3838 >>> dreq: 13018 >>> rep: 8128 >>> req: 8256 >>> rtu: 8256 >>> cm_tx_msgs: >>> drep: 8011 >>> dreq: 68856 >>> rep: 8256 >>> req: 8128 >>> rtu: 8128 >>> cm_tx_retries: >>> dreq: 60483 >>> >>> Note that the active/passive side is distributed. >>> >>> Enabling pr_debug in cm.c gives tons of: >>> >>> [171778.814239] mlx4_ib_multiplex_cm_handler: id{slave: >>> 1,sl_cm_id: 0xd393089f} is NULL! >>> >>> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the >>> tear-down phase of the application is reduced from 113 to 67 >>> seconds. Retries/duplicates are also significantly reduced: >>> >>> cm_rx_duplicates: >>> dreq: 7726 >>> [] >>> cm_tx_retries: >>> drep: 1 >>> dreq: 7779 >>> >>> Increasing the timeout further didn't help, as these duplicates and >>> retries stem from a too short CMA timeout, which was 20 (~4 seconds) >>> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the >>> numbers fell down to about one hundred for both of them. >>> >>> Adjustment of the CMA timeout is _not_ part of this commit. >>> >>> Signed-off-by: Håkon Bugge >>> --- >>> drivers/infiniband/hw/mlx4/cm.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> Jack? What do you think? > > I am tempted to send a v2 making this a sysctl tuneable. This because, > full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction > in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: > 0xd393089f} is NULL" with this commit. > > But sure, Jack's opinion matters. Jack, A major contributor to the long processing time in the PF driver proxying QP1 packets is: create_pv_resources -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler, NULL, ctx, cq_size, 0); That is, comp_vector is zero. Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), the zero comp_vector has the intent of let the mlx4_core driver select the least used vector. But, in mlx4_ib_create_cq(), we have: pr_info("eq_table: %p\n", dev->eq_table); if (dev->eq_table) { vector = dev->eq_table[mlx4_choose_vector(dev->dev, vector, ibdev->num_comp_vectors)]; } cq->vector = vector; and dev->eq_table is NULL, so all the CQs for the proxy QPs get comp_vector zero. I have to make some reservations, as this analysis is based on uek4, but I think the code here is equal upstream, but need to double check. Thxs, Håkon > > > Thxs, Håkon > >> >>> diff --git a/drivers/infiniband/hw/mlx4/cm.c >>> b/drivers/infiniband/hw/mlx4/cm.c >>> index fedaf8260105..8c79a480f2b7 100644 >>> --- a/drivers/infiniband/hw/mlx4/cm.c >>> +++ b/drivers/infiniband/hw/mlx4/cm.c >>> @@ -39,7 +39,7 @@ >>> >>> #include "mlx4_ib.h" >>> >>> -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) >>> +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) >>> >>> struct id_map_entry { >>> struct rb_node node; >>> -- >>> 2.20.1
Re: [PATCH] mlx4_ib: Increase the timeout for CM cache
> On 5 Feb 2019, at 23:36, Jason Gunthorpe wrote: > > On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote: >> Using CX-3 virtual functions, either from a bare-metal machine or >> pass-through from a VM, MAD packets are proxied through the PF driver. >> >> Since the VMs have separate name spaces for MAD Transaction Ids >> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping >> in a cache. >> >> Following the RDMA CM protocol, it is clear when an entry has to >> evicted form the cache. But life is not perfect, remote peers may die >> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when >> the PF driver assumes the remote peer has gone. >> >> We have experienced excessive amount of DREQ retries during fail-over >> testing, when running with eight VMs per database server. >> >> The problem has been reproduced in a bare-metal system using one VM >> per physical node. In this environment, running 256 processes in each >> VM, each process uses RDMA CM to create an RC QP between himself and >> all (256) remote processes. All in all 16K QPs. >> >> When tearing down these 16K QPs, excessive DREQ retries (and >> duplicates) are observed. With some cat/paste/awk wizardry on the >> infiniband_cm sysfs, we observe: >> >> dreq: 5007 >> cm_rx_msgs: >> drep: 3838 >> dreq: 13018 >> rep: 8128 >> req: 8256 >> rtu: 8256 >> cm_tx_msgs: >> drep: 8011 >> dreq: 68856 >> rep: 8256 >> req: 8128 >> rtu: 8128 >> cm_tx_retries: >> dreq: 60483 >> >> Note that the active/passive side is distributed. >> >> Enabling pr_debug in cm.c gives tons of: >> >> [171778.814239] mlx4_ib_multiplex_cm_handler: id{slave: >> 1,sl_cm_id: 0xd393089f} is NULL! >> >> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the >> tear-down phase of the application is reduced from 113 to 67 >> seconds. Retries/duplicates are also significantly reduced: >> >> cm_rx_duplicates: >> dreq: 7726 >> [] >> cm_tx_retries: >> drep: 1 >> dreq: 7779 >> >> Increasing the timeout further didn't help, as these duplicates and >> retries stem from a too short CMA timeout, which was 20 (~4 seconds) >> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the >> numbers fell down to about one hundred for both of them. >> >> Adjustment of the CMA timeout is _not_ part of this commit. >> >> Signed-off-by: Håkon Bugge >> --- >> drivers/infiniband/hw/mlx4/cm.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) > > Jack? What do you think? I am tempted to send a v2 making this a sysctl tuneable. This because, full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit. But sure, Jack's opinion matters. Thxs, Håkon > >> diff --git a/drivers/infiniband/hw/mlx4/cm.c >> b/drivers/infiniband/hw/mlx4/cm.c >> index fedaf8260105..8c79a480f2b7 100644 >> --- a/drivers/infiniband/hw/mlx4/cm.c >> +++ b/drivers/infiniband/hw/mlx4/cm.c >> @@ -39,7 +39,7 @@ >> >> #include "mlx4_ib.h" >> >> -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) >> +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) >> >> struct id_map_entry { >> struct rb_node node; >> -- >> 2.20.1
[PATCH] mlx4_ib: Increase the timeout for CM cache
Using CX-3 virtual functions, either from a bare-metal machine or pass-through from a VM, MAD packets are proxied through the PF driver. Since the VMs have separate name spaces for MAD Transaction Ids (TIDs), the PF driver has to re-map the TIDs and keep the book keeping in a cache. Following the RDMA CM protocol, it is clear when an entry has to evicted form the cache. But life is not perfect, remote peers may die or be rebooted. Hence, it's a timeout to wipe out a cache entry, when the PF driver assumes the remote peer has gone. We have experienced excessive amount of DREQ retries during fail-over testing, when running with eight VMs per database server. The problem has been reproduced in a bare-metal system using one VM per physical node. In this environment, running 256 processes in each VM, each process uses RDMA CM to create an RC QP between himself and all (256) remote processes. All in all 16K QPs. When tearing down these 16K QPs, excessive DREQ retries (and duplicates) are observed. With some cat/paste/awk wizardry on the infiniband_cm sysfs, we observe: dreq: 5007 cm_rx_msgs: drep: 3838 dreq: 13018 rep: 8128 req: 8256 rtu: 8256 cm_tx_msgs: drep: 8011 dreq: 68856 rep: 8256 req: 8128 rtu: 8128 cm_tx_retries: dreq: 60483 Note that the active/passive side is distributed. Enabling pr_debug in cm.c gives tons of: [171778.814239] mlx4_ib_multiplex_cm_handler: id{slave: 1,sl_cm_id: 0xd393089f} is NULL! By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the tear-down phase of the application is reduced from 113 to 67 seconds. Retries/duplicates are also significantly reduced: cm_rx_duplicates: dreq: 7726 [] cm_tx_retries: drep: 1 dreq: 7779 Increasing the timeout further didn't help, as these duplicates and retries stem from a too short CMA timeout, which was 20 (~4 seconds) on the systems. By increasing the CMA timeout to 22 (~17 seconds), the numbers fell down to about one hundred for both of them. Adjustment of the CMA timeout is _not_ part of this commit. Signed-off-by: Håkon Bugge --- drivers/infiniband/hw/mlx4/cm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c index fedaf8260105..8c79a480f2b7 100644 --- a/drivers/infiniband/hw/mlx4/cm.c +++ b/drivers/infiniband/hw/mlx4/cm.c @@ -39,7 +39,7 @@ #include "mlx4_ib.h" -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) struct id_map_entry { struct rb_node node; -- 2.20.1
Re: [PATCH net 2/2] net/mlx5e: Cleanup of dcbnl related fields
> On 14 Aug 2018, at 11:01, Yuval Shaia wrote: > > On Wed, Aug 08, 2018 at 03:48:08PM -0700, Saeed Mahameed wrote: >> From: Huy Nguyen >> >> Remove unused netdev_registered_init/remove in en.h >> Return ENOSUPPORT if the check MLX5_DSCP_SUPPORTED fails. > > s/ENOSUPPORT/EOPNOTSUPP > (noted by Haakon) Sure did, > >> Remove extra white space and I also said that this has nothing to with the commit (no matter how tempting it can be) ;-) Thxs, Håkon >> >> Fixes: 2a5e7a1344f4 ("net/mlx5e: Add dcbnl dscp to priority support") >> Signed-off-by: Huy Nguyen >> Cc: Yuval Shaia >> Reviewed-by: Parav Pandit >> Signed-off-by: Saeed Mahameed >> --- >> drivers/net/ethernet/mellanox/mlx5/core/en.h | 2 -- >> .../ethernet/mellanox/mlx5/core/en_dcbnl.c| 30 +++ >> 2 files changed, 11 insertions(+), 21 deletions(-) >> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h >> b/drivers/net/ethernet/mellanox/mlx5/core/en.h >> index eb9eb7aa953a..405236cf0b04 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h >> @@ -858,8 +858,6 @@ struct mlx5e_profile { >> mlx5e_fp_handle_rx_cqe handle_rx_cqe; >> mlx5e_fp_handle_rx_cqe handle_rx_cqe_mpwqe; >> } rx_handlers; >> -void(*netdev_registered_init)(struct mlx5e_priv *priv); >> -void(*netdev_registered_remove)(struct mlx5e_priv *priv); >> int max_tc; >> }; >> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c >> b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c >> index e33afa8d2417..722998d68564 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c >> @@ -443,16 +443,12 @@ static int mlx5e_dcbnl_ieee_setapp(struct net_device >> *dev, struct dcb_app *app) >> bool is_new; >> int err; >> >> -if (app->selector != IEEE_8021QAZ_APP_SEL_DSCP) >> -return -EINVAL; >> - >> -if (!MLX5_CAP_GEN(priv->mdev, vport_group_manager)) >> -return -EINVAL; >> - >> -if (!MLX5_DSCP_SUPPORTED(priv->mdev)) >> -return -EINVAL; >> +if (!MLX5_CAP_GEN(priv->mdev, vport_group_manager) || >> +!MLX5_DSCP_SUPPORTED(priv->mdev)) >> +return -EOPNOTSUPP; >> >> -if (app->protocol >= MLX5E_MAX_DSCP) >> +if ((app->selector != IEEE_8021QAZ_APP_SEL_DSCP) || >> +(app->protocol >= MLX5E_MAX_DSCP)) >> return -EINVAL; >> >> /* Save the old entry info */ >> @@ -500,16 +496,12 @@ static int mlx5e_dcbnl_ieee_delapp(struct net_device >> *dev, struct dcb_app *app) >> struct mlx5e_priv *priv = netdev_priv(dev); >> int err; >> >> -if (app->selector != IEEE_8021QAZ_APP_SEL_DSCP) >> -return -EINVAL; >> - >> -if (!MLX5_CAP_GEN(priv->mdev, vport_group_manager)) >> -return -EINVAL; >> - >> -if (!MLX5_DSCP_SUPPORTED(priv->mdev)) >> -return -EINVAL; >> +if (!MLX5_CAP_GEN(priv->mdev, vport_group_manager) || >> + !MLX5_DSCP_SUPPORTED(priv->mdev)) >> +return -EOPNOTSUPP; >> >> -if (app->protocol >= MLX5E_MAX_DSCP) >> +if ((app->selector != IEEE_8021QAZ_APP_SEL_DSCP) || >> +(app->protocol >= MLX5E_MAX_DSCP)) >> return -EINVAL; >> >> /* Skip if no dscp app entry */ >> @@ -1146,7 +1138,7 @@ static int mlx5e_set_trust_state(struct mlx5e_priv >> *priv, u8 trust_state) >> { >> int err; >> >> -err = mlx5_set_trust_state(priv->mdev, trust_state); >> +err = mlx5_set_trust_state(priv->mdev, trust_state); >> if (err) >> return err; >> priv->dcbx_dp.trust_state = trust_state; >> -- >> 2.17.0 >>
Re: [PATCH] mlx4_core: allocate 4KB ICM chunks
> On 11 May 2018, at 01:31, Qing Huang wrote: > > When a system is under memory presure (high usage with fragments), > the original 256KB ICM chunk allocations will likely trigger kernel > memory management to enter slow path doing memory compact/migration > ops in order to complete high order memory allocations. > > When that happens, user processes calling uverb APIs may get stuck > for more than 120s easily even though there are a lot of free pages > in smaller chunks available in the system. > > Syslog: > ... > Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task > oracle_205573_e:205573 blocked for more than 120 seconds. > ... > > With 4KB ICM chunk size, the above issue is fixed. > > However in order to support 4KB ICM chunk size, we need to fix another > issue in large size kcalloc allocations. > > E.g. > Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk > size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt > entry). So we need a 16MB allocation for a table->icm pointer array to > hold 2M pointers which can easily cause kcalloc to fail. > > The solution is to use vzalloc to replace kcalloc. There is no need > for contiguous memory pages for a driver meta data structure (no need > of DMA ops). > > Signed-off-by: Qing Huang > Acked-by: Daniel Jurgens > --- > drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c > b/drivers/net/ethernet/mellanox/mlx4/icm.c > index a822f7a..2b17a4b 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/icm.c > +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c > @@ -43,12 +43,12 @@ > #include "fw.h" > > /* > - * We allocate in as big chunks as we can, up to a maximum of 256 KB > - * per chunk. > + * We allocate in 4KB page size chunks to avoid high order memory > + * allocations in fragmented/high usage memory situation. > */ > enum { > - MLX4_ICM_ALLOC_SIZE = 1 << 18, > - MLX4_TABLE_CHUNK_SIZE = 1 << 18 > + MLX4_ICM_ALLOC_SIZE = 1 << 12, > + MLX4_TABLE_CHUNK_SIZE = 1 << 12 Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the hw page size is 8KiB. Thxs, Håkon > }; > > static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk > *chunk) > @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table, > obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; > num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; > > - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); > + table->icm = vzalloc(num_icm * sizeof(*table->icm)); > if (!table->icm) > return -ENOMEM; > table->virt = virt; > @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table, > mlx4_free_icm(dev, table->icm[i], use_coherent); > } > > - kfree(table->icm); > + vfree(table->icm); > > return -ENOMEM; > } > @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct > mlx4_icm_table *table) > mlx4_free_icm(dev, table->icm[i], table->coherent); > } > > - kfree(table->icm); > + vfree(table->icm); > } > -- > 2.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rds: ib: Fix missing call to rds_ib_dev_put in rds_ib_setup_qp
> On 25 Apr 2018, at 13:22, Dag Moxnes wrote: > > The function rds_ib_setup_qp is calling rds_ib_get_client_data and > should correspondingly call rds_ib_dev_put. This call was lost in > the non-error path with the introduction of error handling done in > commit 3b12f73a5c29 ("rds: ib: add error handle") > > Signed-off-by: Dag Moxnes Reviewed-by: Håkon Bugge Thxs, Håkon > --- > net/rds/ib_cm.c |3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c > index eea1d86..13b38ad 100644 > --- a/net/rds/ib_cm.c > +++ b/net/rds/ib_cm.c > @@ -547,7 +547,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) > rdsdebug("conn %p pd %p cq %p %p\n", conn, ic->i_pd, >ic->i_send_cq, ic->i_recv_cq); > > - return ret; > + goto out; > > sends_out: > vfree(ic->i_sends); > @@ -572,6 +572,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) > ic->i_send_cq = NULL; > rds_ibdev_out: > rds_ib_remove_conn(rds_ibdev, conn); > +out: > rds_ib_dev_put(rds_ibdev); > > return ret; > -- > 1.7.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Revert "rds: ib: add error handle"
> On 24 Apr 2018, at 05:27, santosh.shilim...@oracle.com wrote: > > On 4/23/18 6:39 PM, Zhu Yanjun wrote: >> This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc. >> After long time discussion and investigations, it seems that there >> is no mem leak. So this patch is reverted. >> Signed-off-by: Zhu Yanjun >> --- > Well your fix was not for any leaks but just proper labels for > graceful exits. I don't know which long time discussion > you are referring but there is no need to revert this change > unless you see any issue with your change. As spotted by Dax Moxnes, the patch misses to call rds_ib_dev_put() when exiting normally, only on err exit. Thxs, Håkon > > Regards, > Santosh > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
Hi Sowmini, A little nit below. And some spelling issues in existing commentary you can consider fixing, since you reshuffle this file considerable. Thxs, Håkon > On 18 Mar 2018, at 21:45, Sowmini Varadhan > wrote: > > On (03/18/18 00:55), Kirill Tkhai wrote: >> >> I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is >> another solution to do that, I'm not again that. > > The patch below takes care of this. I've done some preliminary testing, > and I'll send it upstream after doing additional self-review/testing. > Please also take a look, if you can, to see if I missed something. > > Thanks for the input, > > --Sowmini > ---patch follows > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > index 08ea9cd..87c2643 100644 > --- a/net/rds/tcp.c > +++ b/net/rds/tcp.c > @@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net) > return err; > } > > -static void __net_exit rds_tcp_exit_net(struct net *net) > -{ > - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); > - > - if (rtn->rds_tcp_sysctl) > - unregister_net_sysctl_table(rtn->rds_tcp_sysctl); > - > - if (net != &init_net && rtn->ctl_table) > - kfree(rtn->ctl_table); > - > - /* If rds_tcp_exit_net() is called as a result of netns deletion, > - * the rds_tcp_kill_sock() device notifier would already have cleaned > - * up the listen socket, thus there is no work to do in this function. > - * > - * If rds_tcp_exit_net() is called as a result of module unload, > - * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then > - * we do need to clean up the listen socket here. > - */ > - if (rtn->rds_tcp_listen_sock) { > - struct socket *lsock = rtn->rds_tcp_listen_sock; > - > - rtn->rds_tcp_listen_sock = NULL; > - rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w); > - } > -} > - > -static struct pernet_operations rds_tcp_net_ops = { > - .init = rds_tcp_init_net, > - .exit = rds_tcp_exit_net, > - .id = &rds_tcp_netid, > - .size = sizeof(struct rds_tcp_net), > - .async = true, > -}; > - > static void rds_tcp_kill_sock(struct net *net) > { > struct rds_tcp_connection *tc, *_tc; > @@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net) > rds_conn_destroy(tc->t_cpath->cp_conn); > } > > -void *rds_tcp_listen_sock_def_readable(struct net *net) > +static void __net_exit rds_tcp_exit_net(struct net *net) > { > struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); > - struct socket *lsock = rtn->rds_tcp_listen_sock; > > - if (!lsock) > - return NULL; > + rds_tcp_kill_sock(net); > > - return lsock->sk->sk_user_data; > + if (rtn->rds_tcp_sysctl) > + unregister_net_sysctl_table(rtn->rds_tcp_sysctl); > + > + if (net != &init_net && rtn->ctl_table) > + kfree(rtn->ctl_table); Well, this comes from the existing code, but as pointed out by Linus recently, kfree() handles NULL pointers. So, if rtn->ctl_table is NULL most likely, the code is OK _if_ you add an unlikely() around the if-clause. Otherwise, the “ && rtn->ctl_table” can simply be removed. > } > > -static int rds_tcp_dev_event(struct notifier_block *this, > - unsigned long event, void *ptr) > +static struct pernet_operations rds_tcp_net_ops = { > + .init = rds_tcp_init_net, > + .exit = rds_tcp_exit_net, > + .id = &rds_tcp_netid, > + .size = sizeof(struct rds_tcp_net), > + .async = true, > +}; > + > +void *rds_tcp_listen_sock_def_readable(struct net *net) > { > - struct net_device *dev = netdev_notifier_info_to_dev(ptr); > + struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); > + struct socket *lsock = rtn->rds_tcp_listen_sock; > > - /* rds-tcp registers as a pernet subys, so the ->exit will only > - * get invoked after network acitivity has quiesced. We need to > - * clean up all sockets to quiesce network activity, and use > - * the unregistration of the per-net loopback device as a trigger > - * to start that cleanup. > - */ > - if (event == NETDEV_UNREGISTER_FINAL && > - dev->ifindex == LOOPBACK_IFINDEX) > - rds_tcp_kill_sock(dev_net(dev)); > + if (!lsock) > + return NULL; > > - return NOTIFY_DONE; > + return lsock->sk->sk_user_data; > } > > -static struct notifier_block rds_tcp_dev_notifier = { > - .notifier_call= rds_tcp_dev_event, > - .priority = -10, /* must be called after other network notifiers */ > -}; > - > /* when sysctl is used to modify some kernel socket parameters,this s/when/When/ s/parameters,this/parameters, this/ Well, not part of your commit. > * function resets the RDS connections in that netns so that we can Two double spaces incidents above Not part of y
[PATCH net] rds: Fix NULL pointer dereference in __rds_rdma_map
This is a fix for syzkaller719569, where memory registration was attempted without any underlying transport being loaded. Analysis of the case reveals that it is the setsockopt() RDS_GET_MR (2) and RDS_GET_MR_FOR_DEST (7) that are vulnerable. Here is an example stack trace when the bug is hit: BUG: unable to handle kernel NULL pointer dereference at 00c0 IP: __rds_rdma_map+0x36/0x440 [rds] PGD 2f93d03067 P4D 2f93d03067 PUD 2f93d02067 PMD 0 Oops: [#1] SMP Modules linked in: bridge stp llc tun rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rds binfmt_misc sb_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul c rc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd iTCO_wdt mei_me sg iTCO_vendor_support ipmi_si mei ipmi_devintf nfsd shpchp pcspkr i2c_i801 ioatd ma ipmi_msghandler wmi lpc_ich mfd_core auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 mgag200 i2c_algo_bit drm_kms_helper ixgbe syscopyarea ahci sysfillrect sysimgblt libahci mdio fb_sys_fops ttm ptp libata sd_mod mlx4_core drm crc32c_intel pps_core megaraid_sas i2c_core dca dm_mirror dm_region_hash dm_log dm_mod CPU: 48 PID: 45787 Comm: repro_set2 Not tainted 4.14.2-3.el7uek.x86_64 #2 Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 3111 03/03/2017 task: 882f9190db00 task.stack: c9002b994000 RIP: 0010:__rds_rdma_map+0x36/0x440 [rds] RSP: 0018:c9002b997df0 EFLAGS: 00010202 RAX: RBX: 882fa2182580 RCX: RDX: RSI: c9002b997e40 RDI: 882fa2182580 RBP: c9002b997e30 R08: R09: 0002 R10: 885fb29e3838 R11: R12: 882fa2182580 R13: 882fa2182580 R14: 0002 R15: 2ffc FS: 7fbffa20b700() GS:882fbfb8() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00c0 CR3: 002f98a66006 CR4: 001606e0 Call Trace: rds_get_mr+0x56/0x80 [rds] rds_setsockopt+0x172/0x340 [rds] ? __fget_light+0x25/0x60 ? __fdget+0x13/0x20 SyS_setsockopt+0x80/0xe0 do_syscall_64+0x67/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7fbff9b117f9 RSP: 002b:7fbffa20aed8 EFLAGS: 0293 ORIG_RAX: 0036 RAX: ffda RBX: 000c84a4 RCX: 7fbff9b117f9 RDX: 0002 RSI: 4114 RDI: 109b RBP: 7fbffa20af10 R08: 0020 R09: 7fbff9dd7860 R10: 2ffc R11: 0293 R12: R13: 7fbffa20b9c0 R14: 7fbffa20b700 R15: 0021 Code: 41 56 41 55 49 89 fd 41 54 53 48 83 ec 18 8b 87 f0 02 00 00 48 89 55 d0 48 89 4d c8 85 c0 0f 84 2d 03 00 00 48 8b 87 00 03 00 00 <48> 83 b8 c0 00 00 00 00 0f 84 25 03 00 0 0 48 8b 06 48 8b 56 08 The fix is to check the existence of an underlying transport in __rds_rdma_map(). Signed-off-by: Håkon Bugge Reported-by: syzbot --- net/rds/rdma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rds/rdma.c b/net/rds/rdma.c index 8886f15abe90..bc2f1e0977d6 100644 --- a/net/rds/rdma.c +++ b/net/rds/rdma.c @@ -183,7 +183,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, long i; int ret; - if (rs->rs_bound_addr == 0) { + if (rs->rs_bound_addr == 0 || !rs->rs_transport) { ret = -ENOTCONN; /* XXX not a great errno */ goto out; } -- 2.13.6
[PATCH net] rds: ib: Fix NULL pointer dereference in debug code
() and ib_ib_port_recv() statements: /* XXX when can this fail? */ ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr); + if (can_wait) + usleep_range(1000, 5000); rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv, recv->r_ibinc, sg_page(&recv->r_frag->f_sg), (long) ib_sg_dma_address( The fix is simply to move the rdsdebug() statement up before the ib_post_recv() and remove the printing of ret, which is taken care of anyway by the non-debug code. Signed-off-by: Håkon Bugge Reviewed-by: Knut Omang Reviewed-by: Wei Lin Guay --- net/rds/ib_recv.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c index 9722bf8..b4e421a 100644 --- a/net/rds/ib_recv.c +++ b/net/rds/ib_recv.c @@ -410,14 +410,14 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp) break; } - /* XXX when can this fail? */ - ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr); - rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv, + rdsdebug("recv %p ibinc %p page %p addr %lu\n", recv, recv->r_ibinc, sg_page(&recv->r_frag->f_sg), (long) ib_sg_dma_address( ic->i_cm_id->device, - &recv->r_frag->f_sg), - ret); + &recv->r_frag->f_sg)); + + /* XXX when can this fail? */ + ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr); if (ret) { rds_ib_conn_error(conn, "recv post on " "%pI4 returned %d, disconnecting and " -- 2.9.3
[PATCH] rds: ib: Fix uninitialized variable
send_flags needs to be initialized before calling rds_ib_set_wr_signal_state(). Signed-off-by: Håkon Bugge Acked-by: Santosh Shilimkar --- net/rds/ib_send.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 6ab39db..8f46755 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -792,6 +792,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, struct rm_atomic_op *op) send->s_atomic_wr.compare_add_mask = op->op_m_fadd.nocarry_mask; send->s_atomic_wr.swap_mask = 0; } + send->s_wr.send_flags = 0; nr_sig = rds_ib_set_wr_signal_state(ic, send, op->op_notify); send->s_atomic_wr.wr.num_sge = 1; send->s_atomic_wr.wr.next = NULL; -- 1.8.3.1
Re: [PATCH] rds: Fix uninitialized variable
Ignore this, sent the old one. Håkon > On 24 Oct 2017, at 18:07, Håkon Bugge wrote: > > send_flags needs to be initialized before calling > rds_ib_set_wr_signal_state(). > > Signed-off-by: Håkon Bugge > --- > net/rds/ib_send.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c > index 6ab39db..8f46755 100644 > --- a/net/rds/ib_send.c > +++ b/net/rds/ib_send.c > @@ -792,6 +792,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, > struct rm_atomic_op *op) > send->s_atomic_wr.compare_add_mask = op->op_m_fadd.nocarry_mask; > send->s_atomic_wr.swap_mask = 0; > } > + send->s_wr.send_flags = 0; > nr_sig = rds_ib_set_wr_signal_state(ic, send, op->op_notify); > send->s_atomic_wr.wr.num_sge = 1; > send->s_atomic_wr.wr.next = NULL; > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rds: Fix uninitialized variable
send_flags needs to be initialized before calling rds_ib_set_wr_signal_state(). Signed-off-by: Håkon Bugge --- net/rds/ib_send.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 6ab39db..8f46755 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -792,6 +792,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, struct rm_atomic_op *op) send->s_atomic_wr.compare_add_mask = op->op_m_fadd.nocarry_mask; send->s_atomic_wr.swap_mask = 0; } + send->s_wr.send_flags = 0; nr_sig = rds_ib_set_wr_signal_state(ic, send, op->op_notify); send->s_atomic_wr.wr.num_sge = 1; send->s_atomic_wr.wr.next = NULL; -- 1.8.3.1
[PATCH] rds: Fix uninitialized variable
send_flags needs to be initialized before calling rds_ib_set_wr_signal_state(). Signed-off-by: Håkon Bugge --- net/rds/ib_send.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 6ab39db..8f46755 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -792,6 +792,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, struct rm_atomic_op *op) send->s_atomic_wr.compare_add_mask = op->op_m_fadd.nocarry_mask; send->s_atomic_wr.swap_mask = 0; } + send->s_wr.send_flags = 0; nr_sig = rds_ib_set_wr_signal_state(ic, send, op->op_notify); send->s_atomic_wr.wr.num_sge = 1; send->s_atomic_wr.wr.next = NULL; -- 1.8.3.1
[PATCH] rds: Fix inaccurate accounting of unsignaled wrs
The number of unsignaled work-requests posted to the IB send queue is tracked by a counter in the rds_ib_connection struct. When it reaches zero, or the caller explicitly asks for it, the send-signaled bit is set in send_flags and the counter is reset. This is performed by the rds_ib_set_wr_signal_state() function. However, this function is not always used which yields inaccurate accounting. This commit fixes this, re-factors a code bloat related to the matter, and makes the actual parameter type to the function consistent. Signed-off-by: Håkon Bugge --- net/rds/ib_send.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 6ab39db..d07ecb0 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -661,13 +661,15 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, } } - rds_ib_set_wr_signal_state(ic, send, 0); + rds_ib_set_wr_signal_state(ic, send, false); /* * Always signal the last one if we're stopping due to flow control. */ - if (ic->i_flowctl && flow_controlled && i == (work_alloc-1)) - send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + if (ic->i_flowctl && flow_controlled && i == (work_alloc - 1)) { + rds_ib_set_wr_signal_state(ic, send, true); + send->s_wr.send_flags |= IB_SEND_SOLICITED; + } if (send->s_wr.send_flags & IB_SEND_SIGNALED) nr_sig++; @@ -705,11 +707,8 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, if (scat == &rm->data.op_sg[rm->data.op_count]) { prev->s_op = ic->i_data_op; prev->s_wr.send_flags |= IB_SEND_SOLICITED; - if (!(prev->s_wr.send_flags & IB_SEND_SIGNALED)) { - ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs; - prev->s_wr.send_flags |= IB_SEND_SIGNALED; - nr_sig++; - } + if (!(prev->s_wr.send_flags & IB_SEND_SIGNALED)) + nr_sig += rds_ib_set_wr_signal_state(ic, prev, true); ic->i_data_op = NULL; } -- 1.8.3.1
[PATCH net v2] rds: Fix incorrect statistics counting
In rds_send_xmit() there is logic to batch the sends. However, if another thread has acquired the lock and has incremented the send_gen, it is considered a race and we yield. The code incrementing the s_send_lock_queue_raced statistics counter did not count this event correctly. This commit counts the race condition correctly. Changes from v1: - Removed check for *someone_on_xmit()* - Fixed incorrect indentation Signed-off-by: Håkon Bugge Reviewed-by: Knut Omang --- net/rds/send.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index 058a407..b52cdc8 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -428,14 +428,18 @@ int rds_send_xmit(struct rds_conn_path *cp) * some work and we will skip our goto */ if (ret == 0) { + bool raced; + smp_mb(); + raced = send_gen != READ_ONCE(cp->cp_send_gen); + if ((test_bit(0, &conn->c_map_queued) || -!list_empty(&cp->cp_send_queue)) && - send_gen == READ_ONCE(cp->cp_send_gen)) { - rds_stats_inc(s_send_lock_queue_raced); + !list_empty(&cp->cp_send_queue)) && !raced) { if (batch_count < send_batch_count) goto restart; queue_delayed_work(rds_wq, &cp->cp_send_w, 1); + } else if (raced) { + rds_stats_inc(s_send_lock_queue_raced); } } out: -- 2.9.3
Re: [PATCH net] rds: Fix incorrect statistics counting
> On 6 Sep 2017, at 17:58, Santosh Shilimkar > wrote: > > On 9/6/2017 8:29 AM, Håkon Bugge wrote: >> In rds_send_xmit() there is logic to batch the sends. However, if >> another thread has acquired the lock, it is considered a race and we >> yield. The code incrementing the s_send_lock_queue_raced statistics >> counter did not count this event correctly. >> This commit removes a small race in determining the race and >> increments the statistics counter correctly. >> Signed-off-by: Håkon Bugge >> Reviewed-by: Knut Omang >> --- >> net/rds/send.c | 16 +--- >> 1 file changed, 13 insertions(+), 3 deletions(-) > Those counters are not really to give that accurate so > am not very keen to add additional cycles in send paths > and add additional code. Have you seen any real issue > or this is just a observation. s_send_lock_queue_raced > counter is never used to check for smaller increments > and hence the question. Hi Santosh, Yes, I agree with accuracy of s_send_lock_queue_raced. But the main point is that the existing code counts some partial share of when it is _not_ raced. So, in the critical path, my patch adds one test_bit(), which hits the local CPU cache, if not raced. If raced, some other thread is in control, so I would not think the added cycles would make any big difference. I can send a v2 where the race tightening is removed if you like. Thxs, Håkon
[PATCH net] rds: Fix incorrect statistics counting
In rds_send_xmit() there is logic to batch the sends. However, if another thread has acquired the lock, it is considered a race and we yield. The code incrementing the s_send_lock_queue_raced statistics counter did not count this event correctly. This commit removes a small race in determining the race and increments the statistics counter correctly. Signed-off-by: Håkon Bugge Reviewed-by: Knut Omang --- net/rds/send.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index 058a407..ecfe0b5 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -101,6 +101,11 @@ void rds_send_path_reset(struct rds_conn_path *cp) } EXPORT_SYMBOL_GPL(rds_send_path_reset); +static bool someone_in_xmit(struct rds_conn_path *cp) +{ + return test_bit(RDS_IN_XMIT, &cp->cp_flags); +} + static int acquire_in_xmit(struct rds_conn_path *cp) { return test_and_set_bit(RDS_IN_XMIT, &cp->cp_flags) == 0; @@ -428,14 +433,19 @@ int rds_send_xmit(struct rds_conn_path *cp) * some work and we will skip our goto */ if (ret == 0) { + bool raced; + smp_mb(); + raced = someone_in_xmit(cp) || + send_gen != READ_ONCE(cp->cp_send_gen); + if ((test_bit(0, &conn->c_map_queued) || -!list_empty(&cp->cp_send_queue)) && - send_gen == READ_ONCE(cp->cp_send_gen)) { - rds_stats_inc(s_send_lock_queue_raced); + !list_empty(&cp->cp_send_queue)) && !raced) { if (batch_count < send_batch_count) goto restart; queue_delayed_work(rds_wq, &cp->cp_send_w, 1); + } else if (raced) { + rds_stats_inc(s_send_lock_queue_raced); } } out: -- 2.9.3
[PATCH net] rds: Fix non-atomic operation on shared flag variable
The bits in m_flags in struct rds_message are used for a plurality of reasons, and from different contexts. To avoid any missing updates to m_flags, use the atomic set_bit() instead of the non-atomic equivalent. Signed-off-by: Håkon Bugge Reviewed-by: Knut Omang Reviewed-by: Wei Lin Guay --- net/rds/send.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index 41b9f0f..058a407 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -273,7 +273,7 @@ int rds_send_xmit(struct rds_conn_path *cp) len = ntohl(rm->m_inc.i_hdr.h_len); if (cp->cp_unacked_packets == 0 || cp->cp_unacked_bytes < len) { - __set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); + set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); cp->cp_unacked_packets = rds_sysctl_max_unacked_packets; @@ -829,7 +829,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn, * throughput hits a certain threshold. */ if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2) - __set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); + set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); list_add_tail(&rm->m_sock_item, &rs->rs_send_queue); set_bit(RDS_MSG_ON_SOCK, &rm->m_flags); -- 2.9.3
[PATCH net] rds: Reintroduce statistics counting
In commit 7e3f2952eeb1 ("rds: don't let RDS shutdown a connection while senders are present"), refilling the receive queue was removed from rds_ib_recv(), along with the increment of s_ib_rx_refill_from_thread. Commit 73ce4317bf98 ("RDS: make sure we post recv buffers") re-introduces filling the receive queue from rds_ib_recv(), but does not add the statistics counter. rds_ib_recv() was later renamed to rds_ib_recv_path(). This commit reintroduces the statistics counting of s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq. Signed-off-by: Håkon Bugge Reviewed-by: Knut Omang Reviewed-by: Wei Lin Guay --- net/rds/ib_recv.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c index e10624a..9722bf8 100644 --- a/net/rds/ib_recv.c +++ b/net/rds/ib_recv.c @@ -1015,8 +1015,10 @@ void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, if (rds_ib_ring_empty(&ic->i_recv_ring)) rds_ib_stats_inc(s_ib_rx_ring_empty); - if (rds_ib_ring_low(&ic->i_recv_ring)) + if (rds_ib_ring_low(&ic->i_recv_ring)) { rds_ib_recv_refill(conn, 0, GFP_NOWAIT); + rds_ib_stats_inc(s_ib_rx_refill_from_cq); + } } int rds_ib_recv_path(struct rds_conn_path *cp) @@ -1029,6 +1031,7 @@ int rds_ib_recv_path(struct rds_conn_path *cp) if (rds_conn_up(conn)) { rds_ib_attempt_ack(ic); rds_ib_recv_refill(conn, 0, GFP_KERNEL); + rds_ib_stats_inc(s_ib_rx_refill_from_thread); } return ret; -- 2.9.3
Re: [PATCH net] rds: Make sure updates to cp_send_gen can be observed
> On 20 Jul 2017, at 13:02, Sowmini Varadhan > wrote: > > On (07/20/17 12:28), H??kon Bugge wrote: >> cp->cp_send_gen is treated as a normal variable, although it may be >> used by different threads. > > I'm confused by that assertion. If you look at the comments right > above the change in your patch, there is a note that > acquire_in_xmit/release_in_xmit are the synchronization/serialization > points. > > Can you please clarify? The way the original code works, is that it is allowed for the compiler to keep the value of “cp->cp_send_gen + 1” in a register. The compiler has no requirement to store this value to memory, before leaving the function or calling another one. Further, said register can be used in the comparison outside the acquire_in_xmit/release_in_xmit, at which point another thread may have changed its value. Thxs, Håkon > >> --- a/net/rds/send.c >> +++ b/net/rds/send.c >> @@ -170,8 +170,8 @@ int rds_send_xmit(struct rds_conn_path *cp) >> * The acquire_in_xmit() check above ensures that only one >> * caller can increment c_send_gen at any time. >> */ >> -cp->cp_send_gen++; >> -send_gen = cp->cp_send_gen; >> +send_gen = READ_ONCE(cp->cp_send_gen) + 1; >> +WRITE_ONCE(cp->cp_send_gen, send_gen); >> > > --Sowmini > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] rds: Make sure updates to cp_send_gen can be observed
cp->cp_send_gen is treated as a normal variable, although it may be used by different threads. This is fixed by using {READ,WRITE}_ONCE when it is incremented and READ_ONCE when it is read outside the {acquire,release}_in_xmit protection. Normative reference from the Linux-Kernel Memory Model: Loads from and stores to shared (but non-atomic) variables should be protected with the READ_ONCE(), WRITE_ONCE(), and ACCESS_ONCE(). Clause 5.1.2.4/25 in the C standard is also relevant. Signed-off-by: Håkon Bugge Reviewed-by: Knut Omang --- net/rds/send.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index 5cc6403..fa0368c 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -170,8 +170,8 @@ int rds_send_xmit(struct rds_conn_path *cp) * The acquire_in_xmit() check above ensures that only one * caller can increment c_send_gen at any time. */ - cp->cp_send_gen++; - send_gen = cp->cp_send_gen; + send_gen = READ_ONCE(cp->cp_send_gen) + 1; + WRITE_ONCE(cp->cp_send_gen, send_gen); /* * rds_conn_shutdown() sets the conn state and then tests RDS_IN_XMIT, @@ -431,7 +431,7 @@ int rds_send_xmit(struct rds_conn_path *cp) smp_mb(); if ((test_bit(0, &conn->c_map_queued) || !list_empty(&cp->cp_send_queue)) && - send_gen == cp->cp_send_gen) { + send_gen == READ_ONCE(cp->cp_send_gen)) { rds_stats_inc(s_send_lock_queue_raced); if (batch_count < send_batch_count) goto restart; -- 2.9.3