Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-15 Thread santosh shilimkar

On 12/9/2015 10:42 AM, Christoph Hellwig wrote:

On Tue, Dec 08, 2015 at 07:52:03PM -0500, ira.weiny wrote:

Searching patchworks...

I'm a bit worried about the size of the patch and I would like to see it split
up for review.  But I agree Christophs method is better long term.


I'd be happy to split it up if I could see a way to split it.  So if
anyone has an idea you're welcome!


Christoph do you have this on github somewhere?  Perhaps it is split but I'm
not finding in on patchworks?


No need for github, we have much better (and older) git hosting sites :)

http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/ib_device_attr


net/rds/ib.c   |  34 ++---
net/rds/iw.c   |  23 +--

For RDS changes,
Acked-by: Santosh Shilimlkar 

I will try and find some time to test them out. Thanks !!
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR

2015-11-25 Thread santosh shilimkar

On 11/25/2015 1:00 AM, Christoph Hellwig wrote:

On Tue, Nov 24, 2015 at 01:54:02PM -0800, santosh shilimkar wrote:

As already indicated to Sagi [1], RDS IB FR support is work in
progress and I was hoping to get it ready for 4.5. There are few
issues we found with one of the HCA and hence the progress
slowed down. Looking at where we are, 4.6 merge window seems
to be realistic for me to get RDS FR support.


Ok.


Now on the iWARP transport itself, it has been bit tough because
of lack of hardware to tests. I have been requesting test(s) in
previous RDS patches and haven't seen any interest so far. If
this continues to be the trend, I might as well get rid of
RDS iWARP support after couple of merge windows.


I'd say drop the current iWarp transport if it's not testable.  The
only real difference between IB and iWarp is the needed to create
a MR for the RDMA READ sink, and we're much better of adding that into
the current IB transport if new iWarp users show up.


Agree. I will probably do it along with RDS IB FR support in 4.6

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR

2015-11-25 Thread santosh shilimkar

On 11/25/2015 10:22 AM, Or Gerlitz wrote:

On Wed, Nov 25, 2015 at 7:09 PM, santosh shilimkar
<santosh.shilim...@oracle.com> wrote:

As already indicated to Sagi [1], RDS IB FR support is work in
progress and I was hoping to get it ready for 4.5.


These are really good news! can you please elaborate a bit on the
design changes this move introduced in RDS?


Yeah. It has been a bit of pain point since the need
was to keep the RDS design same and retrofit the FR
support so that it can co-exist with existing deployed
FMR code.

Leaving the details for the code review but at very high
level,

- Have to split the poll CQ handling so that
send + FR WR completion can be handled together.
FR CQ handler and reg/inv WR prep marks the MR state
like INVALID, VALID & STALE appropriately.
- Allocate 2X space on WR and WC queues during queue setup.
- Manage the MR reg/inv based on the space available
in FR WR ring(actually it is just a counter). This is
bit tricky because RDS does MR operation via sendmsg()
as well as directly through socket APIs so needs
co-ordination.

Am hoping that above remains true when code actually
makes to the list but that is how things stand as
of now.

Regards,
Santosh

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR

2015-11-24 Thread santosh shilimkar

Hi Christoph,

On 11/23/2015 10:52 PM, Christoph Hellwig wrote:

On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:

On 11/23/2015 5:14 PM, Chuck Lever wrote:

FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.


I thought FMR support was about to be removed in the core.


Seems like everyone would love to kill it, but no one dares to do
it just yet.  Reasons to keep FMRs:

  - mthca doesn't support FRs but haven't been staged out
  - RDS only supports FRMs for the IB transports (it does support FRs for
an entirely separate iWarp transports.  I'd love to know why we can't
just use that iWarp transport for IB as well).
  - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
in reply to the iSER remote invalidation series).

So at lest for 4.5 we're unlikely to be able to get rid of it alone
due to the RDS issue.  We'll then need performance numbers for mlx4,
and figure out how much we care about mthca.


As already indicated to Sagi [1], RDS IB FR support is work in
progress and I was hoping to get it ready for 4.5. There are few
issues we found with one of the HCA and hence the progress
slowed down. Looking at where we are, 4.6 merge window seems
to be realistic for me to get RDS FR support.

Now on the iWARP transport itself, it has been bit tough because
of lack of hardware to tests. I have been requesting test(s) in
previous RDS patches and haven't seen any interest so far. If
this continues to be the trend, I might as well get rid of
RDS iWARP support after couple of merge windows.

Regards,
Santosh
[1] http://www.spinics.net/lists/linux-nfs/msg53909.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/14] RDS: Use per-bucket rw lock for bind hash-table

2015-09-30 Thread Santosh Shilimkar
One global lock protecting hash-tables with 1024 buckets isn't
efficient and it shows up in a massive systems with truck
loads of RDS sockets serving multiple databases. The
perf data clearly highlights the contention on the rw
lock in these massive workloads.

When the contention gets worse, the code gets into a state where
it decides to back off on the lock. So while it has disabled interrupts,
it sits and backs off on this lock get. This causes the system to
become sluggish and eventually all sorts of bad things happen.

The simple fix is to move the lock into the hash bucket and
use per-bucket lock to improve the scalability.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/af_rds.c |  2 ++
 net/rds/bind.c   | 47 ---
 net/rds/rds.h|  1 +
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index dc08766..384ea1e 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -582,6 +582,8 @@ static int rds_init(void)
 {
int ret;
 
+   rds_bind_lock_init();
+
ret = rds_conn_init();
if (ret)
goto out;
diff --git a/net/rds/bind.c b/net/rds/bind.c
index 166c605..bc6b93e 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -38,22 +38,27 @@
 #include 
 #include "rds.h"
 
+struct bind_bucket {
+   rwlock_tlock;
+   struct hlist_head   head;
+};
+
 #define BIND_HASH_SIZE 1024
-static struct hlist_head bind_hash_table[BIND_HASH_SIZE];
-static DEFINE_RWLOCK(rds_bind_lock);
+static struct bind_bucket bind_hash_table[BIND_HASH_SIZE];
 
-static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port)
+static struct bind_bucket *hash_to_bucket(__be32 addr, __be16 port)
 {
return bind_hash_table + (jhash_2words((u32)addr, (u32)port, 0) &
  (BIND_HASH_SIZE - 1));
 }
 
 /* must hold either read or write lock (write lock for insert != NULL) */
-static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port,
+static struct rds_sock *rds_bind_lookup(struct bind_bucket *bucket,
+   __be32 addr, __be16 port,
struct rds_sock *insert)
 {
struct rds_sock *rs;
-   struct hlist_head *head = hash_to_bucket(addr, port);
+   struct hlist_head *head = >head;
u64 cmp;
u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
 
@@ -91,10 +96,11 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 {
struct rds_sock *rs;
unsigned long flags;
+   struct bind_bucket *bucket = hash_to_bucket(addr, port);
 
-   read_lock_irqsave(_bind_lock, flags);
-   rs = rds_bind_lookup(addr, port, NULL);
-   read_unlock_irqrestore(_bind_lock, flags);
+   read_lock_irqsave(>lock, flags);
+   rs = rds_bind_lookup(bucket, addr, port, NULL);
+   read_unlock_irqrestore(>lock, flags);
 
if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) {
rds_sock_put(rs);
@@ -113,6 +119,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
unsigned long flags;
int ret = -EADDRINUSE;
u16 rover, last;
+   struct bind_bucket *bucket;
 
if (*port != 0) {
rover = be16_to_cpu(*port);
@@ -122,13 +129,15 @@ static int rds_add_bound(struct rds_sock *rs, __be32 
addr, __be16 *port)
last = rover - 1;
}
 
-   write_lock_irqsave(_bind_lock, flags);
-
do {
struct rds_sock *rrs;
if (rover == 0)
rover++;
-   rrs = rds_bind_lookup(addr, cpu_to_be16(rover), rs);
+
+   bucket = hash_to_bucket(addr, cpu_to_be16(rover));
+   write_lock_irqsave(>lock, flags);
+   rrs = rds_bind_lookup(bucket, addr, cpu_to_be16(rover), rs);
+   write_unlock_irqrestore(>lock, flags);
if (!rrs) {
*port = rs->rs_bound_port;
ret = 0;
@@ -140,16 +149,16 @@ static int rds_add_bound(struct rds_sock *rs, __be32 
addr, __be16 *port)
}
} while (rover++ != last);
 
-   write_unlock_irqrestore(_bind_lock, flags);
-
return ret;
 }
 
 void rds_remove_bound(struct rds_sock *rs)
 {
unsigned long flags;
+   struct bind_bucket *bucket =
+   hash_to_bucket(rs->rs_bound_addr, rs->rs_bound_port);
 
-   write_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(>lock, flags);
 
if (rs->rs_bound_addr) {
rdsdebug("rs %p unbinding from %pI4:%d\n",
@@ -161,7 +170,7 @@ void rds_remove_bound(struct rds_sock *rs)
rs->rs_bound_addr = 0;
}
 
-   write_unlock_irqrestore(_bind_loc

[PATCH v2 14/14] RDS: IB: split mr pool to improve 8K messages performance

2015-09-30 Thread Santosh Shilimkar
8K message sizes are pretty important usecase for RDS current
workloads so we make provison to have 8K mrs available from the pool.
Based on number of SG's in the RDS message, we pick a pool to use.

Also to make sure that we don't under utlise mrs when say 8k messages
are dominating which could lead to 8k pull being exhausted, we fall-back
to 1m pool till 8k pool recovers for use.

This helps to at least push ~55 kB/s bidirectional data which
is a nice improvement.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/ib.c   |  47 +
 net/rds/ib.h   |  43 ---
 net/rds/ib_rdma.c  | 101 +
 net/rds/ib_stats.c |  18 ++
 4 files changed, 147 insertions(+), 62 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 883813a..a833ab7 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -43,14 +43,14 @@
 #include "rds.h"
 #include "ib.h"
 
-static unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE;
-unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned 
MRs */
+unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
+unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(fmr_pool_size, int, 0444);
-MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA");
-module_param(fmr_message_size, int, 0444);
-MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer");
+module_param(rds_ib_fmr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
+module_param(rds_ib_fmr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -97,8 +97,10 @@ static void rds_ib_dev_free(struct work_struct *work)
struct rds_ib_device *rds_ibdev = container_of(work,
struct rds_ib_device, free_work);
 
-   if (rds_ibdev->mr_pool)
-   rds_ib_destroy_mr_pool(rds_ibdev->mr_pool);
+   if (rds_ibdev->mr_8k_pool)
+   rds_ib_destroy_mr_pool(rds_ibdev->mr_8k_pool);
+   if (rds_ibdev->mr_1m_pool)
+   rds_ib_destroy_mr_pool(rds_ibdev->mr_1m_pool);
if (rds_ibdev->pd)
ib_dealloc_pd(rds_ibdev->pd);
 
@@ -148,9 +150,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32;
-   rds_ibdev->max_fmrs = dev_attr->max_mr ?
-   min_t(unsigned int, dev_attr->max_mr, fmr_pool_size) :
-   fmr_pool_size;
+   rds_ibdev->max_1m_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, (dev_attr->max_mr / 2),
+ rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+
+   rds_ibdev->max_8k_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, ((dev_attr->max_mr / 2) * RDS_MR_8K_SCALE),
+ rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = dev_attr->max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = dev_attr->max_qp_rd_atom;
@@ -162,12 +168,25 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev);
-   if (IS_ERR(rds_ibdev->mr_pool)) {
-   rds_ibdev->mr_pool = NULL;
+   rds_ibdev->mr_1m_pool =
+   rds_ib_create_mr_pool(rds_ibdev, RDS_IB_MR_1M_POOL);
+   if (IS_ERR(rds_ibdev->mr_1m_pool)) {
+   rds_ibdev->mr_1m_pool = NULL;
goto put_dev;
}
 
+   rds_ibdev->mr_8k_pool =
+   rds_ib_create_mr_pool(rds_ibdev, RDS_IB_MR_8K_POOL);
+   if (IS_ERR(rds_ibdev->mr_8k_pool)) {
+   rds_ibdev->mr_8k_pool = NULL;
+   goto put_dev;
+   }
+
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+dev_attr->max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
+rds_ibdev->max_8k_fmrs);
+
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 3a8cd31..f17d095 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -9,8 +9,11 @@
 #include "rds.h"
 #include "

[PATCH v2 07/14] RDS: IB: ack more receive completions to improve performance

2015-09-30 Thread Santosh Shilimkar
For better performance, we split the receive completion IRQ handler. That
lets us acknowledge several WCE events in one call. We also limit the WC
to max 32 to avoid latency. Acknowledging several completions in one call
instead of several calls each time will provide better performance since
less mutual exclusion locks are being performed.

In next patch, send completion is also split which re-uses the poll_cq()
and hence the code is moved to ib_cm.c

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/ib.h   |  28 +--
 net/rds/ib_cm.c|  70 ++-
 net/rds/ib_recv.c  | 136 +++--
 net/rds/ib_stats.c |   3 +-
 4 files changed, 132 insertions(+), 105 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f1fd5ff..727759b 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -24,6 +24,8 @@
 
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
+#define RDS_IB_WC_MAX  32
+
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
 
@@ -89,6 +91,20 @@ struct rds_ib_work_ring {
atomic_tw_free_ctr;
 };
 
+/* Rings are posted with all the allocations they'll need to queue the
+ * incoming message to the receiving socket so this can't fail.
+ * All fragments start with a header, so we can make sure we're not receiving
+ * garbage, and we can tell a small 8 byte fragment from an ACK frame.
+ */
+struct rds_ib_ack_state {
+   u64 ack_next;
+   u64 ack_recv;
+   unsigned intack_required:1;
+   unsigned intack_next_valid:1;
+   unsigned intack_recv_valid:1;
+};
+
+
 struct rds_ib_device;
 
 struct rds_ib_connection {
@@ -102,6 +118,10 @@ struct rds_ib_connection {
struct ib_pd*i_pd;
struct ib_cq*i_send_cq;
struct ib_cq*i_recv_cq;
+   struct ib_wci_recv_wc[RDS_IB_WC_MAX];
+
+   /* interrupt handling */
+   struct tasklet_struct   i_recv_tasklet;
 
/* tx */
struct rds_ib_work_ring i_send_ring;
@@ -112,7 +132,6 @@ struct rds_ib_connection {
atomic_ti_signaled_sends;
 
/* rx */
-   struct tasklet_struct   i_recv_tasklet;
struct mutexi_recv_mutex;
struct rds_ib_work_ring i_recv_ring;
struct rds_ib_incoming  *i_ibinc;
@@ -199,13 +218,14 @@ struct rds_ib_statistics {
uint64_ts_ib_connect_raced;
uint64_ts_ib_listen_closed_stale;
uint64_ts_ib_tx_cq_call;
+   uint64_ts_ib_evt_handler_call;
+   uint64_ts_ib_tasklet_call;
uint64_ts_ib_tx_cq_event;
uint64_ts_ib_tx_ring_full;
uint64_ts_ib_tx_throttle;
uint64_ts_ib_tx_sg_mapping_failure;
uint64_ts_ib_tx_stalled;
uint64_ts_ib_tx_credit_updates;
-   uint64_ts_ib_rx_cq_call;
uint64_ts_ib_rx_cq_event;
uint64_ts_ib_rx_ring_empty;
uint64_ts_ib_rx_refill_from_cq;
@@ -324,7 +344,8 @@ void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
-void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
+struct rds_ib_ack_state *state);
 void rds_ib_recv_tasklet_fn(unsigned long data);
 void rds_ib_recv_init_ring(struct rds_ib_connection *ic);
 void rds_ib_recv_clear_ring(struct rds_ib_connection *ic);
@@ -332,6 +353,7 @@ void rds_ib_recv_init_ack(struct rds_ib_connection *ic);
 void rds_ib_attempt_ack(struct rds_ib_connection *ic);
 void rds_ib_ack_send_complete(struct rds_ib_connection *ic);
 u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic);
+void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq, int ack_required);
 
 /* ib_ring.c */
 void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 9043f5c..28e0979 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -216,6 +216,72 @@ static void rds_ib_cq_event_handler(struct ib_event 
*event, void *data)
 event->event, ib_event_msg(event->event), data);
 }
 
+/* Plucking the oldest entry from the ring can be done concurrently with
+ * the thread refilling the ring.  Each ring operation is protected by
+ * spinlocks and the transient state of refilling doesn't change the
+ * recording of which entry is oldest.
+ *
+ * This relies on IB only calling one cq comp_handler for each cq so that
+ * there will only be one caller of rds_recv_incoming() per RDS connection.

[PATCH v2 13/14] RDS: IB: use max_mr from HCA caps than max_fmr

2015-09-30 Thread Santosh Shilimkar
All HCA drivers seems to popullate max_mr caps and few of
them do both max_mr and max_fmr.

Hence update RDS code to make use of max_mr.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/ib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 2d3f2ab..883813a 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -148,8 +148,8 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32;
-   rds_ibdev->max_fmrs = dev_attr->max_fmr ?
-   min_t(unsigned int, dev_attr->max_fmr, fmr_pool_size) :
+   rds_ibdev->max_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, dev_attr->max_mr, fmr_pool_size) :
fmr_pool_size;
 
rds_ibdev->max_initiator_depth = dev_attr->max_qp_init_rd_atom;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/14] RDS: IB: handle rds_ibdev release case instead of crashing the kernel

2015-09-30 Thread Santosh Shilimkar
From: Santosh Shilimkar <ssant...@kernel.org>

Just in case we are still handling the QP receive completion while the
rds_ibdev is released, drop the connection instead of crashing the kernel.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
---
 net/rds/ib_cm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8f51d0d..2b2370e 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -285,7 +285,8 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
struct rds_ib_device *rds_ibdev = ic->rds_ibdev;
struct rds_ib_ack_state state;
 
-   BUG_ON(!rds_ibdev);
+   if (!rds_ibdev)
+   rds_conn_drop(conn);
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/14] RDS: defer the over_batch work to send worker

2015-09-30 Thread Santosh Shilimkar
Current process gives up if its send work over the batch limit.
The work queue will get  kicked to finish off any other requests.
This fixes remainder condition from commit 443be0e5affe ("RDS: make
sure not to loop forever inside rds_send_xmit").

The restart condition is only for the case where we reached to
over_batch code for some other reason so just retrying again
before giving up.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/send.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 4df61a5..f1e709c 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -423,7 +423,9 @@ over_batch:
 !list_empty(>c_send_queue)) &&
send_gen == conn->c_send_gen) {
rds_stats_inc(s_send_lock_queue_raced);
-   goto restart;
+   if (batch_count < 1024)
+   goto restart;
+   queue_delayed_work(rds_wq, >c_send_w, 1);
}
}
 out:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/14] RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL

2015-09-30 Thread Santosh Shilimkar
In Transport indepedent rds_sendmsg(), we shouldn't make decisions based
on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based
transports. We can safely issue rds_send_xmit() and the using its
return value take decision on deferred work. This will also fix
the scenario where at times we are seeing connections stuck with
the LL_SEND_FULL bit getting set and never cleared.

We kick krdsd after any time we see -ENOMEM or -EAGAIN from the
ring allocation code.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/send.c| 10 ++
 net/rds/threads.c |  2 ++
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index f1e709c..9d8b52d 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1122,8 +1122,9 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
 */
rds_stats_inc(s_send_queued);
 
-   if (!test_bit(RDS_LL_SEND_FULL, >c_flags))
-   rds_send_xmit(conn);
+   ret = rds_send_xmit(conn);
+   if (ret == -ENOMEM || ret == -EAGAIN)
+   queue_delayed_work(rds_wq, >c_send_w, 1);
 
rds_message_put(rm);
return payload_len;
@@ -1179,8 +1180,9 @@ rds_send_pong(struct rds_connection *conn, __be16 dport)
rds_stats_inc(s_send_queued);
rds_stats_inc(s_send_pong);
 
-   if (!test_bit(RDS_LL_SEND_FULL, >c_flags))
-   queue_delayed_work(rds_wq, >c_send_w, 0);
+   ret = rds_send_xmit(conn);
+   if (ret == -ENOMEM || ret == -EAGAIN)
+   queue_delayed_work(rds_wq, >c_send_w, 1);
 
rds_message_put(rm);
return 0;
diff --git a/net/rds/threads.c b/net/rds/threads.c
index dc2402e..454aa6d 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -162,7 +162,9 @@ void rds_send_worker(struct work_struct *work)
int ret;
 
if (rds_conn_state(conn) == RDS_CONN_UP) {
+   clear_bit(RDS_LL_SEND_FULL, >c_flags);
ret = rds_send_xmit(conn);
+   cond_resched();
rdsdebug("conn %p ret %d\n", conn, ret);
switch (ret) {
case -EAGAIN:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/14] RDS: make socket bind/release locking scheme simple and more efficient

2015-09-30 Thread Santosh Shilimkar
RDS bind and release locking scheme is very inefficient. It
uses RCU for maintaining the bind hash-table which is great but
it also needs to hold spinlock for [add/remove]_bound(). So
overall usecase, the hash-table concurrent speedup doesn't pay off.
In fact blocking nature of synchronize_rcu() makes the RDS
socket shutdown too slow which hurts RDS performance since
connection shutdown and re-connect happens quite often to
maintain the RC part of the protocol.

So we make the locking scheme simpler and more efficient by
replacing spin_locks with reader/writer locks and getting rid
off rcu for bind hash-table.

In subsequent patch, we also covert the global lock with per-bucket
lock to reduce the global lock contention.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/af_rds.c |  6 --
 net/rds/bind.c   | 35 +++
 2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index a2f28a6..dc08766 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -72,13 +72,7 @@ static int rds_release(struct socket *sock)
rds_clear_recv_queue(rs);
rds_cong_remove_socket(rs);
 
-   /*
-* the binding lookup hash uses rcu, we need to
-* make sure we synchronize_rcu before we free our
-* entry
-*/
rds_remove_bound(rs);
-   synchronize_rcu();
 
rds_send_drop_to(rs, NULL);
rds_rdma_drop_keys(rs);
diff --git a/net/rds/bind.c b/net/rds/bind.c
index dd666fb..01989e2 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -40,7 +40,7 @@
 
 #define BIND_HASH_SIZE 1024
 static struct hlist_head bind_hash_table[BIND_HASH_SIZE];
-static DEFINE_SPINLOCK(rds_bind_lock);
+static DEFINE_RWLOCK(rds_bind_lock);
 
 static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port)
 {
@@ -48,6 +48,7 @@ static struct hlist_head *hash_to_bucket(__be32 addr, __be16 
port)
  (BIND_HASH_SIZE - 1));
 }
 
+/* must hold either read or write lock (write lock for insert != NULL) */
 static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port,
struct rds_sock *insert)
 {
@@ -56,30 +57,24 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 
port,
u64 cmp;
u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
 
-   rcu_read_lock();
-   hlist_for_each_entry_rcu(rs, head, rs_bound_node) {
+   hlist_for_each_entry(rs, head, rs_bound_node) {
cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
  be16_to_cpu(rs->rs_bound_port);
 
-   if (cmp == needle) {
-   rcu_read_unlock();
+   if (cmp == needle)
return rs;
-   }
}
-   rcu_read_unlock();
 
if (insert) {
/*
 * make sure our addr and port are set before
-* we are added to the list, other people
-* in rcu will find us as soon as the
-* hlist_add_head_rcu is done
+* we are added to the list.
 */
insert->rs_bound_addr = addr;
insert->rs_bound_port = port;
rds_sock_addref(insert);
 
-   hlist_add_head_rcu(>rs_bound_node, head);
+   hlist_add_head(>rs_bound_node, head);
}
return NULL;
 }
@@ -93,8 +88,11 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 
port,
 struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 {
struct rds_sock *rs;
+   unsigned long flags;
 
+   read_lock_irqsave(_bind_lock, flags);
rs = rds_bind_lookup(addr, port, NULL);
+   read_unlock_irqrestore(_bind_lock, flags);
 
if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
rds_sock_addref(rs);
@@ -103,6 +101,7 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 
rdsdebug("returning rs %p for %pI4:%u\n", rs, ,
ntohs(port));
+
return rs;
 }
 
@@ -121,7 +120,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
last = rover - 1;
}
 
-   spin_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(_bind_lock, flags);
 
do {
if (rover == 0)
@@ -135,7 +134,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
}
} while (rover++ != last);
 
-   spin_unlock_irqrestore(_bind_lock, flags);
+   write_unlock_irqrestore(_bind_lock, flags);
 
return ret;
 }
@@ -144,19 +143,19 @@ void rds_remove_bound(struct rds_sock *rs)
 {
unsigned long flags;
 
-   spin_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(_bind_lock, flags);
 
if (rs

[PATCH v2 08/14] RDS: IB: split send completion handling and do batch ack

2015-09-30 Thread Santosh Shilimkar
Similar to what we did with receive CQ completion handling, we split
the transmit completion handler so that it lets us implement batched
work completion handling.

We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to
identify the send vs receive completion event handler invocation.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/ib.h   |   6 ++-
 net/rds/ib_cm.c|  45 --
 net/rds/ib_send.c  | 110 +
 net/rds/ib_stats.c |   1 -
 net/rds/send.c |   1 +
 5 files changed, 98 insertions(+), 65 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 727759b..3a8cd31 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -25,6 +25,7 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
+#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
@@ -118,9 +119,11 @@ struct rds_ib_connection {
struct ib_pd*i_pd;
struct ib_cq*i_send_cq;
struct ib_cq*i_recv_cq;
+   struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
/* interrupt handling */
+   struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
 
/* tx */
@@ -217,7 +220,6 @@ struct rds_ib_device {
 struct rds_ib_statistics {
uint64_ts_ib_connect_raced;
uint64_ts_ib_listen_closed_stale;
-   uint64_ts_ib_tx_cq_call;
uint64_ts_ib_evt_handler_call;
uint64_ts_ib_tasklet_call;
uint64_ts_ib_tx_cq_event;
@@ -371,7 +373,7 @@ extern wait_queue_head_t rds_ib_ring_empty_wait;
 void rds_ib_xmit_complete(struct rds_connection *conn);
 int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
unsigned int hdr_off, unsigned int sg, unsigned int off);
-void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 void rds_ib_send_init_ring(struct rds_ib_connection *ic);
 void rds_ib_send_clear_ring(struct rds_ib_connection *ic);
 int rds_ib_xmit_rdma(struct rds_connection *conn, struct rm_rdma_op *op);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 28e0979..8f51d0d 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -250,11 +250,34 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+
+   if (wc->wr_id & RDS_IB_SEND_OP)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
}
}
 }
 
+static void rds_ib_tasklet_fn_send(unsigned long data)
+{
+   struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
+   struct rds_connection *conn = ic->conn;
+   struct rds_ib_ack_state state;
+
+   rds_ib_stats_inc(s_ib_tasklet_call);
+
+   memset(, 0, sizeof(state));
+   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
+   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+
+   if (rds_conn_up(conn) &&
+   (!test_bit(RDS_LL_SEND_FULL, >c_flags) ||
+   test_bit(0, >c_map_queued)))
+   rds_send_xmit(ic->conn);
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -304,6 +327,18 @@ static void rds_ib_qp_event_handler(struct ib_event 
*event, void *data)
}
 }
 
+static void rds_ib_cq_comp_handler_send(struct ib_cq *cq, void *context)
+{
+   struct rds_connection *conn = context;
+   struct rds_ib_connection *ic = conn->c_transport_data;
+
+   rdsdebug("conn %p cq %p\n", conn, cq);
+
+   rds_ib_stats_inc(s_ib_evt_handler_call);
+
+   tasklet_schedule(>i_send_tasklet);
+}
+
 /*
  * This needs to be very careful to not leave IS_ERR pointers around for
  * cleanup to trip over.
@@ -337,7 +372,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_pd = rds_ibdev->pd;
 
cq_attr.cqe = ic->i_send_ring.w_nr + 1;
-   ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler,
+
+   ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 r

[PATCH v2 01/14] RDS: use kfree_rcu in rds_ib_remove_ipaddr

2015-09-30 Thread Santosh Shilimkar
synchronize_rcu() slowing down un-necessarily the socket shutdown
path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr()
which is perfect usecase for kfree_rcu();

So lets use that to gain some speedup.

Signed-off-by: Santosh Shilimkar <ssant...@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
 net/rds/ib.h  | 1 +
 net/rds/ib_rdma.c | 6 ++
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index aae60fd..f1fd5ff 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -164,6 +164,7 @@ struct rds_ib_connection {
 struct rds_ib_ipaddr {
struct list_headlist;
__be32  ipaddr;
+   struct rcu_head rcu;
 };
 
 struct rds_ib_device {
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 251d1ce..872f523 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -159,10 +159,8 @@ static void rds_ib_remove_ipaddr(struct rds_ib_device 
*rds_ibdev, __be32 ipaddr)
}
spin_unlock_irq(_ibdev->spinlock);
 
-   if (to_free) {
-   synchronize_rcu();
-   kfree(to_free);
-   }
+   if (to_free)
+   kfree_rcu(to_free, rcu);
 }
 
 int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/14] RDS: connection scalability and performance improvements

2015-09-30 Thread Santosh Shilimkar
[v2]:
Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from
earlier version [1]. I plan to address the hash table scalability using
re-sizable hash tables as suggested by David Laight and David Miller [2]

This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.

RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin hash-table
uses global spin-lock which is the biggest bottleneck. To make matter worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes the
locking simple and very efficient. The hash table size is still an issue and
I plan to address it by using re-sizable hash tables as suggested on the list.

For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.

Series applies against 4.3-rc1 as well as net-next. Its tested on Oracle
hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is
tested with iXGB NIC. Like last time, iWARP transport is untested with
these changes. The patchset is also available at below git repo:

git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v2

As a side note, the IB HCA driver I used for testing misses at least 3
important patches in upstream to see the full blown IB performance and
am hoping to get that in mainline with help of them.

Santosh Shilimkar (14):
  RDS: use kfree_rcu in rds_ib_remove_ipaddr
  RDS: make socket bind/release locking scheme simple and more efficient
  RDS: fix rds_sock reference bug while doing bind
  RDS: Use per-bucket rw lock for bind hash-table
  RDS: defer the over_batch work to send worker
  RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL
  RDS: IB: ack more receive completions to improve performance
  RDS: IB: split send completion handling and do batch ack
  RDS: IB: handle rds_ibdev release case instead of crashing the kernel
  RDS: IB: fix the rds_ib_fmr_wq kick call
  RDS: IB: use already available pool handle from ibmr
  RDS: IB: mark rds_ib_fmr_wq static
  RDS: IB: use max_mr from HCA caps than max_fmr
  RDS: IB: split mr pool to improve 8K messages performance

 net/rds/af_rds.c   |   8 +---
 net/rds/bind.c |  76 ++
 net/rds/ib.c   |  47 --
 net/rds/ib.h   |  78 +++---
 net/rds/ib_cm.c| 114 ++--
 net/rds/ib_rdma.c  | 116 ++---
 net/rds/ib_recv.c  | 136 +++--
 net/rds/ib_send.c  | 110 ---
 net/rds/ib_stats.c |  22 +
 net/rds/rds.h  |   1 +
 net/rds/send.c |  15 --
 net/rds/threads.c  |   2 +
 12 files changed, 445 insertions(+), 280 deletions(-)

-- 
1.9.1

Regards,
Santosh

[1] https://lkml.org/lkml/2015/9/19/384
[2] https://lkml.org/lkml/2015/9/21/828



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/24] New fast registration API

2015-09-22 Thread santosh shilimkar

On 9/22/2015 12:56 AM, Sagi Grimberg wrote:

On 9/22/2015 10:19 AM, Sagi Grimberg wrote:


As mentioned earlier, I have a WIP RDS fastreg branch [3]
which is functional (at least I can RDMA messages across
nodes ;-)).


Nice!


So merging [2] and [3], I created [4] and applied
a delta change based on your other patches. I saw ib_post_send
failure with my HCA driver returning '-EINVAL'. I didn't
debug it further but at least opcode and num_sge were set
correctly so I shouldn't have seen it. So I did memset()
on reg_wr which seems to have helped to fix the ib_post_send()
failure.


Yep - that was my fault. When converting the ULPs I optimized by
removing the memset but I forgot to set reg_wr.wr.next = NULL when
the ULP needed. This caused the driver to read a second bogus work
request. Steve just reported this as well so I'll fix that in v2.


Ahh, right. There can be chain of wr.



But I got into remote access errors which tells me that I
have messed up setup(rkey, sge setup or access flags)


One thing that pops is that in the old API the MR was registered
with iova_start = 0 (which is probably what was sent to the peer),
but the new API the iova is implicitly sg_dma_address([0]).

The registered MR holds these attributes in:
mr->rkey
mr->iova
mr->length

These should be passed to a peer to perform rdma.


right.


ohh, I just read the RDS 3.1 specification (for the first time..) and I
noticed that RDS 3.1 header extension contains only a 32bit offset
parameter. Why is that anyway? why not 64bit so it can be a valid mapped
address? Also the code doesn't use it at all and always passes 0 (which
is buggy if sg[0] has an offset from a page).

This won't work with the proposed API as the iova is 64bit (as all other
existing RDMA protocols use 64bit addresses).

In any event, I'd much rather to add ib_map_mr_sg_zbva() just for RDS
to use instead of polluting the API with an iova argument, but I think
that the RDS spec can be updated to use 64bit offsets and align to all
other RDMA protocols (it has enough space in h_exthdr which is 128bit).


RDS assumes it's an offset and hence it has been used as 32 bit. I need
to look through this carefully though because all the existing
application use this header format. There is also RDMA read/write
byte information sent as part of the header(Not in upstream code yet)
so the space might be less. But point taken. Will look into it.


I was thinking of:
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index e7e0251..61fcab4 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3033,6 +3033,21 @@ int ib_map_mr_sg(struct ib_mr *mr,
  unsigned int sg_nents,
  unsigned int page_size);

+static inline int
+ib_map_mr_sg_zbva(struct ib_mr *mr,
+ struct scatterlist *sg,
+ unsigned int sg_nents,
+ unsigned int page_size)
+{
+   int rc;
+
+   rc = ib_map_mr_sg(mr, sg, sg_nents, page_size);
+   if (likely(!rc))
+   mr->iova &= ((u64)page_size - 1);
+
+   return rc;
+}
+
  int ib_sg_to_pages(struct ib_mr *mr,
struct scatterlist *sgl,
unsigned int sg_nents,
--

Thoughts?

Santosh, can you use that one instead and let us know if
it resolves your issue?


Unfortunately this change still doesn't fix the issue.


I think you should make sure to correctly construct the
h_exthdr with: rds_rdma_make_cookie(mr->rkey, (32)mr->iova)


Will look into it. Thanks for suggestion.

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/15] RDS: connection scalability and performance improvements

2015-09-21 Thread santosh shilimkar

On 9/20/2015 1:37 AM, Sagi Grimberg wrote:

On 9/20/2015 2:04 AM, Santosh Shilimkar wrote:

This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.

RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin
hash-table
uses global spin-lock which is the biggest bottleneck. To make matter
worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes
the
locking simple and very efficient. The hash table size is also scaled up
accordingly.

For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.


Hi Santosh,

I think that can get a more effective code review if you CC the
Linux-rdma mailing list.


I will do that from next time. Thanks Sagi !!

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 12/12] rds/ib: Remove ib_get_dma_mr calls

2015-08-13 Thread santosh shilimkar

On 7/30/2015 4:22 PM, Jason Gunthorpe wrote:

The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
  net/rds/ib.c  | 8 
  net/rds/ib.h  | 2 --
  net/rds/ib_cm.c   | 4 +---
  net/rds/ib_recv.c | 6 +++---
  net/rds/ib_send.c | 8 
  5 files changed, 8 insertions(+), 20 deletions(-)


I wanted to try this series earlier but couldn't do it because of
broken RDS RDMA. Now I have that fixed with bunch of patches soon
to be posted, tried the series. It works as expected.

The rds change looks also straight forward since ib_get_dma_mr()
is being used for local write.

So feel free to add below tag if you need one.

Tested-Acked-by: Santosh Shilimkar santosh.shilim...@oracle.com


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html