RE: ibv_post_send/recv kernel path optimizations

2011-01-24 Thread Walukiewicz, Miroslaw
Sean,

The assumption here is that user space library prepares the vendor specific 
data in user-space using a shared page allocated by vendor driver. Here 
information about posted buffers is passed not through ib_wr but using the 
shared page. It is a reason why pointers indicating ib_wr in post_send are not 
set, they are not passed to kernel at all to avoid copying them in kernel. 

As there is no ib_wr structure in kernel there is no reference to bad_wr and a 
buffer that failed in this context so the only reasonable information about 
operation state passed using bad_wr could be return of binary information - 
operation successful (bad_wr = 0) or not (bad_wr != 0) 

Instead of using a specific case for RAW_QP it is possible to pass some 
information about posting buffers method by
enum ib_qp_create_flags {
IB_QP_CREATE_IPOIB_UD_LSO   = 1  0,
IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK   = 1  1
};
Extending it with IB_QP_CREATE_USE_SHARED_PAGE= 1  2,

In that case a new method could be used for any type of QP and will be backward 
compatible.

Regards,

Mirek
-Original Message-
From: Hefty, Sean 
Sent: Friday, January 21, 2011 4:50 PM
To: Walukiewicz, Miroslaw; Roland Dreier
Cc: Or Gerlitz; Jason Gunthorpe; linux-rdma@vger.kernel.org
Subject: RE: ibv_post_send/recv kernel path optimizations

 + qp = idr_read_qp(cmd.qp_handle, file-ucontext);
 + if (!qp)
 + goto out_raw_qp;
 +
 + if (qp-qp_type == IB_QPT_RAW_ETH) {
 + resp.bad_wr = 0;
 + ret = qp-device-post_send(qp, NULL, NULL);

This looks odd to me and can definitely confuse someone reading the code.  It 
adds assumptions to uverbs about the underlying driver implementation and ties 
that to the QP type.  I don't know if it makes more sense to key off something 
in the cmd or define some other property of the QP, but the NULL parameters 
into post_send are non-intuitive.
 
 + if (ret)
 + resp.bad_wr = cmd.wr_count;

Is this always the case?

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2011-01-21 Thread Walukiewicz, Miroslaw
Roland, 

You are right that the idr implementation introduces insignificant change in 
performance. I made the version with idr and semaphores usage and I see a 
minimal change comparing to hash table. Now only a shared page is used instead 
of kmalloc and copy_to_user.

I simplified changes to uverbs and I achieved what I wanted in performance. Now 
the patch looks like below.

Are these changes acceptable for k.org

Regards,

Mirek

--- ../SOURCES_19012011/ofa_kernel-1.5.3/drivers/infiniband/core/uverbs_cmd.c   
2011-01-19 05:37:55.0 +0100
+++ ofa_kernel-1.5.3_idr_qp/drivers/infiniband/core/uverbs_cmd.c
2011-01-21 04:10:07.0 +0100
@@ -1449,15 +1449,29 @@
 
if (cmd.wqe_size  sizeof (struct ib_uverbs_send_wr))
return -EINVAL;
+   qp = idr_read_qp(cmd.qp_handle, file-ucontext);
+   if (!qp)
+   goto out_raw_qp;
+
+   if (qp-qp_type == IB_QPT_RAW_ETH) {
+   resp.bad_wr = 0;
+   ret = qp-device-post_send(qp, NULL, NULL);
+   if (ret)
+   resp.bad_wr = cmd.wr_count;
+
+   if (copy_to_user((void __user *) (unsigned long)
+   cmd.response,
+   resp,
+   sizeof resp))
+   ret = -EFAULT;
+   put_qp_read(qp);
+   goto out_raw_qp;
+   }
 
user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL);
if (!user_wr)
return -ENOMEM;
 
-   qp = idr_read_qp(cmd.qp_handle, file-ucontext);
-   if (!qp)
-   goto out;
-
is_ud = qp-qp_type == IB_QPT_UD;
sg_ind = 0;
last = NULL;
@@ -1577,9 +1591,8 @@
wr = next;
}
 
-out:
kfree(user_wr);
-
+out_raw_qp:
return ret ? ret : in_len;
 }
 
@@ -1681,16 +1694,31 @@
if (copy_from_user(cmd, buf, sizeof cmd))
return -EFAULT;
 
+   qp = idr_read_qp(cmd.qp_handle, file-ucontext);
+   if (!qp)
+   goto out_raw_qp;
+
+if (qp-qp_type == IB_QPT_RAW_ETH) {
+   resp.bad_wr = 0;
+   ret = qp-device-post_recv(qp, NULL, NULL);
+if (ret)
+   resp.bad_wr = cmd.wr_count;
+
+if (copy_to_user((void __user *) (unsigned long)
+   cmd.response,
+   resp,
+   sizeof resp))
+   ret = -EFAULT;
+   put_qp_read(qp);
+   goto out_raw_qp;
+   }
+
wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd,
   in_len - sizeof cmd, cmd.wr_count,
   cmd.sge_count, cmd.wqe_size);
if (IS_ERR(wr))
return PTR_ERR(wr);
 
-   qp = idr_read_qp(cmd.qp_handle, file-ucontext);
-   if (!qp)
-   goto out;
-
resp.bad_wr = 0;
ret = qp-device-post_recv(qp, wr, bad_wr);
 
@@ -1707,13 +1735,13 @@
 resp, sizeof resp))
ret = -EFAULT;
 
-out:
while (wr) {
next = wr-next;
kfree(wr);
wr = next;
}
 
+out_raw_qp:
return ret ? ret : in_len;
 }




-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Monday, January 10, 2011 9:38 PM
To: Walukiewicz, Miroslaw
Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

  You are right that the most of the speed-up is coming from avoid semaphores, 
  but not only.
  
  From the oprof traces, the semaphores made half of difference.
  
  The next one was copy_from_user and kmalloc/kfree usage (in my proposal - 
  shared page method is used instead)

OK, but in any case the switch from idr to hash table seems to be
insignificant.  I agree that using a shared page is a good idea, but
removing locking needed for correctness is not a good optimization.

  In my opinion, the responsibility for cases like protection of QP
  against destroy during buffer post (and other similar cases) should
  be moved to vendor driver. The OFED code should move only the code
  path to driver.

Not sure what OFED code you're talking about.  We're discussing the
kernel uverbs code, right?

In any case I'd be interested in seeing how it looks if you move the
protection into the individual drivers.  I'd be worried about having to
duplicate the same code everywhere (which leads to bugs in individual
drivers) -- I guess this could be resolved by having the code be a
library that individual drivers call into.  But also I'm not sure if I
see how you could make such a scheme work -- you need to make sure that
the data structures used in the uverbs dispatch to drivers remain
consistent.

In the end I don't think we should go too far optimizing the
non-kernel-bypass case

RE: ibv_post_send/recv kernel path optimizations

2011-01-21 Thread Hefty, Sean
 + qp = idr_read_qp(cmd.qp_handle, file-ucontext);
 + if (!qp)
 + goto out_raw_qp;
 +
 + if (qp-qp_type == IB_QPT_RAW_ETH) {
 + resp.bad_wr = 0;
 + ret = qp-device-post_send(qp, NULL, NULL);

This looks odd to me and can definitely confuse someone reading the code.  It 
adds assumptions to uverbs about the underlying driver implementation and ties 
that to the QP type.  I don't know if it makes more sense to key off something 
in the cmd or define some other property of the QP, but the NULL parameters 
into post_send are non-intuitive.
 
 + if (ret)
 + resp.bad_wr = cmd.wr_count;

Is this always the case?

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2011-01-10 Thread Walukiewicz, Miroslaw
Roland,

You are right that the most of the speed-up is coming from avoid semaphores, 
but not only.

From the oprof traces, the semaphores made half of difference.

The next one was copy_from_user and kmalloc/kfree usage (in my proposal - 
shared page method is used instead)

In my opinion, the responsibility for cases like protection of QP against 
destroy during buffer post (and other similar cases) should be moved to vendor 
driver. The OFED code should move only the code path to driver. 

Mirek

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier
Sent: Wednesday, January 05, 2011 7:17 PM
To: Walukiewicz, Miroslaw
Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

  The patch for ofed-1.5.3 looks like below. I will try to push it to 
  kernel.org after porting.
  
  Now an uverbs  post_send/post_recv path is modified to make pre-lookup
  for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path
  is used for posting buffers. for example using a shared page approach in
  cooperation with user-space library

I don't quite see why a hash table helps performance much vs. an IDR.
Is the actual IDR lookup a significant part of the cost?  (By the way,
instead of list_head you could use hlist_head to make your hash table
denser and save cache footprint -- that way an 8-entry table on 64-bit
systems fits in one cacheline)

Also it seems that you get rid of all the locking on QPs when you look
them up in your hash table.  What protects against userspace posting a
send in one thread and destroying the QP in another thread, and ending
up having the destroy complete before the send is posted (leading to
use-after-free in the kernel)?

I would guess that your speedup is really coming from getting rid of
locking that is actually required for correctness.  Maybe I'm wrong
though, I'm just guessing here.

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_post_send/recv kernel path optimizations

2011-01-10 Thread Roland Dreier
  You are right that the most of the speed-up is coming from avoid semaphores, 
  but not only.
  
  From the oprof traces, the semaphores made half of difference.
  
  The next one was copy_from_user and kmalloc/kfree usage (in my proposal - 
  shared page method is used instead)

OK, but in any case the switch from idr to hash table seems to be
insignificant.  I agree that using a shared page is a good idea, but
removing locking needed for correctness is not a good optimization.

  In my opinion, the responsibility for cases like protection of QP
  against destroy during buffer post (and other similar cases) should
  be moved to vendor driver. The OFED code should move only the code
  path to driver.

Not sure what OFED code you're talking about.  We're discussing the
kernel uverbs code, right?

In any case I'd be interested in seeing how it looks if you move the
protection into the individual drivers.  I'd be worried about having to
duplicate the same code everywhere (which leads to bugs in individual
drivers) -- I guess this could be resolved by having the code be a
library that individual drivers call into.  But also I'm not sure if I
see how you could make such a scheme work -- you need to make sure that
the data structures used in the uverbs dispatch to drivers remain
consistent.

In the end I don't think we should go too far optimizing the
non-kernel-bypass case of verbs -- the main thing we're designing for is
kernel bypass hardware, after all.  Perhaps you could make your case go
faster by using a different file descriptor for each QP or something
(you could pass the fd back as part of the driver-specific create QP path)?

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_post_send/recv kernel path optimizations

2011-01-05 Thread Roland Dreier
  The patch for ofed-1.5.3 looks like below. I will try to push it to 
  kernel.org after porting.
  
  Now an uverbs  post_send/post_recv path is modified to make pre-lookup
  for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path
  is used for posting buffers. for example using a shared page approach in
  cooperation with user-space library

I don't quite see why a hash table helps performance much vs. an IDR.
Is the actual IDR lookup a significant part of the cost?  (By the way,
instead of list_head you could use hlist_head to make your hash table
denser and save cache footprint -- that way an 8-entry table on 64-bit
systems fits in one cacheline)

Also it seems that you get rid of all the locking on QPs when you look
them up in your hash table.  What protects against userspace posting a
send in one thread and destroying the QP in another thread, and ending
up having the destroy complete before the send is posted (leading to
use-after-free in the kernel)?

I would guess that your speedup is really coming from getting rid of
locking that is actually required for correctness.  Maybe I'm wrong
though, I'm just guessing here.

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_post_send/recv kernel path optimizations

2010-12-27 Thread Or Gerlitz
Jason Gunthorpe wrote:
 Walukiewicz, Miroslaw wrote:

 called for many QPs, there is a single entry point to
 ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that
 case there is a lookup to QP store (idr_read_qp) necessary to find a
 correct ibv_qp Structure, what is a big time consumer on the path.

 I don't think this should be such a big problem. The simplest solution
 would be to front the idr_read_qp with a small learning hashing table.

yes, there must be a few ways (e.g as Jason suggested) to do this house-keeping
much more efficient, in a manner that fits fast path - which maybe wasn't the 
mindset 
when this code was written as its primary use was to invoke control plane 
commands.

 The NES IMA kernel path also has such QP lookup but the QP number
 format is designed to make such lookup very quickly.  The QP numbers in
 OFED are not defined so generic lookup functions like idr_read_qp() must be 
 use.

 Maybe look at moving the QPN to ibv_qp translation into the driver
 then - or better yet, move allocation out of the driver, if Mellanox
 could change their FW.. You are right that we could do this much
 faster if the QPN was structured in some way

I think there should be some validation on the uverbs level, as the caller is 
untrusted
user space application, e.g in a similar way for each system call done on a 
file-descriptor

Or.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2010-12-27 Thread Walukiewicz, Miroslaw
 = kmalloc(cmd.wqe_size, GFP_KERNEL);
if (!user_wr)
return -ENOMEM;
@@ -1579,7 +1649,7 @@ out_put:
 
 out:
kfree(user_wr);
-
+out_raw_qp:
return ret ? ret : in_len;
 }
 
@@ -1664,7 +1734,6 @@ err:
kfree(wr);
wr = next;
}
-
return ERR_PTR(ret);
 }
 
@@ -1681,6 +1750,25 @@ ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file,
if (copy_from_user(cmd, buf, sizeof cmd))
return -EFAULT;
 
+   mutex_lock(file-mutex);
+   qp = raw_qp_lookup(cmd.qp_handle, file-ucontext);
+   mutex_unlock(file-mutex);
+   if (qp) {
+   if (qp-qp_type == IB_QPT_RAW_ETH) {
+   resp.bad_wr = 0;
+   ret = qp-device-post_recv(qp, NULL, bad_wr);
+   if (ret)
+   resp.bad_wr = cmd.wr_count;
+
+   if (copy_to_user((void __user *) (unsigned long)
+   cmd.response,
+   resp,
+   sizeof resp))
+   ret = -EFAULT;
+   goto out_raw_qp;
+   }
+   }
+
wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd,
   in_len - sizeof cmd, cmd.wr_count,
   cmd.sge_count, cmd.wqe_size); @@ -1713,7 
+1801,7 @@ out:
kfree(wr);
wr = next;
}
-
+out_raw_qp:
return ret ? ret : in_len;
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 
f5b054a..adf1dd8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -838,6 +838,8 @@ struct ib_fmr_attr {
u8  page_shift;
 };
 
+#define MAX_RAW_QP_HASH 8
+
 struct ib_ucontext {
struct ib_device   *device;
struct list_headpd_list;
@@ -848,6 +850,7 @@ struct ib_ucontext {
struct list_headsrq_list;
struct list_headah_list;
struct list_headxrc_domain_list;
+   struct list_headraw_qp_hash[MAX_RAW_QP_HASH];
int closing;
 };
 
@@ -859,6 +862,7 @@ struct ib_uobject {
int id; /* index into kernel idr */
struct kref ref;
struct rw_semaphore mutex;  /* protects .live */
+   struct list_headraw_qp_list;/* raw qp hash */
int live;
 };

Mirek

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Or Gerlitz
Sent: Monday, December 27, 2010 1:39 PM
To: Jason Gunthorpe; Walukiewicz, Miroslaw
Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

Jason Gunthorpe wrote:
 Walukiewicz, Miroslaw wrote:

 called for many QPs, there is a single entry point to
 ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that
 case there is a lookup to QP store (idr_read_qp) necessary to find a
 correct ibv_qp Structure, what is a big time consumer on the path.

 I don't think this should be such a big problem. The simplest solution
 would be to front the idr_read_qp with a small learning hashing table.

yes, there must be a few ways (e.g as Jason suggested) to do this house-keeping
much more efficient, in a manner that fits fast path - which maybe wasn't the 
mindset 
when this code was written as its primary use was to invoke control plane 
commands.

 The NES IMA kernel path also has such QP lookup but the QP number
 format is designed to make such lookup very quickly.  The QP numbers in
 OFED are not defined so generic lookup functions like idr_read_qp() must be 
 use.

 Maybe look at moving the QPN to ibv_qp translation into the driver
 then - or better yet, move allocation out of the driver, if Mellanox
 could change their FW.. You are right that we could do this much
 faster if the QPN was structured in some way

I think there should be some validation on the uverbs level, as the caller is 
untrusted
user space application, e.g in a similar way for each system call done on a 
file-descriptor

Or.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_post_send/recv kernel path optimizations

2010-12-27 Thread Or Gerlitz

On 12/27/2010 5:18 PM, Walukiewicz, Miroslaw wrote:
I implemented the very small hash table and I achieved performance 
comparable to previous solution.


Just to clarify, when saying achieved performance comparable to 
previous solution you refer to the approach which bypasses uverbs on 
the post send path? Also, why enhance only the raw qp flow?


Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2010-12-27 Thread Walukiewicz, Miroslaw
 Just to clarify, when saying achieved performance comparable to 
 previous solution you refer to the approach which bypasses uverbs on 
 the post send path? Also, why enhance only the raw qp flow?

I compare to my previous solution using private device for passing information 
about packets. Comparing to current approach I see more than 20% of improvement.

This solution introduces a new path for posting buffers using a shared page 
approach. It works following way:
1. create RAW qp and add it to the raw QP hash list. 
2. user space library mmaps the shared page (it is specific action per device 
and must implemented separately per each driver)
3. during buffer posting the library puts buffers info into shared page and 
calls uverbs.
4. uverbs detects the raw qp and inform the driver bypassing current path.

The solution cannot be shared between RDMA drivers because it needs redesign of 
the driver (share page format is vendor specific).
Now only NES driver implements the RAW QP path through kernel (other vendors 
uses pure user-space solution) so 
No other vendor  will use this path.

There is possibility to add a new QP capability or attribute that will inform  
uverbs that it is a new transmit path used so then the solution could be 
extended for all drivers.

Mirek



-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Monday, December 27, 2010 4:22 PM
To: Walukiewicz, Miroslaw
Cc: Jason Gunthorpe; Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 12/27/2010 5:18 PM, Walukiewicz, Miroslaw wrote:
 I implemented the very small hash table and I achieved performance 
 comparable to previous solution.

Just to clarify, when saying achieved performance comparable to 
previous solution you refer to the approach which bypasses uverbs on 
the post send path? Also, why enhance only the raw qp flow?

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2010-12-14 Thread Walukiewicz, Miroslaw
Or,

I looked into shared page approach of passing post_send/post_recv info. I still 
have some concerns.

The shared page must be allocated per QP and there should be a common way to 
allocate such page for each driver.

As Jason and Roland said, the best way to pass this parameter through mmap is 
offset. There is no common way how the 
Offset is used per driver and it is driver specific parameter.

The next problem is how many shared pages should driver allocate to share with 
user-space. They should contain place for each posted buffer by application. It 
is a big concern to post_recv where large number of buffers are posted.
Current implementation has no such limit. 

Even the common offset definition would be defined and accepted, the shared 
page must be stored in ib_qp structure. 
When a post_send is called for many QPs, there is a single entry point to 
ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that case there 
is a lookup to QP store (idr_read_qp) necessary to find a correct ibv_qp 
 Structure, what is a big time consumer on the path. 

The NES IMA kernel path also has such QP lookup but the QP number format is 
designed to make such lookup very quickly.
The QP numbers in OFED are not defined so generic lookup functions like 
idr_read_qp() must be use.

Regards,

Mirek


-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Wednesday, December 01, 2010 9:12 AM
To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier
Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote:
 Form the trace it looks like the __up_read() - 11% wastes most of time. It is 
 called from idr_read_qp when a  put_uobj_read is called. if 
 (copy_from_user(cmd, buf, sizeof cmd))  - 5% it is called twice from 
 ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... 
 and __kmalloc/kfree - 5% is the third function that has a big meaning. It is 
 called twice for each frame transmitted. It is about 20% of performance loss 
 comparing to nes_ud_sksq path which we miss when we use a OFED path.

 What I can modify is a kmalloc/kfree optimization - it is possible to make 
 allocation only at start and use pre-allocated buffers. I don't see any way 
 for optimalization of idr_read_qp usage or copy_user. In current approach we 
 use a shared page and a separate nes_ud_sksq handle for each created QP so 
 there is no need for any user space data copy or QP lookup.
As was mentioned earlier on this thread, and repeated here, the 
kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see 
why the ib uverbs flow (BTW - the data path has nothing to do with the 
rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
e.g to support shared-page which is allocated  mmaped from uverbs to 
user space and used in the same manner your implementation does. The 1st 
copy_from_user should add pretty nothing and if it does, it can be 
replaced with different user/kernel IPC mechanism which costs less. So 
we're basically remained with the idr_read_qp, I wonder what other 
people think if/how this can be optimized?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_post_send/recv kernel path optimizations

2010-12-14 Thread Jason Gunthorpe
On Tue, Dec 14, 2010 at 02:12:56PM +, Walukiewicz, Miroslaw wrote:
 Or,
 
 I looked into shared page approach of passing post_send/post_recv
 info. I still have some concerns.
 
 The shared page must be allocated per QP and there should be a
 common way to allocate such page for each driver.

Why must it be common?

Aren't these pages just hidden away in your driver library or is
something visible to the app? I get that they are somehow used to
insert/remove the extra header information that the hardware cannot
stuff, but how?

 As Jason and Roland said, the best way to pass this parameter
 through mmap is offset. There is no common way how the Offset is
 used per driver and it is driver specific parameter.

Right, offset is driver specific, it is only used between your
userspace driver library and your kernel driver. You can define
whatever you want.

So use something like QPN*4096 + 1

 The next problem is how many shared pages should driver allocate to
 share with user-space. They should contain place for each posted
 buffer by application. It is a big concern to post_recv where large
 number of buffers are posted.  Current implementation has no such
 limit.

I don't see the problem. mmap also has a length argument. So
mmap(0,16*4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,QPN*4096+1)

Means map 16 pages associated with QPN. You don't have to have the
offset and length 'make sense' they are just parameters.

 Even the common offset definition would be defined and accepted, the
 shared page must be stored in ib_qp structure.  When a post_send is

You don't need a common definition, it is driver specific.

 called for many QPs, there is a single entry point to
 ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that
 case there is a lookup to QP store (idr_read_qp) necessary to find a
 correct ibv_qp Structure, what is a big time consumer on the path.

I don't think this should be such a big problem. The simplest solution
would be to front the idr_read_qp with a small learning hashing table.
If you have only one active QP per verbs instance then you avoid the
idr calls. I'm guessing your raw ethernet app uses one QP?

 The NES IMA kernel path also has such QP lookup but the QP number
 format is designed to make such lookup very quickly.  The QP numbers
 in OFED are not defined so generic lookup functions like
 idr_read_qp() must be use.

Maybe look at moving the QPN to ibv_qp translation into the driver
then - or better yet, move allocation out of the driver, if Mellanox
could change their FW.. You are right that we could do this much
faster if the QPN was structured in some way

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2010-12-08 Thread Walukiewicz, Miroslaw
Or, 

I don't see why the ib uverbs flow (BTW - the data path has nothing to do with 
the 
rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
e.g to support shared-page which is allocated  mmaped from uverbs to 
user space and used in the same manner your implementation does.

The problem that I see is that the mmap is currently used for mapping of 
doorbell page in different drivers.

We can use it for mapping a page for transmit/receive operation when we are 
able to differentiate that we need to map 
Doorbell or our shared page. 

The second problem is that this rx/tx mmap should map the separate page per QP 
to avoid the unnecessary QP lookups so page identifier passed to mmap should be 
based on QP identifier. 

I cannot find a specific code for /dev/infiniband/uverbsX. Is this device 
driver sharing the same functions like /dev/infiniband/rdmacm or it has own 
implementation. 

Mirek

-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Wednesday, December 01, 2010 9:12 AM
To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier
Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote:
 Form the trace it looks like the __up_read() - 11% wastes most of time. It is 
 called from idr_read_qp when a  put_uobj_read is called. if 
 (copy_from_user(cmd, buf, sizeof cmd))  - 5% it is called twice from 
 ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... 
 and __kmalloc/kfree - 5% is the third function that has a big meaning. It is 
 called twice for each frame transmitted. It is about 20% of performance loss 
 comparing to nes_ud_sksq path which we miss when we use a OFED path.

 What I can modify is a kmalloc/kfree optimization - it is possible to make 
 allocation only at start and use pre-allocated buffers. I don't see any way 
 for optimalization of idr_read_qp usage or copy_user. In current approach we 
 use a shared page and a separate nes_ud_sksq handle for each created QP so 
 there is no need for any user space data copy or QP lookup.
As was mentioned earlier on this thread, and repeated here, the 
kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see 
why the ib uverbs flow (BTW - the data path has nothing to do with the 
rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
e.g to support shared-page which is allocated  mmaped from uverbs to 
user space and used in the same manner your implementation does. The 1st 
copy_from_user should add pretty nothing and if it does, it can be 
replaced with different user/kernel IPC mechanism which costs less. So 
we're basically remained with the idr_read_qp, I wonder what other 
people think if/how this can be optimized?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_post_send/recv kernel path optimizations

2010-12-08 Thread Jason Gunthorpe
On Wed, Dec 08, 2010 at 12:14:51PM +, Walukiewicz, Miroslaw wrote:
 Or, 
 
 I don't see why the ib uverbs flow (BTW - the data path has nothing to do 
 with the 
 rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
 e.g to support shared-page which is allocated  mmaped from uverbs to 
 user space and used in the same manner your implementation does.
 
 The problem that I see is that the mmap is currently used for
 mapping of doorbell page in different drivers.
 
 We can use it for mapping a page for transmit/receive operation when
 we are able to differentiate that we need to map Doorbell or our
 shared page.

There is the 64 bit file offset field to mmap which I think is
driver-specific. You could use 0 for the doorbell page, QPN*PAGE_SIZE
+ QPN_OFFSET for the per-QP page, etc..

 The second problem is that this rx/tx mmap should map the separate
 page per QP to avoid the unnecessary QP lookups so page identifier
 passed to mmap should be based on QP identifier.
 
 I cannot find a specific code for /dev/infiniband/uverbsX. Is this
 device driver sharing the same functions like /dev/infiniband/rdmacm
 or it has own implementation.

It is in drivers/infiniband/core/uverbs*

For mmap the call is just routed to the driver's ib_dev mmap function,
so you can do whatever you want in your driver and match the
functionality in your userspace libibverbs driver library.

I think you should be able to implement your driver-specific
optimization within the uverbs framework - that would be best all
round.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_post_send/recv kernel path optimizations

2010-12-08 Thread Roland Dreier
  The problem that I see is that the mmap is currently used for mapping
  of doorbell page in different drivers.

The driver can use different offsets into the file to map different
things.  For example I believe ehca, ipath and qib already do this.

  I cannot find a specific code for /dev/infiniband/uverbsX. Is this
  device driver sharing the same functions like /dev/infiniband/rdmacm
  or it has own implementation.

it is in drivers/infiniband/core/uverbs_main.c.

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)

2010-11-26 Thread Walukiewicz, Miroslaw
Some time ago we discussed a possibility of removing usage of nes_ud_sksq for 
IMA driver as a blocker of pushing IMA solution to kernel.org. 

The proposal was using OFED transmit optimized path by /dev/infiniband/rdma_cm 
instead of using private nes_ud_sksq device.

I made an implementation of such solution for checking the performance impact 
and looking for optimize the existing code. 

I made a simple send test (sendto in kernel) using my NEHALEM i7 machine. 
Current nes_ud_sksq implementation achieved about 1,25mln pkts/sec.
The OFED path (with rdma_cm call) achieved about 0,9mlns pkts/sec.


I run oprofile on rdma_cm code and got a following results:

samples  %linenr info app name 
symbol name
2586067  24.5323  nes_uverbs.c:558libnes-rdmav2.so 
nes_upoll_cq
1198042  11.3650  (no location information)   vmlinux  __up_read
5392585.1156  (no location information)   vmlinux  
copy_user_generic_string
4078843.8693  msa_verbs.c:1692libmsa.so.1.0.0  
msa_post_send
3045692.8892  msa_verbs.c:2098libmsa.so.1.0.0  
usq_sendmsg_noblock
2999542.8455  (no location information)   vmlinux  __kmalloc
2974632.8218  (no location information)   libibverbs.so.1.0.0  
/usr/lib64/libibverbs.so.1.0.0
2679512.5419  uverbs_cmd.c:1433   ib_uverbs.ko 
ib_uverbs_post_send
2647092.5111  (no location information)   vmlinux  kfree
2051071.9457  port.c:2947 libmsa.so.1.0.0  sendto
1462251.3871  (no location information)   vmlinux  
__down_read
1459411.3844  (no location information)   libpthread-2.5.so
__write_nocancel
1399341.3275  nes_ud.c:1746   iw_nes.ko
nes_ud_post_send_new_path
1318791.2510  send.c:32   msa_tst  
blocking_test_send(void*)
1275191.2097  (no location information)   vmlinux  
system_call
1235521.1721  port.c:858  libmsa.so.1.0.0  
find_mcast
1092491.0364  nes_verbs.c:3478iw_nes.ko
nes_post_send
92060 0.8733  (no location information)   vmlinux  vfs_write
90187 0.8555  uverbs_cmd.c:144ib_uverbs.ko 
__idr_get_uobj
89563 0.8496  nes_uverbs.c:1460   libnes-rdmav2.so 
nes_upost_send

Form the trace it looks like the__up_read() - 11% wastes most of time. It 
is called from idr_read_qp when a  put_uobj_read is called. 

if (copy_from_user(cmd, buf, sizeof cmd))  - 5% it is called twice 
from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame
return -EFAULT;

and __kmalloc/kfree - 5% is the third function that has a big meaning. It is 
called twice for each frame transmitted.

It is about 20% of performance loss comparing to nes_ud_sksq path which we miss 
when we use a OFED path. 

What I can modify is a kmalloc/kfree optimization - it is possible to make 
allocation only at start and use pre-allocated buffers.

 I don't see any way for optimalization of idr_read_qp usage or copy_user. In 
current approach we use a shared page and a separate nes_ud_sksq handle for 
each created QP so there is no need for any user space data copy or QP lookup. 

Do you have any idea how can we optimize this path?

Regards,

Mirek

-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Thursday, November 25, 2010 4:01 PM
To: Walukiewicz, Miroslaw
Cc: Jason Gunthorpe; Roland Dreier; Roland Dreier; Hefty, Sean; 
linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations (was: uverbs: handle 
large number of entries)

Jason Gunthorpe wrote:
 Hmm, considering your list is everything but Mellanox, maybe it makes much 
 more sense to push the copy_to_user down into the driver - ie a 
 ibv_poll_cq_user - then the driver can construct each CQ entry on the stack 
 and copy it to userspace, avoid the double copy, allocation and avoid any 
 fixed overhead of ibv_poll_cq.

 A bigger change to be sure, but remember this old thread:
 http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05114.html
 2x improvement by removing allocs on the post path..
Hi Mirek,

Any updates on your findings with the patches?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)

2010-11-25 Thread Or Gerlitz

Jason Gunthorpe wrote:

Hmm, considering your list is everything but Mellanox, maybe it makes much more 
sense to push the copy_to_user down into the driver - ie a ibv_poll_cq_user - 
then the driver can construct each CQ entry on the stack and copy it to 
userspace, avoid the double copy, allocation and avoid any fixed overhead of 
ibv_poll_cq.

A bigger change to be sure, but remember this old thread:
http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05114.html
2x improvement by removing allocs on the post path..

Hi Mirek,

Any updates on your findings with the patches?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html