RE: ibv_post_send/recv kernel path optimizations
Sean, The assumption here is that user space library prepares the vendor specific data in user-space using a shared page allocated by vendor driver. Here information about posted buffers is passed not through ib_wr but using the shared page. It is a reason why pointers indicating ib_wr in post_send are not set, they are not passed to kernel at all to avoid copying them in kernel. As there is no ib_wr structure in kernel there is no reference to bad_wr and a buffer that failed in this context so the only reasonable information about operation state passed using bad_wr could be return of binary information - operation successful (bad_wr = 0) or not (bad_wr != 0) Instead of using a specific case for RAW_QP it is possible to pass some information about posting buffers method by enum ib_qp_create_flags { IB_QP_CREATE_IPOIB_UD_LSO = 1 0, IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK = 1 1 }; Extending it with IB_QP_CREATE_USE_SHARED_PAGE= 1 2, In that case a new method could be used for any type of QP and will be backward compatible. Regards, Mirek -Original Message- From: Hefty, Sean Sent: Friday, January 21, 2011 4:50 PM To: Walukiewicz, Miroslaw; Roland Dreier Cc: Or Gerlitz; Jason Gunthorpe; linux-rdma@vger.kernel.org Subject: RE: ibv_post_send/recv kernel path optimizations + qp = idr_read_qp(cmd.qp_handle, file-ucontext); + if (!qp) + goto out_raw_qp; + + if (qp-qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp-device-post_send(qp, NULL, NULL); This looks odd to me and can definitely confuse someone reading the code. It adds assumptions to uverbs about the underlying driver implementation and ties that to the QP type. I don't know if it makes more sense to key off something in the cmd or define some other property of the QP, but the NULL parameters into post_send are non-intuitive. + if (ret) + resp.bad_wr = cmd.wr_count; Is this always the case? - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Roland, You are right that the idr implementation introduces insignificant change in performance. I made the version with idr and semaphores usage and I see a minimal change comparing to hash table. Now only a shared page is used instead of kmalloc and copy_to_user. I simplified changes to uverbs and I achieved what I wanted in performance. Now the patch looks like below. Are these changes acceptable for k.org Regards, Mirek --- ../SOURCES_19012011/ofa_kernel-1.5.3/drivers/infiniband/core/uverbs_cmd.c 2011-01-19 05:37:55.0 +0100 +++ ofa_kernel-1.5.3_idr_qp/drivers/infiniband/core/uverbs_cmd.c 2011-01-21 04:10:07.0 +0100 @@ -1449,15 +1449,29 @@ if (cmd.wqe_size sizeof (struct ib_uverbs_send_wr)) return -EINVAL; + qp = idr_read_qp(cmd.qp_handle, file-ucontext); + if (!qp) + goto out_raw_qp; + + if (qp-qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp-device-post_send(qp, NULL, NULL); + if (ret) + resp.bad_wr = cmd.wr_count; + + if (copy_to_user((void __user *) (unsigned long) + cmd.response, + resp, + sizeof resp)) + ret = -EFAULT; + put_qp_read(qp); + goto out_raw_qp; + } user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); if (!user_wr) return -ENOMEM; - qp = idr_read_qp(cmd.qp_handle, file-ucontext); - if (!qp) - goto out; - is_ud = qp-qp_type == IB_QPT_UD; sg_ind = 0; last = NULL; @@ -1577,9 +1591,8 @@ wr = next; } -out: kfree(user_wr); - +out_raw_qp: return ret ? ret : in_len; } @@ -1681,16 +1694,31 @@ if (copy_from_user(cmd, buf, sizeof cmd)) return -EFAULT; + qp = idr_read_qp(cmd.qp_handle, file-ucontext); + if (!qp) + goto out_raw_qp; + +if (qp-qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp-device-post_recv(qp, NULL, NULL); +if (ret) + resp.bad_wr = cmd.wr_count; + +if (copy_to_user((void __user *) (unsigned long) + cmd.response, + resp, + sizeof resp)) + ret = -EFAULT; + put_qp_read(qp); + goto out_raw_qp; + } + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, in_len - sizeof cmd, cmd.wr_count, cmd.sge_count, cmd.wqe_size); if (IS_ERR(wr)) return PTR_ERR(wr); - qp = idr_read_qp(cmd.qp_handle, file-ucontext); - if (!qp) - goto out; - resp.bad_wr = 0; ret = qp-device-post_recv(qp, wr, bad_wr); @@ -1707,13 +1735,13 @@ resp, sizeof resp)) ret = -EFAULT; -out: while (wr) { next = wr-next; kfree(wr); wr = next; } +out_raw_qp: return ret ? ret : in_len; } -Original Message- From: Roland Dreier [mailto:rdre...@cisco.com] Sent: Monday, January 10, 2011 9:38 PM To: Walukiewicz, Miroslaw Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations You are right that the most of the speed-up is coming from avoid semaphores, but not only. From the oprof traces, the semaphores made half of difference. The next one was copy_from_user and kmalloc/kfree usage (in my proposal - shared page method is used instead) OK, but in any case the switch from idr to hash table seems to be insignificant. I agree that using a shared page is a good idea, but removing locking needed for correctness is not a good optimization. In my opinion, the responsibility for cases like protection of QP against destroy during buffer post (and other similar cases) should be moved to vendor driver. The OFED code should move only the code path to driver. Not sure what OFED code you're talking about. We're discussing the kernel uverbs code, right? In any case I'd be interested in seeing how it looks if you move the protection into the individual drivers. I'd be worried about having to duplicate the same code everywhere (which leads to bugs in individual drivers) -- I guess this could be resolved by having the code be a library that individual drivers call into. But also I'm not sure if I see how you could make such a scheme work -- you need to make sure that the data structures used in the uverbs dispatch to drivers remain consistent. In the end I don't think we should go too far optimizing the non-kernel-bypass case
RE: ibv_post_send/recv kernel path optimizations
+ qp = idr_read_qp(cmd.qp_handle, file-ucontext); + if (!qp) + goto out_raw_qp; + + if (qp-qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp-device-post_send(qp, NULL, NULL); This looks odd to me and can definitely confuse someone reading the code. It adds assumptions to uverbs about the underlying driver implementation and ties that to the QP type. I don't know if it makes more sense to key off something in the cmd or define some other property of the QP, but the NULL parameters into post_send are non-intuitive. + if (ret) + resp.bad_wr = cmd.wr_count; Is this always the case? - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Roland, You are right that the most of the speed-up is coming from avoid semaphores, but not only. From the oprof traces, the semaphores made half of difference. The next one was copy_from_user and kmalloc/kfree usage (in my proposal - shared page method is used instead) In my opinion, the responsibility for cases like protection of QP against destroy during buffer post (and other similar cases) should be moved to vendor driver. The OFED code should move only the code path to driver. Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier Sent: Wednesday, January 05, 2011 7:17 PM To: Walukiewicz, Miroslaw Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations The patch for ofed-1.5.3 looks like below. I will try to push it to kernel.org after porting. Now an uverbs post_send/post_recv path is modified to make pre-lookup for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path is used for posting buffers. for example using a shared page approach in cooperation with user-space library I don't quite see why a hash table helps performance much vs. an IDR. Is the actual IDR lookup a significant part of the cost? (By the way, instead of list_head you could use hlist_head to make your hash table denser and save cache footprint -- that way an 8-entry table on 64-bit systems fits in one cacheline) Also it seems that you get rid of all the locking on QPs when you look them up in your hash table. What protects against userspace posting a send in one thread and destroying the QP in another thread, and ending up having the destroy complete before the send is posted (leading to use-after-free in the kernel)? I would guess that your speedup is really coming from getting rid of locking that is actually required for correctness. Maybe I'm wrong though, I'm just guessing here. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_post_send/recv kernel path optimizations
You are right that the most of the speed-up is coming from avoid semaphores, but not only. From the oprof traces, the semaphores made half of difference. The next one was copy_from_user and kmalloc/kfree usage (in my proposal - shared page method is used instead) OK, but in any case the switch from idr to hash table seems to be insignificant. I agree that using a shared page is a good idea, but removing locking needed for correctness is not a good optimization. In my opinion, the responsibility for cases like protection of QP against destroy during buffer post (and other similar cases) should be moved to vendor driver. The OFED code should move only the code path to driver. Not sure what OFED code you're talking about. We're discussing the kernel uverbs code, right? In any case I'd be interested in seeing how it looks if you move the protection into the individual drivers. I'd be worried about having to duplicate the same code everywhere (which leads to bugs in individual drivers) -- I guess this could be resolved by having the code be a library that individual drivers call into. But also I'm not sure if I see how you could make such a scheme work -- you need to make sure that the data structures used in the uverbs dispatch to drivers remain consistent. In the end I don't think we should go too far optimizing the non-kernel-bypass case of verbs -- the main thing we're designing for is kernel bypass hardware, after all. Perhaps you could make your case go faster by using a different file descriptor for each QP or something (you could pass the fd back as part of the driver-specific create QP path)? - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_post_send/recv kernel path optimizations
The patch for ofed-1.5.3 looks like below. I will try to push it to kernel.org after porting. Now an uverbs post_send/post_recv path is modified to make pre-lookup for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path is used for posting buffers. for example using a shared page approach in cooperation with user-space library I don't quite see why a hash table helps performance much vs. an IDR. Is the actual IDR lookup a significant part of the cost? (By the way, instead of list_head you could use hlist_head to make your hash table denser and save cache footprint -- that way an 8-entry table on 64-bit systems fits in one cacheline) Also it seems that you get rid of all the locking on QPs when you look them up in your hash table. What protects against userspace posting a send in one thread and destroying the QP in another thread, and ending up having the destroy complete before the send is posted (leading to use-after-free in the kernel)? I would guess that your speedup is really coming from getting rid of locking that is actually required for correctness. Maybe I'm wrong though, I'm just guessing here. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_post_send/recv kernel path optimizations
Jason Gunthorpe wrote: Walukiewicz, Miroslaw wrote: called for many QPs, there is a single entry point to ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that case there is a lookup to QP store (idr_read_qp) necessary to find a correct ibv_qp Structure, what is a big time consumer on the path. I don't think this should be such a big problem. The simplest solution would be to front the idr_read_qp with a small learning hashing table. yes, there must be a few ways (e.g as Jason suggested) to do this house-keeping much more efficient, in a manner that fits fast path - which maybe wasn't the mindset when this code was written as its primary use was to invoke control plane commands. The NES IMA kernel path also has such QP lookup but the QP number format is designed to make such lookup very quickly. The QP numbers in OFED are not defined so generic lookup functions like idr_read_qp() must be use. Maybe look at moving the QPN to ibv_qp translation into the driver then - or better yet, move allocation out of the driver, if Mellanox could change their FW.. You are right that we could do this much faster if the QPN was structured in some way I think there should be some validation on the uverbs level, as the caller is untrusted user space application, e.g in a similar way for each system call done on a file-descriptor Or. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
= kmalloc(cmd.wqe_size, GFP_KERNEL); if (!user_wr) return -ENOMEM; @@ -1579,7 +1649,7 @@ out_put: out: kfree(user_wr); - +out_raw_qp: return ret ? ret : in_len; } @@ -1664,7 +1734,6 @@ err: kfree(wr); wr = next; } - return ERR_PTR(ret); } @@ -1681,6 +1750,25 @@ ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file, if (copy_from_user(cmd, buf, sizeof cmd)) return -EFAULT; + mutex_lock(file-mutex); + qp = raw_qp_lookup(cmd.qp_handle, file-ucontext); + mutex_unlock(file-mutex); + if (qp) { + if (qp-qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp-device-post_recv(qp, NULL, bad_wr); + if (ret) + resp.bad_wr = cmd.wr_count; + + if (copy_to_user((void __user *) (unsigned long) + cmd.response, + resp, + sizeof resp)) + ret = -EFAULT; + goto out_raw_qp; + } + } + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, in_len - sizeof cmd, cmd.wr_count, cmd.sge_count, cmd.wqe_size); @@ -1713,7 +1801,7 @@ out: kfree(wr); wr = next; } - +out_raw_qp: return ret ? ret : in_len; } diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index f5b054a..adf1dd8 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -838,6 +838,8 @@ struct ib_fmr_attr { u8 page_shift; }; +#define MAX_RAW_QP_HASH 8 + struct ib_ucontext { struct ib_device *device; struct list_headpd_list; @@ -848,6 +850,7 @@ struct ib_ucontext { struct list_headsrq_list; struct list_headah_list; struct list_headxrc_domain_list; + struct list_headraw_qp_hash[MAX_RAW_QP_HASH]; int closing; }; @@ -859,6 +862,7 @@ struct ib_uobject { int id; /* index into kernel idr */ struct kref ref; struct rw_semaphore mutex; /* protects .live */ + struct list_headraw_qp_list;/* raw qp hash */ int live; }; Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Or Gerlitz Sent: Monday, December 27, 2010 1:39 PM To: Jason Gunthorpe; Walukiewicz, Miroslaw Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations Jason Gunthorpe wrote: Walukiewicz, Miroslaw wrote: called for many QPs, there is a single entry point to ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that case there is a lookup to QP store (idr_read_qp) necessary to find a correct ibv_qp Structure, what is a big time consumer on the path. I don't think this should be such a big problem. The simplest solution would be to front the idr_read_qp with a small learning hashing table. yes, there must be a few ways (e.g as Jason suggested) to do this house-keeping much more efficient, in a manner that fits fast path - which maybe wasn't the mindset when this code was written as its primary use was to invoke control plane commands. The NES IMA kernel path also has such QP lookup but the QP number format is designed to make such lookup very quickly. The QP numbers in OFED are not defined so generic lookup functions like idr_read_qp() must be use. Maybe look at moving the QPN to ibv_qp translation into the driver then - or better yet, move allocation out of the driver, if Mellanox could change their FW.. You are right that we could do this much faster if the QPN was structured in some way I think there should be some validation on the uverbs level, as the caller is untrusted user space application, e.g in a similar way for each system call done on a file-descriptor Or. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_post_send/recv kernel path optimizations
On 12/27/2010 5:18 PM, Walukiewicz, Miroslaw wrote: I implemented the very small hash table and I achieved performance comparable to previous solution. Just to clarify, when saying achieved performance comparable to previous solution you refer to the approach which bypasses uverbs on the post send path? Also, why enhance only the raw qp flow? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Just to clarify, when saying achieved performance comparable to previous solution you refer to the approach which bypasses uverbs on the post send path? Also, why enhance only the raw qp flow? I compare to my previous solution using private device for passing information about packets. Comparing to current approach I see more than 20% of improvement. This solution introduces a new path for posting buffers using a shared page approach. It works following way: 1. create RAW qp and add it to the raw QP hash list. 2. user space library mmaps the shared page (it is specific action per device and must implemented separately per each driver) 3. during buffer posting the library puts buffers info into shared page and calls uverbs. 4. uverbs detects the raw qp and inform the driver bypassing current path. The solution cannot be shared between RDMA drivers because it needs redesign of the driver (share page format is vendor specific). Now only NES driver implements the RAW QP path through kernel (other vendors uses pure user-space solution) so No other vendor will use this path. There is possibility to add a new QP capability or attribute that will inform uverbs that it is a new transmit path used so then the solution could be extended for all drivers. Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Monday, December 27, 2010 4:22 PM To: Walukiewicz, Miroslaw Cc: Jason Gunthorpe; Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations On 12/27/2010 5:18 PM, Walukiewicz, Miroslaw wrote: I implemented the very small hash table and I achieved performance comparable to previous solution. Just to clarify, when saying achieved performance comparable to previous solution you refer to the approach which bypasses uverbs on the post send path? Also, why enhance only the raw qp flow? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Or, I looked into shared page approach of passing post_send/post_recv info. I still have some concerns. The shared page must be allocated per QP and there should be a common way to allocate such page for each driver. As Jason and Roland said, the best way to pass this parameter through mmap is offset. There is no common way how the Offset is used per driver and it is driver specific parameter. The next problem is how many shared pages should driver allocate to share with user-space. They should contain place for each posted buffer by application. It is a big concern to post_recv where large number of buffers are posted. Current implementation has no such limit. Even the common offset definition would be defined and accepted, the shared page must be stored in ib_qp structure. When a post_send is called for many QPs, there is a single entry point to ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that case there is a lookup to QP store (idr_read_qp) necessary to find a correct ibv_qp Structure, what is a big time consumer on the path. The NES IMA kernel path also has such QP lookup but the QP number format is designed to make such lookup very quickly. The QP numbers in OFED are not defined so generic lookup functions like idr_read_qp() must be use. Regards, Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Wednesday, December 01, 2010 9:12 AM To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote: Form the trace it looks like the __up_read() - 11% wastes most of time. It is called from idr_read_qp when a put_uobj_read is called. if (copy_from_user(cmd, buf, sizeof cmd)) - 5% it is called twice from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... and __kmalloc/kfree - 5% is the third function that has a big meaning. It is called twice for each frame transmitted. It is about 20% of performance loss comparing to nes_ud_sksq path which we miss when we use a OFED path. What I can modify is a kmalloc/kfree optimization - it is possible to make allocation only at start and use pre-allocated buffers. I don't see any way for optimalization of idr_read_qp usage or copy_user. In current approach we use a shared page and a separate nes_ud_sksq handle for each created QP so there is no need for any user space data copy or QP lookup. As was mentioned earlier on this thread, and repeated here, the kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see why the ib uverbs flow (BTW - the data path has nothing to do with the rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced e.g to support shared-page which is allocated mmaped from uverbs to user space and used in the same manner your implementation does. The 1st copy_from_user should add pretty nothing and if it does, it can be replaced with different user/kernel IPC mechanism which costs less. So we're basically remained with the idr_read_qp, I wonder what other people think if/how this can be optimized? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_post_send/recv kernel path optimizations
On Tue, Dec 14, 2010 at 02:12:56PM +, Walukiewicz, Miroslaw wrote: Or, I looked into shared page approach of passing post_send/post_recv info. I still have some concerns. The shared page must be allocated per QP and there should be a common way to allocate such page for each driver. Why must it be common? Aren't these pages just hidden away in your driver library or is something visible to the app? I get that they are somehow used to insert/remove the extra header information that the hardware cannot stuff, but how? As Jason and Roland said, the best way to pass this parameter through mmap is offset. There is no common way how the Offset is used per driver and it is driver specific parameter. Right, offset is driver specific, it is only used between your userspace driver library and your kernel driver. You can define whatever you want. So use something like QPN*4096 + 1 The next problem is how many shared pages should driver allocate to share with user-space. They should contain place for each posted buffer by application. It is a big concern to post_recv where large number of buffers are posted. Current implementation has no such limit. I don't see the problem. mmap also has a length argument. So mmap(0,16*4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,QPN*4096+1) Means map 16 pages associated with QPN. You don't have to have the offset and length 'make sense' they are just parameters. Even the common offset definition would be defined and accepted, the shared page must be stored in ib_qp structure. When a post_send is You don't need a common definition, it is driver specific. called for many QPs, there is a single entry point to ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that case there is a lookup to QP store (idr_read_qp) necessary to find a correct ibv_qp Structure, what is a big time consumer on the path. I don't think this should be such a big problem. The simplest solution would be to front the idr_read_qp with a small learning hashing table. If you have only one active QP per verbs instance then you avoid the idr calls. I'm guessing your raw ethernet app uses one QP? The NES IMA kernel path also has such QP lookup but the QP number format is designed to make such lookup very quickly. The QP numbers in OFED are not defined so generic lookup functions like idr_read_qp() must be use. Maybe look at moving the QPN to ibv_qp translation into the driver then - or better yet, move allocation out of the driver, if Mellanox could change their FW.. You are right that we could do this much faster if the QPN was structured in some way Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Or, I don't see why the ib uverbs flow (BTW - the data path has nothing to do with the rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced e.g to support shared-page which is allocated mmaped from uverbs to user space and used in the same manner your implementation does. The problem that I see is that the mmap is currently used for mapping of doorbell page in different drivers. We can use it for mapping a page for transmit/receive operation when we are able to differentiate that we need to map Doorbell or our shared page. The second problem is that this rx/tx mmap should map the separate page per QP to avoid the unnecessary QP lookups so page identifier passed to mmap should be based on QP identifier. I cannot find a specific code for /dev/infiniband/uverbsX. Is this device driver sharing the same functions like /dev/infiniband/rdmacm or it has own implementation. Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Wednesday, December 01, 2010 9:12 AM To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote: Form the trace it looks like the __up_read() - 11% wastes most of time. It is called from idr_read_qp when a put_uobj_read is called. if (copy_from_user(cmd, buf, sizeof cmd)) - 5% it is called twice from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... and __kmalloc/kfree - 5% is the third function that has a big meaning. It is called twice for each frame transmitted. It is about 20% of performance loss comparing to nes_ud_sksq path which we miss when we use a OFED path. What I can modify is a kmalloc/kfree optimization - it is possible to make allocation only at start and use pre-allocated buffers. I don't see any way for optimalization of idr_read_qp usage or copy_user. In current approach we use a shared page and a separate nes_ud_sksq handle for each created QP so there is no need for any user space data copy or QP lookup. As was mentioned earlier on this thread, and repeated here, the kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see why the ib uverbs flow (BTW - the data path has nothing to do with the rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced e.g to support shared-page which is allocated mmaped from uverbs to user space and used in the same manner your implementation does. The 1st copy_from_user should add pretty nothing and if it does, it can be replaced with different user/kernel IPC mechanism which costs less. So we're basically remained with the idr_read_qp, I wonder what other people think if/how this can be optimized? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_post_send/recv kernel path optimizations
On Wed, Dec 08, 2010 at 12:14:51PM +, Walukiewicz, Miroslaw wrote: Or, I don't see why the ib uverbs flow (BTW - the data path has nothing to do with the rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced e.g to support shared-page which is allocated mmaped from uverbs to user space and used in the same manner your implementation does. The problem that I see is that the mmap is currently used for mapping of doorbell page in different drivers. We can use it for mapping a page for transmit/receive operation when we are able to differentiate that we need to map Doorbell or our shared page. There is the 64 bit file offset field to mmap which I think is driver-specific. You could use 0 for the doorbell page, QPN*PAGE_SIZE + QPN_OFFSET for the per-QP page, etc.. The second problem is that this rx/tx mmap should map the separate page per QP to avoid the unnecessary QP lookups so page identifier passed to mmap should be based on QP identifier. I cannot find a specific code for /dev/infiniband/uverbsX. Is this device driver sharing the same functions like /dev/infiniband/rdmacm or it has own implementation. It is in drivers/infiniband/core/uverbs* For mmap the call is just routed to the driver's ib_dev mmap function, so you can do whatever you want in your driver and match the functionality in your userspace libibverbs driver library. I think you should be able to implement your driver-specific optimization within the uverbs framework - that would be best all round. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_post_send/recv kernel path optimizations
The problem that I see is that the mmap is currently used for mapping of doorbell page in different drivers. The driver can use different offsets into the file to map different things. For example I believe ehca, ipath and qib already do this. I cannot find a specific code for /dev/infiniband/uverbsX. Is this device driver sharing the same functions like /dev/infiniband/rdmacm or it has own implementation. it is in drivers/infiniband/core/uverbs_main.c. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)
Some time ago we discussed a possibility of removing usage of nes_ud_sksq for IMA driver as a blocker of pushing IMA solution to kernel.org. The proposal was using OFED transmit optimized path by /dev/infiniband/rdma_cm instead of using private nes_ud_sksq device. I made an implementation of such solution for checking the performance impact and looking for optimize the existing code. I made a simple send test (sendto in kernel) using my NEHALEM i7 machine. Current nes_ud_sksq implementation achieved about 1,25mln pkts/sec. The OFED path (with rdma_cm call) achieved about 0,9mlns pkts/sec. I run oprofile on rdma_cm code and got a following results: samples %linenr info app name symbol name 2586067 24.5323 nes_uverbs.c:558libnes-rdmav2.so nes_upoll_cq 1198042 11.3650 (no location information) vmlinux __up_read 5392585.1156 (no location information) vmlinux copy_user_generic_string 4078843.8693 msa_verbs.c:1692libmsa.so.1.0.0 msa_post_send 3045692.8892 msa_verbs.c:2098libmsa.so.1.0.0 usq_sendmsg_noblock 2999542.8455 (no location information) vmlinux __kmalloc 2974632.8218 (no location information) libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1.0.0 2679512.5419 uverbs_cmd.c:1433 ib_uverbs.ko ib_uverbs_post_send 2647092.5111 (no location information) vmlinux kfree 2051071.9457 port.c:2947 libmsa.so.1.0.0 sendto 1462251.3871 (no location information) vmlinux __down_read 1459411.3844 (no location information) libpthread-2.5.so __write_nocancel 1399341.3275 nes_ud.c:1746 iw_nes.ko nes_ud_post_send_new_path 1318791.2510 send.c:32 msa_tst blocking_test_send(void*) 1275191.2097 (no location information) vmlinux system_call 1235521.1721 port.c:858 libmsa.so.1.0.0 find_mcast 1092491.0364 nes_verbs.c:3478iw_nes.ko nes_post_send 92060 0.8733 (no location information) vmlinux vfs_write 90187 0.8555 uverbs_cmd.c:144ib_uverbs.ko __idr_get_uobj 89563 0.8496 nes_uverbs.c:1460 libnes-rdmav2.so nes_upost_send Form the trace it looks like the__up_read() - 11% wastes most of time. It is called from idr_read_qp when a put_uobj_read is called. if (copy_from_user(cmd, buf, sizeof cmd)) - 5% it is called twice from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame return -EFAULT; and __kmalloc/kfree - 5% is the third function that has a big meaning. It is called twice for each frame transmitted. It is about 20% of performance loss comparing to nes_ud_sksq path which we miss when we use a OFED path. What I can modify is a kmalloc/kfree optimization - it is possible to make allocation only at start and use pre-allocated buffers. I don't see any way for optimalization of idr_read_qp usage or copy_user. In current approach we use a shared page and a separate nes_ud_sksq handle for each created QP so there is no need for any user space data copy or QP lookup. Do you have any idea how can we optimize this path? Regards, Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Thursday, November 25, 2010 4:01 PM To: Walukiewicz, Miroslaw Cc: Jason Gunthorpe; Roland Dreier; Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries) Jason Gunthorpe wrote: Hmm, considering your list is everything but Mellanox, maybe it makes much more sense to push the copy_to_user down into the driver - ie a ibv_poll_cq_user - then the driver can construct each CQ entry on the stack and copy it to userspace, avoid the double copy, allocation and avoid any fixed overhead of ibv_poll_cq. A bigger change to be sure, but remember this old thread: http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05114.html 2x improvement by removing allocs on the post path.. Hi Mirek, Any updates on your findings with the patches? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)
Jason Gunthorpe wrote: Hmm, considering your list is everything but Mellanox, maybe it makes much more sense to push the copy_to_user down into the driver - ie a ibv_poll_cq_user - then the driver can construct each CQ entry on the stack and copy it to userspace, avoid the double copy, allocation and avoid any fixed overhead of ibv_poll_cq. A bigger change to be sure, but remember this old thread: http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05114.html 2x improvement by removing allocs on the post path.. Hi Mirek, Any updates on your findings with the patches? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html