RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
I agree with you that changing kernel ABI is not necessary. I will follow your directions regarding a single allocation at start. Regards, Mirek -Original Message- From: Roland Dreier [mailto:rdre...@cisco.com] Sent: Friday, August 06, 2010 5:58 PM To: Walukiewicz, Miroslaw Cc: linux-rdma@vger.kernel.org Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations The proposed path optimization is removing of dynamic allocations by redefining a structure definition passed to kernel. To struct ibv_post_send { __u32 command; __u16 in_words; __u16 out_words; __u64 response; __u32 qp_handle; __u32 wr_count; __u32 sge_count; __u32 wqe_size; struct ibv_kern_send_wr send_wr[512]; }; I don't see how this can possibly work. Where does the scatter/gather list go if you make this have a fixed size array of send_wr? Also I don't see why you need to change the user/kernel ABI at all to get rid of dynamic allocations... can't you just have the kernel keep a cached send_wr allocation (say, per user context) and reuse that? (ie allocate memory but don't free the first time into post_send, and only reallocate if a bigger send request comes, and only free when destroying the context) - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
Hello Jason, Do you have any benchmarks that show the alloca is a measurable overhead? We changed overall path (both kernel and user space) to allocation-less approach and We achieved twice better latency using call to kernel driver. I have no data which path Is dominant - kernel or user space. I think I will have some measurements next week, so I will share My results. Roland is right, all you really need is a per-context (+per-cpu?) buffer you can grab, fill, and put back. I agree. I will go into this direction. Regards, Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Friday, August 06, 2010 6:33 PM To: Walukiewicz, Miroslaw Cc: Roland Dreier; linux-rdma@vger.kernel.org Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote: Currently the transmit/receive path works following way: User calls ibv_post_send() where vendor specific function is called. When the path should go through kernel the ibv_cmd_post_send() is called. The function creates the POST_SEND message body that is passed to kernel. As the number of sges is unknown the dynamic allocation for message body is performed. (see libibverbs/src/cmd.c) Do you have any benchmarks that show the alloca is a measurable overhead? I'm pretty skeptical... alloca will generally boil down to one or two assembly instructions adjusting the stack pointer, and not even that if you are lucky and it can be merged into the function prologe. In the kernel the message body is parsed and a structure of wr and sges is recreated using dynamic allocations in kernel The goal of this operation is having a similar structure like in user space. .. the kmalloc call(s) on the other hand definately seems worth looking at .. In kernel in ib_uverbs_post_send() instead of dynamic allocation of the ib_send_wr structures the table of 512 ib_send_wr structures will be defined and all entries will be linked to unidirectional list so qp-device-post_send(qp, wr, bad_wr) API will be not changed. Isn't there a kernel API already for managing a pool of pre-allocated fixed-size allocations? It isn't clear to me that is even necessary, Roland is right, all you really need is a per-context (+per-cpu?) buffer you can grab, fill, and put back. As I know no driver uses that kernel path to posting buffers so iWARP multicast acceleration implemented in NES driver Would be a first application that can utilize the optimized path. ?? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
The proposed path optimization is removing of dynamic allocations by redefining a structure definition passed to kernel. To struct ibv_post_send { __u32 command; __u16 in_words; __u16 out_words; __u64 response; __u32 qp_handle; __u32 wr_count; __u32 sge_count; __u32 wqe_size; struct ibv_kern_send_wr send_wr[512]; }; I don't see how this can possibly work. Where does the scatter/gather list go if you make this have a fixed size array of send_wr? Also I don't see why you need to change the user/kernel ABI at all to get rid of dynamic allocations... can't you just have the kernel keep a cached send_wr allocation (say, per user context) and reuse that? (ie allocate memory but don't free the first time into post_send, and only reallocate if a bigger send request comes, and only free when destroying the context) - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote: Currently the transmit/receive path works following way: User calls ibv_post_send() where vendor specific function is called. When the path should go through kernel the ibv_cmd_post_send() is called. The function creates the POST_SEND message body that is passed to kernel. As the number of sges is unknown the dynamic allocation for message body is performed. (see libibverbs/src/cmd.c) Do you have any benchmarks that show the alloca is a measurable overhead? I'm pretty skeptical... alloca will generally boil down to one or two assembly instructions adjusting the stack pointer, and not even that if you are lucky and it can be merged into the function prologe. In the kernel the message body is parsed and a structure of wr and sges is recreated using dynamic allocations in kernel The goal of this operation is having a similar structure like in user space. .. the kmalloc call(s) on the other hand definately seems worth looking at .. In kernel in ib_uverbs_post_send() instead of dynamic allocation of the ib_send_wr structures the table of 512 ib_send_wr structures will be defined and all entries will be linked to unidirectional list so qp-device-post_send(qp, wr, bad_wr) API will be not changed. Isn't there a kernel API already for managing a pool of pre-allocated fixed-size allocations? It isn't clear to me that is even necessary, Roland is right, all you really need is a per-context (+per-cpu?) buffer you can grab, fill, and put back. As I know no driver uses that kernel path to posting buffers so iWARP multicast acceleration implemented in NES driver Would be a first application that can utilize the optimized path. ?? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
On Fri, 2010-08-06 at 03:03 -0700, Walukiewicz, Miroslaw wrote: Currently the ibv_post_send()/ibv_post_recv() path through kernel (using /dev/infiniband/rdmacm) could be optimized by removing dynamic memory allocations on the path. Currently the transmit/receive path works following way: User calls ibv_post_send() where vendor specific function is called. When the path should go through kernel the ibv_cmd_post_send() is called. The function creates the POST_SEND message body that is passed to kernel. As the number of sges is unknown the dynamic allocation for message body is performed. (see libibverbs/src/cmd.c) In the kernel the message body is parsed and a structure of wr and sges is recreated using dynamic allocations in kernel The goal of this operation is having a similar structure like in user space. The proposed path optimization is removing of dynamic allocations by redefining a structure definition passed to kernel. From struct ibv_post_send { __u32 command; __u16 in_words; __u16 out_words; __u64 response; __u32 qp_handle; __u32 wr_count; __u32 sge_count; __u32 wqe_size; struct ibv_kern_send_wr send_wr[0]; }; To struct ibv_post_send { __u32 command; __u16 in_words; __u16 out_words; __u64 response; __u32 qp_handle; __u32 wr_count; __u32 sge_count; __u32 wqe_size; struct ibv_kern_send_wr send_wr[512]; }; Similar change is required in kernel struct ib_uverbs_post_send defined in /ofa_kernel/include/rdma/ib_uverbs.h This change limits a number of send_wr passed from unlimited (assured by dynamic allocation) to reasonable number of 512. I think this number should be a max number of QP entries available to send. As the all iB/iWARP applications are low latency applications so the number of WRs passed are never unlimited. As the result instead of dynamic allocation the ibv_cmd_post_send() fills the proposed structure directly and passes it to kernel. Whenever the number of send_wr number exceeds the limit the ENOMEM error is returned. In kernel in ib_uverbs_post_send() instead of dynamic allocation of the ib_send_wr structures the table of 512 ib_send_wr structures will be defined and all entries will be linked to unidirectional list so qp-device-post_send(qp, wr, bad_wr) API will be not changed. As I know no driver uses that kernel path to posting buffers so iWARP multicast acceleration implemented in NES driver Would be a first application that can utilize the optimized path. Regards, Mirek Signed-off-by: Mirek Walukiewicz miroslaw.walukiew...@intel.com The libipathverbs.so plug-in for libibverbs and the ib_ipath and ib_qib kernel modules use this path for ibv_post_send(). -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html