RE: [PATCH 0/4] add RAW Packet QP type
According VLANs, for RAW QP better solution is allowing inserting VLANs per packet by adding new flags to ib_post_send with special field containing VLAN ID. On ingress, it would be interesting to see the ingress VLAN in CQE, by introducing a new field and flags indicating VLAN appearance. As Steve mentioned, the HW checksum offload is necessary also in the API for sending IP fragments. When IP packet is not fragmented it is necessary to send it with L4/L3 sum computed by HW. And when fragments are sent the L3 only csum computation is requested. Regards, Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Steve Wise Sent: Tuesday, January 17, 2012 4:08 PM To: Or Gerlitz Cc: Roland Dreier; linux-rdma; Christoph Lameter; Liran Liss Subject: Re: [PATCH 0/4] add RAW Packet QP type On 01/17/2012 05:34 AM, Or Gerlitz wrote: > The new qp type designated usage is from user-space in Ethernet environments, > e.g by applications that do TCP/IP offloading. Only processes with the NET_RAW > capability may open such qp. The name raw packet was selected to resemble the > similarity to AF_PACKET / SOL_RAW sockets. Applications that use this qp type > should deal with whole packets, including link level headers. > > This series allows to create such QPs and send packets over them. To receive > packets, flow steering support has to be added to the verbs and low-level > drivers. Flow Steering is the ability to program the HCA to direct a packet > which matches a given flow specification to a given QP. Flow specs set by > applications are typically made of L3 (IP) and L4 (TCP/UDP) based tuples, > where network drivers typically use L2 based tuples. Core and mlx4 patches > for flow steering are expected in the coming weeks. Hey Or, I think this series should add some new send flags for HW that does checksum offload: For example, cxgb4 supports these: enum { /* TX_PKT_XT checksum types */ TX_CSUM_TCP= 0, TX_CSUM_UDP= 1, TX_CSUM_CRC16 = 4, TX_CSUM_CRC32 = 5, TX_CSUM_CRC32C = 6, TX_CSUM_FCOE = 7, TX_CSUM_TCPIP = 8, TX_CSUM_UDPIP = 9, TX_CSUM_TCPIP6 = 10, TX_CSUM_UDPIP6 = 11, TX_CSUM_IP = 12, }; I'm sure mlx4 has this sort of functionality too? Another form of HW assist is with VLAN insertion/extraction. The API should provide a way to specify if a VLAN ID should be inserted by HW and removed from a packet on ingress (and passed to the app via the CQE). In fact, we probably want a way to associate a VLAN with a RAW QP, maybe as a QP attribute? Also, on ingress, most hardware can do INET checksum validation, and a way to indicate the results to the application is needed. Perhaps flags in the CQE? The cxgb4 device provides many fields on a ingress packet completion that would be useful for user mode applications including indications of MAC RX errors, protocol length vs packet length mismatches, IP version not 4 or 6, and more. Does mlx4 has these sorts of indications on ingress packet CQEs? Food for thought. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Sean, The assumption here is that user space library prepares the vendor specific data in user-space using a shared page allocated by vendor driver. Here information about posted buffers is passed not through ib_wr but using the shared page. It is a reason why pointers indicating ib_wr in post_send are not set, they are not passed to kernel at all to avoid copying them in kernel. As there is no ib_wr structure in kernel there is no reference to bad_wr and a buffer that failed in this context so the only reasonable information about operation state passed using bad_wr could be return of binary information - operation successful (bad_wr = 0) or not (bad_wr != 0) Instead of using a specific case for RAW_QP it is possible to pass some information about posting buffers method by enum ib_qp_create_flags { IB_QP_CREATE_IPOIB_UD_LSO = 1 << 0, IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK = 1 << 1 }; Extending it with IB_QP_CREATE_USE_SHARED_PAGE= 1 << 2, In that case a new method could be used for any type of QP and will be backward compatible. Regards, Mirek -Original Message- From: Hefty, Sean Sent: Friday, January 21, 2011 4:50 PM To: Walukiewicz, Miroslaw; Roland Dreier Cc: Or Gerlitz; Jason Gunthorpe; linux-rdma@vger.kernel.org Subject: RE: ibv_post_send/recv kernel path optimizations > + qp = idr_read_qp(cmd.qp_handle, file->ucontext); > + if (!qp) > + goto out_raw_qp; > + > + if (qp->qp_type == IB_QPT_RAW_ETH) { > + resp.bad_wr = 0; > + ret = qp->device->post_send(qp, NULL, NULL); This looks odd to me and can definitely confuse someone reading the code. It adds assumptions to uverbs about the underlying driver implementation and ties that to the QP type. I don't know if it makes more sense to key off something in the cmd or define some other property of the QP, but the NULL parameters into post_send are non-intuitive. > + if (ret) > + resp.bad_wr = cmd.wr_count; Is this always the case? - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Roland, You are right that the idr implementation introduces insignificant change in performance. I made the version with idr and semaphores usage and I see a minimal change comparing to hash table. Now only a shared page is used instead of kmalloc and copy_to_user. I simplified changes to uverbs and I achieved what I wanted in performance. Now the patch looks like below. Are these changes acceptable for k.org Regards, Mirek --- ../SOURCES_19012011/ofa_kernel-1.5.3/drivers/infiniband/core/uverbs_cmd.c 2011-01-19 05:37:55.0 +0100 +++ ofa_kernel-1.5.3_idr_qp/drivers/infiniband/core/uverbs_cmd.c 2011-01-21 04:10:07.0 +0100 @@ -1449,15 +1449,29 @@ if (cmd.wqe_size < sizeof (struct ib_uverbs_send_wr)) return -EINVAL; + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) + goto out_raw_qp; + + if (qp->qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp->device->post_send(qp, NULL, NULL); + if (ret) + resp.bad_wr = cmd.wr_count; + + if (copy_to_user((void __user *) (unsigned long) + cmd.response, + &resp, + sizeof resp)) + ret = -EFAULT; + put_qp_read(qp); + goto out_raw_qp; + } user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); if (!user_wr) return -ENOMEM; - qp = idr_read_qp(cmd.qp_handle, file->ucontext); - if (!qp) - goto out; - is_ud = qp->qp_type == IB_QPT_UD; sg_ind = 0; last = NULL; @@ -1577,9 +1591,8 @@ wr = next; } -out: kfree(user_wr); - +out_raw_qp: return ret ? ret : in_len; } @@ -1681,16 +1694,31 @@ if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + qp = idr_read_qp(cmd.qp_handle, file->ucontext); + if (!qp) + goto out_raw_qp; + +if (qp->qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp->device->post_recv(qp, NULL, NULL); +if (ret) + resp.bad_wr = cmd.wr_count; + +if (copy_to_user((void __user *) (unsigned long) + cmd.response, + &resp, + sizeof resp)) + ret = -EFAULT; + put_qp_read(qp); + goto out_raw_qp; + } + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, in_len - sizeof cmd, cmd.wr_count, cmd.sge_count, cmd.wqe_size); if (IS_ERR(wr)) return PTR_ERR(wr); - qp = idr_read_qp(cmd.qp_handle, file->ucontext); - if (!qp) - goto out; - resp.bad_wr = 0; ret = qp->device->post_recv(qp, wr, &bad_wr); @@ -1707,13 +1735,13 @@ &resp, sizeof resp)) ret = -EFAULT; -out: while (wr) { next = wr->next; kfree(wr); wr = next; } +out_raw_qp: return ret ? ret : in_len; } -Original Message- From: Roland Dreier [mailto:rdre...@cisco.com] Sent: Monday, January 10, 2011 9:38 PM To: Walukiewicz, Miroslaw Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations > You are right that the most of the speed-up is coming from avoid semaphores, > but not only. > > From the oprof traces, the semaphores made half of difference. > > The next one was copy_from_user and kmalloc/kfree usage (in my proposal - > shared page method is used instead) OK, but in any case the switch from idr to hash table seems to be insignificant. I agree that using a shared page is a good idea, but removing locking needed for correctness is not a good optimization. > In my opinion, the responsibility for cases like protection of QP > against destroy during buffer post (and other similar cases) should > be moved to vendor driver. The OFED code should move only the code > path to driver. Not sure what OFED code you're talking about. We're discussing the kernel uverbs code, right? In any case I'd be interested in seeing how it looks if you move the protection into the individual drivers. I'd be worried about having to duplicate the same code everywhere (which leads to bugs in individual drivers) -- I guess this could be resolved by having the code be a library that individual drivers call into. But also I'm not sure if I see how you could make such a scheme work -- you need to make sure that the data str
RE: Problems with ibv_post_send and completion queues
Hello Manoj, Responding to your questions. 1. the 510 is HW max for QP length for NE020 card so it cannot be increased. 2. The NE020 driver keeps posted buffers on the FIFO-like queue which is 510 entries long in your application. There are maintained tail and head pointers that does not allow for QP overflow. The head pointer is updated during post_send and tail is updated during poll_cq on CQ assigned to your QP. When you make post_sends without checking cq the tail pointer is not updated so after 510 post_send calls the QP looks being full for driver and in effect you cannot send more data using that QP. Regards, Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Manoj Nambiar Sent: Wednesday, January 19, 2011 12:21 PM To: linux-rdma@vger.kernel.org Subject: Problems with ibv_post_send and completion queues Hi, I am writing some rdma based programs using the Intel NetEffect iWARP NICs. I am running into the following problems with my code - 1. I can only set maximum work requests to 510 using rdma_create_qp, otherwise it gives me an error – “libnes: nes_ucreate_qp Bad sq attr parameters max_send_wr=511 max_send_sge=1. Is there a way to increase this?” 2. Is there a way to do RDMA writes without using a completion queue? I use an alternative channel to determine if my work requests were correctly executed or not. When I tried to do so I could send 510 (may be related to the previous question) work requests successfully. After that ibv_post_send returns 22. Repeatedly retrying ibv_post_send doesn’t seem to clear the problem. It returns the same error code. Checked up the error code which tells me invalid arguments.? Unable to make sense of this. Is there a way to clean up the work requests in the system? I am creating the queue pair with sq_sig_all = 0 in struct ibv_qp_init_attr and am not setting setting IBV_SEND_SIGNALED in the send_flags member of struct ibv_send_wr. This means I do not get any completion events. When calling rdma_create_qp I initialize the send_cq and recv_cq of the struct ibv_qp_init_attr with the completion queue created using ibv_create_cq (with cge parameter same as cap.max_send_wr in struct ibv_qp_init_attr. (I think that creating a completion queue is a must for creating an rdma queue pair - pls correct me if I am wrong.) Pls note – I do not get this problem when I poll completion queues (when rdma queue pair is created with sq_sig_all = 0 && IBV_SEND_SIGNALED set in the flag of the work requests to ibv_post_send) Thanks, Manoj Nambiar -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Roland, You are right that the most of the speed-up is coming from avoid semaphores, but not only. >From the oprof traces, the semaphores made half of difference. The next one was copy_from_user and kmalloc/kfree usage (in my proposal - shared page method is used instead) In my opinion, the responsibility for cases like protection of QP against destroy during buffer post (and other similar cases) should be moved to vendor driver. The OFED code should move only the code path to driver. Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier Sent: Wednesday, January 05, 2011 7:17 PM To: Walukiewicz, Miroslaw Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations > The patch for ofed-1.5.3 looks like below. I will try to push it to > kernel.org after porting. > > Now an uverbs post_send/post_recv path is modified to make pre-lookup > for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path > is used for posting buffers. for example using a shared page approach in > cooperation with user-space library I don't quite see why a hash table helps performance much vs. an IDR. Is the actual IDR lookup a significant part of the cost? (By the way, instead of list_head you could use hlist_head to make your hash table denser and save cache footprint -- that way an 8-entry table on 64-bit systems fits in one cacheline) Also it seems that you get rid of all the locking on QPs when you look them up in your hash table. What protects against userspace posting a send in one thread and destroying the QP in another thread, and ending up having the destroy complete before the send is posted (leading to use-after-free in the kernel)? I would guess that your speedup is really coming from getting rid of locking that is actually required for correctness. Maybe I'm wrong though, I'm just guessing here. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
> Just to clarify, when saying "achieved performance comparable to > previous solution" you refer to the approach which bypasses uverbs on > the post send path? Also, why enhance only the raw qp flow? I compare to my previous solution using private device for passing information about packets. Comparing to current approach I see more than 20% of improvement. This solution introduces a new path for posting buffers using a shared page approach. It works following way: 1. create RAW qp and add it to the raw QP hash list. 2. user space library mmaps the shared page (it is specific action per device and must implemented separately per each driver) 3. during buffer posting the library puts buffers info into shared page and calls uverbs. 4. uverbs detects the raw qp and inform the driver bypassing current path. The solution cannot be shared between RDMA drivers because it needs redesign of the driver (share page format is vendor specific). Now only NES driver implements the RAW QP path through kernel (other vendors uses pure user-space solution) so No other vendor will use this path. There is possibility to add a new QP capability or attribute that will inform uverbs that it is a new transmit path used so then the solution could be extended for all drivers. Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Monday, December 27, 2010 4:22 PM To: Walukiewicz, Miroslaw Cc: Jason Gunthorpe; Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations On 12/27/2010 5:18 PM, Walukiewicz, Miroslaw wrote: > I implemented the very small hash table and I achieved performance > comparable to previous solution. Just to clarify, when saying "achieved performance comparable to previous solution" you refer to the approach which bypasses uverbs on the post send path? Also, why enhance only the raw qp flow? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
cmd.response, + &resp, + sizeof resp)) + ret = -EFAULT; + goto out_raw_qp; + } + } + user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); if (!user_wr) return -ENOMEM; @@ -1579,7 +1649,7 @@ out_put: out: kfree(user_wr); - +out_raw_qp: return ret ? ret : in_len; } @@ -1664,7 +1734,6 @@ err: kfree(wr); wr = next; } - return ERR_PTR(ret); } @@ -1681,6 +1750,25 @@ ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file, if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + mutex_lock(&file->mutex); + qp = raw_qp_lookup(cmd.qp_handle, file->ucontext); + mutex_unlock(&file->mutex); + if (qp) { + if (qp->qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp->device->post_recv(qp, NULL, &bad_wr); + if (ret) + resp.bad_wr = cmd.wr_count; + + if (copy_to_user((void __user *) (unsigned long) + cmd.response, + &resp, + sizeof resp)) + ret = -EFAULT; + goto out_raw_qp; + } + } + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, in_len - sizeof cmd, cmd.wr_count, cmd.sge_count, cmd.wqe_size); @@ -1713,7 +1801,7 @@ out: kfree(wr); wr = next; } - +out_raw_qp: return ret ? ret : in_len; } diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index f5b054a..adf1dd8 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -838,6 +838,8 @@ struct ib_fmr_attr { u8 page_shift; }; +#define MAX_RAW_QP_HASH 8 + struct ib_ucontext { struct ib_device *device; struct list_headpd_list; @@ -848,6 +850,7 @@ struct ib_ucontext { struct list_headsrq_list; struct list_headah_list; struct list_headxrc_domain_list; + struct list_headraw_qp_hash[MAX_RAW_QP_HASH]; int closing; }; @@ -859,6 +862,7 @@ struct ib_uobject { int id; /* index into kernel idr */ struct kref ref; struct rw_semaphore mutex; /* protects .live */ + struct list_headraw_qp_list; /* raw qp hash */ int live; }; Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Or Gerlitz Sent: Monday, December 27, 2010 1:39 PM To: Jason Gunthorpe; Walukiewicz, Miroslaw Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations Jason Gunthorpe wrote: > Walukiewicz, Miroslaw wrote: >> called for many QPs, there is a single entry point to >> ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that >> case there is a lookup to QP store (idr_read_qp) necessary to find a >> correct ibv_qp Structure, what is a big time consumer on the path. > I don't think this should be such a big problem. The simplest solution > would be to front the idr_read_qp with a small learning hashing table. yes, there must be a few ways (e.g as Jason suggested) to do this house-keeping much more efficient, in a manner that fits fast path - which maybe wasn't the mindset when this code was written as its primary use was to invoke control plane commands. >> The NES IMA kernel path also has such QP lookup but the QP number >> format is designed to make such lookup very quickly. The QP numbers in >> OFED are not defined so generic lookup functions like idr_read_qp() must be >> use. > Maybe look at moving the QPN to ibv_qp translation into the driver > then - or better yet, move allocation out of the driver, if Mellanox > could change their FW.. You are right that we could do this much > faster if the QPN was structured in some way I think there should be some validation on the uverbs level, as the caller is untrusted user space application, e.g in a similar way for each system call done on a file-descriptor Or. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Or, I looked into shared page approach of passing post_send/post_recv info. I still have some concerns. The shared page must be allocated per QP and there should be a common way to allocate such page for each driver. As Jason and Roland said, the best way to pass this parameter through mmap is offset. There is no common way how the Offset is used per driver and it is driver specific parameter. The next problem is how many shared pages should driver allocate to share with user-space. They should contain place for each posted buffer by application. It is a big concern to post_recv where large number of buffers are posted. Current implementation has no such limit. Even the common offset definition would be defined and accepted, the shared page must be stored in ib_qp structure. When a post_send is called for many QPs, there is a single entry point to ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that case there is a lookup to QP store (idr_read_qp) necessary to find a correct ibv_qp Structure, what is a big time consumer on the path. The NES IMA kernel path also has such QP lookup but the QP number format is designed to make such lookup very quickly. The QP numbers in OFED are not defined so generic lookup functions like idr_read_qp() must be use. Regards, Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Wednesday, December 01, 2010 9:12 AM To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote: > Form the trace it looks like the __up_read() - 11% wastes most of time. It is > called from idr_read_qp when a put_uobj_read is called. if > (copy_from_user(&cmd, buf, sizeof cmd)) - 5% it is called twice from > ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... > and __kmalloc/kfree - 5% is the third function that has a big meaning. It is > called twice for each frame transmitted. It is about 20% of performance loss > comparing to nes_ud_sksq path which we miss when we use a OFED path. > > What I can modify is a kmalloc/kfree optimization - it is possible to make > allocation only at start and use pre-allocated buffers. I don't see any way > for optimalization of idr_read_qp usage or copy_user. In current approach we > use a shared page and a separate nes_ud_sksq handle for each created QP so > there is no need for any user space data copy or QP lookup. As was mentioned earlier on this thread, and repeated here, the kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see why the ib uverbs flow (BTW - the data path has nothing to do with the rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced e.g to support shared-page which is allocated && mmaped from uverbs to user space and used in the same manner your implementation does. The 1st copy_from_user should add pretty nothing and if it does, it can be replaced with different user/kernel IPC mechanism which costs less. So we're basically remained with the idr_read_qp, I wonder what other people think if/how this can be optimized? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Or, >I don't see why the ib uverbs flow (BTW - the data path has nothing to do with >the >rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced >e.g to support shared-page which is allocated && mmaped from uverbs to >user space and used in the same manner your implementation does. The problem that I see is that the mmap is currently used for mapping of doorbell page in different drivers. We can use it for mapping a page for transmit/receive operation when we are able to differentiate that we need to map Doorbell or our shared page. The second problem is that this rx/tx mmap should map the separate page per QP to avoid the unnecessary QP lookups so page identifier passed to mmap should be based on QP identifier. I cannot find a specific code for /dev/infiniband/uverbsX. Is this device driver sharing the same functions like /dev/infiniband/rdmacm or it has own implementation. Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Wednesday, December 01, 2010 9:12 AM To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote: > Form the trace it looks like the __up_read() - 11% wastes most of time. It is > called from idr_read_qp when a put_uobj_read is called. if > (copy_from_user(&cmd, buf, sizeof cmd)) - 5% it is called twice from > ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... > and __kmalloc/kfree - 5% is the third function that has a big meaning. It is > called twice for each frame transmitted. It is about 20% of performance loss > comparing to nes_ud_sksq path which we miss when we use a OFED path. > > What I can modify is a kmalloc/kfree optimization - it is possible to make > allocation only at start and use pre-allocated buffers. I don't see any way > for optimalization of idr_read_qp usage or copy_user. In current approach we > use a shared page and a separate nes_ud_sksq handle for each created QP so > there is no need for any user space data copy or QP lookup. As was mentioned earlier on this thread, and repeated here, the kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see why the ib uverbs flow (BTW - the data path has nothing to do with the rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced e.g to support shared-page which is allocated && mmaped from uverbs to user space and used in the same manner your implementation does. The 1st copy_from_user should add pretty nothing and if it does, it can be replaced with different user/kernel IPC mechanism which costs less. So we're basically remained with the idr_read_qp, I wonder what other people think if/how this can be optimized? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)
Some time ago we discussed a possibility of removing usage of nes_ud_sksq for IMA driver as a blocker of pushing IMA solution to kernel.org. The proposal was using OFED transmit optimized path by /dev/infiniband/rdma_cm instead of using private nes_ud_sksq device. I made an implementation of such solution for checking the performance impact and looking for optimize the existing code. I made a simple send test (sendto in kernel) using my NEHALEM i7 machine. Current nes_ud_sksq implementation achieved about 1,25mln pkts/sec. The OFED path (with rdma_cm call) achieved about 0,9mlns pkts/sec. I run oprofile on rdma_cm code and got a following results: samples %linenr info app name symbol name 2586067 24.5323 nes_uverbs.c:558libnes-rdmav2.so nes_upoll_cq 1198042 11.3650 (no location information) vmlinux __up_read 5392585.1156 (no location information) vmlinux copy_user_generic_string 4078843.8693 msa_verbs.c:1692libmsa.so.1.0.0 msa_post_send 3045692.8892 msa_verbs.c:2098libmsa.so.1.0.0 usq_sendmsg_noblock 2999542.8455 (no location information) vmlinux __kmalloc 2974632.8218 (no location information) libibverbs.so.1.0.0 /usr/lib64/libibverbs.so.1.0.0 2679512.5419 uverbs_cmd.c:1433 ib_uverbs.ko ib_uverbs_post_send 2647092.5111 (no location information) vmlinux kfree 2051071.9457 port.c:2947 libmsa.so.1.0.0 sendto 1462251.3871 (no location information) vmlinux __down_read 1459411.3844 (no location information) libpthread-2.5.so __write_nocancel 1399341.3275 nes_ud.c:1746 iw_nes.ko nes_ud_post_send_new_path 1318791.2510 send.c:32 msa_tst blocking_test_send(void*) 1275191.2097 (no location information) vmlinux system_call 1235521.1721 port.c:858 libmsa.so.1.0.0 find_mcast 1092491.0364 nes_verbs.c:3478iw_nes.ko nes_post_send 92060 0.8733 (no location information) vmlinux vfs_write 90187 0.8555 uverbs_cmd.c:144ib_uverbs.ko __idr_get_uobj 89563 0.8496 nes_uverbs.c:1460 libnes-rdmav2.so nes_upost_send Form the trace it looks like the__up_read() - 11% wastes most of time. It is called from idr_read_qp when a put_uobj_read is called. if (copy_from_user(&cmd, buf, sizeof cmd)) - 5% it is called twice from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame return -EFAULT; and __kmalloc/kfree - 5% is the third function that has a big meaning. It is called twice for each frame transmitted. It is about 20% of performance loss comparing to nes_ud_sksq path which we miss when we use a OFED path. What I can modify is a kmalloc/kfree optimization - it is possible to make allocation only at start and use pre-allocated buffers. I don't see any way for optimalization of idr_read_qp usage or copy_user. In current approach we use a shared page and a separate nes_ud_sksq handle for each created QP so there is no need for any user space data copy or QP lookup. Do you have any idea how can we optimize this path? Regards, Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Thursday, November 25, 2010 4:01 PM To: Walukiewicz, Miroslaw Cc: Jason Gunthorpe; Roland Dreier; Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries) Jason Gunthorpe wrote: > Hmm, considering your list is everything but Mellanox, maybe it makes much > more sense to push the copy_to_user down into the driver - ie a > ibv_poll_cq_user - then the driver can construct each CQ entry on the stack > and copy it to userspace, avoid the double copy, allocation and avoid any > fixed overhead of ibv_poll_cq. > > A bigger change to be sure, but remember this old thread: > http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05114.html > 2x improvement by removing allocs on the post path.. Hi Mirek, Any updates on your findings with the patches? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
Hello Jason, Do you have any benchmarks that show the alloca is a measurable overhead? We changed overall path (both kernel and user space) to allocation-less approach and We achieved twice better latency using call to kernel driver. I have no data which path Is dominant - kernel or user space. I think I will have some measurements next week, so I will share My results. Roland is right, all you really need is a per-context (+per-cpu?) buffer you can grab, fill, and put back. I agree. I will go into this direction. Regards, Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Friday, August 06, 2010 6:33 PM To: Walukiewicz, Miroslaw Cc: Roland Dreier; linux-rdma@vger.kernel.org Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote: > Currently the transmit/receive path works following way: User calls > ibv_post_send() where vendor specific function is called. When the > path should go through kernel the ibv_cmd_post_send() is called. > The function creates the POST_SEND message body that is passed to > kernel. As the number of sges is unknown the dynamic allocation for > message body is performed. (see libibverbs/src/cmd.c) Do you have any benchmarks that show the alloca is a measurable overhead? I'm pretty skeptical... alloca will generally boil down to one or two assembly instructions adjusting the stack pointer, and not even that if you are lucky and it can be merged into the function prologe. > In the kernel the message body is parsed and a structure of wr and > sges is recreated using dynamic allocations in kernel The goal of > this operation is having a similar structure like in user space. .. the kmalloc call(s) on the other hand definately seems worth looking at .. > In kernel in ib_uverbs_post_send() instead of dynamic allocation of > the ib_send_wr structures the table of 512 ib_send_wr structures > will be defined and all entries will be linked to unidirectional > list so qp->device->post_send(qp, wr, &bad_wr) API will be not > changed. Isn't there a kernel API already for managing a pool of pre-allocated fixed-size allocations? It isn't clear to me that is even necessary, Roland is right, all you really need is a per-context (+per-cpu?) buffer you can grab, fill, and put back. > As I know no driver uses that kernel path to posting buffers so > iWARP multicast acceleration implemented in NES driver Would be a > first application that can utilize the optimized path. ?? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
I agree with you that changing kernel ABI is not necessary. I will follow your directions regarding a single allocation at start. Regards, Mirek -Original Message- From: Roland Dreier [mailto:rdre...@cisco.com] Sent: Friday, August 06, 2010 5:58 PM To: Walukiewicz, Miroslaw Cc: linux-rdma@vger.kernel.org Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations > The proposed path optimization is removing of dynamic allocations > by redefining a structure definition passed to kernel. > To > > struct ibv_post_send { > __u32 command; > __u16 in_words; > __u16 out_words; > __u64 response; > __u32 qp_handle; > __u32 wr_count; > __u32 sge_count; > __u32 wqe_size; > struct ibv_kern_send_wr send_wr[512]; > }; I don't see how this can possibly work. Where does the scatter/gather list go if you make this have a fixed size array of send_wr? Also I don't see why you need to change the user/kernel ABI at all to get rid of dynamic allocations... can't you just have the kernel keep a cached send_wr allocation (say, per user context) and reuse that? (ie allocate memory but don't free the first time into post_send, and only reallocate if a bigger send request comes, and only free when destroying the context) - R. -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
{RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations
Currently the ibv_post_send()/ibv_post_recv() path through kernel (using /dev/infiniband/rdmacm) could be optimized by removing dynamic memory allocations on the path. Currently the transmit/receive path works following way: User calls ibv_post_send() where vendor specific function is called. When the path should go through kernel the ibv_cmd_post_send() is called. The function creates the POST_SEND message body that is passed to kernel. As the number of sges is unknown the dynamic allocation for message body is performed. (see libibverbs/src/cmd.c) In the kernel the message body is parsed and a structure of wr and sges is recreated using dynamic allocations in kernel The goal of this operation is having a similar structure like in user space. The proposed path optimization is removing of dynamic allocations by redefining a structure definition passed to kernel. >From struct ibv_post_send { __u32 command; __u16 in_words; __u16 out_words; __u64 response; __u32 qp_handle; __u32 wr_count; __u32 sge_count; __u32 wqe_size; struct ibv_kern_send_wr send_wr[0]; }; To struct ibv_post_send { __u32 command; __u16 in_words; __u16 out_words; __u64 response; __u32 qp_handle; __u32 wr_count; __u32 sge_count; __u32 wqe_size; struct ibv_kern_send_wr send_wr[512]; }; Similar change is required in kernel struct ib_uverbs_post_send defined in /ofa_kernel/include/rdma/ib_uverbs.h This change limits a number of send_wr passed from unlimited (assured by dynamic allocation) to reasonable number of 512. I think this number should be a max number of QP entries available to send. As the all iB/iWARP applications are low latency applications so the number of WRs passed are never unlimited. As the result instead of dynamic allocation the ibv_cmd_post_send() fills the proposed structure directly and passes it to kernel. Whenever the number of send_wr number exceeds the limit the ENOMEM error is returned. In kernel in ib_uverbs_post_send() instead of dynamic allocation of the ib_send_wr structures the table of 512 ib_send_wr structures will be defined and all entries will be linked to unidirectional list so qp->device->post_send(qp, wr, &bad_wr) API will be not changed. As I know no driver uses that kernel path to posting buffers so iWARP multicast acceleration implemented in NES driver Would be a first application that can utilize the optimized path. Regards, Mirek Signed-off-by: Mirek Walukiewicz -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: InfiniBand/RDMA merge plans for 2.6.36
Hello Or, My patch was implemented using the most effective method available for current version of code and it is ready and working as a functionality. For me another thing then functionality is an optimization of the existing SW paths in infiniband code and modification earlier written code to new interfaces. Tomorrow I will start a new discussion regarding an optimization of the post_send/post_recv path. Thank you for reminder. Regards, Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Thursday, August 05, 2010 2:28 PM To: Walukiewicz, Miroslaw Cc: Roland Dreier; linux-ker...@vger.kernel.org; linux-rdma@vger.kernel.org Subject: Re: InfiniBand/RDMA merge plans for 2.6.36 Walukiewicz, Miroslaw wrote: > Hello Roland, What about a series from Aleksey Senin [...] And my patch > RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver > https://patchwork.kernel.org/patch/110252 Hi Mirek, Reading your response @ http://marc.info/?l=linux-rdma&m=127954552519544 to the comments made during the review, I was under the impression that you're going to try and modify the NES implementation, isn't this the case any more? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: InfiniBand/RDMA merge plans for 2.6.36
Hello Roland, What about a series from Aleksey Senin [PATCH V1 0/4] New RAW_PACKET QP type [PATCH V1 1/4] Rename RAW_ETY to RAW_ETHERTYPE [PATCH V1 2/4] New RAW_PACKET QP type definition [PATCH V1 3/4] Security check on QP type [PATCH V1 4/4] Add RAW_PACKET to ib_attach/detach mcast calls And my patch RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver https://patchwork.kernel.org/patch/110252/ I see only one patch from that series in your plans Aleksey Senin (1): IB: Rename RAW_ETY to RAW_ETHERTYPE Regards, Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier Sent: Thursday, August 05, 2010 1:34 AM To: linux-ker...@vger.kernel.org; linux-rdma@vger.kernel.org Subject: InfiniBand/RDMA merge plans for 2.6.36 Since 2.6.35 is here, it's probably a good time to talk about 2.6.36 merge plans. All the pending things that I'm aware of are listed below. Boilerplate: If something isn't already in my tree and it isn't listed below, I probably missed it or dropped it unintentionally. Please remind me. As usual, when submitting a patch: - Give a good changelog that explains what issue your patch addresses, how you address the issue, how serious the issue is, and any other information that would be useful to someone evaluating your patch now, or trying to understand it years from now. - Please make sure that you include a "Signed-off-by:" line, and put any extra junk that should not go into the final kernel log *after* the "---" line so that git tools strip it off automatically. Make the subject line be appropriate for inclusion in the kernel log as well once the leading "[PATCH ...]" stuff is stripped off. I waste a lot of time fixing patches by hand that could otherwise be spent doing something productive like watching youtube. - Run your patch through checkpatch.pl so I don't have to nag you to fix trivial issues (or spend time fixing them myself). - Check your patch over at least enough so I don't see a memory leak or deadlock as soon as I look at it. - Build your patch with sparse checking ("C=2 CF=-D__CHECK_ENDIAN__") and make sure it doesn't introduce new warnings. (A big bonus in goodwill for sending patches that fix old warnings) - Test your patch on a kernel with things like slab debugging and lockdep turned on. And while you're waiting for me to get to your patch, I sure wouldn't mind if you read and commented on someone else's patch. We currently have a big imbalance between people who are writing patches (many) and people who are reviewing patches (mostly me). None of this means you shouldn't remind me about pending patches, since I often lose track of things and drop them accidentally. I don't think it makes sense to break down what I merged by topics this time around -- there wasn't anything big that I can think of. It was really just a matter of small improvements, fixes, and cleanups all over. Here are a few topics I'm tracking that are not ready in time for the 2.6.36 window and will need to wait for 2.6.37 at least: - XRC. While I think we made significant progress here, the fact is that this is not ready to merge at the beginning of the merge window, and so we'll need to keep working on it and wait for the next merge window. I think this is just blocked on me at the moment. - IBoE. Same as XRC; we made significant progress (and I opened an iboe branch to track this), and I think we have finally gotten the user-kernel interface nailed down yet, but it's just too late. - ummunotify-as-part-of-uverbs. I'm working on this but don't have anything ready for the merge window. - AF_IB work. I have not even had a chance to think about this yet, since I haven't dug through earlier backlog items. - mlx4 SR-IOV support. See AF_IB above. Here all the patches I already have in my for-next branch: Aleksey Senin (1): IB: Rename RAW_ETY to RAW_ETHERTYPE Alexander Schmidt (3): IB/ehca: Fix bitmask handling for lock_hcalls IB/ehca: Catch failing ioremap() IB/ehca: Init irq tasklet before irq can happen Arnd Bergmann (1): IB/qib: Use generic_file_llseek Bart Van Assche (3): IB/srp: Use print_hex_dump() IB/srp: Make receive buffer handling more robust IB/srp: Export req_lim via sysfs Ben Hutchings (1): IB/ipath: Fix probe failure path Chien Tung (1): RDMA/nes: Store and print eeprom version Dan Carpenter (2): RDMA/cxgb4: Remove unneeded assignment RDMA/cxgb3: Clean up signed check of unsigned variable Dave Olson (1): IB/qib: Allow PSM to select from multiple port assignment algorithms David Rientjes (1): RDMA/cxgb4: Remove dependency on __GFP_NOFAIL Faisal Latif (1): RDMA/nes: Fix hangs on ifdown Ira Weiny (1): IB/qib: Allow writes to the diag_counters to be able to clear them Miroslaw Walukiew
RE: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
Hello Or, I think about other options yet and I'm still not sure what option should I choose for implementation. I agree with you that it is possible to fix the post_send path in OFED. Let me think a few days yet. Regards, Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Sunday, July 18, 2010 6:52 PM To: Walukiewicz, Miroslaw Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org; aleks...@voltaire.com Subject: Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver > I don't think there are applications around which would use raw qp AND > are linked against libibverbs-1.0, such that they would exercise the 1_0 > wrapper, so we can ignore the 1st allocation, the one at the wrapper code. > As for the 2nd allocation, since a WQE --posting-- is synchronous, > using the maximal values specified during the creation of the QP, I > believe that this allocation can be done once per QP and used later. [...] Hi Mirek, any comment on my response to the NES patch you sent? Or. > >> dive to kernel: >> ib_uverbs_post_send() >> user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc >> next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + >>user_wr->num_sge * sizeof (struct ib_sge), >>GFP_KERNEL); <- 4. dyn >> alloc >> And now there is finel call to driver. > ~same here for #4 you can compute/allocate once the maximal possible > size for "next" per qp and use it later. As for #3, this need further > thinking. > > But before diving to all this design changes, what was the penalty > introduced by these allocations? is it in packets-per-second, latency? > >> Diving to kernel is treated as a something like passing signal to >> kernel that there is prepared information to post_send/post_recv. The >> information about buffers are passed through shared page (available to >> userspace through mmap) to avoid copying of data. Write() ops is used >> to passing signal about post_send. Read() ops is used to pass >> information about post_recv(). We avoid additional copying of the data >> that way. > thanks for the heads-up, I took a look and this user/kernel shared > memory page is used to hold the work-request, nothing to do with data. > > As for the work request, you still have to copy it in user space from > the user work request to the library mmaped buffer. So the only > difference would be the copy_from_user done by uverbs, for few tens of > bytes, can you tell if/what is the extra penalty introduced by this copy? > >> struct nes_ud_send_wr { >> u32 wr_cnt; >> u32 qpn; >> u32 flags; >> u32 resv[1]; >> struct ib_sge sg_list[64]; >> }; >> >> struct nes_ud_recv_wr { >> u32 wr_cnt; >> u32 qpn; >> u32 resv[2]; >> struct ib_sge sg_list[64]; >> }; > Looking on struct nes_ud_send/recv_wr, I wasn't sure to follow, the same > instance can be used to post list of work requests, where is work > request is limited to use one SGE, am I correct? > > I don't think there a need to support posting 64 --send-- requests, for > recv it might makes sense, but it could be done in a "batch/background" > flow, thoughts? > > Or. > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
FW: [PATCH] RDMA/nes: corrected link type for nes cards
Now correct interface link type is set for ibv_query_port() Signed-off-by: Mirek Walukiewicz --- drivers/infiniband/hw/nes/nes_verbs.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index f179586..45bf56c 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -599,7 +599,7 @@ static int nes_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr props->active_width = IB_WIDTH_4X; props->active_speed = 1; props->max_msg_sz = 0x8000; - + props->link_layer = IB_LINK_LAYER_ETHERNET; return 0; } N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
[PATCH] RDMA/nes: corrected firmware version update
Now firmware version is read from correct place Signed-off-by: Mirek Walukiewicz --- drivers/infiniband/hw/nes/nes_verbs.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 0abd4f2..f179586 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -520,7 +520,7 @@ static int nes_query_device(struct ib_device *ibdev, struct ib_device_attr *prop memset(props, 0, sizeof(*props)); memcpy(&props->sys_image_guid, nesvnic->netdev->dev_addr, 6); - props->fw_ver = nesdev->nesadapter->fw_ver; + props->fw_ver = nesdev->nesadapter->firmware_version; props->device_cap_flags = nesdev->nesadapter->device_cap_flags; props->vendor_id = nesdev->nesadapter->vendor_id; props->vendor_part_id = nesdev->nesadapter->vendor_part_id;
RE: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
Hello Or, I still don't see what is the performance issue with the uverbs post_send/post_recv and if there is such why it can't be fixed, to avoid introducing lib/driver nes special char device. Could you explain it with some more details? You were mention the rdma-cm device file, but the uverbs cmd api is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway a slow path. From my measuremnts it looks like the problem is related to memory allocation in the user-space and kernel path, that is a very, very expesive operation. Look for the tx path (rx is very similar). Ibv_post_send() post_send_wrapper_1_0 for (w = wr; w; w = w->next) { real_wr = alloca(sizeof *real_wr); <- 1. dyn alloc real_wr->wr_id = w->wr_id; next the call to HW specific part and prepare message to send cmd = alloca(cmd_size); <- 2. dyn allocation IBV_INIT_CMD_RESP(cmd, cmd_size, POST_SEND, &resp, sizeof resp); dive to kernel: ib_uverbs_post_send() user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + user_wr->num_sge * sizeof (struct ib_sge), GFP_KERNEL); <- 4. dyn alloc And now there is finel call to driver. Adding the additional device makes possible diving to kernel without that memory allocations. Also, I understand that .read (.write) entry maps to posting a receive (send) buffer, what is the use case for .mmap entry Not exactly. Diving to kernel is treated as a something like passing signal to kernel that there is prepared information to post_send/post_recv. The information about buffers are passed through shared page (available to userspace through mmap) to avoid copying of data. Write() ops is used to passing signal about post_send. Read() ops is used to pass information about post_recv(). We avoid additional copying of the data that way. > @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr > *attr, > nesqp->hwqp.qp_id, attr->qp_state, nesqp->ibqp_state, > nesqp->iwarp_state, atomic_read(&nesqp->refcount)); > > + if (ibqp->qp_type == IB_QPT_RAW_PACKET) > + return 0; isn't a raw qp associated with a specific port of the device? In NES architecture the QP type and number defines a specific device or port. It is one to one mapping Regards, Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Tuesday, July 06, 2010 10:50 AM To: Walukiewicz, Miroslaw Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org; aleks...@voltaire.com Subject: Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver miroslaw.walukiew...@intel.com wrote: > adds a IB_QPT_RAW_PACKET QP type implementation for nes driver > +++ b/drivers/infiniband/hw/nes/nes_ud.c > +static const struct file_operations nes_ud_sksq_fops = { > + .owner = THIS_MODULE, > + .open = nes_ud_sksq_open, > + .release = nes_ud_sksq_close, > + .write = nes_ud_sksq_write, > + .read = nes_ud_sksq_read, > + .mmap = nes_ud_sksq_mmap, > +}; > + > + > +static struct miscdevice nes_ud_sksq_misc = { > + .minor = MISC_DYNAMIC_MINOR, > + .name = "nes_ud_sksq", > + .fops = &nes_ud_sksq_fops, > +}; Reading through the May 2010 "RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver" email thread, e.g at the below links, you say > The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is > shared with > all other user-kernel communication and it is quite complex. It is a perfect > path > for QP/CQ/PD/mem management but for me it is too complex for traffic > acceleration. > The user<->kernel path through additional driver, shared page for > lkey/vaddr/len > passing and SW memory translation in kernel is much more effective. http://marc.info/?l=linux-rdma&m=127299659017928 http://marc.info/?l=linux-rdma&m=127306694704653 I still don't see what is the performance issue with the uverbs post_send/post_recv and if there is such why it can't be fixed, to avoid introducing lib/driver nes special char device. Could you explain it with some more details? You were mention the rdma-cm device file, but the uverbs cmd api is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway a slow path. Also, I understand that .read (.write) entry maps to posting a receive (send) buffer, what is the use case for .mmap entry > --- a/drivers/infiniband/hw/nes/nes_verbs.c > +++ b/d
RE: Name for a new type of QP
I would prefer a name IBV_QPT_FRAME so it is a L2 layerQP. The packet is reserved for L3. Regards, Mirek -Original Message- From: Moni Shoua [mailto:mo...@voltaire.com] Sent: Wednesday, June 23, 2010 11:20 AM To: linux-rdma Cc: Walukiewicz, Miroslaw; Roland Dreier; al...@voltaire.com Subject: Name for a new type of QP Hi, This message follows a discussion in the EWG mailing list. We want to promote a patch that enables use of a new QP type. This QP type lets the user post_send() data to its SQ and treat it as the entire packet, including headers. An example of use with this QP is sending Ethernet packets from userspace (and enjoying kernel bypass). An open question in this matter it how should we call this QP type. The first name IBV_QPT_RAW_ETH seems to be too similar to the existing type IBV_QPT_RAW_ETY. My suggestion (that were posted in a different thread) are IBV_QPT_FRAME IBV_QPT_PACKET IBV_QPT_NOHDR Please make your comments and send your suggestions. When we decide about a name we will send a patch that enables the use of this QP type. thanks Moni -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type
Thanks Moni, I treated them as accepted. Sean, response for your question is that changes with RAW_QP patches should be accepted first to have that application working. Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Moni Shoua Sent: Monday, June 21, 2010 1:17 PM To: Walukiewicz, Miroslaw Cc: Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type Walukiewicz, Miroslaw wrote: > No, no more changes are necessary. It is a standalone application. > > Mirek > Are you sure? AFAIK, patches to kernel and libibverbs are required which were not accepted yet. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type
No, no more changes are necessary. It is a standalone application. Mirek -Original Message- From: Hefty, Sean Sent: Monday, June 14, 2010 6:33 PM To: Walukiewicz, Miroslaw Cc: linux-rdma@vger.kernel.org Subject: RE: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type > The patch adds a new test application describing a usage of the > IBV_QPT_RAW_ETH > for IPv4 multicast acceleration on iWARP cards. Are any changes still needed to the kernel or libibverbs to make this work? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type
Hello Or, any reason not to patch mckey to support both IB and Ethernet raw QPs? The mckey works on UD_QP type and mcraw works on RAW_QP type. The data payload prepared for UD and RAW_QP are on different layers. The mckey uses rdma_join_multicast() that triggers a state machine for IB multicast joining. The mcraw does not trigger such state machine because for sending the ethernet multicast there is no need for any multicast joining state machine. The multicast destination address on ethernet is determined by multicast group address. For me the API changes between mcraw and mckey are quite large and adding additional options could be confused for users. does raw qp has any relation to the iWARP/TOE HW stack? Yes, As I told the logic for joining multicast group is different for IB and ethernet (I mention here about ethernet not iWARP specific). The mcraw handles an ethernet path sending multicasts that could be similar for nes and mlx4. Regards, Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Or Gerlitz Sent: Friday, June 11, 2010 10:13 PM To: Walukiewicz, Miroslaw Cc: Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type Walukiewicz, Miroslaw wrote: > The patch adds a new test application describing a usage of the > IBV_QPT_RAW_ETH > for IPv4 multicast acceleration on iWARP cards. See man mcraw for parameters > description So this is the only raw qp related patch to librdmacm? any reason not to patch mckey to support both IB and Ethernet raw QPs? does raw qp has any relation to the iWARP/TOE HW stack? there's also raw qp patch posted to ewg for mlx4 which has no backing iwarp logic. Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type
The patch adds a new test application describing a usage of the IBV_QPT_RAW_ETH for IPv4 multicast acceleration on iWARP cards. See man mcraw for parameters description. Signed-of-by: Mirek Walukiewicz --- diff --git a/Makefile.am b/Makefile.am index 4ddbcfa..0132b36 100644 --- a/Makefile.am +++ b/Makefile.am @@ -18,7 +18,7 @@ src_librdmacm_la_LDFLAGS = -version-info 1 -export-dynamic \ src_librdmacm_la_DEPENDENCIES = $(srcdir)/src/librdmacm.map bin_PROGRAMS = examples/ucmatose examples/rping examples/udaddy examples/mckey \ - examples/rdma_client examples/rdma_server + examples/rdma_client examples/rdma_server examples/mcraw examples_ucmatose_SOURCES = examples/cmatose.c examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la examples_rping_SOURCES = examples/rping.c @@ -31,6 +31,8 @@ examples_rdma_client_SOURCES = examples/rdma_client.c examples_rdma_client_LDADD = $(top_builddir)/src/librdmacm.la examples_rdma_server_SOURCES = examples/rdma_server.c examples_rdma_server_LDADD = $(top_builddir)/src/librdmacm.la +examples_mcraw_SOURCES = examples/mcraw.c +examples_mcraw_LDADD = $(top_builddir)/src/librdmacm.la librdmacmincludedir = $(includedir)/rdma infinibandincludedir = $(includedir)/infiniband @@ -77,7 +79,8 @@ man_MANS = \ man/udaddy.1 \ man/mckey.1 \ man/rping.1 \ - man/rdma_cm.7 + man/rdma_cm.7 \ + man/mcraw.1 EXTRA_DIST = include/rdma/rdma_cma_abi.h include/rdma/rdma_cma.h \ include/infiniband/ib.h include/rdma/rdma_verbs.h \ diff --git a/examples/mcraw.c b/examples/mcraw.c new file mode 100644 index 000..864c20d --- /dev/null +++ b/examples/mcraw.c @@ -0,0 +1,897 @@ +/* + * Copyright (c) 2010 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#include + +#define IB_SEND_IP_CSUM0x10 +#define IMA_VLAN_FLAG 0x20 + +#define VLAN_PRIORITY 0x0 + +#define UDP_HEADER_SIZE(sizeof(struct udphdr)) + +#define HEADER_LEN14 + 28 + +struct cmatest_node { + int id; + struct rdma_cm_id *cma_id; + int connected; + struct ibv_pd *pd; + struct ibv_cq *scq; + struct ibv_cq *rcq; + struct ibv_mr *mr; + struct ibv_ah *ah; + uint32_tremote_qpn; + uint32_tremote_qkey; + uint8_t *mem; + struct ibv_comp_channel *channel; +}; + +struct cmatest { + struct rdma_event_channel *channel; + struct cmatest_node *nodes; + int conn_index; + int connects_left; + + struct sockaddr_in6 dst_in; + struct sockaddr *dst_addr; + struct sockaddr_in6 src_in; + struct sockaddr *src_addr; + int fd[1024]; +}; + +static struct cmatest test; +static int connections = 1; +static int message_size = 100; +static int message_count = 10; +static int is_sender; +static int unmapped_addr; +static char *dst_addr; +static char *src_addr; +static enum rdma_port_space port_space = RDMA_PS_UDP; + +int vlan_flag; +int vlan_ident; + +static int cq_len = 512; +static int qp_len = 256; + +uint16_t IP_CRC(void *buf, int hdr_len) +{ + unsigned long sum = 0; + const uint16_t *ip1; + + ip1
RE: [PATCH v2] libibverbs: add path record definitions to sa.h
Hello Steve, I want to add a change preventing creation of the L2 RAW_QPT from user priviledge (uid = 0 will be able to do such operation) What is the best place to do such change: ibv_create_qp in libibverbs(verbs.c) or allowing to decide for NIC vendors if they want to enable such API to user or root. In that case the change is requested only for libnes library? Regards, Mirek -Original Message- From: Steve Wise [mailto:sw...@opengridcomputing.com] Sent: Wednesday, May 19, 2010 6:00 PM To: Walukiewicz, Miroslaw Cc: Roland Dreier; Hefty, Sean; linux-rdma Subject: Re: [PATCH v2] libibverbs: add path record definitions to sa.h Walukiewicz, Miroslaw wrote: > Hello Steve, > > Do you plan some changes in the core code related to RAW_QPT? > > The only changes I see needed to the kernel core is the mcast change you already proposed to allow mcast attach/detach for RAW_ETY qps... > Could you explain me better what means "priviledged interface" for you? > > I just mean that allocating these raw qps should only be allowed by effective UID 0. This is analogous to PF_PACKET sockets which are privileged as well. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2] libibverbs: add path record definitions to sa.h
Hello Steve, Do you plan some changes in the core code related to RAW_QPT? Could you explain me better what means "priviledged interface" for you? Regards, Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Steve Wise Sent: Tuesday, May 18, 2010 4:04 PM To: Roland Dreier Cc: Hefty, Sean; linux-rdma Subject: Re: [PATCH v2] libibverbs: add path record definitions to sa.h Roland Dreier wrote: > > Can you add the RAW_ETY qp type in this release as well? > > To be honest I haven't looked at the iWARP datagram stuff at all. I'm > not sure overloading the RAW_ETY QP type is necessarily the right thing > to do -- it has quite different (never implemented) semantics in the IB > case. Is there any overview of what you guys are planning as far as > how work requests are created for such QPs? > The RAW_ETY qp would be just that: A kernel-bypass/user mode qp that allows sending/receiving ethernet packets. It would also provide a way for user applications to join/leave ethernet mcast groups (which requires an rdma core kernel change that Intel posted too). What the iWARP vendors are doing on top of that is implementing some form of UDP in user mode. The main goal here is to provide an ultra low latency UDP multicast and unicast channel for important market segments that desire this paradigm. Also, due to the nature of this (send/recv raw eth frames), the interface would be privileged. If you want to wait, then later I'll post patches on how this is being done for cxgb4. But I thought adding the RAW_ETY was definitely a common requirement for Intel and Chelsio. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
Steve Wise wrote: Is this all just optimizing mcast packets? The RAW ETH QP API could be used to accelerate sending and receiving any L2 packets. It depends on application and HW setup. We use it for accelerating a multicast traffic. Steve Wise wrote: Does this raw qp service share the mac address with the ports being used by the host stack? Or does each raw qp get its own mac address? We use a MAC address of the port as a source MAC. The destination MAC is derived from multicast group. In theory, it is possible using other MAC for unicast traffic acceleration, but it is much more complex due to making correct ARP responses and HW possibility to push unicast packets to correct QPs. Steve Wise wrote: Do you all have a user mode UDP/IP running on this raw qp? Yes, we use modified mckey for most tests. Steve Wise wrote: If so, does it use its own IP address separate from the host stack or does it share the host's IP address. Our test application shares an IP address of host interface as a source IP address. Regards, Mirek -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Steve Wise Sent: Wednesday, May 05, 2010 4:56 PM To: Walukiewicz, Miroslaw Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type > I see here some misunderstanding. Let me explain better how our tramsmit path > works. > > In our implementation we use normal memory registration path using ibv_reg_mr > and we use ibv_post_send() with lkey/vaddr/len. > > The implementation of ibv_post_send (nes_post_send in libnes) for RAW QP > passes lkey/virtual_addr/len information to kernel using shared page to our > device driver (ud_post_send). There is no data copy here and the driver is > used only for fast synchronization. > > Because our RAW ETH QP must use physical addresses only, ud_post_send() in > kernel makes a virtual to physical memory translation and accesses the QP HW > for packet transmission. Previously a packet buffer memory was registered and > pinned by ibv_reg_mr to provide necessary information for making such > translation. > > I see. Thanks! > The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is > shared with all other user-kernel communication and it is quite complex. It > is a perfect path for QP/CQ/PD/mem management but for me it is too complex > for traffic acceleration. > > The user<->kernel path through additional driver, shared page for > lkey/vaddr/len passing and SW memory translation in kernel is much more > effective. > > Maybe it is a good idea to make that API more official after some kind of > standarization. Our tests proved that it works. We achieved twice better > performance and latency. That way could open the way for adding some non-RDMA > devices to devices supported OFED API. > > Sounds good. Do you have specific perf numbers to share? Is this all just optimizing mcast packets? Also: Does this raw qp service share the mac address with the ports being used by the host stack? Or does each raw qp get its own mac address? Do you all have a user mode UDP/IP running on this raw qp? If so, does it use its own IP address separate from the host stack or does it share the host's IP address. Thanks, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
Steve, > ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP > needs access to physical addresses from user space. Due to security reasons > we should make a virtual-to-physical address translation in kernel. > > Steve Wise wrote: But why couldn't you just use the normal memory registration paths? IE the user mode app does ibv_reg_mr() and then uses lkey/addr/len in SGEs in the ibv_post_send() which could do kernel bypass. I see here some misunderstanding. Let me explain better how our tramsmit path works. In our implementation we use normal memory registration path using ibv_reg_mr and we use ibv_post_send() with lkey/vaddr/len. The implementation of ibv_post_send (nes_post_send in libnes) for RAW QP passes lkey/virtual_addr/len information to kernel using shared page to our device driver (ud_post_send). There is no data copy here and the driver is used only for fast synchronization. Because our RAW ETH QP must use physical addresses only, ud_post_send() in kernel makes a virtual to physical memory translation and accesses the QP HW for packet transmission. Previously a packet buffer memory was registered and pinned by ibv_reg_mr to provide necessary information for making such translation. Steve Wise wrote: Seems like maybe you could fix the non-bypass post_send/recv paths instead of implementing an entirely new user<->kernel interface... The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is shared with all other user-kernel communication and it is quite complex. It is a perfect path for QP/CQ/PD/mem management but for me it is too complex for traffic acceleration. The user<->kernel path through additional driver, shared page for lkey/vaddr/len passing and SW memory translation in kernel is much more effective. Maybe it is a good idea to make that API more official after some kind of standarization. Our tests proved that it works. We achieved twice better performance and latency. That way could open the way for adding some non-RDMA devices to devices supported OFED API. Regards, Mirek -Original Message- From: Steve Wise [mailto:sw...@opengridcomputing.com] Sent: Tuesday, May 04, 2010 8:15 PM To: Walukiewicz, Miroslaw Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type Walukiewicz, Miroslaw wrote: > Hello Steve, > > Our Hw QP is not a UD type QP but L2 raw QP. In verbs API there is assumtion > that user provides a data payload only for TX and similarly receives a > payload only. The protocol headers (in case of UD - MAC/IP/UDP) are attached > by HW. > > Our QP implementation in HW does not provide such possibity of attaching > headers by HW for UD traffic so for multicast acceleration we choose L2 raw > path. It provides some overhead for user application but it is still zero > copy apprach. > > I thought about using a simulation of UD path using L2 raw QP to get the same > result like for true UD QP (user handles a payload only). Such approach costs > additional copy of payload in SW due to putting headers first and next > payload to single tx buffer. Similar situation is for rx. It is a need for > copy payload to posted buffers or provide data with some offset. > > ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP > needs access to physical addresses from user space. Due to security reasons > we should make a virtual-to-physical address translation in kernel. > > But why couldn't you just use the normal memory registration paths? IE the user mode app does ibv_reg_mr() and then uses lkey/addr/len in SGEs in the ibv_post_send() which could do kernel bypass. > Unfortunately an OFED path for ibv_post_send diving to kernel is quite slow > due to some number of dynamic memory allocations in the path. We choose to > create own private post_send channel to increase tx bandwidth using > ud_post_send and friends. Seems like maybe you could fix the non-bypass post_send/recv paths instead of implementing an entirely new user<->kernel interface... Steve. > > > Regards, > > Mirek > > -Original Message- > From: Steve Wise [mailto:sw...@opengridcomputing.com] > Sent: Tuesday, May 04, 2010 7:19 PM > To: Walukiewicz, Miroslaw > Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org > Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast > acceleration over IB_QPT_RAW_ETY QP type > > Hey Mirek, > > It looks like this patch adds a new file interface for a UD service. > Why didn't you extend the existing UD interface as needed? > > What IO is supported with these changes? IMA via the raw QP, but what > ud_post_send() and friends used for? >
RE: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
Hello Steve, Our Hw QP is not a UD type QP but L2 raw QP. In verbs API there is assumtion that user provides a data payload only for TX and similarly receives a payload only. The protocol headers (in case of UD - MAC/IP/UDP) are attached by HW. Our QP implementation in HW does not provide such possibity of attaching headers by HW for UD traffic so for multicast acceleration we choose L2 raw path. It provides some overhead for user application but it is still zero copy apprach. I thought about using a simulation of UD path using L2 raw QP to get the same result like for true UD QP (user handles a payload only). Such approach costs additional copy of payload in SW due to putting headers first and next payload to single tx buffer. Similar situation is for rx. It is a need for copy payload to posted buffers or provide data with some offset. ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP needs access to physical addresses from user space. Due to security reasons we should make a virtual-to-physical address translation in kernel. Unfortunately an OFED path for ibv_post_send diving to kernel is quite slow due to some number of dynamic memory allocations in the path. We choose to create own private post_send channel to increase tx bandwidth using ud_post_send and friends. Regards, Mirek -Original Message- From: Steve Wise [mailto:sw...@opengridcomputing.com] Sent: Tuesday, May 04, 2010 7:19 PM To: Walukiewicz, Miroslaw Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type Hey Mirek, It looks like this patch adds a new file interface for a UD service. Why didn't you extend the existing UD interface as needed? What IO is supported with these changes? IMA via the raw QP, but what ud_post_send() and friends used for? Steve. miroslaw.walukiew...@intel.com wrote: > This patch implements iWarp multicast acceleration (IMA) > over IB_QPT_RAW_ETY QP type in nes driver. > > Application creates a raw eth QP (IBV_QPT_RAW_ETH in user-space) and > manages the multicast via ibv_attach_mcast and ibv_detach_mcast calls. > > Calling ibv_attach_mcast/ibv_datach_mcast has an effect of > enabling/disabling L2 MAC address filters in HW. > > Signed-off-by: Mirek Walukiewicz > > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to use IB_QPT_RAW_ETY QP from user space.
Hello, I look for equivalent of definition of the IB_QPT_RAW_ETY in libibverbs/include/infiniband/verbs.h ib_verbs.h enum ib_qp_type { /* * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries * here (and in that order) since the MAD layer uses them as * indices into a 2-entry table. */ IB_QPT_SMI, IB_QPT_GSI, IB_QPT_RC, IB_QPT_UC, IB_QPT_UD, IB_QPT_RAW_IPV6, IB_QPT_RAW_ETY }; I see only Verbs.h enum ibv_qp_type { IBV_QPT_RC = 2, IBV_QPT_UC, IBV_QPT_UD, IBV_QPT_XRC }; Is something missing in libibverbs.h? Regards, Mirek - Intel Technology Poland sp. z o.o. z siedziba w Gdansku ul. Slowackiego 173 80-298 Gdansk Sad Rejonowy Gdansk Polnoc w Gdansku, VII Wydzial Gospodarczy Krajowego Rejestru Sadowego, numer KRS 101882 NIP 957-07-52-316 Kapital zakladowy 200.000 zl This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html