Re: [opensm] RFC: new routing options
Hi Al, This looks really great! One question: have you tried benchmarking the BW with up/down routing using the guid_routing_order_file option w/o your new features? -- YK On 08-Oct-10 7:40 PM, Albert Chu wrote: Hey Sasha, We recently got a new cluster and I've been experimenting with some routing changes to improve the average bandwidth of the cluster. They are attached as patches with description of the routing goals below. We're using mpiGraph (http://sourceforge.net/projects/mpigraph/) to measure min, peak, and average send/recv bandwidth across the cluster. What we found with the original updn routing was an average of around 420 MB/s send bandwidth and 508 MB/s recv bandwidth. The following two patches were able to get the average send bandwidth up to 1045 MB/s and recv bandwidth up to 1228 MB/s. I'm sure this is only round 1 of the patches and I'm looking for comments. Many areas could be cleaned up w/ some rearchitecture or struct changes, but I simply implemented the most non-invasive implementation first. I'm also open to name changes on the options. BTW, b/c of the old management tree on the git server, the following patches were developed on an internal LLNL tree. I'll rebase after the up2date tree is on the openfabrics server. 1) Port Shifting This is similar to what was done with some of the LMC 0 code. Congestion would occur due to alignment of routes w/ common traffic patterns. However, we found that it was also necessary for LMC=0 and only for used-ports. For example, lets say there are 4 ports (called A, B, C, D) and we are routing lids 1-9 through them. Suppose only routing through A, B, and C will reach lids 1-9. The LFT would normally be: A: 1 4 7 B: 2 5 8 C: 3 6 9 D: The Port Shifting would make this: A: 1 6 8 B: 2 4 9 C: 3 5 7 D: This option by itself improved the mpiGraph average send/recv bandwidth from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s. 2) Remote Guid Sorting Most core/spine switches we've seen have had line boards connected to spine boards in a consistent pattern. However, we recently got some Qlogic switches that connect from line/leaf boards to spine boards in a (to the casual observer) random pattern. I'm sure there was a good electrical/board reason for this design, but it does hurt routing b/c some of the opensm routing algorithms didn't account for this assumption. Here's an output from iblinkinfo as an example. Switch 0x00066a00ec0029b8 ibcore1 L123: 1801[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 254 19[ ] ibsw55 ( ) 1802[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 253 19[ ] ibsw56 ( ) 1803[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 258 19[ ] ibsw57 ( ) 1804[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 257 19[ ] ibsw58 ( ) 1805[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 256 19[ ] ibsw59 ( ) 1806[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 255 19[ ] ibsw60 ( ) 1807[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 261 19[ ] ibsw61 ( ) 1808[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 262 19[ ] ibsw62 ( ) 1809[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 260 19[ ] ibsw63 ( ) 180 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 259 19[ ] ibsw64 ( ) 180 11[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 284 19[ ] ibsw65 ( ) 180 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 285 19[ ] ibsw66 ( ) 180 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 2227 19[ ] ibsw67 ( ) 180 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 283 19[ ] ibsw68 ( ) 180 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 267 19[ ] ibsw69 ( ) 180 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 270 19[ ] ibsw70 ( ) 180 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 269 19[ ] ibsw71 ( ) 180 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 268 19[ ] ibsw72 ( ) 180 19[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 222 17[ ] ibcore1 S117B ( ) 180 20[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 209 19[ ] ibcore1 S211B ( ) 180 21[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 218 21[ ] ibcore1 S117A ( ) 180 22[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 192 23[ ] ibcore1 S215B ( ) 180 23[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 85 15[ ] ibcore1 S209A ( ) 180 24[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 182 13[ ] ibcore1 S215A ( ) 180 25[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 200 11[ ] ibcore1 S115B ( ) 180 26[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 129 25[ ] ibcore1 S209B ( ) 180 27[ ] ==( 4X 10.0 Gbps
Re: [PATCH] mlx4: Limit num of fast reg WRs
On Tue, Oct 12, 2010 at 12:13:26AM +0200, Or Gerlitz wrote: Guys, can you clarify if the hardware limitation is 511 entries or its (PAGE_SIZE / sizeof(pointer)) - 1 which is 4096 / 8 - 1 = 511 but can change if the page size gets bigger or smaller? The limit is 511 entries. After I posted this patch, I was told that there is yet another constraint on the page list: The buffer containing the list must not cross a page boundary. So I was thinking what is the best way to deal with this. One way is to always allocate a whole page and map it using dma_map_page(page, DMA_TO_DEVICE), something like this (not a complete patch, just the idea). diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 83e3cc7..e9b2c8a 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -237,18 +237,23 @@ struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device if (!mfrpl-ibfrpl.page_list) goto err_free; - mfrpl-mapped_page_list = dma_alloc_coherent(dev-dev-pdev-dev, -size, mfrpl-map, -GFP_KERNEL); + mfrpl-mapped_page_list = (__be64 *)__get_free_page(GFP_KERNEL); if (!mfrpl-mapped_page_list) goto err_free; - WARN_ON(mfrpl-map 0x3f); + mfrpl-map = dma_map_single(ibdev-dma_device, mfrpl-mapped_page_list, + PAGE_SIZE, DMA_TO_DEVICE); + if (dma_mapping_error(ibdev-dma_device, mfrpl-map)) + goto err_page; + + return mfrpl-ibfrpl; +err_page: + free_page((unsigned long) mfrpl-mapped_page_list); + err_free: - kfree(mfrpl-ibfrpl.page_list); kfree(mfrpl); return ERR_PTR(-ENOMEM); } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch v3] infiniband: uverbs: handle large number of entries
In the original code there was a potential integer overflow if you passed in a large cmd.ne. The calls to kmalloc() would allocate smaller buffers than intended, leading to memory corruption. There was also an information leak. Documentation/infiniband/user_verbs.txt suggests this function is meant for unprivileged access. Jason Gunthorpe suggested that I should modify it to pass the data to the user bit by bit and avoid the kmalloc() entirely. CC: sta...@kernel.org Signed-off-by: Dan Carpenter erro...@gmail.com --- Please, please, check this. I've think I've done it right, but I don't have the hardware and can not test it. It's strange to me that we return in_len on success. struct ib_uverbs_poll_cq_resp is used by userspace libraries right? Otherwise I could delete it. diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 6fcfbeb..b0788b6 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -891,68 +891,89 @@ out: return ret ? ret : in_len; } +static int copy_header_to_user(void __user *dest, u32 count) +{ + u32 header[2]; /* the second u32 is reserved */ + + memset(header, 0, sizeof(header)); + if (copy_to_user(dest, header, sizeof(header))) + return -EFAULT; + return 0; +} + +static int copy_wc_to_user(void __user *dest, struct ib_wc *wc) +{ + struct ib_uverbs_wc tmp; + + memset(tmp, 0, sizeof(tmp)); + + tmp.wr_id = wc-wr_id; + tmp.status = wc-status; + tmp.opcode = wc-opcode; + tmp.vendor_err = wc-vendor_err; + tmp.byte_len = wc-byte_len; + tmp.ex.imm_data= (__u32 __force) wc-ex.imm_data; + tmp.qp_num = wc-qp-qp_num; + tmp.src_qp = wc-src_qp; + tmp.wc_flags = wc-wc_flags; + tmp.pkey_index = wc-pkey_index; + tmp.slid = wc-slid; + tmp.sl = wc-sl; + tmp.dlid_path_bits = wc-dlid_path_bits; + tmp.port_num = wc-port_num; + + if (copy_to_user(dest, tmp, sizeof(tmp))) + return -EFAULT; + return 0; +} + ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) { struct ib_uverbs_poll_cq cmd; - struct ib_uverbs_poll_cq_resp *resp; + u8 __user *header_ptr; + u8 __user *data_ptr; struct ib_cq *cq; - struct ib_wc *wc; - intret = 0; + struct ib_wc wc; + u32count = 0; + intret; inti; - intrsize; if (copy_from_user(cmd, buf, sizeof cmd)) return -EFAULT; - wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL); - if (!wc) - return -ENOMEM; - - rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc); - resp = kmalloc(rsize, GFP_KERNEL); - if (!resp) { - ret = -ENOMEM; - goto out_wc; - } - cq = idr_read_cq(cmd.cq_handle, file-ucontext, 0); - if (!cq) { - ret = -EINVAL; - goto out; - } + if (!cq) + return -EINVAL; - resp-count = ib_poll_cq(cq, cmd.ne, wc); + /* we copy a struct ib_uverbs_poll_cq_resp to user space */ + header_ptr = (void __user *)(unsigned long)cmd.response; + data_ptr = header_ptr + sizeof(u32) * 2; - put_cq_read(cq); + for (i = 0; i cmd.ne; i++) { + ret = ib_poll_cq(cq, 1, wc); + if (ret 0) + goto out_put; + if (!ret) + break; - for (i = 0; i resp-count; i++) { - resp-wc[i].wr_id = wc[i].wr_id; - resp-wc[i].status = wc[i].status; - resp-wc[i].opcode = wc[i].opcode; - resp-wc[i].vendor_err = wc[i].vendor_err; - resp-wc[i].byte_len = wc[i].byte_len; - resp-wc[i].ex.imm_data= (__u32 __force) wc[i].ex.imm_data; - resp-wc[i].qp_num = wc[i].qp-qp_num; - resp-wc[i].src_qp = wc[i].src_qp; - resp-wc[i].wc_flags = wc[i].wc_flags; - resp-wc[i].pkey_index = wc[i].pkey_index; - resp-wc[i].slid = wc[i].slid; - resp-wc[i].sl = wc[i].sl; - resp-wc[i].dlid_path_bits = wc[i].dlid_path_bits; - resp-wc[i].port_num = wc[i].port_num; + ret = copy_wc_to_user(data_ptr, wc); + if (ret) + goto out_put; + data_ptr +=
Trying to link with DAT 2.0 function
My motivation for using the dat_cno_fd_create() is that I am able register a file descriptor with a reactor (all events go through a reactor which has multiple I/O including I/O which is not at all tied to uDAPL). An application is able to work on other tasks while waiting for the reactor to call back on the file descriptor when an event is available. To achieve the same behavior with a dat_cno_wait(), I would have to spawn off another thread which blocks on dat_cno_wait(), then notify the reactor (to queue up a reactor event) when the dat_cno_wait() is unblocked, lock critical sections of code, etc.,. If the dat_cno_fd_create() function was available to me, it would seem to be a cleaner way to achieve this functionality. -Original Message- From: Davis, Arlin R [mailto:arlin.r.da...@intel.com] Sent: Monday, October 11, 2010 5:34 PM To: Young, Eric R.; linux-rdma@vger.kernel.org Subject: EXTERNAL:RE: Trying to link with DAT 2.0 function Do you have a roadmap available? Is this planned to be implemented in the near future? There are no plans. I really don't know how this call even made it in the specification given that DAT is suppose to be O/S agnostic. In any case, can you use dat_cno_wait() on top of the EVD's as a means to support/trigger multiple event streams? What is driving your choice to use dat_cno_fd_create()? Maybe we can come up with an alternative with the existing API. Arlin -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [opensm] RFC: new routing options
Hey Yevgeny, Yes, I tried that and it didn't have much of an effect. Ever since Sasha put in his routing sorted by switch load (sort_ports_by_switch_load() in osm_ucast_mgr.c), guid_routing_order isn't really necessary (as long as most of the cluster is up). Al On Tue, 2010-10-12 at 00:59 -0700, Yevgeny Kliteynik wrote: Hi Al, This looks really great! One question: have you tried benchmarking the BW with up/down routing using the guid_routing_order_file option w/o your new features? -- YK On 08-Oct-10 7:40 PM, Albert Chu wrote: Hey Sasha, We recently got a new cluster and I've been experimenting with some routing changes to improve the average bandwidth of the cluster. They are attached as patches with description of the routing goals below. We're using mpiGraph (http://BLOCKEDsourceforge.net/projects/mpigraph/) to measure min, peak, and average send/recv bandwidth across the cluster. What we found with the original updn routing was an average of around 420 MB/s send bandwidth and 508 MB/s recv bandwidth. The following two patches were able to get the average send bandwidth up to 1045 MB/s and recv bandwidth up to 1228 MB/s. I'm sure this is only round 1 of the patches and I'm looking for comments. Many areas could be cleaned up w/ some rearchitecture or struct changes, but I simply implemented the most non-invasive implementation first. I'm also open to name changes on the options. BTW, b/c of the old management tree on the git server, the following patches were developed on an internal LLNL tree. I'll rebase after the up2date tree is on the openfabrics server. 1) Port Shifting This is similar to what was done with some of the LMC 0 code. Congestion would occur due to alignment of routes w/ common traffic patterns. However, we found that it was also necessary for LMC=0 and only for used-ports. For example, lets say there are 4 ports (called A, B, C, D) and we are routing lids 1-9 through them. Suppose only routing through A, B, and C will reach lids 1-9. The LFT would normally be: A: 1 4 7 B: 2 5 8 C: 3 6 9 D: The Port Shifting would make this: A: 1 6 8 B: 2 4 9 C: 3 5 7 D: This option by itself improved the mpiGraph average send/recv bandwidth from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s. 2) Remote Guid Sorting Most core/spine switches we've seen have had line boards connected to spine boards in a consistent pattern. However, we recently got some Qlogic switches that connect from line/leaf boards to spine boards in a (to the casual observer) random pattern. I'm sure there was a good electrical/board reason for this design, but it does hurt routing b/c some of the opensm routing algorithms didn't account for this assumption. Here's an output from iblinkinfo as an example. Switch 0x00066a00ec0029b8 ibcore1 L123: 1801[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 254 19[ ] ibsw55 ( ) 1802[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 253 19[ ] ibsw56 ( ) 1803[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 258 19[ ] ibsw57 ( ) 1804[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 257 19[ ] ibsw58 ( ) 1805[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 256 19[ ] ibsw59 ( ) 1806[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 255 19[ ] ibsw60 ( ) 1807[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 261 19[ ] ibsw61 ( ) 1808[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 262 19[ ] ibsw62 ( ) 1809[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 260 19[ ] ibsw63 ( ) 180 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 259 19[ ] ibsw64 ( ) 180 11[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 284 19[ ] ibsw65 ( ) 180 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 285 19[ ] ibsw66 ( ) 180 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 2227 19[ ] ibsw67 ( ) 180 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 283 19[ ] ibsw68 ( ) 180 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 267 19[ ] ibsw69 ( ) 180 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 270 19[ ] ibsw70 ( ) 180 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 269 19[ ] ibsw71 ( ) 180 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 268 19[ ] ibsw72 ( ) 180 19[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 222 17[ ] ibcore1 S117B ( ) 180 20[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 209 19[ ] ibcore1 S211B ( ) 180 21[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 218 21[ ] ibcore1 S117A ( ) 180 22[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 192 23[ ] ibcore1
Work completions generated after a queue pair has made the transition to an error state
Hello, Has anyone already tried to process the work completions generated by a HCA after the state of a queue pair has been changed to IB_QPS_ERR ? With the hardware/firmware/driver combination I have tested I have observed the following: * Multiple completions with the same wr_id and nonzero (error) status were received by the application, while all work requests queued with the flag IB_SEND_SIGNALED had a unique wr_id. * Completions with non-zero (error) status and a wr_id / opcode combination were received that were never queued by the application. Note: some work requests were queued with and some without the flag IB_SEND_SIGNALED. I'm not sure however whether that has anything to do with the observed behavior. This behavior is easy to reproduce. If I interpret the InfiniBand Architecture Specification correctly, this behavior is non-compliant. Has anyone been looking into this before ? Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Work completions generated after a queue pair has made the transition to an error state
Bart Van Assche bvanass...@acm.org wrote: Has anyone been looking into this before ? nope, never ever, what hca is that? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Work completions generated after a queue pair has made the transition to an error state
On Tue, Oct 12, 2010 at 8:50 PM, Ralph Campbell ralph.campb...@qlogic.com wrote: On Tue, 2010-10-12 at 11:38 -0700, Bart Van Assche wrote: Hello, Has anyone already tried to process the work completions generated by a HCA after the state of a queue pair has been changed to IB_QPS_ERR ? With the hardware/firmware/driver combination I have tested I have observed the following: * Multiple completions with the same wr_id and nonzero (error) status were received by the application, while all work requests queued with the flag IB_SEND_SIGNALED had a unique wr_id. * Completions with non-zero (error) status and a wr_id / opcode combination were received that were never queued by the application. Note: some work requests were queued with and some without the flag IB_SEND_SIGNALED. I'm not sure however whether that has anything to do with the observed behavior. This behavior is easy to reproduce. If I interpret the InfiniBand Architecture Specification correctly, this behavior is non-compliant. Has anyone been looking into this before ? I haven't seen it. It isn't supposed to happen. What hardware and software are you using and how do you reproduce it? Hello Ralph and Or, The way I reproduce that behavior is by modifying the state of a queue pair into IB_QPS_ERR while RDMA is ongoing. The application, which is multithreaded, performs RDMA by calling ib_post_recv() and ib_post_send() (opcodes IB_WR_SEND, IB_WR_RDMA_READ and IB_WR_RDMA_WRITE). This has been observed with the mlx4 driver, a ConnectX HCA and firmware version 2.7.0. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 0/2] IB/umad: Export mad snooping to userspace
The kernel mad interface allows a client to view all sent and received MADs. This has proven to be a useful debugging technique when paired with the external kernel module, madeye. However, madeye was never intended to be submitted upstream. A couple of alternatives have been proposed for making this functionality available in the upstream kernel, using trace events or exporting the snooping interface to user space. This patch series takes the latter approach. In addition to snooping MADs simply for debugging purposes, applications can be constructed to examine and act on MAD traffic. For example, a daemon could snoop SA queries and CM messages as part of providing a path record caching service. It could cached snooped path records and use CM timeouts as an indication that cached data may be stale. Because such services may become crucial to support large clusters, the desire is to add mad snooping capabilities to the stack directly, rather than using a debug interface. These patches compile, but have not been tested. If this approach is acceptable, I will modify libibumad to work with the proposed changes. I will also create a userspace version of madeye as a new ib-diag. Finally, the IB ACM will eventually be updated to monitor CM response timeouts. Signed-off-by: Sean Hefty sean.he...@intel.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/2] IB/mad: Simplify snooping interface
In preparation for exporting the kernel mad snooping capability to user space, remove all code originally inserted as place holders and simplify the mad snooping interface. For performance reasons, we want to filter which mads are reported to clients of the snooping interface at the lowest level, but we also don't want to perform complex filtering at that level. As a trade-off, we allow filtering based on mgmt_class, attr_id, and mad request status. The reasoning behind these choices are to allow a user to filter traffic to a specific service (the SA or CM), for a well known purpose (path record queries or multicast joins), or view only operations that have failed. Filtering based on mgmt_class and attr_id were used by the external madeye debug module, so we have some precedence that filtering at that level is usable. Signed-off-by: Sean Hefty sean.he...@intel.com --- drivers/infiniband/core/mad.c | 86 ++-- drivers/infiniband/core/mad_priv.h |2 - include/rdma/ib_mad.h | 51 ++--- 3 files changed, 68 insertions(+), 71 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index ef1304f..b90f7f0 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -381,22 +381,6 @@ error1: } EXPORT_SYMBOL(ib_register_mad_agent); -static inline int is_snooping_sends(int mad_snoop_flags) -{ - return (mad_snoop_flags - (/*IB_MAD_SNOOP_POSTED_SENDS | -IB_MAD_SNOOP_RMPP_SENDS |*/ -IB_MAD_SNOOP_SEND_COMPLETIONS /*| -IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS*/)); -} - -static inline int is_snooping_recvs(int mad_snoop_flags) -{ - return (mad_snoop_flags - (IB_MAD_SNOOP_RECVS /*| -IB_MAD_SNOOP_RMPP_RECVS*/)); -} - static int register_snoop_agent(struct ib_mad_qp_info *qp_info, struct ib_mad_snoop_private *mad_snoop_priv) { @@ -434,8 +418,8 @@ out: struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, u8 port_num, enum ib_qp_type qp_type, - int mad_snoop_flags, - ib_mad_snoop_handler snoop_handler, + struct ib_mad_snoop_reg_req *snoop_reg_req, + ib_mad_send_handler send_handler, ib_mad_recv_handler recv_handler, void *context) { @@ -444,12 +428,6 @@ struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, struct ib_mad_snoop_private *mad_snoop_priv; int qpn; - /* Validate parameters */ - if ((is_snooping_sends(mad_snoop_flags) !snoop_handler) || - (is_snooping_recvs(mad_snoop_flags) !recv_handler)) { - ret = ERR_PTR(-EINVAL); - goto error1; - } qpn = get_spl_qp_index(qp_type); if (qpn == -1) { ret = ERR_PTR(-EINVAL); @@ -471,11 +449,11 @@ struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, mad_snoop_priv-qp_info = port_priv-qp_info[qpn]; mad_snoop_priv-agent.device = device; mad_snoop_priv-agent.recv_handler = recv_handler; - mad_snoop_priv-agent.snoop_handler = snoop_handler; + mad_snoop_priv-agent.send_handler = send_handler; + mad_snoop_priv-reg_req = *snoop_reg_req; mad_snoop_priv-agent.context = context; mad_snoop_priv-agent.qp = port_priv-qp_info[qpn].qp; mad_snoop_priv-agent.port_num = port_num; - mad_snoop_priv-mad_snoop_flags = mad_snoop_flags; init_completion(mad_snoop_priv-comp); mad_snoop_priv-snoop_index = register_snoop_agent( port_priv-qp_info[qpn], @@ -592,10 +570,35 @@ static void dequeue_mad(struct ib_mad_list_head *mad_list) spin_unlock_irqrestore(mad_queue-lock, flags); } +static int snoop_check_filter(struct ib_mad_snoop_private *mad_snoop_priv, + struct ib_mad_hdr *mad_hdr, enum ib_wc_status status) +{ + struct ib_mad_snoop_reg_req *reg = mad_snoop_priv-reg_req; + + if (reg-errors !mad_hdr-status + (status == IB_WC_SUCCESS || status == IB_WC_WR_FLUSH_ERR)) + return 0; + + if (reg-mgmt_class) { + if (reg-mgmt_class != mad_hdr-mgmt_class) + return 0; + + if (reg-attr_id reg-attr_id != mad_hdr-attr_id) + return 0; + + if (reg-mgmt_class_version + reg-mgmt_class_version != mad_hdr-class_version) + return 0; + + if (is_vendor_class(reg-mgmt_class) is_vendor_oui(reg-oui) +
[RFC 2/2] IB/umad: Export mad snooping capability to userspace
Export the mad snooping capability to user space clients through the existing umad interface. This will allow users to capture MAD data for debugging, plus it allows for services to act on MAD traffic that occurs. For example, a daemon could snoop SA queries and CM messages as part of providing a path record caching service. (It could cached snooped path records, record the average time needed for the SA to respond to queries, use CM timeouts as an indication that cached data may be stale, etc.) Because such services may become crucial to support large clusters, mad snooping capabilities are not limited to a debugging interface. Backwards compatibility is maintained by using the upper bit of the QPN to indicate if a user is registering to send/receive MADs or only wishes to snoop traffic. Signed-off-by: Sean Hefty sean.he...@intel.com --- drivers/infiniband/core/user_mad.c | 134 ++-- include/rdma/ib_user_mad.h | 33 - 2 files changed, 143 insertions(+), 24 deletions(-) diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 5fa8569..e666038 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -252,6 +252,80 @@ err1: ib_free_recv_mad(mad_recv_wc); } +static void snoop_send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *send_wc) +{ + struct ib_umad_file *file = agent-context; + struct ib_umad_packet *packet; + struct ib_mad_send_buf *msg = send_wc-send_buf; + struct ib_rmpp_mad *rmpp_mad; + int data_len; + u32 seg_num; + + data_len = msg-seg_count ? msg-seg_size : msg-data_len; + packet = kzalloc(sizeof *packet + msg-hdr_len + data_len, GFP_KERNEL); + if (!packet) + return; + + packet-length = msg-hdr_len + data_len; + packet-mad.hdr.status = send_wc-status; + packet-mad.hdr.timeout_ms = msg-timeout_ms; + packet-mad.hdr.retries = msg-retries; + packet-mad.hdr.length = hdr_size(file) + packet-length; + + if (msg-seg_count) { + rmpp_mad = msg-mad; + seg_num = be32_to_cpu(rmpp_mad-rmpp_hdr.seg_num); + memcpy(packet-mad.data, msg-mad, msg-hdr_len); + memcpy(((u8 *) packet-mad.data) + msg-hdr_len, + ib_get_rmpp_segment(msg, seg_num), data_len); + } else { + memcpy(packet-mad.data, msg-mad, packet-length); + } + + if (queue_packet(file, agent, packet)) + kfree(packet); +} + +static void snoop_recv_handler(struct ib_mad_agent *agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_umad_file *file = agent-context; + struct ib_umad_packet *packet; + struct ib_mad_recv_buf *recv_buf = mad_recv_wc-recv_buf; + + packet = kzalloc(sizeof *packet + sizeof *recv_buf-mad, GFP_KERNEL); + if (!packet) + return; + + packet-length = sizeof *recv_buf-mad; + packet-mad.hdr.length = hdr_size(file) + packet-length; + packet-mad.hdr.qpn = cpu_to_be32(mad_recv_wc-wc-src_qp); + packet-mad.hdr.lid = cpu_to_be16(mad_recv_wc-wc-slid); + packet-mad.hdr.sl = mad_recv_wc-wc-sl; + packet-mad.hdr.path_bits = mad_recv_wc-wc-dlid_path_bits; + packet-mad.hdr.pkey_index = mad_recv_wc-wc-pkey_index; + packet-mad.hdr.grh_present = !!(mad_recv_wc-wc-wc_flags IB_WC_GRH); + if (packet-mad.hdr.grh_present) { + struct ib_ah_attr ah_attr; + + ib_init_ah_from_wc(agent-device, agent-port_num, + mad_recv_wc-wc, mad_recv_wc-recv_buf.grh, + ah_attr); + + packet-mad.hdr.gid_index = ah_attr.grh.sgid_index; + packet-mad.hdr.hop_limit = ah_attr.grh.hop_limit; + packet-mad.hdr.traffic_class = ah_attr.grh.traffic_class; + memcpy(packet-mad.hdr.gid, ah_attr.grh.dgid, 16); + packet-mad.hdr.flow_label = cpu_to_be32(ah_attr.grh.flow_label); + } + + memcpy(packet-mad.data, recv_buf-mad, packet-length); + + if (queue_packet(file, agent, packet)) + kfree(packet); +} + static ssize_t copy_recv_mad(struct ib_umad_file *file, char __user *buf, struct ib_umad_packet *packet, size_t count) { @@ -603,8 +677,9 @@ static int ib_umad_reg_agent(struct ib_umad_file *file, void __user *arg, { struct ib_user_mad_reg_req ureq; struct ib_mad_reg_req req; + struct ib_mad_snoop_reg_req snoop_req; struct ib_mad_agent *agent = NULL; - int agent_id; + int agent_id, snoop; int ret; mutex_lock(file-port-file_mutex); @@ -620,6 +695,8 @@ static int ib_umad_reg_agent(struct ib_umad_file *file, void __user *arg, goto out; } + snoop = ureq.qpn
Re: Work completions generated after a queue pair has made the transition to an error state
On Tue, Oct 12, 2010 at 08:58:59PM +0200, Bart Van Assche wrote: On Tue, Oct 12, 2010 at 8:50 PM, Ralph Campbell ralph.campb...@qlogic.com wrote: On Tue, 2010-10-12 at 11:38 -0700, Bart Van Assche wrote: Hello, Has anyone already tried to process the work completions generated by a HCA after the state of a queue pair has been changed to IB_QPS_ERR ? With the hardware/firmware/driver combination I have tested I have observed the following: * Multiple completions with the same wr_id and nonzero (error) status were received by the application, while all work requests queued with the flag IB_SEND_SIGNALED had a unique wr_id. I assume your QP is configured for selective signalling, right? This means that for succcessful processing of the work request there will not be any completion. But for unsuccessful WR, the hardware should generate a completion. For these casese it is worth having a meaningfull wrid. * Completions with non-zero (error) status and a wr_id / opcode combination were received that were never queued by the application. In case of error the opcode of the completed operation is not provided. I am not sure why. Note: some work requests were queued with and some without the flag IB_SEND_SIGNALED. I'm not sure however whether that has anything to do with the observed behavior. If you have WRs for which you did not set IB_SEND_SIGNALED, they are not considered completed before a comletion entry is pushed to the CQ that correspnds to that send queue. I am not sure if it means that all the WR in the send queue should be completed with error. This behavior is easy to reproduce. If I interpret the InfiniBand Architecture Specification correctly, this behavior is non-compliant. Has anyone been looking into this before ? I haven't seen it. It isn't supposed to happen. What hardware and software are you using and how do you reproduce it? Hello Ralph and Or, The way I reproduce that behavior is by modifying the state of a queue pair into IB_QPS_ERR while RDMA is ongoing. The application, which is multithreaded, performs RDMA by calling ib_post_recv() and ib_post_send() (opcodes IB_WR_SEND, IB_WR_RDMA_READ and IB_WR_RDMA_WRITE). This has been observed with the mlx4 driver, a ConnectX HCA and firmware version 2.7.0. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] svcrdma: NFSRDMA Server fixes for 2.6.37
Hi Bruce, These fixes are ready for 2.6.37. They fix two bugs in the server-side NFSRDMA transport. Thanks, Tom --- Tom Tucker (2): svcrdma: Cleanup DMA unmapping in error paths. svcrdma: Change DMA mapping logic to avoid the page_address kernel API net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 19 --- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 82 ++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 41 +++ 3 files changed, 92 insertions(+), 50 deletions(-) -- Signed-off-by: Tom Tucker t...@ogc.us -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] svcrdma: Change DMA mapping logic to avoid the page_address kernel API
There was logic in the send path that assumed that a page containing data to send to the client has a KVA. This is not always the case and can result in data corruption when page_address returns zero and we end up DMA mapping zero. This patch changes the bus mapping logic to avoid page_address() where necessary and converts all calls from ib_dma_map_single to ib_dma_map_page in order to keep the map/unmap calls symmetric. Signed-off-by: Tom Tucker t...@ogc.us --- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 18 --- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 80 ++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 18 +++ 3 files changed, 78 insertions(+), 38 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 0194de8..926bdb4 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -263,9 +263,9 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt, frmr-page_list_len = PAGE_ALIGN(byte_count) PAGE_SHIFT; for (page_no = 0; page_no frmr-page_list_len; page_no++) { frmr-page_list-page_list[page_no] = - ib_dma_map_single(xprt-sc_cm_id-device, - page_address(rqstp-rq_arg.pages[page_no]), - PAGE_SIZE, DMA_FROM_DEVICE); + ib_dma_map_page(xprt-sc_cm_id-device, + rqstp-rq_arg.pages[page_no], 0, + PAGE_SIZE, DMA_FROM_DEVICE); if (ib_dma_mapping_error(xprt-sc_cm_id-device, frmr-page_list-page_list[page_no])) goto fatal_err; @@ -309,17 +309,21 @@ static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt, int count) { int i; + unsigned long off; ctxt-count = count; ctxt-direction = DMA_FROM_DEVICE; for (i = 0; i count; i++) { ctxt-sge[i].length = 0; /* in case map fails */ if (!frmr) { + BUG_ON(0 == virt_to_page(vec[i].iov_base)); + off = (unsigned long)vec[i].iov_base ~PAGE_MASK; ctxt-sge[i].addr = - ib_dma_map_single(xprt-sc_cm_id-device, - vec[i].iov_base, - vec[i].iov_len, - DMA_FROM_DEVICE); + ib_dma_map_page(xprt-sc_cm_id-device, + virt_to_page(vec[i].iov_base), + off, + vec[i].iov_len, + DMA_FROM_DEVICE); if (ib_dma_mapping_error(xprt-sc_cm_id-device, ctxt-sge[i].addr)) return -EINVAL; diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index b15e1eb..d4f5e0e 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -70,8 +70,8 @@ * on extra page for the RPCRMDA header. */ static int fast_reg_xdr(struct svcxprt_rdma *xprt, -struct xdr_buf *xdr, -struct svc_rdma_req_map *vec) + struct xdr_buf *xdr, + struct svc_rdma_req_map *vec) { int sge_no; u32 sge_bytes; @@ -96,21 +96,25 @@ static int fast_reg_xdr(struct svcxprt_rdma *xprt, vec-count = 2; sge_no++; - /* Build the FRMR */ + /* Map the XDR head */ frmr-kva = frva; frmr-direction = DMA_TO_DEVICE; frmr-access_flags = 0; frmr-map_len = PAGE_SIZE; frmr-page_list_len = 1; + page_off = (unsigned long)xdr-head[0].iov_base ~PAGE_MASK; frmr-page_list-page_list[page_no] = - ib_dma_map_single(xprt-sc_cm_id-device, - (void *)xdr-head[0].iov_base, - PAGE_SIZE, DMA_TO_DEVICE); + ib_dma_map_page(xprt-sc_cm_id-device, + virt_to_page(xdr-head[0].iov_base), + page_off, + PAGE_SIZE - page_off, + DMA_TO_DEVICE); if (ib_dma_mapping_error(xprt-sc_cm_id-device, frmr-page_list-page_list[page_no])) goto fatal_err; atomic_inc(xprt-sc_dma_used); + /* Map the XDR page list */ page_off = xdr-page_base; page_bytes = xdr-page_len + page_off; if (!page_bytes) @@ -128,9 +132,9 @@ static int fast_reg_xdr(struct svcxprt_rdma
[PATCH 2/2] svcrdma: Cleanup DMA unmapping in error paths.
There are several error paths in the code that do not unmap DMA. This patch adds calls to svc_rdma_unmap_dma to free these DMA contexts. Signed-off-by: Tom Tucker t...@opengridcomputing.com --- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |1 + net/sunrpc/xprtrdma/svc_rdma_sendto.c|2 ++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 29 ++--- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 926bdb4..df67211 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -495,6 +495,7 @@ next_sge: printk(KERN_ERR svcrdma: Error %d posting RDMA_READ\n, err); set_bit(XPT_CLOSE, xprt-sc_xprt.xpt_flags); + svc_rdma_unmap_dma(ctxt); svc_rdma_put_context(ctxt, 0); goto out; } diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index d4f5e0e..249a835 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -367,6 +367,8 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, goto err; return 0; err: + svc_rdma_unmap_dma(ctxt); + svc_rdma_put_frmr(xprt, vec-frmr); svc_rdma_put_context(ctxt, 0); /* Fatal error, close transport */ return -EIO; diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 23f90c3..d22a44d 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -511,9 +511,9 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt) ctxt-sge[sge_no].addr = pa; ctxt-sge[sge_no].length = PAGE_SIZE; ctxt-sge[sge_no].lkey = xprt-sc_dma_lkey; + ctxt-count = sge_no + 1; buflen += PAGE_SIZE; } - ctxt-count = sge_no; recv_wr.next = NULL; recv_wr.sg_list = ctxt-sge[0]; recv_wr.num_sge = ctxt-count; @@ -529,6 +529,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt) return ret; err_put_ctxt: + svc_rdma_unmap_dma(ctxt); svc_rdma_put_context(ctxt, 1); return -ENOMEM; } @@ -1306,7 +1307,6 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp, enum rpcrdma_errcode err) { struct ib_send_wr err_wr; - struct ib_sge sge; struct page *p; struct svc_rdma_op_ctxt *ctxt; u32 *va; @@ -1319,26 +1319,27 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp, /* XDR encode error */ length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va); + ctxt = svc_rdma_get_context(xprt); + ctxt-direction = DMA_FROM_DEVICE; + ctxt-count = 1; + ctxt-pages[0] = p; + /* Prepare SGE for local address */ - sge.addr = ib_dma_map_page(xprt-sc_cm_id-device, - p, 0, PAGE_SIZE, DMA_FROM_DEVICE); - if (ib_dma_mapping_error(xprt-sc_cm_id-device, sge.addr)) { + ctxt-sge[0].addr = ib_dma_map_page(xprt-sc_cm_id-device, + p, 0, length, DMA_FROM_DEVICE); + if (ib_dma_mapping_error(xprt-sc_cm_id-device, ctxt-sge[0].addr)) { put_page(p); return; } atomic_inc(xprt-sc_dma_used); - sge.lkey = xprt-sc_dma_lkey; - sge.length = length; - - ctxt = svc_rdma_get_context(xprt); - ctxt-count = 1; - ctxt-pages[0] = p; + ctxt-sge[0].lkey = xprt-sc_dma_lkey; + ctxt-sge[0].length = length; /* Prepare SEND WR */ memset(err_wr, 0, sizeof err_wr); ctxt-wr_op = IB_WR_SEND; err_wr.wr_id = (unsigned long)ctxt; - err_wr.sg_list = sge; + err_wr.sg_list = ctxt-sge; err_wr.num_sge = 1; err_wr.opcode = IB_WR_SEND; err_wr.send_flags = IB_SEND_SIGNALED; @@ -1348,9 +1349,7 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp, if (ret) { dprintk(svcrdma: Error %d posting send for protocol error\n, ret); - ib_dma_unmap_page(xprt-sc_cm_id-device, - sge.addr, PAGE_SIZE, - DMA_FROM_DEVICE); + svc_rdma_unmap_dma(ctxt); svc_rdma_put_context(ctxt, 1); } } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/2] IB/umad: Export mad snooping to userspace
On Tue, Oct 12, 2010 at 12:10:37PM -0700, Hefty, Sean wrote: The kernel mad interface allows a client to view all sent and received MADs. This has proven to be a useful debugging technique when paired with the external kernel module, madeye. However, madeye was never intended to be submitted upstream. A couple of alternatives have been proposed for making this functionality available in the upstream kernel, using trace events or exporting the snooping interface to user space. This patch series takes the latter approach. TBH, I think this would be much better off integrating with the existing paths tcpdump/setc uses rather than yet again something new and unique. I think everone who has to actually use the IB stuff in real life would be estatic if wireshark just worked... Yes, I realize that is a bit awkward.. But maybe it is time we had a netdev for the raw IB device? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4: Limit num of fast reg WRs
After I posted this patch, I was told that there is yet another constraint on the page list: The buffer containing the list must not cross a page boundary. So I was thinking what is the best way to deal with this. One way is to always allocate a whole page and map it using dma_map_page(page, DMA_TO_DEVICE), something like this (not a complete patch, just the idea). Is there any chance of the dma_alloc_coherent() in the current code allocating memory that crosses a page boundary? - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC 0/2] IB/umad: Export mad snooping to userspace
TBH, I think this would be much better off integrating with the existing paths tcpdump/setc uses rather than yet again something new This ties in with the existing MAD interface, which isn't going away anytime soon, if ever. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch v3] infiniband: uverbs: handle large number of entries
On Tue, Oct 12, 2010 at 01:31:17PM +0200, Dan Carpenter wrote: In the original code there was a potential integer overflow if you passed in a large cmd.ne. The calls to kmalloc() would allocate smaller buffers than intended, leading to memory corruption. Keep in mind these are probably performance sensitive APIs, I was imagining batching a small number and they copy_to_user ? No idea what the various performance trades offs are.. Please, please, check this. I've think I've done it right, but I don't have the hardware and can not test it. Nor, do I.. I actually don't know what hardware uses this path? The Mellanox cards use a user-space only version. Maybe an iwarp card? I kinda recall some recent messages concerning memory allocations in these paths for iwarp. I wonder if removing the allocation is such a big win the larger number of copy_to_user calls does not matter? It's strange to me that we return in_len on success. Agree.. +static int copy_header_to_user(void __user *dest, u32 count) +{ + u32 header[2]; /* the second u32 is reserved */ + + memset(header, 0, sizeof(header)); Don't you need header[0] = count ? Maybe: u32 header[2] = {count}; And let the compiler 0 the other word optimally. Also, I'm not matters here, since you are zeroing user memory that isn't currently used.. +static int copy_wc_to_user(void __user *dest, struct ib_wc *wc) +{ + struct ib_uverbs_wc tmp; + + memset(tmp, 0, sizeof(tmp)); I'd really like to see that memset go away for performance. Again maybe use named initializers and let the compiler zero the uninitialized (does it zero padding, I wonder?). Or pre-zero this memory outside the loop.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/2] IB/umad: Export mad snooping to userspace
On Tue, Oct 12, 2010 at 01:54:54PM -0700, Hefty, Sean wrote: TBH, I think this would be much better off integrating with the existing paths tcpdump/setc uses rather than yet again something new This ties in with the existing MAD interface, which isn't going away anytime soon, if ever. I didn't say the MAD interface was going away, I said it was not the interface everything else in the kernel uses for packet capture. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Make multicast and path record queue flexible.
On Tue, Oct 12, 2010 at 06:29:53PM +0200, Alekseys Senin wrote: On Tue, 2010-10-05 at 14:12 -0500, Christoph Lameter wrote: On Tue, 5 Oct 2010, Jason Gunthorpe wrote: On Tue, Oct 05, 2010 at 06:07:37PM +0200, Aleksey Senin wrote: When using slow SM allow more packets to be buffered before answer comming back. This patch based on idea of Christoph Lameter. http://lists.openfabrics.org/pipermail/general/2009-June/059853.html IMHO, I think it is better to send multicasts to the broadcast MLID than to queue them.. More like ethernet that way. I agree. We had similar ideas. However, the kernel does send igmp reports to the MC address not to 244.0.0.2. We would have to redirect at the IB layer until multicast via MLID becomes functional. We cannot tell when that will be the case. But what if it will not be available from some reason? How long should we wait? Do we need implement another queue/counter/timeout? If you follow the scheme I outlined - where traffic to a MGID that doesn't yet have a MLID is routed to the broadcast MLID then you do it until you get a MLID, with periodic retries/refreshes of the SA operation. This is similar to how ethernet works, and is generally harmless. Better to have a working, but suboptimal network, than one that is busted. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4: Limit num of fast reg WRs
On Tue, Oct 12, 2010 at 01:37:37PM -0700, Roland Dreier wrote: Is there any chance of the dma_alloc_coherent() in the current code allocating memory that crosses a page boundary? You mean that the allocation is aligned at least to its size? I could not find any commitment to this anywhere. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC 0/2] IB/umad: Export mad snooping to userspace
TBH, I think this would be much better off integrating with the existing paths tcpdump/setc uses rather than yet again something new This ties in with the existing MAD interface, which isn't going away anytime soon, if ever. I didn't say the MAD interface was going away, I said it was not the interface everything else in the kernel uses for packet capture. My focus is tying this functionality in with the existing IB stack. The MAD, verbs, and HCA drivers do not use net_device, sk_buff, or anything in netdev, and I don't have the time or inclination to try to add it. We have an interface that allows registration to receive MADs; this provides a simple extension of that interface. I'm mainly interested in capturing MAD data, not all packets. We don't have access to any of the headers, or even an easy way to know the destination for sent MADs. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Opensm crash with OFED 1.5
Folks: I have a multi-processor machine, running FedoraCore 12. I have installed OFED 1.5. Everything seems to come up ok, I can look at the ibstat and it shows that the Mellanox card stats etc... As soon as I start opensm, I get the following kernel oops and the machine locks up. Any ideas Thanks, Suri -- Oct 12 17:19:38 localhost OpenSM[2617]: OpenSM 3.3.5#012 Oct 12 17:19:38 localhost OpenSM[2617]: Entering DISCOVERING state#012 Oct 12 17:20:20 localhost kernel: ib0: ib_query_gid() failed Oct 12 17:20:30 localhost kernel: ib0: ib_query_port failed Oct 12 17:20:52 localhost kernel: BUG: soft lockup - CPU#15 stuck for 61s! [opensm:2637] Oct 12 17:20:52 localhost kernel: Modules linked in: fuse sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad iw_nes libcrc32c iw_cxgb3 cxgb3 mlx4_en mlx4_ib ib_mthca ib_mad ib_core dm_multipath uinput mlx4_core igb i2c_i801 joydev dca i2c_core iTCO_wdt iTCO_vendor_support mpt2sas scsi_transport_sas [last unloaded: microcode] Oct 12 17:20:52 localhost kernel: CPU 15: Oct 12 17:20:52 localhost kernel: Modules linked in: fuse sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad iw_nes libcrc32c iw_cxgb3 cxgb3 mlx4_en mlx4_ib ib_mthca ib_mad ib_core dm_multipath uinput mlx4_core igb i2c_i801 joydev dca i2c_core iTCO_wdt iTCO_vendor_support mpt2sas scsi_transport_sas [last unloaded: microcode] Oct 12 17:20:52 localhost kernel: Pid: 2637, comm: opensm Not tainted 2.6.31.5-127.fc12.x86_64 #1 X8DTH-i/6/iF/6F Oct 12 17:20:52 localhost kernel: RIP: 0010:[81203558] [81203558] __bitmap_empty+0x0/0x64 Oct 12 17:20:52 localhost kernel: RSP: 0018:880c174bbd90 EFLAGS: 0246 Oct 12 17:20:52 localhost kernel: RAX: RBX: 880c174bbdd8 RCX: 0001 Oct 12 17:20:52 localhost kernel: RDX: 818ba920 RSI: 0100 RDI: 818ba918 Oct 12 17:20:52 localhost kernel: RBP: 8101286e R08: R09: 0004 Oct 12 17:20:52 localhost kernel: R10: 0004 R11: 0206 R12: 880c174bbdd8 Oct 12 17:20:52 localhost kernel: R13: 8101286e R14: 810dc920 R15: 880c174bbcf8 Oct 12 17:20:52 localhost kernel: FS: 7ff2d02e7710() GS:c90001e0() knlGS: Oct 12 17:20:52 localhost kernel: CS: 0010 DS: ES: CR0: 80050033 Oct 12 17:20:52 localhost kernel: CR2: 0041f0c0 CR3: 000c19074000 CR4: 06e0 Oct 12 17:20:52 localhost kernel: DR0: DR1: DR2: Oct 12 17:20:52 localhost kernel: DR3: DR6: 0ff0 DR7: 0400 Oct 12 17:20:52 localhost kernel: Call Trace: Oct 12 17:20:52 localhost kernel: [810383f2] ? native_flush_tlb_others+0xc3/0xf2 Oct 12 17:20:52 localhost kernel: [8103859d] ? flush_tlb_mm+0x6f/0x76 Oct 12 17:20:52 localhost kernel: [810debbc] ? mprotect_fixup+0x480/0x611 Oct 12 17:20:52 localhost kernel: [810da81d] ? free_pgtables+0xa9/0xcc Oct 12 17:20:52 localhost kernel: [810f185d] ? virt_to_head_page+0xe/0x2f Oct 12 17:20:52 localhost kernel: [810deee9] ? sys_mprotect+0x19c/0x227 Oct 12 17:20:52 localhost kernel: [81011cf2] ? system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
linux-next: manual merge of the bkl-llseek tree with the infiniband tree
Hi Arnd, Today's linux-next merge of the bkl-llseek tree got a conflict in drivers/infiniband/hw/cxgb4/device.c between commit 8bbac892fb75d20fa274ca026e24faf00afbf9dd (RDMA/cxgb4: Add default_llseek to debugfs files) from the infiniband tree and commit 9711569d06e7df5f02a943fc4138fb152526e719 (llseek: automatically add .llseek fop) from the bkl-llseek tree. Not really a conflict. The infiniband tree patch is a superset of the part of the bkl-llseek commit that affects this file. -- Cheers, Stephen Rothwells...@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ pgpmicm3AWJFj.pgp Description: PGP signature