Re: list corruption in IPOIB
which list_del do you mean? in ipoib_cm_tx_start? On Mon, May 20, 2013 at 11:05 AM, Or Gerlitz ogerl...@mellanox.com wrote: On 19/05/2013 12:17, Jack Wang wrote: we added inject_bug sysfs node to make function run into error case, like something below. Yes, you are right, we want to speedup the bug reproduce process, and we saw the warning and come to conclusion the neigh-list corrupted some where. What's your opinion? Yes, for the synthetic experiment you made, there's a possible point here: Under the CM error flow (e.g your injected error), we delete a neighbour, but we also do it from the flow that flush neighbours (e.g when changing the device mode from UD to CM, as you did in your script). When this happens concurrently, these two code pieces call for deleting the neighbour from the list. So the spinlock might not be enough and we should have do list_del_init(neigh-list) instead of list_del, helps? Or. -- Mit freundlichen Grüßen, Best Regards, Jack Wang Linux Kernel Developer Storage ProfitBricks GmbH The IaaS-Company. ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 6098 56991-308 Fax: +49 30 6098 56992-203 Email: jinpu.w...@profitbricks.com URL: http://www.profitbricks.com Sitz der Gesellschaft: Berlin. Registergericht: Amtsgericht Charlottenburg, HRB 125506 B. Geschäftsführer: Andreas Gauger, Achim Weiss. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list corruption in IPOIB
On 20/05/2013 12:10, Jinpu Wang wrote: which list_del do you mean? in ipoib_cm_tx_start? yes, but not only, you can start with 5KG hammer and convert all thesehits to list_del_init linux-2.6]# grep list_del drivers/infiniband/ulp/ipoib/*.c | grep neigh drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
libibverbs / libmlx4 release
Hi Roland, Following what we discussed last week during the Linux Foundation EU summit, I think it would be good to follow what you said and have a point release for libibverbs and libmlx4 before we pull in the verbs extensions framework and features that use it (XRC, Flow-Steering, etc more fun). I mentioned to you that we have some more libmlx4 patches, but its totally OK for us to submit them after that release, makes sense? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list corruption in IPOIB
A quick test show the list_corruption warning is gone, after I convert all list_del(neigh-list) to list_del_list(neigh-list). Test is still running, will update status if anything wrong. Thanks Or. On Mon, May 20, 2013 at 12:58 PM, Or Gerlitz ogerl...@mellanox.com wrote: On 20/05/2013 12:10, Jinpu Wang wrote: which list_del do you mean? in ipoib_cm_tx_start? yes, but not only, you can start with 5KG hammer and convert all thesehits to list_del_init linux-2.6]# grep list_del drivers/infiniband/ulp/ipoib/*.c | grep neigh drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list); -- Mit freundlichen Grüßen, Best Regards, Jack Wang Linux Kernel Developer Storage ProfitBricks GmbH The IaaS-Company. ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 6098 56991-308 Fax: +49 30 6098 56992-203 Email: jinpu.w...@profitbricks.com URL: http://www.profitbricks.com Sitz der Gesellschaft: Berlin. Registergericht: Amtsgericht Charlottenburg, HRB 125506 B. Geschäftsführer: Andreas Gauger, Achim Weiss. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list corruption in IPOIB
On 20/05/2013 15:46, Jinpu Wang wrote: A quick test show the list_corruption warning is gone, after I convert all list_del(neigh-list) to list_del_list(neigh-list). yes, but this wasn't your original problem or was it? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list corruption in IPOIB
On 5/20/2013 3:58 PM, Jack Wang wrote: I haven't reproduced the original bug we saw in our production environment BUG: unable to handle kernel at 0008 IP: [a0206c30] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib] ... RIP: 0010:[a0206c30] [a0206c30] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib] RSP: 0018:8825fdcbddb0 EFLAGS: 00010086 RAX: 0246 RBX: 8807b59c29c0 RCX: RDX: 44000602 RSI: 0246 RDI: 8810026527c0 RBP: 881002652000 R08: 00015360 R09: dead00200200 R10: dead00100100 R11: 0001 R12: 0001 R13: R14: 8810026523a0 R15: 8810026527c0 FS: 7f4c9a325700() GS:880807c0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0008 CR3: 002605e3a000 CR4: 000407f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process kworker/u:3 (pid: 61374, threadinfo 8825fdcbc000, task 8807fd0eafb0) Stack: 8820043303c0 880807d52700 8807fd0eafb0 8825fdcbdde0 8810026533b8 a0039868 8825fdcbdde0 8805fd549a00 81b9d480 8807fd2f4000 a0206b50 Call Trace: [a0039868] ? process_req+0xe8/0x1a0 [ib_addr] [a0206b50] ? ipoib_cm_tx_handler+0x2d0/0x2d0 [ib_ipoib] [81052d64] ? process_one_work+0x114/0x470 [81055033] ? worker_thread+0x163/0x3e0 [81054ed0] ? manage_workers+0x200/0x200 [81054ed0] ? manage_workers+0x200/0x200 [8105963e] ? kthread+0x9e/0xb0 [8167e9e4] ? kernel_thread_helper+0x4/0x10 [810595a0] ? kthread_freezable_should_stop+0x60/0x60 [8167e9e0] ? gs_change+0x13/0x13 ... [a01fec30] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib] RSP 881d275f1db0 ---[ end trace 38ff082cbc03dd75 ]--- Kernel panic - not syncing: Fatal exception in interrupt , only the A variant of the crash in has been reproduced: WARNING: at lib/list_debug.c:49 __list_del_entry+0x63/0xd0() Hardware name: System Product Name list_del corruption, 88020dbd3080-next is LIST_POISON1 (dead00100100) Modules linked in: ... Pid: 16248, comm: iperf Tainted: GW 3.4.23-pserver+ #76 Call Trace: IRQ [8103c21f] warn_slowpath_common+0x7f/0xc0 [8103c316] warn_slowpath_fmt+0x46/0x50 [81428563] ? do_raw_spin_lock+0xd3/0x140 [81428883] __list_del_entry+0x63/0xd0 [81428901] list_del+0x11/0x40 [a02f64c5] ipoib_cm_handle_tx_wc+0x225/0x380 [ib_ipoib] [a02eea44] ipoib_poll+0x164/0x190 [ib_ipoib] [815d91fd] net_rx_action+0x13d/0x320 [81044f29] ? __do_softirq+0x89/0x380 [81044f98] __do_softirq+0xf8/0x380 [8174632c] call_softirq+0x1c/0x30 EOI [81004305] do_softirq+0x95/0xd0 [815daacc] ? dev_queue_xmit+0x29c/0xbf0 [8104461b] local_bh_enable+0xeb/0xf0 [815daacc] dev_queue_xmit+0x29c/0xbf0 [815da830] ? ptype_seq_start+0xb0/0xb0 [815e0d87] neigh_connected_output+0xc7/0x110 [8109f36d] ? trace_hardirqs_on+0xd/0x10 [81617386] ip_finish_output2+0x1c6/0x460 [8161723a] ? ip_finish_output2+0x7a/0x460 [81619033] ip_finish_output+0xc3/0x230 [81619510] ip_output+0xa0/0x110 [8161764d] ip_local_out+0x2d/0x90 [816176cb] ip_send_skb+0x1b/0x60 [8163f27b] udp_send_skb+0x10b/0x380 [815c3a70] ? sock_def_wakeup+0x1b0/0x1b0 [81616e90] ? ip_append_page+0x530/0x530 [81641462] udp_sendmsg+0x3b2/0xb50 [8173c530] ? retint_restore_args+0x13/0x13 [8164d9a0] ? inet_create+0x5b0/0x5b0 [815c2310] ? sock_update_classid+0x150/0x2b0 [8164dacb] inet_sendmsg+0x12b/0x240 [8164d9a0] ? inet_create+0x5b0/0x5b0 [815c2272] ? sock_update_classid+0xb2/0x2b0 [815c2310] ? sock_update_classid+0x150/0x2b0 [815bda40] sock_aio_write+0x190/0x1b0 [8142214e] ? trace_hardirqs_on_thunk+0x3a/0x3f [8116e82a] do_sync_write+0xea/0x130 [8109bfdd] ? trace_hardirqs_off+0xd/0x10 [811713d3] ? fget_light+0x43/0x490 [813b14f3] ? security_file_permission+0x23/0x90 [8116ee82] vfs_write+0x172/0x190 [8116ef91] sys_write+0x51/0x90 [81744de9] system_call_fastpath+0x16/0x1b ---[ end trace 66110390802a41db ]--- after apply commit fa16ebed31f336e41970f3f0ea9e8279f6be2d27 Author: Shlomo Pongratz shlo...@mellanox.com mailto:shlo...@mellanox.com Date: Mon Aug 13 14:39:49 2012 + IB/ipoib: Add missing locking when CM object is deleted Above warning is gone, but we still see the warning at the begin of this thread. 2013/5/20 Or Gerlitz ogerl...@mellanox.com mailto:ogerl...@mellanox.com On 20/05/2013 15:46, Jinpu Wang wrote: A
Re: list corruption in IPOIB
Hi Jack, I don't understand what is the current status, that is what do you see now after applying the patches. If you don't get the original bug why did you gave the trace of it? Or is it a new trace? It is not clear from your mail. Please add only the trace of the current issue. Best regards, S.P. Hi Shlomo, Sorry for confusion. Current list corruption is gone in my preliminary test, after I changed list_del to list_del_init as Or suggested. As Or asked for the original bug, so I just want to show him the whole story. Regards, Jack -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MLX4 Cq Question
On Saturday 18 May 2013 00:37, Roland Dreier wrote: On Fri, May 17, 2013 at 12:25 PM, Tom Tucker t...@opengridcomputing.com wrote: I'm looking at the Linux MLX4 net driver and found something that confuses me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the mlx4_ib_completion function does not take any kind of lock when looking up the SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition if I go look at the code paths where cq are removed from this tree, they are protected by spin_lock_irq. So I am baffled at this point as to what the locking strategy is and how this is supposed to work. I'm sure I'm missing something and would greatly appreciate it if someone would explain this. This is a bit tricky. If you look at void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) { struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_cq_table *cq_table = priv-cq_table; int err; err = mlx4_HW2SW_CQ(dev, NULL, cq-cqn); if (err) mlx4_warn(dev, HW2SW_CQ failed (%d) for CQN %06x\n, err, cq-cqn); synchronize_irq(priv-eq_table.eq[cq-vector].irq); spin_lock_irq(cq_table-lock); radix_tree_delete(cq_table-tree, cq-cqn); spin_unlock_irq(cq_table-lock); if (atomic_dec_and_test(cq-refcount)) complete(cq-free); wait_for_completion(cq-free); mlx4_cq_free_icm(dev, cq-cqn); } you see that when freeing a CQ, we first do the HW2SW_CQ firmware command; once this command completes, no more events will be generated for that CQ. Then we do synchronize_irq for the CQ's interrupt vector. Once that completes, no more completion handlers will be running for the CQ, so we can safely delete the CQ from the radix tree (relying on the radix tree's safety of deleting one entry while possibly looking up other entries, so no lock is needed). We also use the lock to synchronize against the CQ event function, which as you noted does take the lock too. Basic idea is that we're tricky and careful so we can make the fast path (completion interrupt handling) lock-free, but then use locks and whatever else needed in the slow path (CQ async event handling, CQ destroy). - R. === Roland, unfortunately we have seen that we need some locking on the cq completion handler (there is a stack trace which resulted from this lack of proper locking). In our current driver, we are using the patch below (which uses RCU locking instead of spinlocks). I can prepare a proper patch for the upstream kernel. === net/mlx4_core: Fix racy flow in the driver CQ completion handler The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock the radix tree which is used to manage the table of CQs, nor does it increase the reference count of the CQ before invoking the user provided callback (and decrease it afterwards). This is racy and can cause use-after-free, null pointer dereference, etc, which result in kernel crashes. To fix this, we must do the following in mlx4_cq_completion: - increase the ref count on the cq before invoking the user callback, and decrement it after the callback. - Place a lock around the radix tree lookup/ref-count-increase Using an irq spinlock will not fix this issue. The problem is that under VPI, the ETH interface uses multiple msix irq's, which can result in one cq completion event interrupting another in-progress cq completion event. A deadlock results when the handler for the first cq completion grabs the spinlock, and is interrupted by the second completion before it has a chance to release the spinlock. The handler for the second completion will deadlock waiting for the spinlock to be released. The proper fix is to use the RCU mechanism for locking radix-tree accesses in the cq completion event handler (The radix-tree implementation uses the RCU mechanism, so rcu_read_lock/unlock in the reader, with rcu_synchronize in the updater, will do the job). Note that the same issue exists in mlx4_cq_event() (the cq async event handler), which also takes the same lock on the radix tree. Here, we replace the spinlock with an rcu_read_lock(). This patch was motivated by the following report from the field: [...] box panic'ed when trying to find a completion queue. There is no corruption but there is a possible race which could result in mlx4_cq_completion getting wrong height of the radix tree and following a bit too deep into the chains. In the other code which uses this radix tree the access is protected by the lock but mlx4_cq_completion is running in the interrupt context and cannot take locks, so instead it runs without any protection whatsoever. The stack trace below is from the mlnx ofed 1.5.3 driver running under RHEL5.7. (this driver uses the upstream kernel
Re: MLX4 Cq Question
On Mon, May 20, 2013 at 7:53 AM, Jack Morgenstein ja...@dev.mellanox.co.il wrote: This is racy and can cause use-after-free, null pointer dereference, etc, which result in kernel crashes. Sounds fine and I'd be happy to apply your final patch, but I'd be curious to know what the race is in more detail. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libibverbs / libmlx4 release
On Mon, May 20, 2013 at 5:37 AM, Or Gerlitz ogerl...@mellanox.com wrote: Following what we discussed last week during the Linux Foundation EU summit, I think it would be good to follow what you said and have a point release for libibverbs and libmlx4 before we pull in the verbs extensions framework and features that use it (XRC, Flow-Steering, etc more fun). I mentioned to you that we have some more libmlx4 patches, but its totally OK for us to submit them after that release, makes sense? That's fine, I'll do the releases this week. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH libibverbs v5 5/7] libibverbs: Add ibv_open_qp
On 3/26/2013 4:14 PM, sean.he...@intel.com wrote: From: Sean Hefty sean.he...@intel.com XRC receive QPs are shareable across multiple processes. Allow any process with access to the xrc domain to open an existing QP. After opening the QP, the process will receive events related to the QP and be able to modify the QP. Hey Sean, So is it intended that cooperating processes sharing an xrc domain will choose one process to create the xrc qp, and the rest will open it? Steve. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH libibverbs v5 5/7] libibverbs: Add ibv_open_qp
So is it intended that cooperating processes sharing an xrc domain will choose one process to create the xrc qp, and the rest will open it? yes - The QPN of the shared QP must be known between the cooperating processes. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libibverbs / libmlx4 release
On 05/20/2013 12:49 PM, Roland Dreier wrote: On Mon, May 20, 2013 at 5:37 AM, Or Gerlitz ogerl...@mellanox.com wrote: Following what we discussed last week during the Linux Foundation EU summit, I think it would be good to follow what you said and have a point release for libibverbs and libmlx4 before we pull in the verbs extensions framework and features that use it (XRC, Flow-Steering, etc more fun). I mentioned to you that we have some more libmlx4 patches, but its totally OK for us to submit them after that release, makes sense? That's fine, I'll do the releases this week. If you haven't already done so, can you please update to the latest autotools and reroll the autotools files when you do the release? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list corruption in IPOIB
On Mon, May 20, 2013 at 5:36 PM, Jack Wang jinpu.w...@profitbricks.com wrote: Sorry for confusion. Current list corruption is gone in my preliminary test, after I changed list_del to list_del_init as Or suggested. As Or asked for the original bug, so I just want to show him the whole story. I am still not clear if the bug you saw in your production environment is gone with the list_del_init patch applied, please clarify. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rockets feedbacks?
Hi Sean, Do we have some public quoted usages/feedback for rsockets? I think you've mentioned something during the panel at the Linux EU summit last week but I am not sure... Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list corruption in IPOIB
On 2013年05月20日 21:00, Or Gerlitz wrote: On Mon, May 20, 2013 at 5:36 PM, Jack Wang jinpu.w...@profitbricks.com wrote: Sorry for confusion. Current list corruption is gone in my preliminary test, after I changed list_del to list_del_init as Or suggested. As Or asked for the original bug, so I just want to show him the whole story. I am still not clear if the bug you saw in your production environment is gone with the list_del_init patch applied, please clarify. Or. The bug in our production environment is introduced in our backport about ipoib fixes from mainline, and when we hit that bug we reverted back to old kernel without the backport patch, and the bug didn't happen for now. This list_del_init patch do fix list corruption warning, but it's not the one we hit in production, the list corruption is reproduced in our test setup with bug injection patch iperf -P 50 mode switch. Is this clear for you now? Jack -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list corruption in IPOIB
On Mon, May 20, 2013 at 10:38 PM, Jack Wang jinpu.w...@profitbricks.com wrote: The bug in our production environment is introduced in our backport about ipoib fixes from mainline, and when we hit that bug we reverted back to old kernel without the backport patch, and the bug didn't happen for now. This list_del_init patch do fix list corruption warning, but it's not the one we hit in production, the list corruption is reproduced in our test setup with bug injection patch iperf -P 50 mode switch. Is this clear for you now? NO, you say that the list_del_init patch eliminates the list corruption warning, does the list corruption is still reproduced in your test setup even when the patch is applied?! what's the trace? and what is the trace you see in your production when using kernel X (which?) patches with commit Y (which?) Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MLX4 Cq Question
Hi Guys, One other quick one. I've received conflicting claims on the validity of the wc.opcode when wc.status != 0 for mlx4 hardware. My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an error and therefore, the only way to recover what the opcode was is through the wr_id you used when submitting the WR. Is my reading of the code correct? Thanks, Tom On 5/20/13 9:53 AM, Jack Morgenstein wrote: On Saturday 18 May 2013 00:37, Roland Dreier wrote: On Fri, May 17, 2013 at 12:25 PM, Tom Tucker t...@opengridcomputing.com wrote: I'm looking at the Linux MLX4 net driver and found something that confuses me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the mlx4_ib_completion function does not take any kind of lock when looking up the SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition if I go look at the code paths where cq are removed from this tree, they are protected by spin_lock_irq. So I am baffled at this point as to what the locking strategy is and how this is supposed to work. I'm sure I'm missing something and would greatly appreciate it if someone would explain this. This is a bit tricky. If you look at void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) { struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_cq_table *cq_table = priv-cq_table; int err; err = mlx4_HW2SW_CQ(dev, NULL, cq-cqn); if (err) mlx4_warn(dev, HW2SW_CQ failed (%d) for CQN %06x\n, err, cq-cqn); synchronize_irq(priv-eq_table.eq[cq-vector].irq); spin_lock_irq(cq_table-lock); radix_tree_delete(cq_table-tree, cq-cqn); spin_unlock_irq(cq_table-lock); if (atomic_dec_and_test(cq-refcount)) complete(cq-free); wait_for_completion(cq-free); mlx4_cq_free_icm(dev, cq-cqn); } you see that when freeing a CQ, we first do the HW2SW_CQ firmware command; once this command completes, no more events will be generated for that CQ. Then we do synchronize_irq for the CQ's interrupt vector. Once that completes, no more completion handlers will be running for the CQ, so we can safely delete the CQ from the radix tree (relying on the radix tree's safety of deleting one entry while possibly looking up other entries, so no lock is needed). We also use the lock to synchronize against the CQ event function, which as you noted does take the lock too. Basic idea is that we're tricky and careful so we can make the fast path (completion interrupt handling) lock-free, but then use locks and whatever else needed in the slow path (CQ async event handling, CQ destroy). - R. === Roland, unfortunately we have seen that we need some locking on the cq completion handler (there is a stack trace which resulted from this lack of proper locking). In our current driver, we are using the patch below (which uses RCU locking instead of spinlocks). I can prepare a proper patch for the upstream kernel. === net/mlx4_core: Fix racy flow in the driver CQ completion handler The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock the radix tree which is used to manage the table of CQs, nor does it increase the reference count of the CQ before invoking the user provided callback (and decrease it afterwards). This is racy and can cause use-after-free, null pointer dereference, etc, which result in kernel crashes. To fix this, we must do the following in mlx4_cq_completion: - increase the ref count on the cq before invoking the user callback, and decrement it after the callback. - Place a lock around the radix tree lookup/ref-count-increase Using an irq spinlock will not fix this issue. The problem is that under VPI, the ETH interface uses multiple msix irq's, which can result in one cq completion event interrupting another in-progress cq completion event. A deadlock results when the handler for the first cq completion grabs the spinlock, and is interrupted by the second completion before it has a chance to release the spinlock. The handler for the second completion will deadlock waiting for the spinlock to be released. The proper fix is to use the RCU mechanism for locking radix-tree accesses in the cq completion event handler (The radix-tree implementation uses the RCU mechanism, so rcu_read_lock/unlock in the reader, with rcu_synchronize in the updater, will do the job). Note that the same issue exists in mlx4_cq_event() (the cq async event handler), which also takes the same lock on the radix tree. Here, we replace the spinlock with an rcu_read_lock(). This patch was motivated by the following report from the field: [...] box panic'ed when trying to find a completion queue. There is no corruption but there is a possible race which could
RE: rockets feedbacks?
Do we have some public quoted usages/feedback for rsockets? I think you've mentioned something during the panel at the Linux EU summit last week but I am not sure... Most feedback I can think of has come via private emails or personal interactions, especially specific details of various usage models. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list corruption in IPOIB
On 2013年05月20日 21:50, Or Gerlitz wrote: On Mon, May 20, 2013 at 10:38 PM, Jack Wang jinpu.w...@profitbricks.com wrote: The bug in our production environment is introduced in our backport about ipoib fixes from mainline, and when we hit that bug we reverted back to old kernel without the backport patch, and the bug didn't happen for now. This list_del_init patch do fix list corruption warning, but it's not the one we hit in production, the list corruption is reproduced in our test setup with bug injection patch iperf -P 50 mode switch. Is this clear for you now? NO, you say that the list_del_init patch eliminates the list corruption warning, does the list corruption is still reproduced in your test setup even when the patch is applied?! what's the trace? and what is the trace you see in your production when using kernel X (which?) patches with commit Y (which?) Or. No, the list_del_init fixed the list corruption warning, no list corruption anymore. The trace is in my earlier mail in this thread. Jack -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: MLX4 Cq Question
My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an error and therefore, the only way to recover what the opcode was is through the wr_id you used when submitting the WR. Is my reading of the code correct? I believe this is true wrt the IB spec. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MLX4 Cq Question
On 5/20/13 2:58 PM, Hefty, Sean wrote: My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an error and therefore, the only way to recover what the opcode was is through the wr_id you used when submitting the WR. Is my reading of the code correct? I believe this is true wrt the IB spec. Thanks, this was my recollection as well. Tom -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rockets feedbacks?
On Mon, May 20, 2013 at 10:52 PM, Hefty, Sean sean.he...@intel.com wrote: Do we have some public quoted usages/feedback for rsockets? I think you've mentioned something during the panel at the Linux EU summit last week but I am not sure... Most feedback I can think of has come via private emails or personal interactions, especially specific details of various usage models. So if you were pushing these private conversations to linux-rdma, more have been known on rsockets for the benefit of all... oh well. I think you mentioned something re Intel HPC group, or I am wrong? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rockets feedbacks?
So if you were pushing these private conversations to linux-rdma, more have been known on rsockets for the benefit of all... oh well. I think you mentioned something re Intel HPC group, or I am wrong? rsockets will continue to be supported by myself and Intel going forward. The rsocket work originated from HPC related work done that I was doing with Xeon Phi, and targeted Intel MPI and OSU MPI as its initial application. The Xeon Phi processors are slower than the traditional Xeon cores; I believe I mentioned that this was a problem, since the socket API requires data copies in practice. So, extensions to the rsocket API is needed to support zero-copy data transfers. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] make read_config() more robust
Hi, Please find three patches to protect libibverbs from using invalid, unsecure configuration files. Thoses configurations files are usually located in /etc/libibverbs.d/ and contains the name of a shared library to dlopen(). Only legitimate shared libraries should be loaded by libibverbs, so it must be careful on the configuration files used. Regards. Yann Droneaud (3): read_config: ignore files beginning with '.' read_config: ignore directory entry with backup suffix (~) read_config: skip file/directory with unsecure permissions src/init.c | 27 ++- 1 file changed, 26 insertions(+), 1 deletion(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] read_config: ignore files beginning with '.'
Files beginning with a dot are mostly current and parent directories or, by convention, hidden files. Those path are skipped in find_sysfs_dev(). Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/init.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/init.c b/src/init.c index 96ecf04..c880b68 100644 --- a/src/init.c +++ b/src/init.c @@ -308,6 +308,9 @@ static void read_config(void) while ((dent = readdir(conf_dir))) { struct stat buf; + if (dent-d_name[0] == '.') + continue; + if (asprintf(path, %s/%s, IBV_CONFIG_DIR, dent-d_name) 0) { fprintf(stderr, PFX Warning: couldn't read config file %s/%s.\n, IBV_CONFIG_DIR, dent-d_name); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] read_config: ignore directory entry with backup suffix (~)
Try to protect libibverbs from hand modified configuration files. Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/init.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/init.c b/src/init.c index c880b68..1981da7 100644 --- a/src/init.c +++ b/src/init.c @@ -311,6 +311,9 @@ static void read_config(void) if (dent-d_name[0] == '.') continue; + if (dent-d_name[0] == '\0' || dent-d_name[strlen(dent-d_name) - 1] == '~') + continue; + if (asprintf(path, %s/%s, IBV_CONFIG_DIR, dent-d_name) 0) { fprintf(stderr, PFX Warning: couldn't read config file %s/%s.\n, IBV_CONFIG_DIR, dent-d_name); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] read_config: skip file/directory with unsecure permissions
libibverbs must refuse to load arbitrary shared objects. This patch check the configuration directory and files for - being owned by root; - not being writable by others. Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/init.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/src/init.c b/src/init.c index 1981da7..a37b57d 100644 --- a/src/init.c +++ b/src/init.c @@ -294,10 +294,24 @@ static void read_config_file(const char *path) static void read_config(void) { + struct stat buf; DIR *conf_dir; struct dirent *dent; char *path; + if (stat(IBV_CONFIG_DIR, buf) || !S_ISDIR(buf.st_mode)) { + fprintf(stderr, PFX Warning: couldn't stat config directory '%s'.\n, + IBV_CONFIG_DIR); + return; + } + + if (buf.st_uid != 0 || buf.st_gid != 0 || + (buf.st_mode S_IWOTH) != 0) { + fprintf(stderr, PFX Warning: unsecure config directory '%s'.\n, + IBV_CONFIG_DIR); + return; + } + conf_dir = opendir(IBV_CONFIG_DIR); if (!conf_dir) { fprintf(stderr, PFX Warning: couldn't open config directory '%s'.\n, @@ -306,8 +320,6 @@ static void read_config(void) } while ((dent = readdir(conf_dir))) { - struct stat buf; - if (dent-d_name[0] == '.') continue; @@ -329,6 +341,13 @@ static void read_config(void) if (!S_ISREG(buf.st_mode)) goto next; + if (buf.st_uid != 0 || buf.st_gid != 0 || + (buf.st_mode S_IWOTH) != 0) { + fprintf(stderr, PFX Warning: unsecure config file '%s'.\n, + path); + goto next; + } + read_config_file(path); next: free(path); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] add RAW Packet QP type
On Tue, Jan 17, 2012 at 10:21:28AM -0600, Steve Wise wrote: On 01/17/2012 09:59 AM, Or Gerlitz wrote: On 1/17/2012 5:08 PM, Steve Wise wrote: I think this series should add some new send flags for HW that does checksum offload [...] also, on ingress, most hardware can do INET checksum validation, and a way to indicate the results to the application is needed. Perhaps flags in the CQE? [...] another form of HW assist is with VLAN insertion/extraction. The API should provide a way to specify if a VLAN ID should be inserted by HW and removed from a packet on ingress (and passed to the app via the CQE). In fact, we probably want a way to associate a VLAN with a RAW QP, maybe as a QP attribute? Hi Steve, Nice to see how a discussion is quickly revived when the subject is close to people's needs... anyway, yep, sure, this submission allows for basic functionality, as I mentioned and there's more to add here, but this could (should...) be done gradually, i.e in steps that adds on the basic patch. Now to the points you've raised: Sure, we want to be able to do checksum offload on TX and on RX as well. Adding checksum offload support will be done with a patch to libibverbs which is similar in spirit to commit e0605d IB/core: Add IP checksum offload support subject to reporting the RX checksum okay through new IB_WC_IP_CSUM_OK bit in ibv_wc-wc_flags along the lines of the patch I sent last week to the kernel. So we will have a new device capability and two new bit flags IB_SEND_IP_CSUM and IB_WC_IP_CSUM_OK, no ABI breaking... happy. Sounds good, with the caveat that we need more bits to specify L4 CSUM generation as well as L3 (and L2 CRC). Hi Or, I appologize if I missed it, but did any support for L3/L4 CSUM generation get added? Doesn't look like the upstream libibverbs has it, and I don't seem to see any patches floating around. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] configure: use automake's option subdir-objects
Following advice in Autotool Mythbuster [1], option subdir-objects can be used to have Makefiles create object files in the same directory than theirs source files. It reduces clobbering in the build directory. [1] Autotool Mythbuster, by Diego Elio Flameeyes Petten`o http://www.flameeyes.eu/autotools-mythbuster/automake/nonrecursive.html Signed-off-by: Yann Droneaud ydrone...@opteya.com --- configure.ac | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index a1d5365..231083c 100644 --- a/configure.ac +++ b/configure.ac @@ -6,7 +6,7 @@ AC_CONFIG_SRCDIR([src/ibverbs.h]) AC_CONFIG_AUX_DIR(config) AC_CONFIG_MACRO_DIR(config) AC_CONFIG_HEADER(config.h) -AM_INIT_AUTOMAKE([foreign]) +AM_INIT_AUTOMAKE([foreign subdir-objects]) m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES([yes])]) dnl Checks for programs -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] configure: apply updates proposed by autoupdate
'autoupdate' is a tool to help developer to update configure.ac. This patch apply a few fixes as suggested by autoupdate. It was tested on Debian 6.0.7 (Squeeze) and Fedora 17 (Beefy Miracle). Signed-off-by: Yann Droneaud ydrone...@opteya.com --- configure.ac | 17 - 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/configure.ac b/configure.ac index 8f79891..a1d5365 100644 --- a/configure.ac +++ b/configure.ac @@ -1,7 +1,7 @@ dnl Process this file with autoconf to produce a configure script. -AC_PREREQ(2.57) -AC_INIT(libibverbs, 1.1.6, linux-rdma@vger.kernel.org) +AC_PREREQ([2.67]) +AC_INIT([libibverbs],[1.1.6],[linux-rdma@vger.kernel.org]) AC_CONFIG_SRCDIR([src/ibverbs.h]) AC_CONFIG_AUX_DIR(config) AC_CONFIG_MACRO_DIR(config) @@ -11,14 +11,12 @@ m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES([yes])]) dnl Checks for programs AC_PROG_CC -AC_GNU_SOURCE +AC_USE_SYSTEM_EXTENSIONS AC_PROG_LN_S -AC_PROG_LIBTOOL - LT_INIT AC_ARG_WITH([valgrind], -AC_HELP_STRING([--with-valgrind], +AS_HELP_STRING([--with-valgrind], [Enable Valgrind annotations (small runtime overhead, default NO)])) if test x$with_valgrind = x || test x$with_valgrind = xno; then want_valgrind=no @@ -84,9 +82,10 @@ fi AC_SUBST(LIBIBVERBS_VERSION_SCRIPT) AC_CACHE_CHECK(for .symver assembler support, ac_cv_asm_symver_support, -[AC_TRY_COMPILE(, [asm(symbol:\n.symver symbol, api@ABI\n);], -ac_cv_asm_symver_support=yes, -ac_cv_asm_symver_support=no)]) +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[]], +[[asm(symbol:\n.symver symbol, api@ABI\n);]])], +[ac_cv_asm_symver_support=yes], +[ac_cv_asm_symver_support=no])]) if test $ac_cv_asm_symver_support = yes; then AC_DEFINE([HAVE_SYMVER_SUPPORT], 1, [assembler has .symver support]) fi -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] open files with close on exec flag
File opened by libibverbs are not supposed to be inherited across exec*(), most of the files are of no use for another program, and others cannot be used without the associated memory mapping. This patch changes open() and fopen() to always set close on exec flag. This patch also add checks to configure to guess if fopen() supports e flag. If O_CLOEXEC and SOCK_CLOEXEC are supported, fopen() should support e. If not supported, its discarded according to POSIX. Many operating systems have support for fopen(e). You might find more information about close on exec in the following articles: - Excuse me son, but your code is leaking !!! by Dan Walsh http://danwalsh.livejournal.com/53603.html - Secure File Descriptor Handling by Ulrich Drepper http://udrepper.livejournal.com/20407.html Note: this patch won't set close on exec flag on file descriptors created by the kernel for completion channel and such. This should be addressed by a kernel patch. Signed-off-by: Yann Droneaud ydrone...@opteya.com --- configure.ac | 21 + src/device.c | 2 +- src/init.c | 2 +- src/memory.c | 2 +- src/sysfs.c | 2 +- 5 files changed, 25 insertions(+), 4 deletions(-) diff --git a/configure.ac b/configure.ac index efdc5ac..8f79891 100644 --- a/configure.ac +++ b/configure.ac @@ -44,9 +44,30 @@ AC_CHECK_HEADER(valgrind/memcheck.h, [if test $want_valgrind = yes; then AC_MSG_ERROR([Valgrind memcheck support requested, but valgrind/memcheck.h not found.]) fi]) +AC_CHECK_HEADERS([fcntl.h sys/socket.h]) dnl Checks for typedefs, structures, and compiler characteristics. AC_C_CONST +AC_CHECK_DECLS([O_CLOEXEC],,[AC_DEFINE([O_CLOEXEC],[0], [Defined to 0 if not provided])], +[[ +#ifdef HAVE_FCNTL_H +# include fcntl.h +#endif +]]) +AC_CHECK_DECLS([SOCK_CLOEXEC],,[AC_DEFINE([SOCK_CLOEXEC],[0],[Defined to 0 if not provided])], +[[ +#ifdef HAVE_SYS_SOCKET_H +# include sys/socket.h +#endif +]]) + +AC_CACHE_CHECK(for close on exec modifier for fopen(), ac_cv_feature_stream_cloexec_flag, + [if test $ac_cv_have_decl_O_CLOEXEC = yes ; then +if test $ac_cv_have_decl_SOCK_CLOEXEC = yes ; then +ac_cv_feature_stream_cloexec_flag=e +fi +fi]) +AC_DEFINE_UNQUOTED([STREAM_CLOEXEC], $ac_cv_feature_stream_cloexec_flag, [fopen() modifier for setting close on exec flag]) AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, [if test -n `$LD --help /dev/null 2/dev/null | grep version-script`; then diff --git a/src/device.c b/src/device.c index 5798895..1923fa5 100644 --- a/src/device.c +++ b/src/device.c @@ -135,7 +135,7 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device) * We'll only be doing writes, but we need O_RDWR in case the * provider needs to mmap() the file. */ - cmd_fd = open(devpath, O_RDWR); + cmd_fd = open(devpath, O_RDWR | O_CLOEXEC); free(devpath); if (cmd_fd 0) diff --git a/src/init.c b/src/init.c index 8d6786e..8e93f3f 100644 --- a/src/init.c +++ b/src/init.c @@ -243,7 +243,7 @@ static void read_config_file(const char *path) size_t buflen = 0; ssize_t len; - conf = fopen(path, r); + conf = fopen(path, r STREAM_CLOEXEC); if (!conf) { fprintf(stderr, PFX Warning: couldn't read config file %s.\n, path); diff --git a/src/memory.c b/src/memory.c index 7d97e55..e9d1eec 100644 --- a/src/memory.c +++ b/src/memory.c @@ -109,7 +109,7 @@ static unsigned long get_page_size(void *base) pid = getpid(); snprintf(buf, sizeof(buf), /proc/%d/smaps, pid); - file = fopen(buf, r); + file = fopen(buf, r STREAM_CLOEXEC); if (!file) goto out; diff --git a/src/sysfs.c b/src/sysfs.c index 85aee39..e031631 100644 --- a/src/sysfs.c +++ b/src/sysfs.c @@ -85,7 +85,7 @@ int ibv_read_sysfs_file(const char *dir, const char *file, if (asprintf(path, %s/%s, dir, file) 0) return -1; - fd = open(path, O_RDONLY); + fd = open(path, O_RDONLY | O_CLOEXEC); if (fd 0) { free(path); return -1; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 0/9] Add receive Flow Steering support
On Wed, Apr 24, 2013 at 04:58:43PM +0300, Or Gerlitz wrote: Hi Roland, all The first five patches in the series are mlx4 DMFS (Device Managed Flow Steering) pre-patches needed for flow steering access from the mlx4 IB driver. net/mlx4_core: Move DMFS HW structs to common header file net/mlx4: Match DMFS promiscuous field names to firmware spec net/mlx4_core: Change few DMFS fields names to match firmare spec net/mlx4_core: Directly expose fields of DMFS HW rule control segment net/mlx4_core: Expose few helpers to fill DMFS HW strucutures We're submitting them through the rdma tree as the next patch depend on them, please apply them for 3.10 this way (accepting also the next patches) or another (even if they are not accepted now), so we will not have extra dependencies with further changes on that area. The next four patches add Flow Steering support to the kernel IB core, to uverbs and to the mlx4 IB (verbs) driver along with one patch to uverbs which adds some code to support extensions. IB/core: Add receive Flow Steering support IB/core: Infra-structure to support verbs extensions through uverbs IB/core: Export ib_create/destroy_flow through uverbs IB/mlx4: Add receive Flow Steering support The main patch which introduces the Flow-Steering API is IB/core: Add receive Flow Steering support, see its change log. Looking on the Network Adapter Flow Steering slides from Tzahi Oved which he presented on the annual OFA 2012 meeting could be helpful https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/518-network-adapter-flow-steering.html Hi Or, Are there any patches for libibverbs to add ibv_create_flow/ibv_destroy_flow? And are there any needed patches for libmlx4? I'm building up a stack so we can begin testing this series. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html