Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
Ralph Campbell wrote: On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. I see other manifestations resulting in different crashes : :mon t [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70 9:mon e cpu 0x9: Vector: 300 (Data Access) at [c0074603b720] pc: c05ac390: ._spin_lock+0x20/0xc8 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] sp: c0074603b9a0 msr: 80009032 dar: 3a0 dsisr: 4000 current = 0xc00756ce8b00 paca= 0xc0f63800 pid = 18095, comm = ipoib 9:mon Recreating the crash has been tricky. I have tried several several hundred times today to unload and reload IPoIB while there is traffic and no crashes happened. I took a closer look at the IPoIB CM code and I see a few things that look suspicious. In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is held before calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). Is this a potential race? ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected by netif_tx_lock(dev) or stopping the ipoib network device. I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() appears to be called without netif_tx_lock(dev) held. Is that correct? Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
On Fri, 2010-07-16 at 02:13 -0700, Pradeep Satyanarayana wrote: Ralph Campbell wrote: On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. I see other manifestations resulting in different crashes : :mon t [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70 9:mon e cpu 0x9: Vector: 300 (Data Access) at [c0074603b720] pc: c05ac390: ._spin_lock+0x20/0xc8 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] sp: c0074603b9a0 msr: 80009032 dar: 3a0 dsisr: 4000 current = 0xc00756ce8b00 paca= 0xc0f63800 pid = 18095, comm = ipoib 9:mon Recreating the crash has been tricky. I have tried several several hundred times today to unload and reload IPoIB while there is traffic and no crashes happened. I took a closer look at the IPoIB CM code and I see a few things that look suspicious. In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is held before calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). Is this a potential race? ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected by netif_tx_lock(dev) or stopping the ipoib network device. I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() appears to be called without netif_tx_lock(dev) held. Is that correct? Thanks Pradeep ipoib_neigh_cleanup() is called by neigh_cleanup_and_release() when freeing a struct neighbour. I assume the Linux network stack is not going to call into the IPoIB driver to send sk_buffs in that case but I could be wrong. If it can, then you are correct that the netif_tx_lock(dev) should be acquired. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. I see other manifestations resulting in different crashes : :mon t [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70 9:mon e cpu 0x9: Vector: 300 (Data Access) at [c0074603b720] pc: c05ac390: ._spin_lock+0x20/0xc8 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] sp: c0074603b9a0 msr: 80009032 dar: 3a0 dsisr: 4000 current = 0xc00756ce8b00 paca= 0xc0f63800 pid = 18095, comm = ipoib 9:mon Recreating the crash has been tricky. I have tried several several hundred times today to unload and reload IPoIB while there is traffic and no crashes happened. I took a closer look at the IPoIB CM code and I see a few things that look suspicious. In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is held before calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). Is this a potential race? In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, tx-flags) in ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a clear_bit(IPOIB_FLAG_OPER_UP, tx-flags) called before calling ipoib_cm_destroy_tx() only in select cases. Was that intended? Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
I will write up a description of the locking as I understand it and the changes I made. Give me a day or two to write it up and check it. On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. I see other manifestations resulting in different crashes : :mon t [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70 9:mon e cpu 0x9: Vector: 300 (Data Access) at [c0074603b720] pc: c05ac390: ._spin_lock+0x20/0xc8 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] sp: c0074603b9a0 msr: 80009032 dar: 3a0 dsisr: 4000 current = 0xc00756ce8b00 paca= 0xc0f63800 pid = 18095, comm = ipoib 9:mon Recreating the crash has been tricky. I have tried several several hundred times today to unload and reload IPoIB while there is traffic and no crashes happened. I took a closer look at the IPoIB CM code and I see a few things that look suspicious. In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is held before calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). Is this a potential race? In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, tx-flags) in ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a clear_bit(IPOIB_FLAG_OPER_UP, tx-flags) called before calling ipoib_cm_destroy_tx() only in select cases. Was that intended? Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Pradeep Satyanarayana wrote: Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. I see other manifestations resulting in different crashes : :mon t [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70 9:mon e cpu 0x9: Vector: 300 (Data Access) at [c0074603b720] pc: c05ac390: ._spin_lock+0x20/0xc8 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] sp: c0074603b9a0 msr: 80009032 dar: 3a0 dsisr: 4000 current = 0xc00756ce8b00 paca= 0xc0f63800 pid = 18095, comm = ipoib 9:mon Recreating the crash has been tricky. I have tried several several hundred times today to unload and reload IPoIB while there is traffic and no crashes happened. I took a closer look at the IPoIB CM code and I see a few things that look suspicious. In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is held before calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix dangling pointer). Is this a potential race? ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected by netif_tx_lock(dev) or stopping the ipoib network device. It all depends on what pointer or data structure you think is being accessed while free or being modified without the proper protection. In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, tx-flags) in ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's patch) there is a clear_bit(IPOIB_FLAG_OPER_UP, tx-flags) called before calling ipoib_cm_destroy_tx() only in select cases. Was that intended? The v4 patch comments explain the changes: http://www.spinics.net/lists/linux-rdma/msg03733.html Basically, IPOIB_FLAG_INITIALIZED now means that the struct ipoib_cm_tx has completed the RC QP creation process via the CM instead of simply when ipoib_cm_create_tx() allocates the structure. The test and clear was used to indicate the struct ipoib_cm_tx had been put on the destroy list and the reaper thread woken up. Now ipoib_cm_destroy_tx() uses the tx-neigh pointer != NULL to indicate that ipoib_cm_destroy_tx() has started the destroy process. ipoib_cm_destroy_tx() is only called when netif_tx_lock() and priv-lock are held to protect tx-neigh. Thanks Pradeep The longer write up on locking is turning out to be very complex. I will keep working on it but I think it will be just as hard to understand as slogging through the code. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
Pradeep Satyanarayana wrote: Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. I see other manifestations resulting in different crashes : :mon t [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70 9:mon e cpu 0x9: Vector: 300 (Data Access) at [c0074603b720] pc: c05ac390: ._spin_lock+0x20/0xc8 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] sp: c0074603b9a0 msr: 80009032 dar: 3a0 dsisr: 4000 current = 0xc00756ce8b00 paca= 0xc0f63800 pid = 18095, comm = ipoib 9:mon Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana prade...@linux.vnet.ibm.com wrote: I realize that the following patch: https://patchwork.kernel.org/patch/97243/ is queued in your backlog of patches and unlikely that it will go into 2.6.35. What are the chances that it will make it into 2.6.36? This patch has fixed a a rarely seen crash and we would like it to go upstream ASAP. The following comment was made on that patch by Ralph Campbell (see also http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg03125.html): Quite right. I should also use list_for_each_entry_safe(). I will fix this. This makes me wonder whether version three of this patch can go in unmodified ? Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
Bart Van Assche wrote: On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana prade...@linux.vnet.ibm.com wrote: I realize that the following patch: https://patchwork.kernel.org/patch/97243/ is queued in your backlog of patches and unlikely that it will go into 2.6.35. What are the chances that it will make it into 2.6.36? This patch has fixed a a rarely seen crash and we would like it to go upstream ASAP. The following comment was made on that patch by Ralph Campbell (see also http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg03125.html): Quite right. I should also use list_for_each_entry_safe(). I will fix this. This makes me wonder whether version three of this patch can go in unmodified ? There was a version 4 that followed. That was what I was referring to. Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
On Mon, Jul 12, 2010 at 12:20 PM, Pradeep Satyanarayana prade...@linux.vnet.ibm.com wrote: Bart Van Assche wrote: On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana prade...@linux.vnet.ibm.com wrote: I realize that the following patch: https://patchwork.kernel.org/patch/97243/ is queued in your backlog of patches and unlikely that it will go into 2.6.35. What are the chances that it will make it into 2.6.36? This patch has fixed a a rarely seen crash and we would like it to go upstream ASAP. The following comment was made on that patch by Ralph Campbell (see also http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg03125.html): Quite right. I should also use list_for_each_entry_safe(). I will fix this. This makes me wonder whether version three of this patch can go in unmodified ? There was a version 4 that followed. That was what I was referring to. Thanks for the info -- I had missed version four of that patch. Now that I had a look at it, why does the comment above ipoib_cm_flush_path() say that it removes all entries while the loop inside that function is stopped after the first entry has been found and removed ? Why does that function use list_for_each_entry_safe() while only a single entry is removed ? Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
I realize that the following patch: https://patchwork.kernel.org/patch/97243/ is queued in your backlog of patches and unlikely that it will go into 2.6.35. What are the chances that it will make it into 2.6.36? This patch has fixed a a rarely seen crash and we would like it to go upstream ASAP. How much testing and/or review have you done for this patch? What is the impact of the bug in the cases you have seen? - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
Roland, I realize that the following patch: https://patchwork.kernel.org/patch/97243/ is queued in your backlog of patches and unlikely that it will go into 2.6.35. What are the chances that it will make it into 2.6.36? This patch has fixed a a rarely seen crash and we would like it to go upstream ASAP. Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html