Re: [ewg] Re: Possible process deadlock in RMPP flow
Sean Hefty wrote: I can't find anything off in the code for this. Eventually it was a FW issue that is fixed in our new 2.7.0 release Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: Possible process deadlock in RMPP flow
On Mon, Oct 19, 2009 at 01:30:47PM -0700, Sean Hefty wrote: I can't find anything off in the code for this. It's odd, since unregister_mad_agent() does: flush_workqueue(port_priv-wq); ib_cancel_rmpp_recvs(mad_agent_priv); and ib_cancel_rmpp_recvs() does: spin_lock_irqsave(agent-lock, flags); list_for_each_entry(rmpp_recv, agent-rmpp_list, list) { cancel_delayed_work(rmpp_recv-timeout_work); cancel_delayed_work(rmpp_recv-cleanup_work); } spin_unlock_irqrestore(agent-lock, flags); flush_workqueue(agent-qp_info-port_priv-wq); which basically just flushes the same work queue. I haven't been able to reproduce the problem, but I'm running the latest kernel - not sure that matters in this case. Does ibnetdiscover just hang forever at the end of the test when this occurs? Is there any more information available? We are checking if the problem is a firmware bug, it looks like it. Once we verify this I will send an update. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
RE: [ewg] Re: Possible process deadlock in RMPP flow
Thanks Or. This one is already in OFED 1.4.2 but apparently this is a different problem. Once I have information whether the patch Roland posted fixed it I will update the list. Eli, did you find a commit that fixes the problem you reported on? Or. Not yet :-( I can't find anything off in the code for this. It's odd, since unregister_mad_agent() does: flush_workqueue(port_priv-wq); ib_cancel_rmpp_recvs(mad_agent_priv); and ib_cancel_rmpp_recvs() does: spin_lock_irqsave(agent-lock, flags); list_for_each_entry(rmpp_recv, agent-rmpp_list, list) { cancel_delayed_work(rmpp_recv-timeout_work); cancel_delayed_work(rmpp_recv-cleanup_work); } spin_unlock_irqrestore(agent-lock, flags); flush_workqueue(agent-qp_info-port_priv-wq); which basically just flushes the same work queue. I haven't been able to reproduce the problem, but I'm running the latest kernel - not sure that matters in this case. Does ibnetdiscover just hang forever at the end of the test when this occurs? Is there any more information available? - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: Possible process deadlock in RMPP flow
Eli Cohen wrote: Thanks Or. This one is already in OFED 1.4.2 but apparently this is a different problem. Once I have information whether the patch Roland posted fixed it I will update the list. Eli, did you find a commit that fixes the problem you reported on? Or. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: Possible process deadlock in RMPP flow
Or Gerlitz wrote: Eli Cohen wrote: Thanks Or. This one is already in OFED 1.4.2 but apparently this is a different problem. Once I have information whether the patch Roland posted fixed it I will update the list. Eli, did you find a commit that fixes the problem you reported on? Or. Not yet :-( Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: Possible process deadlock in RMPP flow
On Thu, Sep 24, 2009 at 08:53:24AM -0700, Sean Hefty wrote: Thanks Or. This one is already in OFED 1.4.2 but apparently this is a different problem. Once I have information whether the patch Roland posted fixed it I will update the list. If ibnetdiscover doesn't use RMPP as Hal indicated, I don't think Roland's patch will help. Right, it doesn't help. Still it appears that ibnetdiscover triggers this problem and the lock seams to appear at ib_cancel_rmpp_recvs() waiting for flush_workqueue() to return. Do you know which apps or ULPs make use of RMPPs? Any other ideas what this could be? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: Possible process deadlock in RMPP flow
Eli Cohen wrote: On Wed, Sep 23, 2009 at 09:08:28AM -0700, Sean Hefty wrote: What kernel does 1.4.2 map to? I think OFED 1.4.2 is based on kernel 2.6.27 but they're using RHEL 5.3 Yes, the usual mess: ofed X is based on kernel Y1 but with some additions from kernel Y2 plus plenty of unreviwed and non-merged patches. Distro Z picks ofed X and the result is 99% unsupportable as Roland said. Somehow this ofed creature is still hanging around working on the the next damage its going to bring into this world (code name 1.5) Eli, here's a little tip for you, I had the displeasure to resolve bunch of support cases originating from the fact that the below 2 years old commit missed some ofed version (sorry forgot the number...), maybe it would help you as well? Under a normal setting, if this commit actually solves a bug being hit by many costumers, someone would have opened a distro bugzilla case saying, please pick this commit for your kernel, the customers would have either wait for the next distro update or use a distro intermediate kernel. Currently, I understand that distros are picking ofed versions and that's it. Or. commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2 Author: Sean Hefty sean.he...@intel.com Date: Fri Nov 30 17:30:18 2007 -0800 IB/mad: Fix incorrect access to items on local_list ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: Possible process deadlock in RMPP flow
On Thu, Sep 24, 2009 at 09:38:43AM +0300, Or Gerlitz wrote: commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2 Author: Sean Hefty sean.he...@intel.com Date: Fri Nov 30 17:30:18 2007 -0800 IB/mad: Fix incorrect access to items on local_list Thanks Or. This one is already in OFED 1.4.2 but apparently this is a different problem. Once I have information whether the patch Roland posted fixed it I will update the list. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: Possible process deadlock in RMPP flow
Thanks Or. This one is already in OFED 1.4.2 but apparently this is a different problem. Once I have information whether the patch Roland posted fixed it I will update the list. If ibnetdiscover doesn't use RMPP as Hal indicated, I don't think Roland's patch will help. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: Possible process deadlock in RMPP flow
ibnetdiscover D 80149b8d 0 26968 26544 (L-TLB) 8102c900bd88 0046 81037e8e 81037e8e02e8 8102c900bd78 000a 8102c5b50820 81038a929820 011837bf6105 0ede 8102c5b50a08 0001 Call Trace: [80064207] wait_for_completion+0x79/0xa2 [8008b4cc] default_wake_function+0x0/0xe [882271d9] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde [88224485] :ib_mad:ib_unregister_mad_agent+0x30d/0x424 [883983e9] :ib_umad:ib_umad_close+0x9d/0xd6 [80012e22] __fput+0xae/0x198 [80023de6] filp_close+0x5c/0x64 [800393df] put_files_struct+0x63/0xae [80015b26] do_exit+0x31c/0x911 [8004971a] cpuset_exit+0x0/0x6c [8005e116] system_call+0x7e/0x83 From the dump it seems that the process is waits on the call to flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is OFED 1.4.2. Roland just submitted a patch in this area yesterday. I don't know if the patch would fix their issue, but it may be worth trying. What kernel does 1.4.2 map to? What RMPP messages does ibnetdiscover use? If the program is completing successfully, there may be a different race with the rmpp cleanup. I'll see if anything else stands out in that area. - Sean ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: Possible process deadlock in RMPP flow
On Wed, Sep 23, 2009 at 12:08 PM, Sean Hefty sean.he...@intel.com wrote: ibnetdiscover D 80149b8d 0 26968 26544 (L-TLB) 8102c900bd88 0046 81037e8e 81037e8e02e8 8102c900bd78 000a 8102c5b50820 81038a929820 011837bf6105 0ede 8102c5b50a08 0001 Call Trace: [80064207] wait_for_completion+0x79/0xa2 [8008b4cc] default_wake_function+0x0/0xe [882271d9] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde [88224485] :ib_mad:ib_unregister_mad_agent+0x30d/0x424 [883983e9] :ib_umad:ib_umad_close+0x9d/0xd6 [80012e22] __fput+0xae/0x198 [80023de6] filp_close+0x5c/0x64 [800393df] put_files_struct+0x63/0xae [80015b26] do_exit+0x31c/0x911 [8004971a] cpuset_exit+0x0/0x6c [8005e116] system_call+0x7e/0x83 From the dump it seems that the process is waits on the call to flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is OFED 1.4.2. Roland just submitted a patch in this area yesterday. I don't know if the patch would fix their issue, but it may be worth trying. What kernel does 1.4.2 map to? What RMPP messages does ibnetdiscover use? None AFAIK. -- Hal If the program is completing successfully, there may be a different race with the rmpp cleanup. I'll see if anything else stands out in that area. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: Possible process deadlock in RMPP flow
On Wed, Sep 23, 2009 at 09:08:28AM -0700, Sean Hefty wrote: Roland just submitted a patch in this area yesterday. I don't know if the patch would fix their issue, but it may be worth trying. What kernel does 1.4.2 map to? I think OFED 1.4.2 is based on kernel 2.6.27 but they're using RHEL 5.3. Thanks, we'll try this. What RMPP messages does ibnetdiscover use? If the program is completing successfully, there may be a different race with the rmpp cleanup. I'll see if anything else stands out in that area. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg