Re: [ewg] Re: Possible process deadlock in RMPP flow

2009-10-20 Thread Tziporet Koren

Sean Hefty wrote:
I can't find anything off in the code for this.  

Eventually it was a FW issue that is fixed in our new 2.7.0 release

Tziporet
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: Possible process deadlock in RMPP flow

2009-10-20 Thread Eli Cohen
On Mon, Oct 19, 2009 at 01:30:47PM -0700, Sean Hefty wrote:
 
 I can't find anything off in the code for this.  It's odd, since
 unregister_mad_agent() does:
 
 flush_workqueue(port_priv-wq);
 ib_cancel_rmpp_recvs(mad_agent_priv);
 
 and ib_cancel_rmpp_recvs() does:
 
 spin_lock_irqsave(agent-lock, flags);
 list_for_each_entry(rmpp_recv, agent-rmpp_list, list) {
 cancel_delayed_work(rmpp_recv-timeout_work);
 cancel_delayed_work(rmpp_recv-cleanup_work);
 }
 spin_unlock_irqrestore(agent-lock, flags);
 
 flush_workqueue(agent-qp_info-port_priv-wq);
 
 which basically just flushes the same work queue.
 
 I haven't been able to reproduce the problem, but I'm running the latest 
 kernel
 - not sure that matters in this case.  Does ibnetdiscover just hang forever at
 the end of the test when this occurs?  Is there any more information 
 available?
 

We are checking if the problem is a firmware bug, it looks like it.
Once we verify this I will send an update. 
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


RE: [ewg] Re: Possible process deadlock in RMPP flow

2009-10-19 Thread Sean Hefty
 Thanks Or. This one is already in OFED 1.4.2 but apparently this is a
 different problem. Once I have information whether the patch Roland
 posted fixed it I will update the list.
 Eli, did you find a commit that fixes the problem you reported on?

 Or.


Not yet :-(

I can't find anything off in the code for this.  It's odd, since
unregister_mad_agent() does:

flush_workqueue(port_priv-wq);
ib_cancel_rmpp_recvs(mad_agent_priv);

and ib_cancel_rmpp_recvs() does:

spin_lock_irqsave(agent-lock, flags);
list_for_each_entry(rmpp_recv, agent-rmpp_list, list) {
cancel_delayed_work(rmpp_recv-timeout_work);
cancel_delayed_work(rmpp_recv-cleanup_work);
}
spin_unlock_irqrestore(agent-lock, flags);

flush_workqueue(agent-qp_info-port_priv-wq);

which basically just flushes the same work queue.

I haven't been able to reproduce the problem, but I'm running the latest kernel
- not sure that matters in this case.  Does ibnetdiscover just hang forever at
the end of the test when this occurs?  Is there any more information available?

- Sean 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: Possible process deadlock in RMPP flow

2009-10-04 Thread Or Gerlitz

Eli Cohen wrote:
Thanks Or. This one is already in OFED 1.4.2 but apparently this is a 
different problem. Once I have information whether the patch Roland 
posted fixed it I will update the list.

Eli, did you find a commit that fixes the problem you reported on?

Or.


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: Possible process deadlock in RMPP flow

2009-10-04 Thread Tziporet Koren

Or Gerlitz wrote:

Eli Cohen wrote:
Thanks Or. This one is already in OFED 1.4.2 but apparently this is a 
different problem. Once I have information whether the patch Roland 
posted fixed it I will update the list.

Eli, did you find a commit that fixes the problem you reported on?

Or.



Not yet :-(
Tziporet
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: Possible process deadlock in RMPP flow

2009-09-27 Thread Eli Cohen
On Thu, Sep 24, 2009 at 08:53:24AM -0700, Sean Hefty wrote:
 Thanks Or. This one is already in OFED 1.4.2 but apparently this is a
 different problem. Once I have information whether the patch Roland
 posted fixed it I will update the list.
 
 If ibnetdiscover doesn't use RMPP as Hal indicated, I don't think Roland's 
 patch
 will help.

Right, it doesn't help. Still it appears that ibnetdiscover triggers
this problem and the lock seams to appear at ib_cancel_rmpp_recvs()
waiting for flush_workqueue() to return. Do you know which apps or
ULPs make use of RMPPs?
Any other ideas what this could be?
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: Possible process deadlock in RMPP flow

2009-09-24 Thread Or Gerlitz
Eli Cohen wrote:
 On Wed, Sep 23, 2009 at 09:08:28AM -0700, Sean Hefty wrote:
 What kernel does 1.4.2 map to?
 I think OFED 1.4.2 is based on kernel 2.6.27 but they're using RHEL 5.3

Yes, the usual mess: ofed X is based on kernel Y1 but with some additions from 
kernel Y2 plus plenty of unreviwed and non-merged patches. Distro Z picks ofed 
X and the result is 99% unsupportable as Roland said. Somehow this ofed 
creature is still hanging around working on the the next damage its going to 
bring into this world (code name 1.5)

Eli, here's a little tip for you, I had the displeasure to resolve bunch of 
support cases originating from the fact that the below 2 years old commit 
missed some ofed version (sorry forgot the number...), maybe it would help you 
as well?

Under a normal setting, if this commit actually solves a bug being hit by many 
costumers, someone would have opened a distro bugzilla case saying, please 
pick this commit for your kernel, the customers would have either wait for the 
next distro update or use a distro intermediate kernel. Currently, I understand 
that distros are picking ofed versions and that's it.

Or.

commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2
Author: Sean Hefty sean.he...@intel.com
Date:   Fri Nov 30 17:30:18 2007 -0800

IB/mad: Fix incorrect access to items on local_list


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: Possible process deadlock in RMPP flow

2009-09-24 Thread Eli Cohen
On Thu, Sep 24, 2009 at 09:38:43AM +0300, Or Gerlitz wrote:
 
 commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2
 Author: Sean Hefty sean.he...@intel.com
 Date:   Fri Nov 30 17:30:18 2007 -0800
 
 IB/mad: Fix incorrect access to items on local_list
 
Thanks Or. This one is already in OFED 1.4.2 but apparently this is a
different problem. Once I have information whether the patch Roland
posted fixed it I will update the list.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] RE: Possible process deadlock in RMPP flow

2009-09-24 Thread Sean Hefty
Thanks Or. This one is already in OFED 1.4.2 but apparently this is a
different problem. Once I have information whether the patch Roland
posted fixed it I will update the list.

If ibnetdiscover doesn't use RMPP as Hal indicated, I don't think Roland's patch
will help.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] RE: Possible process deadlock in RMPP flow

2009-09-23 Thread Sean Hefty
ibnetdiscover D 80149b8d 0 26968  26544
(L-TLB)
 8102c900bd88 0046 81037e8e 81037e8e02e8
 8102c900bd78 000a 8102c5b50820 81038a929820
 011837bf6105 0ede 8102c5b50a08 0001
Call Trace:
 [80064207] wait_for_completion+0x79/0xa2
 [8008b4cc] default_wake_function+0x0/0xe
 [882271d9] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde
 [88224485] :ib_mad:ib_unregister_mad_agent+0x30d/0x424
 [883983e9] :ib_umad:ib_umad_close+0x9d/0xd6
 [80012e22] __fput+0xae/0x198
 [80023de6] filp_close+0x5c/0x64
 [800393df] put_files_struct+0x63/0xae
 [80015b26] do_exit+0x31c/0x911
 [8004971a] cpuset_exit+0x0/0x6c
 [8005e116] system_call+0x7e/0x83

From the dump it seems that the process is waits on the call to
flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is
OFED 1.4.2.

Roland just submitted a patch in this area yesterday.  I don't know if the patch
would fix their issue, but it may be worth trying.  What kernel does 1.4.2 map
to?

What RMPP messages does ibnetdiscover use?  If the program is completing
successfully, there may be a different race with the rmpp cleanup.  I'll see if
anything else stands out in that area.

- Sean

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: Possible process deadlock in RMPP flow

2009-09-23 Thread Hal Rosenstock
On Wed, Sep 23, 2009 at 12:08 PM, Sean Hefty sean.he...@intel.com wrote:

 ibnetdiscover D 80149b8d 0 26968  26544
 (L-TLB)
  8102c900bd88 0046 81037e8e 81037e8e02e8
  8102c900bd78 000a 8102c5b50820 81038a929820
  011837bf6105 0ede 8102c5b50a08 0001
 Call Trace:
  [80064207] wait_for_completion+0x79/0xa2
  [8008b4cc] default_wake_function+0x0/0xe
  [882271d9] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde
  [88224485] :ib_mad:ib_unregister_mad_agent+0x30d/0x424
  [883983e9] :ib_umad:ib_umad_close+0x9d/0xd6
  [80012e22] __fput+0xae/0x198
  [80023de6] filp_close+0x5c/0x64
  [800393df] put_files_struct+0x63/0xae
  [80015b26] do_exit+0x31c/0x911
  [8004971a] cpuset_exit+0x0/0x6c
  [8005e116] system_call+0x7e/0x83
 
 From the dump it seems that the process is waits on the call to
 flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is
 OFED 1.4.2.

 Roland just submitted a patch in this area yesterday.  I don't know if the
 patch
 would fix their issue, but it may be worth trying.  What kernel does 1.4.2
 map
 to?

 What RMPP messages does ibnetdiscover use?


None AFAIK.

-- Hal


   If the program is completing
 successfully, there may be a different race with the rmpp cleanup.  I'll
 see if
 anything else stands out in that area.

 - Sean

 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Re: Possible process deadlock in RMPP flow

2009-09-23 Thread Eli Cohen
On Wed, Sep 23, 2009 at 09:08:28AM -0700, Sean Hefty wrote:
 
 Roland just submitted a patch in this area yesterday.  I don't know if the 
 patch
 would fix their issue, but it may be worth trying.  What kernel does 1.4.2 map
 to?
I think OFED 1.4.2 is based on kernel 2.6.27 but they're using RHEL
5.3.
Thanks, we'll try this.

 
 What RMPP messages does ibnetdiscover use?  If the program is completing
 successfully, there may be a different race with the rmpp cleanup.  I'll see 
 if
 anything else stands out in that area.
 
 - Sean
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg