Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-16 Thread Eli Cohen
On Mon, Mar 15, 2010 at 08:25:51AM -0800, Josh England wrote:
 Everything has MT264328 ConnectX cards using the mlx4_ib driver.
 Boot/file servers are using an HP OEM 2.7.000 firmware.  Compute nodes
 have cards using Sun OEM 2.6.200 FW.
 

You probably mean MT26428? Anyway, do you still see the post send
failed messages? If you do, could you apply this patch so we'll have
better insight as for the reason?
http://patchwork.kernel.org/patch/83593/
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-16 Thread Josh England
On Mon, Mar 15, 2010 at 11:33 PM, Eli Cohen e...@dev.mellanox.co.il wrote:
 On Mon, Mar 15, 2010 at 08:25:51AM -0800, Josh England wrote:
 Everything has MT264328 ConnectX cards using the mlx4_ib driver.
 Boot/file servers are using an HP OEM 2.7.000 firmware.  Compute nodes
 have cards using Sun OEM 2.6.200 FW.


 You probably mean MT26428?

Yeah...threw an extra digit in there...

 Anyway, do you still see the post send
 failed messages? If you do, could you apply this patch so we'll have
 better insight as for the reason?
 http://patchwork.kernel.org/patch/83593/

I'll throw the patch in and try to get some datagram-mode testing in
soon.  I haven't gone back since the CM code fix.

-JE
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-13 Thread Eli Cohen
On Thu, Mar 11, 2010 at 01:38:38PM -0800, Roland Dreier wrote:
 
 I do worry (as Moni mentioned) that this doesn't explain why you would
 get send failures in this case, but the patch itself is well-explained
 and looks obviously correct so I think we should apply it.

It could be a problem in the hardware driver.
Josh, can you tell what kind of hardware you were using?
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-11 Thread Roland Dreier
good debugging, applied thanks.

I do worry (as Moni mentioned) that this doesn't explain why you would
get send failures in this case, but the patch itself is well-explained
and looks obviously correct so I think we should apply it.
-- 
Roland Dreier  rola...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-11 Thread Ralph Campbell
On Thu, 2010-03-11 at 13:38 -0800, Roland Dreier wrote:
 good debugging, applied thanks.
 
 I do worry (as Moni mentioned) that this doesn't explain why you would
 get send failures in this case, but the patch itself is well-explained
 and looks obviously correct so I think we should apply it.

Well, after more testing it seems there may still be a problem.
I haven't isolated it yet though. I could definitely use help
reviewing the code changes.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-11 Thread Ralph Campbell
Sorry, I was referring to my patch not Eli's.

On Thu, 2010-03-11 at 13:41 -0800, Ralph Campbell wrote:
 On Thu, 2010-03-11 at 13:38 -0800, Roland Dreier wrote:
  good debugging, applied thanks.
  
  I do worry (as Moni mentioned) that this doesn't explain why you would
  get send failures in this case, but the patch itself is well-explained
  and looks obviously correct so I think we should apply it.
 
 Well, after more testing it seems there may still be a problem.
 I haven't isolated it yet though. I could definitely use help
 reviewing the code changes.
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-11 Thread Roland Dreier
  Sorry, I was referring to my patch not Eli's.

Heh, I never would have said anything about your patch was obvious.
I skimmed yours once but I do want to read it more carefully.

Did you ever say what test case you are using to provoke the problem you're 
fixing?
-- 
Roland Dreier  rola...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue

2010-03-11 Thread Ralph Campbell
On Thu, 2010-03-11 at 13:52 -0800, Roland Dreier wrote:
  Sorry, I was referring to my patch not Eli's.
 
 Heh, I never would have said anything about your patch was obvious.
 I skimmed yours once but I do want to read it more carefully.
 
 Did you ever say what test case you are using to provoke the problem you're 
 fixing?

I think I did but it is just UDP stress tests in general.
Throwing in some link failures and switching between connected
and datagram modes helps too. netperf, qperf, etc. should work.
Anything which causes the connected mode QP to fail should
exercise the fix too.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg