Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
On Mon, Mar 15, 2010 at 08:25:51AM -0800, Josh England wrote: Everything has MT264328 ConnectX cards using the mlx4_ib driver. Boot/file servers are using an HP OEM 2.7.000 firmware. Compute nodes have cards using Sun OEM 2.6.200 FW. You probably mean MT26428? Anyway, do you still see the post send failed messages? If you do, could you apply this patch so we'll have better insight as for the reason? http://patchwork.kernel.org/patch/83593/ ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
On Mon, Mar 15, 2010 at 11:33 PM, Eli Cohen e...@dev.mellanox.co.il wrote: On Mon, Mar 15, 2010 at 08:25:51AM -0800, Josh England wrote: Everything has MT264328 ConnectX cards using the mlx4_ib driver. Boot/file servers are using an HP OEM 2.7.000 firmware. Compute nodes have cards using Sun OEM 2.6.200 FW. You probably mean MT26428? Yeah...threw an extra digit in there... Anyway, do you still see the post send failed messages? If you do, could you apply this patch so we'll have better insight as for the reason? http://patchwork.kernel.org/patch/83593/ I'll throw the patch in and try to get some datagram-mode testing in soon. I haven't gone back since the CM code fix. -JE ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
On Thu, Mar 11, 2010 at 01:38:38PM -0800, Roland Dreier wrote: I do worry (as Moni mentioned) that this doesn't explain why you would get send failures in this case, but the patch itself is well-explained and looks obviously correct so I think we should apply it. It could be a problem in the hardware driver. Josh, can you tell what kind of hardware you were using? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
good debugging, applied thanks. I do worry (as Moni mentioned) that this doesn't explain why you would get send failures in this case, but the patch itself is well-explained and looks obviously correct so I think we should apply it. -- Roland Dreier rola...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
On Thu, 2010-03-11 at 13:38 -0800, Roland Dreier wrote: good debugging, applied thanks. I do worry (as Moni mentioned) that this doesn't explain why you would get send failures in this case, but the patch itself is well-explained and looks obviously correct so I think we should apply it. Well, after more testing it seems there may still be a problem. I haven't isolated it yet though. I could definitely use help reviewing the code changes. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
Sorry, I was referring to my patch not Eli's. On Thu, 2010-03-11 at 13:41 -0800, Ralph Campbell wrote: On Thu, 2010-03-11 at 13:38 -0800, Roland Dreier wrote: good debugging, applied thanks. I do worry (as Moni mentioned) that this doesn't explain why you would get send failures in this case, but the patch itself is well-explained and looks obviously correct so I think we should apply it. Well, after more testing it seems there may still be a problem. I haven't isolated it yet though. I could definitely use help reviewing the code changes. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
Sorry, I was referring to my patch not Eli's. Heh, I never would have said anything about your patch was obvious. I skimmed yours once but I do want to read it more carefully. Did you ever say what test case you are using to provoke the problem you're fixing? -- Roland Dreier rola...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
On Thu, 2010-03-11 at 13:52 -0800, Roland Dreier wrote: Sorry, I was referring to my patch not Eli's. Heh, I never would have said anything about your patch was obvious. I skimmed yours once but I do want to read it more carefully. Did you ever say what test case you are using to provoke the problem you're fixing? I think I did but it is just UDP stress tests in general. Throwing in some link failures and switching between connected and datagram modes helps too. netperf, qperf, etc. should work. Anything which causes the connected mode QP to fail should exercise the fix too. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg