I just posted a patch which might fix your problem. Please try it and let us know if it fixed anything.
On Tue, Mar 02, 2010 at 01:54:09PM -0800, Josh England wrote: > Hello, > > I've been running into several issues using IPoIB. The 2 primary uses > are for read-only NFS to the clients (over TCP) and access to an > ethernet-connected parallel filesystem (Panasas) through router nodes > passing IPoIB<-->10GbE. > > All nodes are running CentOS 5.3 and OFED 1.4.2, although a have played > with OFED 1.5 and seen similar results. Client nodes mount their NFS root > from boot servers via IPoIB with a ratio of 80:1. The boot servers are the > ones that seem to have issues. The fabric itself consists of ~1000 nodes > interconnected such that their is 2:1 oversubscription within any single rack, > and 20:1 oversubscription between racks (through the core switch). I > don't know how much the oversubscription comes into play here as I can > reproduce the error within a single rack. > > In datagram mode, I see errors on the boot servers of the form. > > ib0: post_send failed > ib0: post_send failed > ib0: post_send failed > > > When using connected mode, I hit a different error: > > NETDEV WATCHDOG: ib0: transmit timed out > ib0: transmit timeout: latency 1999 msecs > ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464 > NETDEV WATCHDOG: ib0: transmit timed out > ib0: transmit timeout: latency 2999 msecs > ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464 > ... > ... > NETDEV WATCHDOG: ib0: transmit timed out > ib0: transmit timeout: latency 61824999 msecs > ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464 > > > The errors seem to hit only after NFS comes into play. Once it > starts, the NETDEV WATCHDOG messages continue until I run > 'ifconfig ib0 down up'. I've tried tuning send_queue_size and > recv_queue_size on both sides, the txqueuelen of the ib0 interface, the > NFS rsize/wsize. None of it seems to help greatly. Does anyone have > any ideas about what can I do to try to fix > these problems? > > -JE > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html