On Thursday (11/15/2007 at 12:23PM -0800), [EMAIL PROTECTED] wrote: > > We have a large (~1800 node) IB cluster of x86_64 machines, and > we're having some significant problems with IPoIB. > > The thing that all the IPoIB failures have in common seems to be > an appearance of a "CQ overrun" in syslog, e.g.: > > ib_mthca 0000:06:00.0: CQ overrun on CQN 180082 > > >From there things go badly in different ways - tx_timeouts, > oopses, etc. Sometimes things just start working again after > a few minutes. > > The appearance of these failures seems to be well correlated > with the size of the machine. I don't think there any problems > until the machine is built up to about its maximum size, and > then they become pretty common. > > We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2. > > Does this ring a bell with anyone?
I can perhaps elaborate a little more on the test case we are using to expose this situation... On 1024 (or more) nodes, nttcp -i is started as a "tcp socket server". Eight copies are started, each on a different tcp port (5000 ... 5007). On another client node, as few as 1024 and as many as 8192 nttcp clients are launched from that node to all of the 1024 others. We can have one connection between the client and each node or we can have eight connections between the client and each node. The nttcp test is run for 120 secs and in these scenarios, all connections get established, nttcp moves data, and never fails. We get expected performance. If the node count is increased to 1152, then things start to become unreliable. We will see connections fail to be established when we try to do 8 per node. If we do one per node, they will all establish and run. In fact, we can do one per node across 1664 and that will succeed also. So the problem seems to be related to the total number of nodes on the fabric as well as how many TCP connections you try to establish to each node. One is tempted to believe it is a problem at the single node that is opening all of these connections to the others... but the failure occurs on the nodes being connected to-- the nttcp servers-- with the CQ overrun and TX WATCHDOG TIMEOUTS, etc. The final outcome of which is that we loose all TCP connectivity over IB to the affect nodes for some period of time. Sometimes they come back, sometimes they don't and sometimes its seconds and sometimes its minutes before they come back. Not very deterministic. cje -- Chris Elmquist mailto:[EMAIL PROTECTED] (651)683-3093 Silicon Graphics, Inc. Eagan, MN _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
