Roland, Ci So you are right, it is not a moving target. After repeating the IOZONE tests several times, I narrowed down the culprit, server on3-ib. Parallel I/O had made it a bit difficult to chase it down :-(
BTW, the state of the IPoIB network seemed fine after the failed test, nd the mthca counters are moving up nicely. Do you still think this is a crash of the HCA firmware? Should I call Mellanox? Thanks, Helen ---------- Original Message ----------------- >From [EMAIL PROTECTED] Thu Oct 13 15:13:16 2005 > > Helen> It doesn't seem like shrinking the TCP window had helped. > Helen> I captured the Dmesg log from Lustre server and associated > Helen> client reporting IOZONE error. > >What is the state of the system after you start seeing the ib0 >transmit time out messages? Does IPoIB work at all? Is the HCA >responsive at all -- for example what do you see if you do > > cat /sys/class/infiniband/mthca0/ports/1/state > >or > > cat /sys/class/infiniband/mthca0/ports/1/counters/* > > Helen> BTW, this problem is a moving target so it is hard to > Helen> believe that it is hardware related(?) BTW, I am using the > Helen> mellanox DDR switch and HCA. > >Not sure what you mean by a moving target... the symptoms really look >like a crash of the HCA firmware to me. > >Thanks, > Roland > _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general