Roland,

Ci
So you are right, it is not a moving target.  After repeating 
the IOZONE tests several times, I narrowed down the culprit,
server on3-ib.  Parallel I/O had made it a bit difficult to 
chase it down :-(  

BTW, the state of the IPoIB network seemed fine after the failed
test, nd the mthca counters are moving up nicely.  Do you still 
think this is a crash of the HCA firmware?  Should I call Mellanox? 

Thanks,
Helen


---------- Original Message -----------------
>From [EMAIL PROTECTED] Thu Oct 13 15:13:16 2005
>
>    Helen> It doesn't seem like shrinking the TCP window had helped.
>    Helen> I captured the Dmesg log from Lustre server and associated
>    Helen> client reporting IOZONE error.
>
>What is the state of the system after you start seeing the ib0
>transmit time out messages?  Does IPoIB work at all?  Is the HCA
>responsive at all -- for example what do you see if you do
>
>  cat /sys/class/infiniband/mthca0/ports/1/state
>
>or
>
>  cat /sys/class/infiniband/mthca0/ports/1/counters/*
>
>    Helen> BTW, this problem is a moving target so it is hard to
>    Helen> believe that it is hardware related(?)  BTW, I am using the
>    Helen> mellanox DDR switch and HCA.
>
>Not sure what you mean by a moving target... the symptoms really look
>like a crash of the HCA firmware to me.
>
>Thanks,
>  Roland
>
_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to