[MTT users] RETRY EXCEEDED ERROR

2008-07-31 Thread Rafael Folco
Hi, I need some help, please. I'm running a set of MTT tests on my cluster and I have issues in a particular node. [0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from 10.2.1.90 to: 10.2.1.50 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 268

Re: [MTT users] RETRY EXCEEDED ERROR

2008-07-31 Thread Pavel Shamis (Pasha)
The "RETRY EXCEEDED ERROR" error is related to IB and not MTT. The error says that IB failed to send IB packet from machine 10.2.1.90 to 10.2.1.50 You need to run your IB network monitoring tool and found the issue. Usually it is some bad cable in IB fabric that causes such errors. Regards, P

Re: [MTT users] RETRY EXCEEDED ERROR

2008-07-31 Thread Rafael Folco
Thanks for the response, Pasha. Yes, I agree this is some issue with the IB network. I came to the list looking for some previous experience of other users... I wonder why 10.2.1.90 works with all other nodes, 10.2.1.50 works with all other nodes as well, but they can't work together. Maybe OFED li

Re: [MTT users] RETRY EXCEEDED ERROR

2008-07-31 Thread Jeff Squyres
On Jul 31, 2008, at 12:42 PM, Rafael Folco wrote: Thanks for the response, Pasha. Yes, I agree this is some issue with the IB network. I came to the list looking for some previous experience of other users... I wonder why 10.2.1.90 works with all other nodes, 10.2.1.50 works with all other nod