On May 13, 2009, at 4:55 PM, Åke Sandgren wrote:

I'm having problem with getting the "error polling LP CQ with status
RNR..." on an otherwise completely empty system.
There are no errors visible in the error counters in any of the HCAs or
switches or anywhere else.

I'm running OMPI 1.3.2 built with pathscale 3.2

If i add -mca btl 'ofud,self,sm' the same code works ok.


Interesting. I have only done very limited testing with ofud; are you saying that you get these errors if you "--mca btl openib,sm,self"?

It usually only shows up on runs with nodes=16:ppn=8 or higher, i.e. 8x8
works ok.

This might very well be a pathscale problem since when running with the
debug version of ompi 1.3.2 the problem goes away.

Complete error is:
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 465284992 opcode -1 vendor error 135 qp_idx
0

Any ideas to where in the ompi code i should start reducing optimization
levels to pinpoint this?



Do you have a simple reproducer test case, perchance?

--
Jeff Squyres
Cisco Systems


Reply via email to