Bogdan Costescu wrote:

Brett Pemberton <br...@vpac.org> wrote:

[[1176,1],0][btl_openib_component.c:2905:handle_wc] from
tango092.vpac.org to: tango090 error polling LP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0

I've seen this error with Mellanox ConnectX cards and OFED 1.2.x with
all versions of OpenMPI that I have tried (1.2.x and pre-1.3) and some
MVAPICH versions, from which I have concluded that the problem lies in
the lower levels (OFED or IB card firmware). Indeed after the
installation of OFED 1.3.x and a possible firmware update (not sure
about the firmware as I don't admin that cluster), these errors have
disappeared.


I can confirm this: I had a similar problem over Christmas, for which I asked for help in this list. In fact the problem was not with OpenMPI, but with the OFED stack: an upgrade of the latter (and an upgrade of the firmware, although once again the OFED drivers were complaining about the firmware being too old) fixed the problem. We did both upgrades at once, so as in Brett's case I am not sure which one played the major role.

Biagio

--
=========================================================

Dr. Biagio Lucini                               
Department of Physics, Swansea University
Singleton Park, SA2 8PP Swansea (UK)
Tel. +44 (0)1792 602284

=========================================================

Reply via email to