
I've had a couple of errors recently, of the form:

[[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

My first thought was to increase the retry count, but it is already at maximum.

I've checked connections between the two nodes, and they seem ok

[root@tango090 ~]# ibv_rc_pingpong
  local address:  LID 0x005f, QPN 0xe4045d, PSN 0xdd13f0
  remote address: LID 0x005d, QPN 0xfe0425, PSN 0xc43fe2
8192000 bytes in 0.07 seconds = 996.93 Mbit/sec
1000 iters in 0.07 seconds = 65.74 usec/iter

How can I stop this happening in the future, without increasing the retry count?


        / Brett

Brett Pemberton - VPAC Senior Systems Administrator
http://www.vpac.org/ - (03) 9925 4899

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to