We are trying run openMPI with OFED-1.5 on the 2.6.31-rt11-preempt-rt kernel and see the following errors:
[[45393,1],8][../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc] from elm3b107 to: elm3b17 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 1289846528 opcode -1782678528 vendor error 244 qp_idx 0 At this point I looked at the mlx4 diag counters and saw some non-zero values. Since we were attempting a series of runs, we don't know when the counters increased from 0. Do these counters have any correlation to the above MPI error? [r...@elm3b17 diag_counters]# pwd /sys/class/infiniband/mlx4_0/diag_counters [r...@elm3b17 diag_counters]# [r...@elm3b17 diag_counters]# cat rq_num_rnr 19 [r...@elm3b17 diag_counters]# cat rq_num_wrfe 2009 [r...@elm3b17 diag_counters]# cat sq_num_tree 12 [r...@elm3b17 diag_counters]# cat sq_num_wrfe 12 [r...@elm3b17 diag_counters]# Similarly on 3b107 let us look at the counters. [r...@elm3b107 diag_counters]# cat rq_num_wrfe 5156 [r...@elm3b107 diag_counters]# cat sq_num_rnr 18 [r...@elm3b107 diag_counters]# cat sq_num_tree 20 [r...@elm3b107 diag_counters]# cat sq_num_wrfe 20 [r...@elm3b107 diag_counters]# We are using ConnectX dual port DDR HCAs (FW version 2.6). What does the vendor error 244 mean? Any suggestions to debug this further? Thanks Pradeep _______________________________________________ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg