Matt Hughes wrote:
2009/2/26 Brett Pemberton <br...@vpac.org>:[[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0What OS are you using?
Centos 5 I've seen this error and many other Infiniband
related errors on RedHat enterprise linux 4 update 4, with ConnectX cards and various versions of OFED, up to version 1.3. Depending on the MCA parameters, I also see hangs often enough to make native Infiniband unusable on this OS.
I'd appreciate some advice on if I'm using OFED correctly. I'm running OFED 1.4, however not the kernel modules, just userland. Is this a bad idea? Basically, I recompile the ofed src.rpms for:dapl, libibcm, libibcommon, libibmad, libibumad, libibverbs, libmthca, librdmacm, libsdp, mstflint
And install onto CentOS, upgrading the in-distro versions. Should I also be compiling ofa_kernel ? Could this be causing problems ?As explained off-list, I'm running the most recent firmware for my cards, although the release is quite old:
hca_id: mthca0 fw_ver: 1.2.0 node_guid: 0002:c902:0024:3c6c sys_image_guid: 0002:c902:0024:3c6f vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0140001 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 34 port_lmc: 0x00 cheers, / Brett -- Brett Pemberton - VPAC Senior Systems Administrator http://www.vpac.org/ - (03) 9925 4899
signature.asc
Description: OpenPGP digital signature