OpenMPI folks:
I am having trouble running a specific program (ScalIT, a code produced
and maintained by one of the research groups here at TTU) using
Infiniband. There is conflicting information that has made it impossible
to diagnose the problem:
1) Other programs (like NWChem) run using OpenMPI over multiple nodes
using Infiniband without any problems at all.
2) ScalIT runs on other clusters (and I believe with OpenMPI) without error.
3) ScalIT runs with OpenMPI on a single node without error.
4) ScalIT dies at a particular place with OpenMPI over multiple nodes
(20) with OpenMPI.
I don't know whether it is a hardware problem (but other codes work just
fine) or a programming error in ScalIT (but it works without
modification on other clusters).
The error I am getting is:
local QP operation err (QPN 0014bc, WQE @ 00009005, CQN 000097, index
2232620)
[ 0] 000014bc
[ 4] 00000000
[ 8] 00000000
[ c] 00000000
[10] 026f3410
[14] 00000000
[18] 00009005
[1c] ff100000
[[44095,1],45][btl_openib_component.c:3492:handle_wc] from
compute-6-13.local to: compute-3-11 error polling LP CQ with status
LOCAL QP OPERATION ERROR status number 2 for wr_id 40c5e00 opcode 0
vendor error 111 qp_idx 0
--------------------------------------------------------------------------
mpirun has exited due to process rank 45 with PID 27168 on
node compute-6-13.local exiting improperly. There are two reasons this
could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
I am using OpenMPI 1.6.5 compiled with the Intel 11.1-080 compilers.
`uname -a` returns "Linux compute-1-1.local 2.6.32-279.14.1.el6.x86_64
#1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux"
ibv_devinfo returns
hca_id: mthca0
transport: InfiniBand (0)
fw_ver: 1.2.0
node_guid: 0005:ad00:001f:fed8
sys_image_guid: 0005:ad00:0100:d050
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_03B0120002
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 39
port_lmc: 0x00
link_layer: IB
Any help in tracking down the problem is greatly appreciated.
--
T. Vince Grimes, Ph.D.
CCC System Administrator
Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061
(806) 834-0813 (voice); (806) 742-1289 (fax)