OpenMPI folks:

I am having trouble running a specific program (ScalIT, a code produced and maintained by one of the research groups here at TTU) using Infiniband. There is conflicting information that has made it impossible to diagnose the problem:

1) Other programs (like NWChem) run using OpenMPI over multiple nodes using Infiniband without any problems at all.

2) ScalIT runs on other clusters (and I believe with OpenMPI) without error.

3) ScalIT runs with OpenMPI on a single node without error.

4) ScalIT dies at a particular place with OpenMPI over multiple nodes (20) with OpenMPI.

I don't know whether it is a hardware problem (but other codes work just fine) or a programming error in ScalIT (but it works without modification on other clusters).

The error I am getting is:
local QP operation err (QPN 0014bc, WQE @ 00009005, CQN 000097, index 2232620)
  [ 0] 000014bc
  [ 4] 00000000
  [ 8] 00000000
  [ c] 00000000
  [10] 026f3410
  [14] 00000000
  [18] 00009005
  [1c] ff100000
[[44095,1],45][btl_openib_component.c:3492:handle_wc] from compute-6-13.local to: compute-3-11 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 40c5e00 opcode 0 vendor error 111 qp_idx 0
--------------------------------------------------------------------------
mpirun has exited due to process rank 45 with PID 27168 on
node compute-6-13.local exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

I am using OpenMPI 1.6.5 compiled with the Intel 11.1-080 compilers.

`uname -a` returns "Linux compute-1-1.local 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux"

ibv_devinfo returns
hca_id: mthca0
        transport:                      InfiniBand (0)
        fw_ver:                         1.2.0
        node_guid:                      0005:ad00:001f:fed8
        sys_image_guid:                 0005:ad00:0100:d050
        vendor_id:                      0x02c9
        vendor_part_id:                 25204
        hw_ver:                         0xA0
        board_id:                       MT_03B0120002
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               39
                        port_lmc:               0x00
                        link_layer:             IB


Any help in tracking down the problem is greatly appreciated.

--
T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061

(806) 834-0813 (voice);     (806) 742-1289 (fax)

Reply via email to