Re: [OMPI users] Problem with openmpi and infiniband

Biagio Lucini Wed, 7 Jan 2009 18:29:14 -0500

The test was in fact ok, I have also verified it on 30 processors.Meanwhile I tried OMPI1.3RC2, with which the application fails oninfiniband, I hope this will give some clue (or at least be useful tofinalise the release of OpenMPI 1.3). I remind the mailing list that Iuse the OFED 1.2.5 release. The only change with respect the last timeis the use of OMPI1.3RC2 instead of 1.2.8. To avoid boring the mailinglist, I don't repeat details I have already provided (like the commandline parameters) on which we seem to have agreed that there is noproblem. However, if you want to know more, please ask.


The error file as produced by SGE is attached.


Thanks,
Biagio

Lenny Verkhovsky wrote:

Hi,  just to make sure,

you wrote in the previous mail that you tested IMB-MPI1 and it
"reports for the last test" ...., and the results are for
"processes=6", since you have 4 and 8 core machines, this test could
be run on the same 8 core machine over shared memory and not over
Infiniband, as you suspected.

You can rerun the IMB-MPI1 test with -mca btl self,openib to be sure
that the test does not use shared memory or tcp.

Lenny.



On 12/24/08, Biagio Lucini <b.luc...@swansea.ac.uk> wrote:

Pavel Shamis (Pasha) wrote:

Biagio Lucini wrote:

Hello,

I am new to this list, where I hope to find a solution for a problem
that I have been having for quite a longtime.

I run various versions of openmpi (from 1.1.2 to 1.2.8) on a cluster
with Infiniband interconnects that I use and administer at the same
time. The openfabric stac is OFED-1.2.5, the compilers gcc 4.2 and
Intel. The queue manager is SGE 6.0u8.

Do you use OpenMPI version that is included in OFED ? Did you was able
to run basic OFED/OMPI tests/benchmarks between two nodes ?

 Hi,

 yes to both questions: the OMPI version is the one that comes with OFED
(1.1.2-1) and the basic tests run fine. For instance, IMB-MPI1 (which is
more than basic, as far as I can see) reports for the last test:

 #---------------------------------------------------
 # Benchmarking Barrier
 # #processes = 6
 #---------------------------------------------------
  #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        22.93        22.95        22.94


 for the openib,self btl (6 processes, all processes on different nodes)
 and

 #---------------------------------------------------
 # Benchmarking Barrier
 # #processes = 6
 #---------------------------------------------------
  #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000       191.30       191.42       191.34

 for the tcp,self btl (same test)

 No anomalies for other tests (ping-pong, all-to-all etc.)

 Thanks,
 Biagio


 --
 =========================================================

 Dr. Biagio Lucini
 Department of Physics, Swansea University
 Singleton Park, SA2 8PP Swansea (UK)
 Tel. +44 (0)1792 602284

 =========================================================
 _______________________________________________
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[[5963,1],13][btl_openib_component.c:2893:handle_wc] from node24 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],12][btl_openib_component.c:2893:handle_wc] from node23 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],8][btl_openib_component.c:2893:handle_wc] from node9 to: node11 error 
polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 
13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],11][btl_openib_component.c:2893:handle_wc] from node20 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],9][btl_openib_component.c:2893:handle_wc] from node18 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],4][btl_openib_component.c:2893:handle_wc] from node13 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],3][btl_openib_component.c:2893:handle_wc] from node12 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],6][btl_openib_component.c:2893:handle_wc] from node15 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],1][btl_openib_component.c:2893:handle_wc] from node10 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],7][btl_openib_component.c:2893:handle_wc] from node16 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],5][btl_openib_component.c:2893:handle_wc] from node14 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],10][btl_openib_component.c:2893:handle_wc] from node21 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],14][btl_openib_component.c:2893:handle_wc] from node19 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],2][btl_openib_component.c:2893:handle_wc] from node10 to: node11 
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status 
number 13 for wr_id 37779456 opcode 0 qp_idx 0
--------------------------------------------------------------------------
The OpenFabrics "receiver not ready" retry count on a per-peer
connection between two MPI processes has been exceeded.  In general,
this should not happen because Open MPI uses flow control on per-peer
connections to ensure that receivers are always ready when data is
sent.

This error usually means one of two things:

1. There is something awry within the network fabric itself.  
2. A bug in Open MPI has caused flow control to malfunction.

error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   node24
  Local device: mthca0
  Peer host:    node11

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 18133 on
node node13.cluster exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
[node11:21331] 13 more processes have sent help message help-mpi-btl-openib.txt 
/ pp rnr retry exceeded
[node11:21331] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages

Re: [OMPI users] Problem with openmpi and infiniband

Reply via email to