[OMPI users] bizarre failure with IMB/openib

2011-03-21 Thread Dave Love
I'm trying to test some new nodes with ConnectX adaptors, and failing to
get (so far just) IMB to run on them.

The binary runs on the same cluster using TCP, or using PSM on some
other IB nodes.  A rebuilt PMB and various existing binaries work with
openib on the ConnectX nodes running it exactly the same way as IMB.
I.e. this seems to be something specific to IMB and openib.

It seems rather bizarre, and I have no idea how to debug it in the
absence of hints from a web search, i.e. why has it failed to attempt
the openib BTL in this case.  I can't get any openib-related information
using obvious MCA verbosity flags.  Can anyone make suggestions?

I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB
3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic
nodes).  I'm not sure what else might be relevant.  The output from
trying to run IMB follows, for what it's worth.

  --
  At least one pair of MPI processes are unable to reach each other for
  MPI communications.  This means that no Open MPI device has indicated
  that it can be used to communicate between these processes.  This is
  an error; Open MPI requires that all MPI processes be able to reach
  each other.  This error can sometimes be the result of forgetting to
  specify the "self" BTL.

Process 1 ([[25307,1],2]) is on host: lvgig116
Process 2 ([[25307,1],12]) is on host: lvgig117
BTLs attempted: self sm

  Your MPI job is now going to abort; sorry.
  --
  --
  It looks like MPI_INIT failed for some reason; your parallel process is
  likely to abort.  There are many reasons that a parallel process can
  fail during MPI_INIT; some of which are due to configuration or environment
  problems.  This failure appears to be an internal failure; here's some
  additional information (which may only be relevant to an Open MPI
  developer):

PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
  --
  *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
  *** This is disallowed by the MPI standard.
  *** Your MPI job will now abort.
  [lvgig116:8052] Abort before MPI_INIT completed successfully; not able to 
guarantee that all other processes were killed!
  *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
  *** This is disallowed by the MPI standard.
  *** Your MPI job will now abort.

  ...

  [lvgig116:07931] 19 more processes have sent help message help-mca-bml-r2.txt 
/ unreachable proc
  [lvgig116:07931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
  [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime / 
mpi_init:startup:internal-failure



Re: [OMPI users] bizarre failure with IMB/openib

2011-03-21 Thread Peter Kjellström
On Monday, March 21, 2011 12:25:37 pm Dave Love wrote:
> I'm trying to test some new nodes with ConnectX adaptors, and failing to
> get (so far just) IMB to run on them.
...
> I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB
> 3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic
> nodes).  I'm not sure what else might be relevant.  The output from
> trying to run IMB follows, for what it's worth.
> 
>  
> --
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error;
> Open MPI requires that all MPI processes be able to reach each other. 
> This error can sometimes be the result of forgetting to specify the "self"
> BTL.
> 
> Process 1 ([[25307,1],2]) is on host: lvgig116
> Process 2 ([[25307,1],12]) is on host: lvgig117
> BTLs attempted: self sm

Are you sure you launched it correctly and that you have (re)built OpenMPI 
against your Redhat-5 ib stack?
 
>   Your MPI job is now going to abort; sorry.
...
>   [lvgig116:07931] 19 more processes have sent help message
> help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter

Seems to me that OpenMPI gave up because it didn't succeed in initializing any 
inter-node btl/mtl.

I'd suggest you try (roughly in order):

 1) ibstat on all nodes to verify that your ib interfaces are up
 2) try a verbs level test (like ib_write_bw) to verify data can flow
 3) make sure your OpenMPI was built with the redhat libibverbs-devel present
(=> a suitable openib btl is built).

/Peter

> "orte_base_help_aggregate" to 0 to see all help / error messages
> [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime
> / mpi_init:startup:internal-failure


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] bizarre failure with IMB/openib

2011-03-21 Thread Dave Love
Peter Kjellström  writes:

> Are you sure you launched it correctly and that you have (re)built OpenMPI 
> against your Redhat-5 ib stack?

Yes.  I had to rebuild because I'd omitted openib when we only needed
psm.  As I said, I did exactly the same thing successfully with PMB
(initially because I wanted to try an old binary, and PMB was lying
around).

>>   Your MPI job is now going to abort; sorry.
> ...
>>   [lvgig116:07931] 19 more processes have sent help message
>> help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter
>
> Seems to me that OpenMPI gave up because it didn't succeed in initializing 
> any 
> inter-node btl/mtl.

Sure, but why won't it load the btl under IMB when it will under PMB
(and other codes like XHPL), and how do I get any diagnostics?

My boss has just stumbled upon a reference while looking for something
else It looks as if it's an OFED bug entry, but I can't find an
operational version of an OFED tracker or any other reference to the bug
than (the equivalent of)
http://lists.openfabrics.org/pipermail/ewg/2010-March/014983.html :

  1976  maj jsquyres at cisco.com   errors running IMB over 
openmpi-1.4.1

I guess Jeff will enlighten me if/when he spots this.  (Thanks in
advance, obviously.)



Re: [OMPI users] bizarre failure with IMB/openib

2011-03-22 Thread Dave Love
Dave Love  writes:

> I'm trying to test some new nodes with ConnectX adaptors, and failing to
> get (so far just) IMB to run on them.

I suspect this is https://svn.open-mpi.org/trac/ompi/ticket/1919.  I'm
rather surprised it isn't an FAQ (actually frequently asked, not meaning
someone should have written it up).