[OMPI users] bizarre failure with IMB/openib
I'm trying to test some new nodes with ConnectX adaptors, and failing to get (so far just) IMB to run on them. The binary runs on the same cluster using TCP, or using PSM on some other IB nodes. A rebuilt PMB and various existing binaries work with openib on the ConnectX nodes running it exactly the same way as IMB. I.e. this seems to be something specific to IMB and openib. It seems rather bizarre, and I have no idea how to debug it in the absence of hints from a web search, i.e. why has it failed to attempt the openib BTL in this case. I can't get any openib-related information using obvious MCA verbosity flags. Can anyone make suggestions? I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB 3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic nodes). I'm not sure what else might be relevant. The output from trying to run IMB follows, for what it's worth. -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[25307,1],2]) is on host: lvgig116 Process 2 ([[25307,1],12]) is on host: lvgig117 BTLs attempted: self sm Your MPI job is now going to abort; sorry. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** The MPI_Init_thread() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [lvgig116:8052] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** The MPI_Init_thread() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. ... [lvgig116:07931] 19 more processes have sent help message help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure
Re: [OMPI users] bizarre failure with IMB/openib
On Monday, March 21, 2011 12:25:37 pm Dave Love wrote: > I'm trying to test some new nodes with ConnectX adaptors, and failing to > get (so far just) IMB to run on them. ... > I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB > 3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic > nodes). I'm not sure what else might be relevant. The output from > trying to run IMB follows, for what it's worth. > > > -- > At least one pair of MPI processes are unable to reach each other for MPI > communications. This means that no Open MPI device has indicated that it > can be used to communicate between these processes. This is an error; > Open MPI requires that all MPI processes be able to reach each other. > This error can sometimes be the result of forgetting to specify the "self" > BTL. > > Process 1 ([[25307,1],2]) is on host: lvgig116 > Process 2 ([[25307,1],12]) is on host: lvgig117 > BTLs attempted: self sm Are you sure you launched it correctly and that you have (re)built OpenMPI against your Redhat-5 ib stack? > Your MPI job is now going to abort; sorry. ... > [lvgig116:07931] 19 more processes have sent help message > help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter Seems to me that OpenMPI gave up because it didn't succeed in initializing any inter-node btl/mtl. I'd suggest you try (roughly in order): 1) ibstat on all nodes to verify that your ib interfaces are up 2) try a verbs level test (like ib_write_bw) to verify data can flow 3) make sure your OpenMPI was built with the redhat libibverbs-devel present (=> a suitable openib btl is built). /Peter > "orte_base_help_aggregate" to 0 to see all help / error messages > [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime > / mpi_init:startup:internal-failure signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] bizarre failure with IMB/openib
Peter Kjellström writes: > Are you sure you launched it correctly and that you have (re)built OpenMPI > against your Redhat-5 ib stack? Yes. I had to rebuild because I'd omitted openib when we only needed psm. As I said, I did exactly the same thing successfully with PMB (initially because I wanted to try an old binary, and PMB was lying around). >> Your MPI job is now going to abort; sorry. > ... >> [lvgig116:07931] 19 more processes have sent help message >> help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter > > Seems to me that OpenMPI gave up because it didn't succeed in initializing > any > inter-node btl/mtl. Sure, but why won't it load the btl under IMB when it will under PMB (and other codes like XHPL), and how do I get any diagnostics? My boss has just stumbled upon a reference while looking for something else It looks as if it's an OFED bug entry, but I can't find an operational version of an OFED tracker or any other reference to the bug than (the equivalent of) http://lists.openfabrics.org/pipermail/ewg/2010-March/014983.html : 1976 maj jsquyres at cisco.com errors running IMB over openmpi-1.4.1 I guess Jeff will enlighten me if/when he spots this. (Thanks in advance, obviously.)
Re: [OMPI users] bizarre failure with IMB/openib
Dave Love writes: > I'm trying to test some new nodes with ConnectX adaptors, and failing to > get (so far just) IMB to run on them. I suspect this is https://svn.open-mpi.org/trac/ompi/ticket/1919. I'm rather surprised it isn't an FAQ (actually frequently asked, not meaning someone should have written it up).