We have too many discussion threads overlapping on the same email chain - so let’s break the discussion on the OFI problem into its own chain.
We have been investigating this locally and found there are a number of conflicts between the MTLs and the OFI/BTL stepping on each other. The correct solution is to move endpoint creation/reporting into a the opal/mca/common area, but that is going to take some work and will likely impact release schedules. Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix the problem in master, and then consider bringing it back as a package to v4.1 or v4.2. Comments? If we agree, I’ll file a PR to remove it. Ralph > Begin forwarded message: > > From: Peter Kjellström <c...@nsc.liu.se> > Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1 > Date: September 20, 2018 at 5:18:35 AM PDT > To: "Gabriel, Edgar" <egabr...@central.uh.edu> > Cc: Open MPI Developers <devel@lists.open-mpi.org> > Reply-To: Open MPI Developers <devel@lists.open-mpi.org> > > On Wed, 19 Sep 2018 16:24:53 +0000 > "Gabriel, Edgar" <egabr...@central.uh.edu> wrote: > >> I performed some tests on our Omnipath cluster, and I have a mixed >> bag of results with 4.0.0rc1 > > I've also tried it on our OPA cluster (skylake+centos-7+inbox) with > very similar results. > >> compute-1-1.local.4351PSM2 has not been initialized >> compute-1-0.local.3826PSM2 has not been initialized > > yup I too see these. > >> mpirun detected that one or more processes exited with non-zero >> status, thus causing the job to be terminated. The first process to >> do so was: >> >> Process name: [[38418,1],1] >> Exit code: 255 >> >> ---------------------------------------------------------------------------- > > yup. > >> >> 2. The ofi mtl does not work at all on our Omnipath cluster. If >> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following >> error message. > > Yes ofi seems broken. But not even disabling it helps me completely (I > see "mca_btl_ofi.so [.] mca_btl_ofi_component_progress" in my > perf top... > >> 3. The openib btl component is always getting in the way with >> annoying warnings. It is not really used, but constantly complains: > ... >> [sabine.cacds.uh.edu:25996] 1 more process has sent help message >> help-mpi-btl-openib.txt / ib port not selected > > Yup. > > ... >> So bottom line, if I do >> >> mpirun –mca btl^openib –mca mtl^ofi …. >> >> my tests finish correctly, although mpirun will still return an error. > > I get some things to work with this approach (two ranks on two nodes > for example). But a lot of things crash rahter hard: > > $ mpirun -mca btl ^openib -mca mtl > ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1 > -------------------------------------------------------------------------- > PSM2 was unable to open an endpoint. Please make sure that the network > link is active on the node and the hardware is functioning. > > Error: Failure in initializing endpoint > -------------------------------------------------------------------------- > n909.279895hfi_userinit: assign_context command failed: Device or > resource busy n909.279895psmi_context_open: hfi_userinit: failed, > trying again (1/3) > ... > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > -------------------------------------------------------------------------- > [n908:298761] *** An error occurred in MPI_Init > [n908:298761] *** reported by process [4092002305,59] > [n908:298761] *** on a NULL communicator > [n908:298761] *** Unknown error > [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, [n908:298761] *** and potentially your MPI job) > [n907:407748] 255 more processes have sent help message > help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA > parameter "orte_base_help_aggregate" to 0 to see all help / error > messages [n907:407748] 127 more processes have sent help message > help-mpi-runtime.txt / mpi_init:startup:internal-failure > [n907:407748] 56 more processes have sent help message > help-mpi-errors.txt / mpi_errors_are_fatal unknown handle > > If I disable psm2 too I get it to run (apparantly on vader?) > > /Peter K > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel