I suspect it is a question of what you tested and in which scenarios. Problem is that it can bite someone and there isn’t a clean/obvious solution that doesn’t require the user to do something - e.g., like having to know that they need to disable a BTL. Matias has proposed an mca-based approach, but I would much rather we just fix this correctly. Bandaids have a habit of becoming permanently forgotten - until someone pulls on it and things unravel.
> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon > <tpati...@vols.utk.edu> wrote: > > In the summer, I tested this BTL with along with the MTL and able to use both > of them interchangeably with no problem. I dont know what changed. libpsm2? > > > Arm > > > On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > We have too many discussion threads overlapping on the same email chain - so > let’s break the discussion on the OFI problem into its own chain. > > We have been investigating this locally and found there are a number of > conflicts between the MTLs and the OFI/BTL stepping on each other. The > correct solution is to move endpoint creation/reporting into a the > opal/mca/common area, but that is going to take some work and will likely > impact release schedules. > > Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix the > problem in master, and then consider bringing it back as a package to v4.1 or > v4.2. > > Comments? If we agree, I’ll file a PR to remove it. > Ralph > > >> Begin forwarded message: >> >> From: Peter Kjellström <c...@nsc.liu.se <mailto:c...@nsc.liu.se>> >> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1 >> Date: September 20, 2018 at 5:18:35 AM PDT >> To: "Gabriel, Edgar" <egabr...@central.uh.edu >> <mailto:egabr...@central.uh.edu>> >> Cc: Open MPI Developers <devel@lists.open-mpi.org >> <mailto:devel@lists.open-mpi.org>> >> Reply-To: Open MPI Developers <devel@lists.open-mpi.org >> <mailto:devel@lists.open-mpi.org>> >> >> On Wed, 19 Sep 2018 16:24:53 +0000 >> "Gabriel, Edgar" <egabr...@central.uh.edu <mailto:egabr...@central.uh.edu>> >> wrote: >> >>> I performed some tests on our Omnipath cluster, and I have a mixed >>> bag of results with 4.0.0rc1 >> >> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with >> very similar results. >> >>> compute-1-1.local.4351PSM2 has not been initialized >>> compute-1-0.local.3826PSM2 has not been initialized >> >> yup I too see these. >> >>> mpirun detected that one or more processes exited with non-zero >>> status, thus causing the job to be terminated. The first process to >>> do so was: >>> >>> Process name: [[38418,1],1] >>> Exit code: 255 >>> >>> ---------------------------------------------------------------------------- >> >> yup. >> >>> >>> 2. The ofi mtl does not work at all on our Omnipath cluster. If >>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following >>> error message. >> >> Yes ofi seems broken. But not even disabling it helps me completely (I >> see "mca_btl_ofi.so [.] mca_btl_ofi_component_progress" in my >> perf top... >> >>> 3. The openib btl component is always getting in the way with >>> annoying warnings. It is not really used, but constantly complains: >> ... >>> [sabine.cacds.uh.edu:25996 <http://sabine.cacds.uh.edu:25996/>] 1 more >>> process has sent help message >>> help-mpi-btl-openib.txt / ib port not selected >> >> Yup. >> >> ... >>> So bottom line, if I do >>> >>> mpirun –mca btl^openib –mca mtl^ofi …. >>> >>> my tests finish correctly, although mpirun will still return an error. >> >> I get some things to work with this approach (two ranks on two nodes >> for example). But a lot of things crash rahter hard: >> >> $ mpirun -mca btl ^openib -mca mtl >> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1 >> -------------------------------------------------------------------------- >> PSM2 was unable to open an endpoint. Please make sure that the network >> link is active on the node and the hardware is functioning. >> >> Error: Failure in initializing endpoint >> -------------------------------------------------------------------------- >> n909.279895hfi_userinit: assign_context command failed: Device or >> resource busy n909.279895psmi_context_open: hfi_userinit: failed, >> trying again (1/3) >> ... >> PML add procs failed >> --> Returned "Error" (-1) instead of "Success" (0) >> -------------------------------------------------------------------------- >> [n908:298761] *** An error occurred in MPI_Init >> [n908:298761] *** reported by process [4092002305,59] >> [n908:298761] *** on a NULL communicator >> [n908:298761] *** Unknown error >> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, [n908:298761] *** and potentially your MPI job) >> [n907:407748] 255 more processes have sent help message >> help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA >> parameter "orte_base_help_aggregate" to 0 to see all help / error >> messages [n907:407748] 127 more processes have sent help message >> help-mpi-runtime.txt / mpi_init:startup:internal-failure >> [n907:407748] 56 more processes have sent help message >> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle >> >> If I disable psm2 too I get it to run (apparantly on vader?) >> >> /Peter K >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/devel >> <https://lists.open-mpi.org/mailman/listinfo/devel> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/devel > <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel