Why not simply ompi_ignore it ? Removing a component to bring it back later would force us to lose all history. I would a rather add an .ompi_ignore and give an opportunity to power users do continue playing with it.
George. On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain <r...@open-mpi.org> wrote: > I already suggested the configure option, but it doesn’t solve the > problem. I wouldn’t be terribly surprised to find that Cray also has an > undetected problem given the nature of the issue - just a question of the > amount of testing, variety of environments, etc. > > Nobody has to wait for the next major release, though that isn’t so far > off anyway - there has never been an issue with bringing in a new component > during a release series. > > Let’s just fix this the right way and bring it into 4.1 or 4.2. We may > want to look at fixing the osc/rdma/ofi bandaid as well while we are at it. > > Ralph > > > On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon < > tpati...@vols.utk.edu> wrote: > > I understand and agree with your point. My initial email is just out of > curiosity. > > Howard tested this BTL for Cray in the summer as well. So this seems to > only affected OPA hardware. > > I just remember that in the summer, I have to make some change in libpsm2 > to get this BTL to work for OPA. Maybe this is the problem as the default > libpsm2 won't work. > > So maybe we can fix this in configure step to detect version of libpsm2 > and dont build if we are not satisfied. > > Another idea is maybe we dont build this BTL by default. So the user with > Cray hardware can still use it if they want. (Just rebuild with the btl) - > We just need to verify if it still works on Cray. This way, OFI > stakeholders does not have to wait until next major release to get this in. > > > Arm > > > On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain <r...@open-mpi.org> wrote: > >> I suspect it is a question of what you tested and in which scenarios. >> Problem is that it can bite someone and there isn’t a clean/obvious >> solution that doesn’t require the user to do something - e.g., like having >> to know that they need to disable a BTL. Matias has proposed an mca-based >> approach, but I would much rather we just fix this correctly. Bandaids have >> a habit of becoming permanently forgotten - until someone pulls on it and >> things unravel. >> >> >> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon < >> tpati...@vols.utk.edu> wrote: >> >> In the summer, I tested this BTL with along with the MTL and able to use >> both of them interchangeably with no problem. I dont know what changed. >> libpsm2? >> >> >> Arm >> >> >> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain <r...@open-mpi.org> wrote: >> >>> We have too many discussion threads overlapping on the same email chain >>> - so let’s break the discussion on the OFI problem into its own chain. >>> >>> We have been investigating this locally and found there are a number of >>> conflicts between the MTLs and the OFI/BTL stepping on each other. The >>> correct solution is to move endpoint creation/reporting into a the >>> opal/mca/common area, but that is going to take some work and will likely >>> impact release schedules. >>> >>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix >>> the problem in master, and then consider bringing it back as a package to >>> v4.1 or v4.2. >>> >>> Comments? If we agree, I’ll file a PR to remove it. >>> Ralph >>> >>> >>> Begin forwarded message: >>> >>> *From: *Peter Kjellström <c...@nsc.liu.se> >>> *Subject: **Re: [OMPI devel] Announcing Open MPI v4.0.0rc1* >>> *Date: *September 20, 2018 at 5:18:35 AM PDT >>> *To: *"Gabriel, Edgar" <egabr...@central.uh.edu> >>> *Cc: *Open MPI Developers <devel@lists.open-mpi.org> >>> *Reply-To: *Open MPI Developers <devel@lists.open-mpi.org> >>> >>> On Wed, 19 Sep 2018 16:24:53 +0000 >>> "Gabriel, Edgar" <egabr...@central.uh.edu> wrote: >>> >>> I performed some tests on our Omnipath cluster, and I have a mixed >>> bag of results with 4.0.0rc1 >>> >>> >>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with >>> very similar results. >>> >>> compute-1-1.local.4351PSM2 has not been initialized >>> compute-1-0.local.3826PSM2 has not been initialized >>> >>> >>> yup I too see these. >>> >>> mpirun detected that one or more processes exited with non-zero >>> status, thus causing the job to be terminated. The first process to >>> do so was: >>> >>> Process name: [[38418,1],1] >>> Exit code: 255 >>> >>> >>> ---------------------------------------------------------------------------- >>> >>> >>> yup. >>> >>> >>> 2. The ofi mtl does not work at all on our Omnipath cluster. If >>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following >>> error message. >>> >>> >>> Yes ofi seems broken. But not even disabling it helps me completely (I >>> see "mca_btl_ofi.so [.] mca_btl_ofi_component_progress" in my >>> perf top... >>> >>> 3. The openib btl component is always getting in the way with >>> annoying warnings. It is not really used, but constantly complains: >>> >>> ... >>> >>> [sabine.cacds.uh.edu:25996] 1 more process has sent help message >>> help-mpi-btl-openib.txt / ib port not selected >>> >>> >>> Yup. >>> >>> ... >>> >>> So bottom line, if I do >>> >>> mpirun –mca btl^openib –mca mtl^ofi …. >>> >>> my tests finish correctly, although mpirun will still return an error. >>> >>> >>> I get some things to work with this approach (two ranks on two nodes >>> for example). But a lot of things crash rahter hard: >>> >>> $ mpirun -mca btl ^openib -mca mtl >>> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1 >>> >>> -------------------------------------------------------------------------- >>> PSM2 was unable to open an endpoint. Please make sure that the network >>> link is active on the node and the hardware is functioning. >>> >>> Error: Failure in initializing endpoint >>> >>> -------------------------------------------------------------------------- >>> n909.279895hfi_userinit: assign_context command failed: Device or >>> resource busy n909.279895psmi_context_open: hfi_userinit: failed, >>> trying again (1/3) >>> ... >>> PML add procs failed >>> --> Returned "Error" (-1) instead of "Success" (0) >>> >>> -------------------------------------------------------------------------- >>> [n908:298761] *** An error occurred in MPI_Init >>> [n908:298761] *** reported by process [4092002305,59] >>> [n908:298761] *** on a NULL communicator >>> [n908:298761] *** Unknown error >>> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >>> will now abort, [n908:298761] *** and potentially your MPI job) >>> [n907:407748] 255 more processes have sent help message >>> help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA >>> parameter "orte_base_help_aggregate" to 0 to see all help / error >>> messages [n907:407748] 127 more processes have sent help message >>> help-mpi-runtime.txt / mpi_init:startup:internal-failure >>> [n907:407748] 56 more processes have sent help message >>> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle >>> >>> If I disable psm2 too I get it to run (apparantly on vader?) >>> >>> /Peter K >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/devel >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel