I suspect it is a question of what you tested and in which scenarios. Problem 
is that it can bite someone and there isn’t a clean/obvious solution that 
doesn’t require the user to do something - e.g., like having to know that they 
need to disable a BTL. Matias has proposed an mca-based approach, but I would 
much rather we just fix this correctly. Bandaids have a habit of becoming 
permanently forgotten - until someone pulls on it and things unravel.


> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon 
> <tpati...@vols.utk.edu> wrote:
> 
> In the summer, I tested this BTL with along with the MTL and able to use both 
> of them interchangeably with no problem. I dont know what changed. libpsm2?
> 
> 
> Arm
> 
> 
> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> We have too many discussion threads overlapping on the same email chain - so 
> let’s break the discussion on the OFI problem into its own chain.
> 
> We have been investigating this locally and found there are a number of 
> conflicts between the MTLs and the OFI/BTL stepping on each other. The 
> correct solution is to move endpoint creation/reporting into a the 
> opal/mca/common area, but that is going to take some work and will likely 
> impact release schedules.
> 
> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix the 
> problem in master, and then consider bringing it back as a package to v4.1 or 
> v4.2.
> 
> Comments? If we agree, I’ll file a PR to remove it.
> Ralph
> 
> 
>> Begin forwarded message:
>> 
>> From: Peter Kjellström <c...@nsc.liu.se <mailto:c...@nsc.liu.se>>
>> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
>> Date: September 20, 2018 at 5:18:35 AM PDT
>> To: "Gabriel, Edgar" <egabr...@central.uh.edu 
>> <mailto:egabr...@central.uh.edu>>
>> Cc: Open MPI Developers <devel@lists.open-mpi.org 
>> <mailto:devel@lists.open-mpi.org>>
>> Reply-To: Open MPI Developers <devel@lists.open-mpi.org 
>> <mailto:devel@lists.open-mpi.org>>
>> 
>> On Wed, 19 Sep 2018 16:24:53 +0000
>> "Gabriel, Edgar" <egabr...@central.uh.edu <mailto:egabr...@central.uh.edu>> 
>> wrote:
>> 
>>> I performed some tests on our Omnipath cluster, and I have a mixed
>>> bag of results with 4.0.0rc1
>> 
>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>> very similar results.
>> 
>>> compute-1-1.local.4351PSM2 has not been initialized
>>> compute-1-0.local.3826PSM2 has not been initialized
>> 
>> yup I too see these.
>> 
>>> mpirun detected that one or more processes exited with non-zero
>>> status, thus causing the job to be terminated. The first process to
>>> do so was:
>>> 
>>>              Process name: [[38418,1],1]
>>>              Exit code:    255
>>>              
>>> ----------------------------------------------------------------------------
>> 
>> yup.
>> 
>>> 
>>> 2.       The ofi mtl does not work at all on our Omnipath cluster. If
>>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>>> error message.
>> 
>> Yes ofi seems broken. But not even disabling it helps me completely (I
>> see "mca_btl_ofi.so           [.] mca_btl_ofi_component_progress" in my
>> perf top...
>> 
>>> 3.       The openib btl component is always getting in the way with
>>> annoying warnings. It is not really used, but constantly complains:
>> ...
>>> [sabine.cacds.uh.edu:25996 <http://sabine.cacds.uh.edu:25996/>] 1 more 
>>> process has sent help message
>>> help-mpi-btl-openib.txt / ib port not selected
>> 
>> Yup.
>> 
>> ...
>>> So bottom line, if I do
>>> 
>>> mpirun –mca btl^openib –mca mtl^ofi ….
>>> 
>>> my tests finish correctly, although mpirun will still return an error.
>> 
>> I get some things to work with this approach (two ranks on two nodes
>> for example). But a lot of things crash rahter hard:
>> 
>> $ mpirun -mca btl ^openib -mca mtl
>> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
>> --------------------------------------------------------------------------
>> PSM2 was unable to open an endpoint. Please make sure that the network
>> link is active on the node and the hardware is functioning.
>> 
>>  Error: Failure in initializing endpoint
>> --------------------------------------------------------------------------
>> n909.279895hfi_userinit: assign_context command failed: Device or
>> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
>> trying again (1/3)
>> ...
>>  PML add procs failed
>>  --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> [n908:298761] *** An error occurred in MPI_Init
>> [n908:298761] *** reported by process [4092002305,59]
>> [n908:298761] *** on a NULL communicator
>> [n908:298761] *** Unknown error
>> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>  will now abort, [n908:298761] ***    and potentially your MPI job)
>> [n907:407748] 255 more processes have sent help message
>>  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
>>  parameter "orte_base_help_aggregate" to 0 to see all help / error
>>  messages [n907:407748] 127 more processes have sent help message
>>  help-mpi-runtime.txt / mpi_init:startup:internal-failure
>>  [n907:407748] 56 more processes have sent help message
>>  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>> 
>> If I disable psm2 too I get it to run (apparantly on vader?)
>> 
>> /Peter K
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/devel 
>> <https://lists.open-mpi.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/devel 
> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to