Why not simply ompi_ignore it ? Removing a component to bring it back later
would force us to lose all history. I would a rather add an .ompi_ignore
and give an opportunity to power users do continue playing with it.

  George.


On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain <r...@open-mpi.org> wrote:

> I already suggested the configure option, but it doesn’t solve the
> problem. I wouldn’t be terribly surprised to find that Cray also has an
> undetected problem given the nature of the issue - just a question of the
> amount of testing, variety of environments, etc.
>
> Nobody has to wait for the next major release, though that isn’t so far
> off anyway - there has never been an issue with bringing in a new component
> during a release series.
>
> Let’s just fix this the right way and bring it into 4.1 or 4.2. We may
> want to look at fixing the osc/rdma/ofi bandaid as well while we are at it.
>
> Ralph
>
>
> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon <
> tpati...@vols.utk.edu> wrote:
>
> I understand and agree with your point. My initial email is just out of
> curiosity.
>
> Howard tested this BTL for Cray in the summer as well. So this seems to
> only affected OPA hardware.
>
> I just remember that in the summer, I have to make some change in libpsm2
> to get this BTL to work for OPA.  Maybe this is the problem as the default
> libpsm2 won't work.
>
> So maybe we can fix this in configure step to detect version of libpsm2
> and dont build if we are not satisfied.
>
> Another idea is maybe we dont build this BTL by default. So the user with
> Cray hardware can still use it if they want. (Just rebuild with the btl)  -
> We just need to verify if it still works on Cray.  This way, OFI
> stakeholders does not have to wait until next major release to get this in.
>
>
> Arm
>
>
> On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain <r...@open-mpi.org> wrote:
>
>> I suspect it is a question of what you tested and in which scenarios.
>> Problem is that it can bite someone and there isn’t a clean/obvious
>> solution that doesn’t require the user to do something - e.g., like having
>> to know that they need to disable a BTL. Matias has proposed an mca-based
>> approach, but I would much rather we just fix this correctly. Bandaids have
>> a habit of becoming permanently forgotten - until someone pulls on it and
>> things unravel.
>>
>>
>> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon <
>> tpati...@vols.utk.edu> wrote:
>>
>> In the summer, I tested this BTL with along with the MTL and able to use
>> both of them interchangeably with no problem. I dont know what changed.
>> libpsm2?
>>
>>
>> Arm
>>
>>
>> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain <r...@open-mpi.org> wrote:
>>
>>> We have too many discussion threads overlapping on the same email chain
>>> - so let’s break the discussion on the OFI problem into its own chain.
>>>
>>> We have been investigating this locally and found there are a number of
>>> conflicts between the MTLs and the OFI/BTL stepping on each other. The
>>> correct solution is to move endpoint creation/reporting into a the
>>> opal/mca/common area, but that is going to take some work and will likely
>>> impact release schedules.
>>>
>>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix
>>> the problem in master, and then consider bringing it back as a package to
>>> v4.1 or v4.2.
>>>
>>> Comments? If we agree, I’ll file a PR to remove it.
>>> Ralph
>>>
>>>
>>> Begin forwarded message:
>>>
>>> *From: *Peter Kjellström <c...@nsc.liu.se>
>>> *Subject: **Re: [OMPI devel] Announcing Open MPI v4.0.0rc1*
>>> *Date: *September 20, 2018 at 5:18:35 AM PDT
>>> *To: *"Gabriel, Edgar" <egabr...@central.uh.edu>
>>> *Cc: *Open MPI Developers <devel@lists.open-mpi.org>
>>> *Reply-To: *Open MPI Developers <devel@lists.open-mpi.org>
>>>
>>> On Wed, 19 Sep 2018 16:24:53 +0000
>>> "Gabriel, Edgar" <egabr...@central.uh.edu> wrote:
>>>
>>> I performed some tests on our Omnipath cluster, and I have a mixed
>>> bag of results with 4.0.0rc1
>>>
>>>
>>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>>> very similar results.
>>>
>>> compute-1-1.local.4351PSM2 has not been initialized
>>> compute-1-0.local.3826PSM2 has not been initialized
>>>
>>>
>>> yup I too see these.
>>>
>>> mpirun detected that one or more processes exited with non-zero
>>> status, thus causing the job to be terminated. The first process to
>>> do so was:
>>>
>>>              Process name: [[38418,1],1]
>>>              Exit code:    255
>>>
>>>              
>>> ----------------------------------------------------------------------------
>>>
>>>
>>> yup.
>>>
>>>
>>> 2.       The ofi mtl does not work at all on our Omnipath cluster. If
>>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>>> error message.
>>>
>>>
>>> Yes ofi seems broken. But not even disabling it helps me completely (I
>>> see "mca_btl_ofi.so           [.] mca_btl_ofi_component_progress" in my
>>> perf top...
>>>
>>> 3.       The openib btl component is always getting in the way with
>>> annoying warnings. It is not really used, but constantly complains:
>>>
>>> ...
>>>
>>> [sabine.cacds.uh.edu:25996] 1 more process has sent help message
>>> help-mpi-btl-openib.txt / ib port not selected
>>>
>>>
>>> Yup.
>>>
>>> ...
>>>
>>> So bottom line, if I do
>>>
>>> mpirun –mca btl^openib –mca mtl^ofi ….
>>>
>>> my tests finish correctly, although mpirun will still return an error.
>>>
>>>
>>> I get some things to work with this approach (two ranks on two nodes
>>> for example). But a lot of things crash rahter hard:
>>>
>>> $ mpirun -mca btl ^openib -mca mtl
>>> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
>>>
>>> --------------------------------------------------------------------------
>>> PSM2 was unable to open an endpoint. Please make sure that the network
>>> link is active on the node and the hardware is functioning.
>>>
>>>  Error: Failure in initializing endpoint
>>>
>>> --------------------------------------------------------------------------
>>> n909.279895hfi_userinit: assign_context command failed: Device or
>>> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
>>> trying again (1/3)
>>> ...
>>>  PML add procs failed
>>>  --> Returned "Error" (-1) instead of "Success" (0)
>>>
>>> --------------------------------------------------------------------------
>>> [n908:298761] *** An error occurred in MPI_Init
>>> [n908:298761] *** reported by process [4092002305,59]
>>> [n908:298761] *** on a NULL communicator
>>> [n908:298761] *** Unknown error
>>> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>>  will now abort, [n908:298761] ***    and potentially your MPI job)
>>> [n907:407748] 255 more processes have sent help message
>>>  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
>>>  parameter "orte_base_help_aggregate" to 0 to see all help / error
>>>  messages [n907:407748] 127 more processes have sent help message
>>>  help-mpi-runtime.txt / mpi_init:startup:internal-failure
>>>  [n907:407748] 56 more processes have sent help message
>>>  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>>>
>>> If I disable psm2 too I get it to run (apparantly on vader?)
>>>
>>> /Peter K
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to