Re: [OMPI devel] Coll/tuned: Is coll_tuned_use_dynamic_rules MCA parameter still usefull?

2022-08-04 Thread George Bosilca via devel
The idea here is that the dynamic rules are defined by an entire set of parameters, and that we want a quick way to allow OMPI to ignore them all. If we follow your suggestion and remove coll_tuned_use_dynamic_rules, then turning on/off dynamic rules involves a lot of changes into the MCA file

Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw

2022-05-04 Thread George Bosilca via devel
Damien, As Gilles indicated an example would be great. Meanwhile, as you already have access to the root cause with a debugger, can you check what branch of the if regarding the communicator type in the ompi_coll_base_retain_datatypes_w function is taken. What is the communicator type ? Intra or

Re: [OMPI devel] Eager and Rendezvous Implementation

2021-10-19 Thread George Bosilca via devel
No, the PML UCX in OMPI is just a shim layer of translation from our API into UCX API. The selection of the communication protocol you are interested in is deep inside the UCX code. You will need to talk with UCX developers about that. George. On Tue, Oct 19, 2021 at 10:57 AM Masoud

Re: [OMPI devel] Eager and Rendezvous Implementation

2021-10-19 Thread George Bosilca via devel
Masoud, The protocol selection and implementation in OMPI is only available for the PML OB1, other PMLs make their own internal selection that is usually maintained in some other code base. For OB1, the selection starts in ompi/mca/pml/ob1/pml_ob1_sendreq.c in the function

Re: [OMPI devel] Question regarding the completion of btl_flush

2021-09-29 Thread George Bosilca via devel
uestion is which one is broken (and then we’ll have > to figure out how to fix…). > > > > Brian > > > > On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel" < > devel-boun...@lists.open-mpi.org on behalf of devel@lists.open-mpi.org> >

Re: [OMPI devel] Question regarding the completion of btl_flush

2021-09-28 Thread George Bosilca via devel
Based on my high-level understanding of the code path and according to the UCX implementation of the flush, the required level of completion is local. George. On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel wrote: > Dear All, > > > > I have a question regarding the completion semantics

Re: [OMPI devel] [EXTERNAL] RE: How to display device selection or routing info

2021-08-20 Thread George Bosilca via devel
Larry, There is no simple answer to your question as it depends on many, software and hardware, factors. A user selectable PML (our high level messaging layer) component will decide what protocol to be used to move the data around using what hardware. At this level you have a choice between OB1

Re: [OMPI devel] mpirun alternative

2021-03-09 Thread George Bosilca via devel
Gabriel, Awesome, good luck. I have no idea which are or are not necessary for a proper functioning daemon. To me all of the ones you have here seem critical. Ralph would be a better source of information regarding the daemons' requirements. Thanks, George. On Tue, Mar 9, 2021 at 10:25 AM

Re: [OMPI devel] mpirun alternative

2021-03-05 Thread George Bosilca via devel
Gabriel, You should be able to. Here are at least 2 different ways of doing this. 1. Purely MPI. Start singletons (or smaller groups), and connect via sockets using MPI_Comm_join. You can setup your own DNS-like service, with the goal of having the independent MPI jobs leave a trace there, such

Re: [OMPI devel] Warning

2020-05-15 Thread George Bosilca via devel
Luis, With some low frequency we remove warnings from the code. In this particular instance the meaning of the code is correct, the ompi_info_t structure starts with an opal_info_t, but removing the warnings is good policy. In general we can either cast the ompi_info_t pointer directly to an

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread George Bosilca via devel
John, The common denominator across all these errors is an error from connect while trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If the firewall open ? Is the port 1024 allowed to connect to ? George. On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel <

Re: [OMPI devel] MPI_Info args to spawn - resolving deprecated values?

2020-04-08 Thread George Bosilca via devel
Deprecate, warn and convert seems reasonable. But for how long ? As the number of automatic conversions OMPI supports has shown a tendency to increase, and as these conversions happen all over the code base, we might want to setup a well defined path to deprecation, what and when has been

Re: [OMPI devel] --mca coll choices

2020-04-07 Thread George Bosilca via devel
All the collective decisions are done on the first collective on each communicator. So basically you can change the MCA or pvar before the first collective in a communicator to affect how the decision selection is made. I have posted few examples over the years on the mailing list. George. On

Re: [OMPI devel] Dynamic topologies using MPI_Dist_graph_create

2020-04-06 Thread George Bosilca via devel
Bradley, You call then through a blocking MPI function, the operation is therefore completed by the time you return from the MPI call. So, short story you should be safe calling the dost_graph_create in a loop. The segfault indicates a memory issue with some of the internals of the treematch. Do

Re: [OMPI devel] Open MPI BTL TCP interface mapping

2020-01-09 Thread George Bosilca via devel
Will, The 7134 issue is complex in its interactions with the rest of the TCP BTL, and I could not find the time to look at it careful enough (or test it on AWS). But maybe you can address my main concern here. #7134 interfaces selection will have an impact on the traffic distribution among the

Re: [OMPI devel] Reachable framework integration

2020-01-02 Thread George Bosilca via devel
Ralph, I think the first use is still pending reviews (more precisely my review) at https://github.com/open-mpi/ompi/pull/7134. George. On Wed, Jan 1, 2020 at 9:53 PM Ralph Castain via devel < devel@lists.open-mpi.org> wrote: > Hey folks > > I can't find where the opal/reachable framework

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
> did after over 3 years running the code without interruption. I doubt > anyone had ever run the code for such a long sample interval. We found out > because we missed recording an important earthquake a week after the race > condition was tripped. Murphy's law triumphs again. :) > &

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
If the issue was some kind of memory consistently between threads, then printing that variable in the context of the debugger would show the value of debugger_event_active being false. volatile is not a memory barrier, it simply forces a load for each access of the data, allowing us to weakly

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
I don't think there is a need any protection around that variable. It will change value only once (in a callback triggered from opal_progress), and the volatile guarantees that loads will be issued for every access, so the waiting thread will eventually notice the change. George. On Tue, Nov

Re: [OMPI devel] Anyone have any thoughts about cache-alignment issue in osc/sm?

2019-09-13 Thread George Bosilca via devel
I think we can remove the header, we don't use it anymore. I commented on the issue. George. On Thu, Sep 12, 2019 at 5:23 PM Geoffrey Paulsen via devel < devel@lists.open-mpi.org> wrote: > Does anyone have any thoughts about the cache-alignment issue in osc/sm, > reported in