Re: [OMPI devel] Coll/tuned: Is coll_tuned_use_dynamic_rules MCA parameter still usefull?

2022-08-04 Thread George Bosilca via devel
The idea here is that the dynamic rules are defined by an entire set of
parameters, and that we want a quick way to allow OMPI to ignore them all.
If we follow your suggestion and remove coll_tuned_use_dynamic_rules, then
turning on/off dynamic rules involves a lot of changes into the MCA file
(instead of a single change with coll_tuned_use_dynamic_rules)

  George.


On Mon, Jun 27, 2022 at 8:40 AM GERMAIN, Florent via devel <
devel@lists.open-mpi.org> wrote:

> Hi,
>
> I wonder if coll_tuned_use_dynamic_rules MCA parameter is still useful.
>
> Remainder of this MCA parameter behavior:
> Default value: false
> When activated, it allows the use of dynamic rules of the coll/tuned
> component (algorithm choice and parameters through MCA parameters).
>
> Is there any need or argument for keeping this parameter? (except “it was
> here yesterday”)
>
> The issue encountered here is that, with verbose=0 and
> coll_tuned_use_dynamic_rules=false, all the MCA parameters driving the
> dynamic behavior of the coll/tuned component are silently ignored.
> Forgetting to switch it on leads to a silent unwanted behavior, what is
> kind of annoying.
>
> Is there something I miss about this MCA parameter?
>
> I suggest removing the coll_tuned_use_dynamic_rules MCA parameter or at
> least have its default value to true.
>
>
>
> I can make the changes and open a PR If we want to remove this parameter.
>
>
>
> Regards,
>
> *Florent Germain*
>
> Ingénieur de développement
>
> florent.germ...@atos.net
>
> 1 Rue de Provence
>
> 38130 Echirolles
>
> Atos.net
>
> [image: Atos]
>
>
>


Re: [OMPI devel] Possible Bug / Invalid Read in Ialltoallw

2022-05-04 Thread George Bosilca via devel
Damien,

As Gilles indicated an example would be great. Meanwhile, as you already
have access to the root cause with a debugger, can you check what branch of
the if regarding the communicator type in the
ompi_coll_base_retain_datatypes_w function is taken. What is the
communicator type ? Intra or inter ? with or without topology ?

Thanks,
  George.


On Wed, May 4, 2022 at 9:35 AM Gilles Gouaillardet via devel <
devel@lists.open-mpi.org> wrote:

> Damian,
>
> Thanks for the report!
>
> could you please trim your program and share it so I can have a look?
>
>
> Cheers,
>
> Gilles
>
>
> On Wed, May 4, 2022 at 10:27 PM Damian Marek via devel <
> devel@lists.open-mpi.org> wrote:
>
>> Hello,
>>
>> I have been getting intermittent memory corruptions and segmentation
>> faults while using Ialltoallw in OpenMPI v4.0.3. Valgrind also reports an
>> invalid read in the "ompi_coll_base_retain_datatypes_w" function defined in
>> "coll_base_util.c".
>>
>> Running with a debug build of ompi an assertion fails as well:
>>
>> base/coll_base_util.c:274: ompi_coll_base_retain_datatypes_w: Assertion
>> `OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (stypes[i]))->obj_magic_id' failed.
>>
>> I think it is related to the fact that I am using a communicator created
>> with 2D MPI_Cart_create followed by getting 2 subcommunicators from
>> MPI_Cart_sub, in some cases one of the dimensions is 1. In
>> "ompi_coll_base_retain_datatypes_w" the neighbour count is used to find
>> "rcount" and "scount" at line 267. In my bug case it returns 2 for both,
>> but I believe it should be 1 since that is the comm size and the amount of
>> memory I have allocated for sendtypes and recvtypes. Then, an invalid read
>> happens at 274 and 280.
>>
>> Regards,
>> Damian
>>
>>
>>
>>
>>
>>
>>


Re: [OMPI devel] Eager and Rendezvous Implementation

2021-10-19 Thread George Bosilca via devel
No, the PML UCX in OMPI is just a shim layer of translation from our API
into UCX API. The selection of the communication protocol you are
interested in is deep inside the UCX code. You will need to talk with UCX
developers about that.

  George.


On Tue, Oct 19, 2021 at 10:57 AM Masoud Hemmatpour 
wrote:

>
> Hi  George,
>
> Thank you very much for your reply. I use UCX for the communication. Is it
> somewhere in pml_ucx.c?
>
> Thanks,
>
>
>
> On Tue, Oct 19, 2021 at 4:41 PM George Bosilca 
> wrote:
>
>> Masoud,
>>
>> The protocol selection and implementation in OMPI is only available for
>> the PML OB1, other PMLs make their own internal selection that is usually
>> maintained in some other code base.
>>
>> For OB1, the selection starts in ompi/mca/pml/ob1/pml_ob1_sendreq.c in
>> the function mca_pml_ob1_send_request_start, where the sender decide what
>> protocol might be the best to use according to it's memory layout and
>> message size. This decision is then encapsulated in a matching header, that
>> is forwarded to the peer. Once the matching is done on the receiving
>> processor (in pml_ob1_recvfrag.c starting from match_one), the receiver can
>> confirm the protocol proposed by the sender or can fall back to a different
>> protocol (such as pipeline send/recv).
>>
>> If you have questions let me know.
>>   George.
>>
>>
>>
>> On Tue, Oct 19, 2021 at 10:15 AM Masoud Hemmatpour via devel <
>> devel@lists.open-mpi.org> wrote:
>>
>>> Hello all,
>>>
>>> I am new to Open MPI source code. I am trying to understand the Eager
>>> and Rendezvous
>>> implementation  inside ompi code base. Could you please help and refer
>>> me to the source file?
>>> I read a bit on OMPI, then PML and BTL but I am still not sure what is
>>> going on.
>>>
>>> Thanks!
>>>
>>>
>>>


Re: [OMPI devel] Eager and Rendezvous Implementation

2021-10-19 Thread George Bosilca via devel
Masoud,

The protocol selection and implementation in OMPI is only available for the
PML OB1, other PMLs make their own internal selection that is usually
maintained in some other code base.

For OB1, the selection starts in ompi/mca/pml/ob1/pml_ob1_sendreq.c in the
function mca_pml_ob1_send_request_start, where the sender decide what
protocol might be the best to use according to it's memory layout and
message size. This decision is then encapsulated in a matching header, that
is forwarded to the peer. Once the matching is done on the receiving
processor (in pml_ob1_recvfrag.c starting from match_one), the receiver can
confirm the protocol proposed by the sender or can fall back to a different
protocol (such as pipeline send/recv).

If you have questions let me know.
  George.



On Tue, Oct 19, 2021 at 10:15 AM Masoud Hemmatpour via devel <
devel@lists.open-mpi.org> wrote:

> Hello all,
>
> I am new to Open MPI source code. I am trying to understand the Eager and
> Rendezvous
> implementation  inside ompi code base. Could you please help and refer me
> to the source file?
> I read a bit on OMPI, then PML and BTL but I am still not sure what is
> going on.
>
> Thanks!
>
>
>


Re: [OMPI devel] Question regarding the completion of btl_flush

2021-09-29 Thread George Bosilca via devel
Brian,

My comment was mainly about the BTL code. MPI_Win_fence does not require
remote completion, the call only guarantees that all outbound operations
have been locally completed, and that all inbound operations from other
sources on the process are also complete. I agree with you on the Win_flush
implementation we have, it only guarantees the first part, and assumes the
barrier will drain the network of all pending messages.

You're right, the current implementation assumes that the MPI_Barrier
having a more synchronizing behavior and requiring more messages to be
exchanged between the participants, might increase the likelihood that even
with overtaking all pending messages have reached destination.

  George.


On Tue, Sep 28, 2021 at 10:36 PM Barrett, Brian  wrote:

> George –
>
>
>
> Is your comment about the code path referring to the BTL code or the OSC
> RDMA code?  The OSC code seems to expect remote completion, at least for
> the fence operation.  Fence is implemented as a btl flush followed by a
> window-wide barrier.  There’s no ordering specified between the RDMA
> operations completed by the flush and the send messages in the collective,
> so overtaking is possible.  Given that the BTL and the UCX PML (or OFI MTL
> or whatever) are likely using different QPs, ordering of the packets is
> doubtful.
>
>
>
> Like you, we saw that many BTLs appear to only guarantee local completion
> with flush().  So the question is which one is broken (and then we’ll have
> to figure out how to fix…).
>
>
>
> Brian
>
>
>
> On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel" <
> devel-boun...@lists.open-mpi.org on behalf of devel@lists.open-mpi.org>
> wrote:
>
>
>
>
>
> Based on my high-level understanding of the code path and according to the
> UCX implementation of the flush, the required level of completion is local.
>
>
>
>   George.
>
>
>
>
>
> On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel <
> devel@lists.open-mpi.org> wrote:
>
> Dear All,
>
>
>
> I have a question regarding the completion semantics of btl_flush,
>
>
>
> In opal/mca/btl/btl.h,
>
>
>
>
> https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146
>
>
>
> the comment about btl_flush says:
>
>
>
> * This function returns when all outstanding RDMA (put, get, atomic)
> operations
>
> * that were started prior to the flush call have completed.
>
>
>
> However, it is not clear to me what “complete” actually means? E.g. does
> it mean local completion (the action on RDMA initiator side has completed),
> or does it mean “remote completion”, (the action of RDMA remote side has
> completed). We are interested in this  because for many RDMA btls, “local
> completion” does not equal to “remote completion”.
>
>
>
> From the way btl_flush is used in osc/rdma’s fence operation (which is a
> call to flush followed by a MPI_Barrier), we think that btl_flush should
> mean remote completion, but want to get the clarification from the
> community.
>
>
>
> Sincerely,
>
>
>
> Wei Zhang
>
>
>
>


Re: [OMPI devel] Question regarding the completion of btl_flush

2021-09-28 Thread George Bosilca via devel
Based on my high-level understanding of the code path and according to the
UCX implementation of the flush, the required level of completion is local.

  George.


On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel 
wrote:

> Dear All,
>
>
>
> I have a question regarding the completion semantics of btl_flush,
>
>
>
> In opal/mca/btl/btl.h,
>
>
>
>
> https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146
>
>
>
> the comment about btl_flush says:
>
>
>
> * This function returns when all outstanding RDMA (put, get, atomic)
> operations
>
> * that were started prior to the flush call have completed.
>
>
>
> However, it is not clear to me what “complete” actually means? E.g. does
> it mean local completion (the action on RDMA initiator side has completed),
> or does it mean “remote completion”, (the action of RDMA remote side has
> completed). We are interested in this  because for many RDMA btls, “local
> completion” does not equal to “remote completion”.
>
>
>
> From the way btl_flush is used in osc/rdma’s fence operation (which is a
> call to flush followed by a MPI_Barrier), we think that btl_flush should
> mean remote completion, but want to get the clarification from the
> community.
>
>
>
> Sincerely,
>
>
>
> Wei Zhang
>
>
>


Re: [OMPI devel] [EXTERNAL] RE: How to display device selection or routing info

2021-08-20 Thread George Bosilca via devel
Larry,

There is no simple answer to your question as it depends on many, software
and hardware, factors. A user selectable PML (our high level messaging
layer) component will decide what protocol to be used to move the data
around using what hardware. At this level you have a choice between OB1
(which will then use BTLs to move data), UCX (which will use it’s own
internal capabilities, TLs), or CM (which will then default to an MTL, aka
specialized for each hardware). For any PML other then OB1, the steps
leading to the route selection is made internally by the software component
used by the corresponding PML/MTL, and OMPI has basically no control on the
decision. As a user you might be able to drive this decision using
specialized parameters for the underlying library, but as far as I know
there is no way to extract this information in a portable way.

Let’s now assume you use the OB1 PML, which in turn will use the BTLs to
drive the data movement. The decision on what BTL to use will then be
driven by the exclusivity, latency and bandwidth exposed by the
BTLs themselves. Some BTL will play nicely with others (aka sharing the
load between 2 peers, or allowing multiple instances of itself), while
others will prefer a more dedicated connection. One of the reasons OMPI
does not have a proper mechanism to expose the routing decisions to the
user, is this multiplicity of possibilities. The best we have right now is
the ompi_display_comm MCA, that will show you the initialized BTLs on the
run (which is a superset of what will be used). Try adding `--mca
ompi_display_comm MPI_Finalize` to your mpirun command.

Hopefully this answers your inquiry, at least partially.
  George.


On Fri, Aug 20, 2021 at 11:39 Baker, Lawrence M via devel <
devel@lists.open-mpi.org> wrote:

> Florent,
>
> I agree with your description for the information I have already found
> through the use of -mca btl_base_verbose 10.  I am requesting some way of
> obtaining the device selection and configuration information without the
> rest of the tracing.
>
> As far as routing, I was not referring to the OS network routing, but the
> routing layer that must exist in OpenMPI.  When OMPI must send a message
> from one process to another, it has to pick a path, namely the one among
> what may be several possibilities that it has decided is the shortest when
> mpirun started the run.  The internal OMPI routing table would be a summary
> of all the device choices made my mpirun, I assume.
>
> Thank you,
>
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov
>
>
>
> > On Aug 19 2021, at 11:30:58 PM, GERMAIN, Florent <
> florent.germ...@atos.net> wrote:
> >
> >
> >
> > This email has been received from outside of DOI - Use caution before
> clicking on links, opening attachments, or responding.
> >
> >
> >
> > Hello,
> > Open MPI have given to you the only information it has: message is given
> to mlx4_0:1.
> > As you are using InfiniBand network, the routing tables are handled by
> the ib runtime.
> > You can gather information about ib routing through the ibroute command.
> > I hope it will help you.
> >
> > Florent Germain
> > Ingénieur de développement
> > florent.germ...@atos.net
> > 1 Rue de Provence
> > 38130 Echirolles
> > Atos.net
> >
> > -Message d'origine-
> > De : devel  De la part de Baker,
> Lawrence M via devel
> > Envoyé : vendredi 20 août 2021 04:19
> > À : Geoffrey Paulsen via devel 
> > Cc : Baker, Lawrence M 
> > Objet : [OMPI devel] How to display device selection or routing info
> >
> > I am running into dead ends trying to find an mpirun option to display
> the device selected for each process.  I use
> >
> > mpirun --report-bindings
> >
> > to view the bindings to nodes/cpus/cores.  But I cannot find an
> equivalent option that displays the hardware path used for OpenMPI
> messaging.  Or the device parameters that go with that, if they are
> programmable.  I have found I can see the openib message selecting
> InfiniBand when I add
> >
> > -mca btl_base_verbose 10
> >
> > [compute-0-30.local:02597] [rank=0] openib: using port mlx4_0:1
> >
> > But, I get lots of other messages that are just noise for this purpose.
> >
> > There must be a routing table somewhere being constructed (ORTE?).  I
> could not find an MCA option to show that.
> >
> > Is there a better way to get this information?  My Google searching has
> not turned up anything.  If not, I'd like to put in a request for an mpirun
> option similar to --report-bindings that displays the routing table and/or
> device bindings and programmable settings used.
> >
> > Thank you very much.
> >
> > Larry Baker
> > US Geological Survey
> > 650-329-5608
> > ba...@usgs.gov
> >
> >
>
>


Re: [OMPI devel] mpirun alternative

2021-03-09 Thread George Bosilca via devel
Gabriel,

Awesome, good luck.

I have no idea which are or are not necessary for a proper functioning
daemon. To me all of the ones you have here seem critical. Ralph would be a
better source of information regarding the daemons' requirements.

Thanks,
  George.


On Tue, Mar 9, 2021 at 10:25 AM Gabriel Tanase 
wrote:

> George,
> I started to digg more in option 2 as you describe it. I believe I can
> make that work.
> For example I created this fake ssh :
>
> $ cat ~/bin/ssh
> #!/bin/bash
> fname=env.$$
> echo ">>>>>>>>>>>>> ssh" >> $fname
> env >>$fname
> echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>" >>$fname
> echo $@ >>$fname
>
> And this one prints all args that the remote process will receive:
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> -x 10.0.35.43 orted -mca ess "env" -mca ess_base_jobid "2752512000" -mca
> ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex
> "ip-[2:10]-0-16-120,[2:10].0.35.43,[2:10].0.35.42@0(3)" -mca orte_hnp_uri
> "2752512000.0;tcp://10.0.16.120:44789" -mca plm "rsh" --tree-spawn -mca
> routed "radix" -mca orte_parent_uri "2752512000.0;tcp://10.0.16.120:44789"
> -mca rmaps_base_mapping_policy "node" -mca pmix "^s1,s2,cray,isolated"
>
> Now I am thinking that probably I don;t even need to create all those
> openmpi env variables as I am hoping orted that will be started remotely
> will start the final executable with the right env set. does this sound
> right ?
>
> Thx,
> --Gabriel
>
>
> On Fri, Mar 5, 2021 at 3:15 PM George Bosilca  wrote:
>
>> Gabriel,
>>
>> You should be able to. Here are at least 2 different ways of doing this.
>>
>> 1. Purely MPI. Start singletons (or smaller groups), and connect via
>> sockets using MPI_Comm_join. You can setup your own DNS-like service, with
>> the goal of having the independent MPI jobs leave a trace there, such that
>> they can find each other and create the initial socket.
>>
>> 2. You could replace ssh/rsh with a no-op script (that returns success
>> such that the mpirun process thinks it successfully started the processes),
>> and then handcraft the environment as you did for GASNet.
>>
>> 3. We have support for DVM (Distributed Virtual Machine) that basically
>> created an independent service where different mpirun could connect to
>> retrieve information. The mpirun using this dvm singleton, and fallback to
>> MPI_Comm_connect/accept to recreate an MPI world.
>>
>> Good luck,
>>   George.
>>
>>
>> On Fri, Mar 5, 2021 at 2:08 PM Ralph Castain via devel <
>> devel@lists.open-mpi.org> wrote:
>>
>>> I'm afraid that won't work - there is no way for the job to "self
>>> assemble". One could create a way to do it, but it would take some
>>> significant coding in the guts of OMPI to get there.
>>>
>>>
>>> On Mar 5, 2021, at 9:40 AM, Gabriel Tanase via devel <
>>> devel@lists.open-mpi.org> wrote:
>>>
>>> Hi all,
>>> I decided to use mpi as the messaging layer for a multihost database.
>>> However within my org I faced very strong opposition to allow passwordless
>>> ssh or rsh. For security reasons we want to minimize the opportunities to
>>> execute arbitrary codes on the db clusters. I don;t want to run other
>>> things like slurm, etc.
>>>
>>> My question would be: Is there a way to start an mpi application by
>>> running certain binaries on each host ? E.g., if my executable is "myapp"
>>> can I start a server (orted???) on host zero and than start myapp on each
>>> host with the right env variables set (for specifying the rank, num ranks,
>>> etc.)
>>>
>>> For example when using another messaging API (GASnet) I was able to
>>> start a server on host zero and then manually start the application binary
>>> on each host (with some environment variables properly set) and all was
>>> good.
>>>
>>> I tried to reverse engineer a little the env variables used by mpirun
>>> (mpirun -np 2 env) and then I copied these env variables in a shell script
>>> prior to invoking my hello world mpirun but I got an error message implying
>>> a server is not present:
>>>
>>> PMIx_Init failed for the following reason:
>>>
>>>

Re: [OMPI devel] mpirun alternative

2021-03-05 Thread George Bosilca via devel
Gabriel,

You should be able to. Here are at least 2 different ways of doing this.

1. Purely MPI. Start singletons (or smaller groups), and connect via
sockets using MPI_Comm_join. You can setup your own DNS-like service, with
the goal of having the independent MPI jobs leave a trace there, such that
they can find each other and create the initial socket.

2. You could replace ssh/rsh with a no-op script (that returns success such
that the mpirun process thinks it successfully started the processes), and
then handcraft the environment as you did for GASNet.

3. We have support for DVM (Distributed Virtual Machine) that basically
created an independent service where different mpirun could connect to
retrieve information. The mpirun using this dvm singleton, and fallback to
MPI_Comm_connect/accept to recreate an MPI world.

Good luck,
  George.


On Fri, Mar 5, 2021 at 2:08 PM Ralph Castain via devel <
devel@lists.open-mpi.org> wrote:

> I'm afraid that won't work - there is no way for the job to "self
> assemble". One could create a way to do it, but it would take some
> significant coding in the guts of OMPI to get there.
>
>
> On Mar 5, 2021, at 9:40 AM, Gabriel Tanase via devel <
> devel@lists.open-mpi.org> wrote:
>
> Hi all,
> I decided to use mpi as the messaging layer for a multihost database.
> However within my org I faced very strong opposition to allow passwordless
> ssh or rsh. For security reasons we want to minimize the opportunities to
> execute arbitrary codes on the db clusters. I don;t want to run other
> things like slurm, etc.
>
> My question would be: Is there a way to start an mpi application by
> running certain binaries on each host ? E.g., if my executable is "myapp"
> can I start a server (orted???) on host zero and than start myapp on each
> host with the right env variables set (for specifying the rank, num ranks,
> etc.)
>
> For example when using another messaging API (GASnet) I was able to start
> a server on host zero and then manually start the application binary on
> each host (with some environment variables properly set) and all was good.
>
> I tried to reverse engineer a little the env variables used by mpirun
> (mpirun -np 2 env) and then I copied these env variables in a shell script
> prior to invoking my hello world mpirun but I got an error message implying
> a server is not present:
>
> PMIx_Init failed for the following reason:
>
>   NOT-SUPPORTED
>
> Open MPI requires access to a local PMIx server to execute. Please ensure
> that either you are operating in a PMIx-enabled environment, or use
> "mpirun"
> to execute the job.
>
> Here is the shell script for host0:
>
> $ cat env1.sh
> #!/bin/bash
>
> export OMPI_COMM_WORLD_RANK=0
> export PMIX_NAMESPACE=mpirun-38f9d3525c2c-53291@1
> export PRTE_MCA_prte_base_help_aggregate=0
> export TERM_PROGRAM=Apple_Terminal
> export OMPI_MCA_num_procs=2
> export TERM=xterm-256color
> export SHELL=/bin/bash
> export PMIX_VERSION=4.1.0a1
> export OPAL_USER_PARAMS_GIVEN=1
> export TMPDIR=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T/
> export
> Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.HCXmdRI1WL/Render
> export PMIX_SERVER_URI41=mpirun-38f9d3525c2c-53291@0.0;tcp4://
> 192.168.0.180:52093
> export TERM_PROGRAM_VERSION=421.2
> export PMIX_RANK=0
> export TERM_SESSION_ID=18212D82-DEB2-4AE8-A271-FB47AC71337B
> export OMPI_COMM_WORLD_LOCAL_RANK=0
> export OMPI_ARGV=
> export OMPI_MCA_initial_wdir=/Users/igtanase/ompi
> export USER=igtanase
> export OMPI_UNIVERSE_SIZE=2
> export SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.PhcplcX3pC/Listeners
> export OMPI_COMMAND=./exe
> export __CF_USER_TEXT_ENCODING=0x54984577:0x0:0x0
> export
> OMPI_FILE_LOCATION=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T//prte.38f9d3525c2c.1419265399/dvm.53291/1/0
> export PMIX_SERVER_URI21=mpirun-38f9d3525c2c-53291@0.0;tcp4://
> 192.168.0.180:52093
> export
> PATH=/Users/igtanase/ompi/bin/:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
> export OMPI_COMM_WORLD_LOCAL_SIZE=2
> export PRTE_MCA_pmix_session_server=1
> export PWD=/Users/igtanase/ompi
> export OMPI_COMM_WORLD_SIZE=2
> export OMPI_WORLD_SIZE=2
> export LANG=en_US.UTF-8
> export XPC_FLAGS=0x0
> export PMIX_GDS_MODULE=hash
> export XPC_SERVICE_NAME=0
> export HOME=/Users/igtanase
> export SHLVL=2
> export PMIX_SECURITY_MODE=native
> export PMIX_HOSTNAME=38f9d3525c2c
> export LOGNAME=igtanase
> export OMPI_WORLD_LOCAL_SIZE=2
> export PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
> export PRTE_LAUNCHED=1
> export
> PMIX_SERVER_TMPDIR=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T//prte.38f9d3525c2c.1419265399/dvm.53291
> export OMPI_COMM_WORLD_NODE_RANK=0
> export OMPI_MCA_cpu_type=x86_64
> export PMIX_SYSTEM_TMPDIR=/var/folders/_k/c4_xr5vd14j97fw7j8vzmd45_9hjbq/T/
> export PMIX_SERVER_URI4=mpirun-38f9d3525c2c-53291@0.0;tcp4://
> 192.168.0.180:52093
> export OMPI_NUM_APP_CTX=1
> export SECURITYSESSIONID=186a9
> export PMIX_SERVER_URI3=mpirun-38f9d3525c2c-53291@0.0;tcp4://
> 192

Re: [OMPI devel] Warning

2020-05-15 Thread George Bosilca via devel
Luis,

With some low frequency we remove warnings from the code. In this
particular instance the meaning of the code is correct, the ompi_info_t
structure starts with an opal_info_t, but removing the warnings is good
policy.

In general we can either cast the ompi_info_t pointer directly to an
opal_info_t pointer, or access the super field in the ompi_info_t structure
(as it is done in ompi/mpi/c/win_create_dynamic.c). As in this instance we
are explicitly using one of the MPI predefined info keys (without
equivalent at the OPAL level) it is more clear if we cast it.

  George.


On Fri, May 15, 2020 at 6:37 AM Luis via devel 
wrote:

> Hi OMPI devs,
>
> I was wondewring if this warning is expected, if not, how should we
> internally call ompi_win_create_dynamic?
>
> res = ompi_win_create_dynamic(MPI_INFO_NULL, comm, &win);
>  ^
> In file included from pnbc_osc_internal.h:40,
>  from pnbc_osc_iallreduce.c:21:
> ../../../../ompi/win/win.h:143:42: note: expected ‘opal_info_t *’ {aka
> ‘struct opal_info_t *’} but argument is of type ‘struct ompi_info_t *’
>  int ompi_win_create_dynamic(opal_info_t *info, ompi_communicator_t
> *comm, ompi_win_t **newwin);
>
>
> Regards,
> Luis
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>


Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

2020-05-04 Thread George Bosilca via devel
John,

The common denominator across all these errors is an error from connect
while trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If
the firewall open ? Is the port 1024 allowed to connect to ?

  George.


On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel <
devel@lists.open-mpi.org> wrote:

> Inline below...
>
> On 2020-05-04 11:09, Ralph Castain via devel wrote:
>
> Staring at this some more, I do have the following questions:
>
> * in your first case, it looks like "prte" was started from microway3 -
> correct?
>
> Yes, "prte" was started from microway3.
>
>
> * in the second case, that worked, it looks like "mpirun" was executed
> from microway1 - correct?
>
> No, "mpirun" was executed from microway3.
>
>
> * in the third case, you state that "mpirun" was again executed from
> microway3, and the process output confirms that
>
> Yes, "mpirun" was started from microway3.
>
>
> I'm wondering if the issue here might actually be that PRRTE expects the
> ordering of hosts in the hostfile to start with the host it is sitting on -
> i.e., if the node index number between the various daemons is getting
> confused. Can you perhaps see what happens with the failing cases if you
> put microway3 at the top of the hostfile and execute prte/mpirun from
> microway3 as before?
>
> OK, the first failing case:
>
> mic:/amd/home/jdelsign/PMIx>pterm
> pterm failed to initialize, likely due to no DVM being available
> mic:/amd/home/jdelsign/PMIx>cat myhostfile3
> microway3 slots=16
> microway1 slots=16
> microway2 slots=16
> mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile3 --daemonize
> mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name
> --personality ompi ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway3.totalviewtech.com
> Hello from proc (1): microway1
> Hello from proc (2): microway2.totalviewtech.com
> --
> WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
> should not happen.
>
> Your Open MPI job may now hang or fail.
>
>   Local host: microway1
>   PID:292266
>   Message:connect() to 10.71.2.58:1024 failed
>   Error:  No route to host (113)
> --
> [microway1:292266]
> ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
> mic:/amd/home/jdelsign/PMIx>hostname
> microway3.totalviewtech.com
> mic:/amd/home/jdelsign/PMIx>
>
> And the second failing test case:
>
> mic:/amd/home/jdelsign/PMIx>pterm
> pterm failed to initialize, likely due to no DVM being available
> mic:/amd/home/jdelsign/PMIx>cat myhostfile3+2
> microway3 slots=16
> microway2 slots=16
> mic:/amd/home/jdelsign/PMIx>
> mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name
> --personality ompi --hostfile myhostfile3+2 ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway3.totalviewtech.com
> Hello from proc (1): microway3.totalviewtech.com
> --
> WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
> should not happen.
>
> Your Open MPI job may now hang or fail.
>
>   Local host: microway3
>   PID:271144
>   Message:connect() to 10.71.2.58:1024 failed
>   Error:  No route to host (113)
> --
> [microway3.totalviewtech.com:271144]
> ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
> Hello from proc (2): microway2.totalviewtech.com
> mic:/amd/home/jdelsign/PMIx>
>
> So, AFAICT, host name order didn't matter.
>
> Cheers, John D.
>
>
>
>
>
> On May 4, 2020, at 7:34 AM, John DelSignore via devel <
> devel@lists.open-mpi.org> wrote:
>
> Hi folks,
>
> I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and
> rebuilt. I'm running a very simple test code on three Centos 7.[56] nodes
> named microway[123] over TCP. I'm seeing a fatal error similar to the
> following:
>
> [microway3.totalviewtech.com:227713]
> ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
>
> The case of prun launching an OMPI code does not work correctly. The MPI
> processes seem to launch OK, but there is the follwoing OMPI error at the
> point where the processes communicate. In the following case, I have DVM
> running on three nodes "microway[123]":
>
> mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name
> --personality ompi ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway3.totalviewtech.com
> Hello from proc (1): microway1
> Hello from proc (2): microway2.totalviewtech.com
> --
> WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
> should not happen.
>
> Your Open MPI job may now hang or fail.
>
>   Local host: microway1
>   PID:282716

Re: [OMPI devel] MPI_Info args to spawn - resolving deprecated values?

2020-04-08 Thread George Bosilca via devel
Deprecate, warn and convert seems reasonable. But for how long ?

As the number of automatic conversions OMPI supports has shown a tendency
to increase, and as these conversions happen all over the code base, we
might want to setup a well defined path to deprecation, what and when has
been deprecated, for how long we intend to keep them or their conversion
around and finally when they should be completely removed (or moved into a
state where we warn but not convert).

  George.




On Wed, Apr 8, 2020 at 10:33 AM Jeff Squyres (jsquyres) via devel <
devel@lists.open-mpi.org> wrote:

> On Apr 8, 2020, at 9:51 AM, Ralph Castain via devel <
> devel@lists.open-mpi.org> wrote:
> >
> > We have deprecated a number of cmd line options (e.g., bynode, npernode,
> npersocket) - what do we want to do about their MPI_Info equivalents when
> calling comm_spawn?
> >
> > Do I silently convert them? Should we output a deprecation warning?
> Return an error?
>
>
> We should probably do something similar to what happens on the command
> line (i.e., warn and convert).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>


Re: [OMPI devel] --mca coll choices

2020-04-07 Thread George Bosilca via devel
All the collective decisions are done on the first collective on each
communicator. So basically you can change the MCA or pvar before the first
collective in a communicator to affect how the decision selection is made.
I have posted few examples over the years on the mailing list.

  George.


On Tue, Apr 7, 2020 at 3:44 PM Josh Hursey via devel <
devel@lists.open-mpi.org> wrote:

> If you run with "--mca coll_base_verbose 10" it will display a priority
> list of the components chosen per communicator created. You will see
> something like:
> coll:base:comm_select: new communicator: MPI_COMM_WORLD (cid 0)
> coll:base:comm_select: selecting   basic, priority  10, Enabled
> coll:base:comm_select: selecting  libnbc, priority  10, Enabled
> coll:base:comm_select: selecting   tuned, priority  30, Enabled
>
> Where the 'tuned' component has the highest priority - so OMPI will pick
> its version of a collective operation (e.g., MPI_Bcast), if present, over
> the collective operation of lower priority component.
>
> I'm not sure if there is something finer-grained in each of the components
> on which specific collective function is being used or not.
>
> -- Josh
>
>
> On Tue, Apr 7, 2020 at 1:59 PM Luis Cebamanos 
> wrote:
>
>> Hi Josh,
>>
>> It makes sense, thanks. Is there a debug flag that prints out which
>> component is chosen?
>>
>> Regards,
>> Luis
>>
>>
>> On 07/04/2020 19:42, Josh Hursey via devel wrote:
>>
>> Good question. The reason for this behavior is that the Open MPI
>> coll(ective) framework does not require that every component (e.g.,
>> 'basic', 'tuned', 'libnbc') implement all of the collective operations. It
>> requires instead that the composition of the available components (e.g.,
>> basic + libnbc) provides the full set of collective operations.
>>
>> This is nice for a collective implementor since they can focus on the
>> collective operations they want in their component, but it does mean that
>> the end-user needs to know about this composition behavior.
>>
>> The command below will show you all of the available collective
>> components in your Open MPI build.
>> ompi_info | grep " coll"
>>
>> 'self' and 'libnbc' probably need to be included in all of your
>> runs, maybe 'inter' as well. The others like 'tuned' and 'basic' may be
>> able to be swapped out.
>>
>> To compare 'basic' vs 'tuned' you can run:
>>  --mca coll basic,libnbc,self
>> and
>>  --mca coll tuned,libnbc,self
>>
>> It is worth noting that some of the components like 'sync' are utilities
>> that add functionality on top of the other collectives - in the case of
>> 'sync' it will add a barrier before/after N collective calls.
>>
>>
>>
>> On Tue, Apr 7, 2020 at 10:54 AM Luis Cebamanos via devel <
>> devel@lists.open-mpi.org> wrote:
>>
>>> Hello developers,
>>>
>>> I am trying to debug the mca choices the library is taking for
>>> collective operations. The reason is because I want to force the library
>>> to choose a particular module and compare it with a different one.
>>> One thing I have notice is that I can do:
>>>
>>> mpirun --mca coll basic,libnbc  --np 4 ./iallreduce
>>>
>>> for an "iallreduce" operation, but I get an error if I do
>>>
>>> mpirun --mca coll libnbc  --np 4 ./iallreduce
>>> or
>>> mpirun --mca coll basic  --np 4 ./iallreduce
>>>
>>>
>>> --
>>> Although some coll components are available on your system, none of
>>> them said that they could be used for iallgather on a new communicator.
>>>
>>> This is extremely unusual -- either the "basic", "libnbc" or "self"
>>> components
>>> should be able to be chosen for any communicator.  As such, this
>>> likely means that something else is wrong (although you should double
>>> check that the "basic", "libnbc" and "self" coll components are
>>> available on
>>> your system -- check the output of the "ompi_info" command).
>>> A coll module failed to finalize properly when a communicator that was
>>> using it was destroyed.
>>>
>>> This is somewhat unusual: the module itself may be at fault, or this
>>> may be a symptom of another issue (e.g., a memory problem).
>>>
>>>   mca_coll_base_comm_select(MPI_COMM_WORLD) failed
>>>--> Returned "Not found" (-13) instead of "Success" (0)
>>>
>>>
>>> Can you please help?
>>>
>>> Regards,
>>> Luis
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>
>>
>> --
>> Josh Hursey
>> IBM Spectrum MPI Developer
>>
>>
>>
>
> --
> Josh Hursey
> IBM Spectrum MPI Developer
>


Re: [OMPI devel] Dynamic topologies using MPI_Dist_graph_create

2020-04-06 Thread George Bosilca via devel
Bradley,

You call then through a blocking MPI function, the operation is therefore
completed by the time you return from the MPI call. So, short story you
should be safe calling the dost_graph_create in a loop.

The segfault indicates a memory issue with some of the internals of the
treematch. Do you have an example that reproduces this issue so that I can
take a look and fix it ?

Thanks,
  George.


On Mon, Apr 6, 2020 at 11:31 AM Bradley Morgan via devel <
devel@lists.open-mpi.org> wrote:

> Hello OMPI Developers and Community,
>
> I am interested in investigating dynamic runtime optimization of MPI
> topologies using an evolutionary approach.
>
> My initial testing is resulting in segfaults\sigabrts when I attempt to
> iteratively create a new communicator with reordering enabled, e.g…
>
> [1] Signal: Segmentation fault: 11 (11)
> [1] Signal code: Address not mapped (1)
> [1] Failing at address: 0x0
> [1] [ 0] 0   libsystem_platform.dylib0x7fff69dff42d
> _sigtramp + 29
> [1] [ 1] 0   mpi_island_model_ea 0x00010032
> mpi_island_model_ea + 50
> [1] [ 2] 0   mca_topo_treematch.so   0x000105ddcbf9
> free_list_child + 41
> [1] [ 3] 0   mca_topo_treematch.so   0x000105ddcbf9
> free_list_child + 41
> [1] [ 4] 0   mca_topo_treematch.so   0x000105ddcd1f
> tm_free_tree + 47
> [1] [ 5] 0   mca_topo_treematch.so   0x000105dd6967
> mca_topo_treematch_dist_graph_create + 9479
> [1] [ 6] 0   libmpi.40.dylib 0x0001001992e0
> MPI_Dist_graph_create + 640
> [1] [ 7] 0   mpi_island_model_ea 0x000150c7
> main + 1831
>
>
> I see in some documentation where MPI_Dist_graph_create is not interrupt
> safe, which I interpret to mean it is not really designed for iterative use
> without some sort of safeguard to keep it from overlapping.
>
> I guess my question is, are the topology mapping functions really meant to
> be called in iteration, or are they meant for single use?
>
> If you guys think this is something that might be possible, do you have
> any suggestions for calling the topology mapping iteratively or any hints,
> docs, etc. on what else might be going wrong here?
>
>
> Thanks,
>
> Bradley
>
>
>
>
>
>


Re: [OMPI devel] Open MPI BTL TCP interface mapping

2020-01-09 Thread George Bosilca via devel
Will,

The 7134 issue is complex in its interactions with the rest of the TCP BTL,
and I could not find the time to look at it careful enough (or test it on
AWS). But maybe you can address my main concern here. #7134 interfaces
selection will have an impact on the traffic distribution among the
different sockets by altering the interfaces selection on the links we have
in the TCP BTL (that allows us to increase the bandwidth by multiplexing
the streams between peers). I have the feeling they are not nicely
collaborating to increase the total bandwidth, but that instead they will
prevent each other from functioning efficiently.

  George.


On Thu, Jan 9, 2020 at 2:36 PM Zhang, William via devel <
devel@lists.open-mpi.org> wrote:

> Hello devel,
>
>
>
> Thanks George for reviewing: https://github.com/open-mpi/ompi/pull/7167
>
>
>
> Can I get a review (not from Brian) for this patch as well:
> https://github.com/open-mpi/ompi/pull/7134
>
>
>
> These PR’s fix common matching bugs that users utilizing the tcp btl
> encounter. It has been proven to fix issue
> https://github.com/open-mpi/ompi/issues/7115 – it’s also the first
> utilization of the Reachability framework, which can provide valuable
> reference material.
>
>
>
> Thanks,
>
> William Zhang
>
>
>
> P.S.
>
> I will start increasing the frequency of these reminders, since these PR’s
> are 2+ months old.
>


Re: [OMPI devel] Reachable framework integration

2020-01-02 Thread George Bosilca via devel
Ralph,

I think the first use is still pending reviews (more precisely my review)
at https://github.com/open-mpi/ompi/pull/7134.

  George.


On Wed, Jan 1, 2020 at 9:53 PM Ralph Castain via devel <
devel@lists.open-mpi.org> wrote:

> Hey folks
>
> I can't find where the opal/reachable framework is being used in OMPI. I
> would like to utilize it in the PRRTE oob/tcp component, but need some
> guidance on how to do so, or pointers to an example.
>
> Ralph
>
>
>


Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
As indicated by this discussion, the proper usage of volatile is certainly
misunderstood.

However,  the usage of the volatile we are doing in this particular
instance is correct and valid even in multi-threaded cases. We are using it
for a __single__ trigger, __one way__ synchronization similar to point 2 in
the link you posted, aka a variable modified in another context that is
used __once__.

Here are some well documented usage scenarios, with a way better
explanation mine [1] and [2].

  George.

[1] https://barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword
(check Multithreaded
Applications)
[2] https://www.geeksforgeeks.org/understanding-volatile-qualifier-in-c/ (check
2)


On Tue, Nov 12, 2019 at 4:57 PM Austen W Lauria via devel <
devel@lists.open-mpi.org> wrote:

> I agree that the use of volatile is insufficient if we want to adhere to
> proper multi-threaded programming standards:
>
> "Note that volatile variables are not suitable for communication between
> threads; they do not offer atomicity, synchronization, or memory ordering.
> A read from a volatile variable that is modified by another thread without
> synchronization or concurrent modification from two unsynchronized threads
> is undefined behavior due to a data race."
>
> https://en.cppreference.com/w/c/language/volatile
>
> With proper synchronization, the volatile isn't needed at all for
> multi-threaded programming.
>
> While for this issue the problem is not the use of volatile, it's just a
> ticking time bomb either way. That said I don't know how important the MPIR
> path is here since I understand it is being deprecated.
>
> [image: Inactive hide details for Larry Baker via devel ---11/12/2019
> 04:38:46 PM---"allowing us to weakly synchronize two threads" con]Larry
> Baker via devel ---11/12/2019 04:38:46 PM---"allowing us to weakly
> synchronize two threads" concerns me if the synchronization is important or
> m
>
> From: Larry Baker via devel 
> To: Open MPI Developers 
> Cc: Larry Baker , devel 
> Date: 11/12/2019 04:38 PM
> Subject: Re: [OMPI devel] [EXTERNAL] Open MPI v4.0.1: Process is hanging
> inside MPI_Init() when debugged with TotalView
> Sent by: "devel" 
> --
>
>
>
> "allowing us to weakly synchronize two threads" concerns me if the
> synchronization is important or must be reliable. I do not understand how
> volatile alone provides reliable synchronization without a mechanism to
> order visible changes to memory. If the flag(s) in question are suppposed
> to indicate some state has changed in this weakly synchronized behavior,
> without proper memory barriers, there is no guarantee that memory changes
> will be viewed by the two threads in the same order they were issued. It is
> quite possible that the updated state that is flagged as being "good" or
> "done" or whatever will not yet be visible across multiple cores, even
> though the updated flag indicator may have become visible. Only if the flag
> itself is the data can this work, it seems to me. If it is a flag that
> something has been completed, volatile is not sufficient to guarantee the
> corresponding changes in state will be visible. I have had such experience
> from code that used volatile as a proxy for memory barriers. I was told "it
> has never been a problem". Rare events can, and do, occur. In my case, it
> did after over 3 years running the code without interruption. I doubt
> anyone had ever run the code for such a long sample interval. We found out
> because we missed recording an important earthquake a week after the race
> condition was tripped. Murphy's law triumphs again. :)
>
> Larry Baker
> US Geological Survey
> 650-329-5608
> *ba...@usgs.gov* 
>
>
>
>On 12 Nov 2019, at 1:05:31 PM, George Bosilca via devel <
>   *devel@lists.open-mpi.org* > wrote:
>
>   If the issue was some kind of memory consistently between threads,
>   then printing that variable in the context of the debugger would show 
> the
>   value of debugger_event_active being false.
>
>   volatile is not a memory barrier, it simply forces a load for each
>   access of the data, allowing us to weakly synchronize two threads, as 
> long
>   as we dot expect the synchronization to be immediate.
>
>   Anyway, good to see that the issue has been solved.
>
>   George.
>
>
>   On Tue, Nov 12, 2019 at 2:25 PM John DelSignore via devel <
>   *devel@lists.open-mpi.org* > wrote:
>  Hi Austen,
>
>  Thanks for the reply. What I am seeing is consistent with your
>  thought, in that when I see the hang, one 

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
If the issue was some kind of memory consistently between threads, then
printing that variable in the context of the debugger would show the value
of debugger_event_active being false.

volatile is not a memory barrier, it simply forces a load for each access
of the data, allowing us to weakly synchronize two threads, as long as we
dot expect the synchronization to be immediate.

Anyway, good to see that the issue has been solved.

  George.


On Tue, Nov 12, 2019 at 2:25 PM John DelSignore via devel <
devel@lists.open-mpi.org> wrote:

> Hi Austen,
>
> Thanks for the reply. What I am seeing is consistent with your thought, in
> that when I see the hang, one or more processes did not have a flag
> updated. I don't understand how the Open MPI code works well enough to say
> if it is a memory barrier problem or not. It almost looks like a event
> delivery or dropped event problem to me.
>
> The place in the MPI_init() code where the MPI processes hang and the
> number of "hung" processes seems to *vary* from run to run. In some cases
> the processes are waiting for an event or waiting for a fence (whatever
> that is).
>
> I did the following run today, which shows that it can hang waiting for an
> event that apparently was not generated or was dropped:
>
>1. Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
>2. Ran the mpirun process until it hit the MPIR_Breakpoint() event.
>3. TV attached to all four of the MPI processes *and* left all five
>processes stopped.
>4. Continued all of the processes/threads and let them run freely for
>about 60 seconds. They should have run to completion in that amount of 
> time.
>5. Halted all of the processes. I included an aggregated backtrace of
>all of the processes below.
>
> In this particular run, all four MPI processes were waiting in
> ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196, which is:
>
> /* let the MPI progress engine run while we wait for debugger
> release */
> OMPI_WAIT_FOR_COMPLETION(debugger_event_active);
>
> I don't know how that is supposed to work, but I can clearly see that
> debugger_event_active was *true* in all of the processes, even though TV
> set MPIR_debug_gate to 1:
>
> d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
> Thread 2.1:
>  debugger_event_active = true (1)
> Thread 3.1:
>  debugger_event_active = true (1)
> Thread 4.1:
>  debugger_event_active = true (1)
> Thread 5.1:
>  debugger_event_active = true (1)
> d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
> Thread 2.1:
>  MPIR_debug_gate = 0x0001 (1)
> Thread 3.1:
>  MPIR_debug_gate = 0x0001 (1)
> Thread 4.1:
>  MPIR_debug_gate = 0x0001 (1)
> Thread 5.1:
>  MPIR_debug_gate = 0x0001 (1)
> d1.<>
>
> I think the _*release_fn()* function in *rte_orte_module.c* is supposed
> to set *debugger_event_active* to *false*, but that apparently did not
> happen in this case. So, AFAICT, the reason debugger_event_active would
> *not* be set to false is that the event was never delivered, so the
> _release_fn() function was never called. If that's the case, then the lack
> of a memory barrier is probably a moot point, and the problem is likely
> related to event generation or dropped events.
>
> Cheers, John D.
>
> FWIW: Here's the aggregated backtrace after the whole job was allowed to
> run freely for about 60 seconds, and then stopped:
>
> d1.<> f g w -g f+l
>
> +/
>  +__clone : 5:12[0-3.2-3, p1.2-5]
>  |+start_thread
>  | +listen_thread : 1:2[p1.3, p1.5]
>  | |+__select_nocancel
>  | +progress_engine@opal_progress_threads.c#105 : 4:4[0-3.2]
>  | |+opal_libevent2022_event_base_loop@event.c#1630
>  | | +poll_dispatch@poll.c#165
>  | |  +__poll_nocancel
>  | +progress_engine@pmix_progress_threads.c#109 : 4:4[0-3.3]
>  | |+opal_libevent2022_event_base_loop@event.c#1630
>  | | +epoll_dispatch@epoll.c#407
>  | |  +__epoll_wait_nocancel
>  | +progress_engine : 1:2[p1.2, p1.4]
>  |  +opal_libevent2022_event_base_loop@event.c#1630
>  |   +epoll_dispatch@epoll.c#407 : 1:1[p1.2]
>  |   |+__epoll_wait_nocancel
>  |   +poll_dispatch@poll.c#165 : 1:1[p1.4]
>  |+__poll_nocancel
>  +_start : 5:5[0-3.1, p1.1]
>   +__libc_start_main
>+*main@cpi.c#27  : 4:4[0-3.1]*
>|+PMPI_Init@pinit.c#67
>| +ompi_mpi_init@ompi_mpi_init.c#890
>|  +*ompi_rte_wait_for_debugger@rte_orte_module.c#196
> *
>|   +opal_progress@opal_progress.c#245 : 1:1[0.1]
>|   |+opal_progress_events@opal_progress.c#191
>|   | +opal_libevent2022_event_base_loop@event.c#1630
>|   |  +poll_dispatch@poll.c#165
>|   |   +__poll_nocancel
>|   +opal_progress@opal_progress.c#247 : 3:3[1-3.1]
>|+opal_progress_events@opal_progress.c#191
>| +opal_libevent2022_event_base_loop@event.c#1630
>|  +poll_dispatch@poll.c#165
>|   +__poll_nocancel
>+orterun : 1:1[p1.1]
> +opal_libevent2022_event_base_loop@event.c#1630
>  +poll_dispatch@poll.c#165
>   +__poll_nocancel
>
> d1.<>
>
>
> On 11/12/19 9:4

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
I don't think there is a need any protection around that variable. It will
change value only once (in a callback triggered from opal_progress), and
the volatile guarantees that loads will be issued for every access, so the
waiting thread will eventually notice the change.

 George.


On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel <
devel@lists.open-mpi.org> wrote:

> Could it be that some processes are not seeing the flag get updated? I
> don't think just using a simple while loop with a volatile variable is
> sufficient in all cases in a multi-threaded environment. It's my
> understanding that the volatile keyword just tells the compiler to not
> optimize or do anything funky with it - because it can change at any time.
> However, this doesn't provide any memory barrier - so it's possible that
> the thread polling on this variable is never seeing the update.
>
> Looking at the code - I see:
>
> #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
> do { \
> opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
> "%s lazy waiting on RTE event at %s:%d", \
> OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
> __FILE__, __LINE__); \
> while ((flg)) { \
> opal_progress(); \
> usleep(100); \
> } \
> }while(0);
>
> I think replacing that with:
>
> #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
> do { \
> opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
> "%s lazy waiting on RTE event at %s:%d", \
> OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
> __FILE__, __LINE__); \
>
> pthread_mutex_lock(&lock); \
> while(flag) { \
> pthread_cond_wait(&cond, &lock); \ //Releases the lock while waiting for a
> signal from another thread to wake up
> } \
> pthread_mutex_unlock(&lock); \
>
> }while(0);
>
> Is much more standard when dealing with threads updating a shared variable
> - and might lead to a more expected result in this case.
>
> On the other end, this would require the thread updating this variable to:
>
> pthread_mutex_lock(&lock);
> flg = new_val;
> pthread_cond_signal(&cond);
> pthread_mutex_unlock(&lock);
>
> This provides the memory barrier for the thread polling on the flag to see
> the update - something the volatile keyword doesn't do on its own. I think
> it's also much cleaner as it eliminates an arbitrary sleep from the code -
> which I see as a good thing as well.
>
>
> [image: Inactive hide details for "Ralph Castain via devel" ---11/12/2019
> 09:24:23 AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillar]"Ralph
> Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 2019, at 4:53
> PM, Gilles Gouaillardet via devel  wrote: >
>
> From: "Ralph Castain via devel" 
> To: "OpenMPI Devel" 
> Cc: "Ralph Castain" 
> Date: 11/12/2019 09:24 AM
> Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
> inside MPI_Init() when debugged with TotalView
> Sent by: "devel" 
> --
>
>
>
>
>
> > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel <
> devel@lists.open-mpi.org> wrote:
> >
> > John,
> >
> > OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> >
> >
> > is a simple loop that periodically checks the (volatile) "active"
> condition, that is expected to be updated by an other thread.
> > So if you set your breakpoint too early, and **all** threads are stopped
> when this breakpoint is hit, you might experience
> > what looks like a race condition.
> > I guess a similar scenario can occur if the breakpoint is set in
> mpirun/orted too early, and prevents the pmix (or oob/tcp) thread
> > from sending the message to all MPI tasks)
> >
> >
> >
> > Ralph,
> >
> > does the v4.0.x branch still need the oob/tcp progress thread running
> inside the MPI app?
> > or are we missing some commits (since all interactions with mpirun/orted
> are handled by PMIx, at least in the master branch) ?
>
> IIRC, that progress thread only runs if explicitly asked to do so by MCA
> param. We don't need that code any more as PMIx takes care of it.
>
> >
> > Cheers,
> >
> > Gilles
> >
> > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
> >> Hi John
> >>
> >> Sorry to say, but there is no way to really answer your question as the
> OMPI community doesn't actively test MPIR support. I haven't seen any
> reports of hangs during MPI_Init from any release series, including 4.x. My
> guess is that it may have something to do with the debugger interactions as
> opposed to being a true race condition.
> >>
> >> Ralph
> >>
> >>
> >>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel <
> devel@lists.open-mpi.org  >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> An LLNL TotalView user on a Mac reported that their MPI job was
> hanging inside MPI_Init() when started under the control of TotalView. They
> were using Open MPI 4.0.1, and TotalView was using the MPIR Interface
> (sorry, we don't support the PMIx debugging hooks yet).
> >>>
> >>> I was able to reproduce the hang on my own Linux system with my own
> build of Open MPI 4.0.1, which I built 

Re: [OMPI devel] Anyone have any thoughts about cache-alignment issue in osc/sm?

2019-09-13 Thread George Bosilca via devel
I think we can remove the header, we don't use it anymore. I commented on
the issue.

  George.


On Thu, Sep 12, 2019 at 5:23 PM Geoffrey Paulsen via devel <
devel@lists.open-mpi.org> wrote:

> Does anyone have any thoughts about the cache-alignment issue in osc/sm,
> reported in https://github.com/open-mpi/ompi/issues/6950?
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Memory performance with Bcast

2019-03-21 Thread George Bosilca
Marcin,

I am not sure I understand your question, a bcast is a collective operation
that must be posted by all participants. Independently at what level the
bcast is serviced, if some of the participants have not posted their
participation to the collective, only partial progress can be made.

  George.


On Thu, Mar 21, 2019 at 12:24 PM Joshua Ladd  wrote:

> Marcin,
>
> HPC-X implements the MPI BCAST operation by leveraging hardware multicast
> capabilities. Starting with HPC-X v2.3 we introduced a new multicast based
> algorithm for large messages as well. Hardware multicast scales as O(1)
> modulo switch hops. It is the most efficient way to broadcast a message in
> an IB network.
>
> Hope this helps.
>
> Best,
>
> Josh
>
>
>
> On Thu, Mar 21, 2019 at 5:01 AM marcin.krotkiewski <
> marcin.krotkiew...@gmail.com> wrote:
>
>> Thanks, George! So, the function you mentioned is used when I turn off
>> HCOLL and use OpenMPI's tuned coll instead. That helps a lot. Another thing
>> that makes me think is that in my case the data is sent to the targets
>> asynchronously, or rather - it is a 'put' operation in nature, and the
>> targets don't know, when the data is ready. I guess the tree algorithms you
>> mentioned require active participation of all nodes, otherwise the
>> algorithm will not progress? Is it enough to call any MPI routine to assure
>> progression, or do I have to call the matching Bcast?
>>
>> Anyone from Mellanox here, who knows how HCOLL does this internally?
>> Especially on the EDR architecture. Is there any hardware aid?
>>
>> Thanks!
>>
>> Marcin
>>
>>
>> On 3/20/19 5:10 PM, George Bosilca wrote:
>>
>> If you have support for FCA then it might happen that the collective will
>> use the hardware support. In any case, most of the bcast algorithms have a
>> logarithmic behavior, so there will be at most O(log(P)) memory accesses on
>> the root.
>>
>> If you want to take a look at the code in OMPI to understand what
>> function is called in your specific case head to ompi/mca/coll/tuned/ and
>> search for the ompi_coll_tuned_bcast_intra_dec_fixed function
>> in coll_tuned_decision_fixed.c.
>>
>>   George.
>>
>>
>> On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski <
>> marcin.krotkiew...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> I'm wondering about the details of Bcast implementation in OpenMPI. I'm
>>> specifically interested in IB interconnects, but information about other
>>> architectures (and OpenMPI in general) would also be very useful.
>>>
>>> I am working with a code, which sends the same  (large) message to a
>>> bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange,
>>> but the message is the same for all neighbors. Since memory bandwidth is
>>> a scarce resource, I'd like to make sure we send the message with fewest
>>> possible memory accesses.
>>>
>>> Hence the question: what does OpenMPI (and specifically for the IB case
>>> - the HPCX) do in such case? Does it get the buffer from memory O(1)
>>> times to send it to n peers, and the broadcast is orchestrated by the
>>> hardware? Or does it have to read the memory O(n) times? Is it more
>>> efficient to use Bcast, or is it the same as implementing the operation
>>> by n distinct send / put operations? Finally, is there any way to use
>>> the RMA put method with multiple targets, so that I only have to read
>>> the host memory once, and the switches / HCA take care of the rest?
>>>
>>> Thanks a lot for any insights!
>>>
>>> Marcin
>>>
>>>
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>
>>
>> ___
>> devel mailing 
>> listde...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/devel
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Memory performance with Bcast

2019-03-20 Thread George Bosilca
If you have support for FCA then it might happen that the collective will
use the hardware support. In any case, most of the bcast algorithms have a
logarithmic behavior, so there will be at most O(log(P)) memory accesses on
the root.

If you want to take a look at the code in OMPI to understand what function
is called in your specific case head to ompi/mca/coll/tuned/ and search for
the ompi_coll_tuned_bcast_intra_dec_fixed function
in coll_tuned_decision_fixed.c.

  George.


On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski <
marcin.krotkiew...@gmail.com> wrote:

> Hi!
>
> I'm wondering about the details of Bcast implementation in OpenMPI. I'm
> specifically interested in IB interconnects, but information about other
> architectures (and OpenMPI in general) would also be very useful.
>
> I am working with a code, which sends the same  (large) message to a
> bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange,
> but the message is the same for all neighbors. Since memory bandwidth is
> a scarce resource, I'd like to make sure we send the message with fewest
> possible memory accesses.
>
> Hence the question: what does OpenMPI (and specifically for the IB case
> - the HPCX) do in such case? Does it get the buffer from memory O(1)
> times to send it to n peers, and the broadcast is orchestrated by the
> hardware? Or does it have to read the memory O(n) times? Is it more
> efficient to use Bcast, or is it the same as implementing the operation
> by n distinct send / put operations? Finally, is there any way to use
> the RMA put method with multiple targets, so that I only have to read
> the host memory once, and the switches / HCA take care of the rest?
>
> Thanks a lot for any insights!
>
> Marcin
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Proposal: Github "stale" bot

2019-03-19 Thread George Bosilca
:+1:

George.



On Tue, Mar 19, 2019 at 12:45 PM Jeff Squyres (jsquyres) via devel <
devel@lists.open-mpi.org> wrote:

> I have proposed the use of the Github Probot "stale" bot:
>
> https://probot.github.io/apps/stale/
> https://github.com/open-mpi/ompi/pull/6495
>
> The short version of what this bot does is:
>
> 1. After a period of inactivity, a label will be applied to mark an issue
> as stale, and optionally a comment will be posted to notify contributors
> that the Issue or Pull Request will be closed.
>
> 2. If the Issue or Pull Request is updated, or anyone comments, then the
> stale label is removed and nothing further is done until it becomes stale
> again.
>
> 3. If no more activity occurs, the Issue or Pull Request will be
> automatically closed with an optional comment.
>
> Specifically, the PR I propose sets the Stalebot config as:
>
> - After 60 days of inactivity, issues/PRs will get a warning
> - After 7 more days of inactivity, issues/PRs will be closed and the "Auto
> closed" label will be applied
> - Issues/PRs with the "help wanted" or "good first issue" will be ignored
> by the Stalebot
>
> Thoughts?
>
> If we move ahead with this: given that this will apply to *all* OMPI
> issues/PRs, we might want to take a whack at closing a whole pile of old
> issues/PRs first before unleashing the Stalebot.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Error in TCP BTL??

2018-10-01 Thread George Bosilca
https://github.com/open-mpi/ompi/pull/5819 will ease the pain. I couldn't
figure out what exactly trigger this, but apparently recent versions of OSX
refuse to bind with port 0.

  George.



On Mon, Oct 1, 2018 at 4:12 PM Jeff Squyres (jsquyres) via devel <
devel@lists.open-mpi.org> wrote:

> I get that 100% time in the runs on MacOS, too (with today's HEAD):
>
> --
> $ mpirun -np 4 --mca btl tcp,self ring_c
> Process 0 sending 10 to 1, tag 201 (4 processes in ring)
> [JSQUYRES-M-26UT][[5535,1],0][btl_tcp_endpoint.c:742:mca_btl_tcp_endpoint_start_connect]
> bind() failed: Invalid argument (22)
> [JSQUYRES-M-26UT:85104] *** An error occurred in MPI_Send
> [JSQUYRES-M-26UT:85104] *** reported by process [362741761,0]
> [JSQUYRES-M-26UT:85104] *** on communicator MPI_COMM_WORLD
> [JSQUYRES-M-26UT:85104] *** MPI_ERR_OTHER: known error not in list
> [JSQUYRES-M-26UT:85104] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [JSQUYRES-M-26UT:85104] ***and potentially your MPI job)
> --
>
>
> > On Oct 1, 2018, at 2:12 PM, Ralph H Castain  wrote:
> >
> > I’m getting this error when trying to run a simple ring program on my
> Mac:
> >
> >
> [Ralphs-iMac-2.local][[21423,14],0][btl_tcp_endpoint.c:742:mca_btl_tcp_endpoint_start_connect]
> bind() failed: Invalid argument (22)
> >
> > Anyone recognize the problem? It causes the job to immediately abort.
> This is with current head of master this morning - it was working when I
> last used it, but it has been an unknown period of time.
> > Ralph
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread George Bosilca
Sorry, I missed the 4.0 on the PR (despite being the first thing in the title).

  George.


> On Sep 20, 2018, at 22:15 , Ralph H Castain  wrote:
> 
> That’s why we are leaving it in master - only removing it from release branch 
> 
> Sent from my iPhone
> 
> On Sep 20, 2018, at 7:02 PM, George Bosilca  <mailto:bosi...@icl.utk.edu>> wrote:
> 
>> Why not simply ompi_ignore it ? Removing a component to bring it back later 
>> would force us to lose all history. I would a rather add an .ompi_ignore and 
>> give an opportunity to power users do continue playing with it.
>> 
>>   George.
>> 
>> 
>> On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain > <mailto:r...@open-mpi.org>> wrote:
>> I already suggested the configure option, but it doesn’t solve the problem. 
>> I wouldn’t be terribly surprised to find that Cray also has an undetected 
>> problem given the nature of the issue - just a question of the amount of 
>> testing, variety of environments, etc.
>> 
>> Nobody has to wait for the next major release, though that isn’t so far off 
>> anyway - there has never been an issue with bringing in a new component 
>> during a release series.
>> 
>> Let’s just fix this the right way and bring it into 4.1 or 4.2. We may want 
>> to look at fixing the osc/rdma/ofi bandaid as well while we are at it.
>> 
>> Ralph
>> 
>> 
>>> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon 
>>> mailto:tpati...@vols.utk.edu>> wrote:
>>> 
>>> I understand and agree with your point. My initial email is just out of 
>>> curiosity.
>>> 
>>> Howard tested this BTL for Cray in the summer as well. So this seems to 
>>> only affected OPA hardware.
>>> 
>>> I just remember that in the summer, I have to make some change in libpsm2 
>>> to get this BTL to work for OPA.  Maybe this is the problem as the default 
>>> libpsm2 won't work.
>>> 
>>> So maybe we can fix this in configure step to detect version of libpsm2 and 
>>> dont build if we are not satisfied.
>>> 
>>> Another idea is maybe we dont build this BTL by default. So the user with 
>>> Cray hardware can still use it if they want. (Just rebuild with the btl)  - 
>>> We just need to verify if it still works on Cray.  This way, OFI 
>>> stakeholders does not have to wait until next major release to get this in.
>>> 
>>> 
>>> Arm
>>> 
>>> 
>>> On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain >> <mailto:r...@open-mpi.org>> wrote:
>>> I suspect it is a question of what you tested and in which scenarios. 
>>> Problem is that it can bite someone and there isn’t a clean/obvious 
>>> solution that doesn’t require the user to do something - e.g., like having 
>>> to know that they need to disable a BTL. Matias has proposed an mca-based 
>>> approach, but I would much rather we just fix this correctly. Bandaids have 
>>> a habit of becoming permanently forgotten - until someone pulls on it and 
>>> things unravel.
>>> 
>>> 
>>>> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon 
>>>> mailto:tpati...@vols.utk.edu>> wrote:
>>>> 
>>>> In the summer, I tested this BTL with along with the MTL and able to use 
>>>> both of them interchangeably with no problem. I dont know what changed. 
>>>> libpsm2?
>>>> 
>>>> 
>>>> Arm
>>>> 
>>>> 
>>>> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain >>> <mailto:r...@open-mpi.org>> wrote:
>>>> We have too many discussion threads overlapping on the same email chain - 
>>>> so let’s break the discussion on the OFI problem into its own chain.
>>>> 
>>>> We have been investigating this locally and found there are a number of 
>>>> conflicts between the MTLs and the OFI/BTL stepping on each other. The 
>>>> correct solution is to move endpoint creation/reporting into a the 
>>>> opal/mca/common area, but that is going to take some work and will likely 
>>>> impact release schedules.
>>>> 
>>>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix 
>>>> the problem in master, and then consider bringing it back as a package to 
>>>> v4.1 or v4.2.
>>>> 
>>>> Comments? If we agree, I’ll file a PR to remove it.
>>>> Ralph
>>>> 
>>>> 
>>>>> Begin forwarded mess

Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread George Bosilca
Why not simply ompi_ignore it ? Removing a component to bring it back later
would force us to lose all history. I would a rather add an .ompi_ignore
and give an opportunity to power users do continue playing with it.

  George.


On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain  wrote:

> I already suggested the configure option, but it doesn’t solve the
> problem. I wouldn’t be terribly surprised to find that Cray also has an
> undetected problem given the nature of the issue - just a question of the
> amount of testing, variety of environments, etc.
>
> Nobody has to wait for the next major release, though that isn’t so far
> off anyway - there has never been an issue with bringing in a new component
> during a release series.
>
> Let’s just fix this the right way and bring it into 4.1 or 4.2. We may
> want to look at fixing the osc/rdma/ofi bandaid as well while we are at it.
>
> Ralph
>
>
> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon <
> tpati...@vols.utk.edu> wrote:
>
> I understand and agree with your point. My initial email is just out of
> curiosity.
>
> Howard tested this BTL for Cray in the summer as well. So this seems to
> only affected OPA hardware.
>
> I just remember that in the summer, I have to make some change in libpsm2
> to get this BTL to work for OPA.  Maybe this is the problem as the default
> libpsm2 won't work.
>
> So maybe we can fix this in configure step to detect version of libpsm2
> and dont build if we are not satisfied.
>
> Another idea is maybe we dont build this BTL by default. So the user with
> Cray hardware can still use it if they want. (Just rebuild with the btl)  -
> We just need to verify if it still works on Cray.  This way, OFI
> stakeholders does not have to wait until next major release to get this in.
>
>
> Arm
>
>
> On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain  wrote:
>
>> I suspect it is a question of what you tested and in which scenarios.
>> Problem is that it can bite someone and there isn’t a clean/obvious
>> solution that doesn’t require the user to do something - e.g., like having
>> to know that they need to disable a BTL. Matias has proposed an mca-based
>> approach, but I would much rather we just fix this correctly. Bandaids have
>> a habit of becoming permanently forgotten - until someone pulls on it and
>> things unravel.
>>
>>
>> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon <
>> tpati...@vols.utk.edu> wrote:
>>
>> In the summer, I tested this BTL with along with the MTL and able to use
>> both of them interchangeably with no problem. I dont know what changed.
>> libpsm2?
>>
>>
>> Arm
>>
>>
>> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain  wrote:
>>
>>> We have too many discussion threads overlapping on the same email chain
>>> - so let’s break the discussion on the OFI problem into its own chain.
>>>
>>> We have been investigating this locally and found there are a number of
>>> conflicts between the MTLs and the OFI/BTL stepping on each other. The
>>> correct solution is to move endpoint creation/reporting into a the
>>> opal/mca/common area, but that is going to take some work and will likely
>>> impact release schedules.
>>>
>>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix
>>> the problem in master, and then consider bringing it back as a package to
>>> v4.1 or v4.2.
>>>
>>> Comments? If we agree, I’ll file a PR to remove it.
>>> Ralph
>>>
>>>
>>> Begin forwarded message:
>>>
>>> *From: *Peter Kjellström 
>>> *Subject: **Re: [OMPI devel] Announcing Open MPI v4.0.0rc1*
>>> *Date: *September 20, 2018 at 5:18:35 AM PDT
>>> *To: *"Gabriel, Edgar" 
>>> *Cc: *Open MPI Developers 
>>> *Reply-To: *Open MPI Developers 
>>>
>>> On Wed, 19 Sep 2018 16:24:53 +
>>> "Gabriel, Edgar"  wrote:
>>>
>>> I performed some tests on our Omnipath cluster, and I have a mixed
>>> bag of results with 4.0.0rc1
>>>
>>>
>>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>>> very similar results.
>>>
>>> compute-1-1.local.4351PSM2 has not been initialized
>>> compute-1-0.local.3826PSM2 has not been initialized
>>>
>>>
>>> yup I too see these.
>>>
>>> mpirun detected that one or more processes exited with non-zero
>>> status, thus causing the job to be terminated. The first process to
>>> do so was:
>>>
>>>  Process name: [[38418,1],1]
>>>  Exit code:255
>>>
>>>  
>>> 
>>>
>>>
>>> yup.
>>>
>>>
>>> 2.   The ofi mtl does not work at all on our Omnipath cluster. If
>>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>>> error message.
>>>
>>>
>>> Yes ofi seems broken. But not even disabling it helps me completely (I
>>> see "mca_btl_ofi.so   [.] mca_btl_ofi_component_progress" in my
>>> perf top...
>>>
>>> 3.   The openib btl component is always getting in the way with
>>> annoying warnings. It is not really used, but constantly complains:
>>>
>>> ...
>>>

Re: [OMPI devel] Network simulation from within OpenMPI

2018-09-17 Thread George Bosilca
Millian,

The level of interposition you need depends on what exactly you are trying
to simulate and at which granularity. If you want to simulate the different
protocols (small, eager, PUT, GET, pipelining) supported by our default PML
OB1, then you need to provide a BTL (with the exclusive flag and with
support for self and local processes). If instead, you are interested in a
higher level but simplified simulation of messages then you can go directly
for a PML. If you plan to provide support for one-sided communications you
might also want to implement an OSC module.

  George.

PS: Ping me outside the mailing list, I'll be happy to go over the
high-level design of the different components related to communications in
the OMPI stack.



On Mon, Sep 17, 2018 at 1:08 PM Millian Poquet 
wrote:

> Hello everyone,
>
> We are working on the simulation of MPI applications with the SimGrid
> simulation framework. We would like to make simulation possible from within
> OpenMPI (instead of using our own MPI implementation). To this end, we plan
> to implement a set of modules so that network transfers are simulated
> instead of being actually performed.
>
> We have hacked a working ODLS to spawn an additional orchestration process
> and to hack the processes' environment. We are currently trying to
> implement simulation in a custom BTL.
>
> Do you think that BTL is our best entry point component to simulate the
> network transfers? The PML also looks very appealing but at this moment we
> do not fully understand the role of all OpenMPI components and how they
> interact.
>
> Best regards,
> --
> Dr. Millian Poquet
> Postdoc Researcher, Myriads Team, Inria/IRISA
> https://mpoquet.github.io
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Collective communication algorithms

2018-03-26 Thread George Bosilca
Mikhail,

Some of these algorithms have been left out due to practical purposes, they
did not behave better than existing algorithms in any case. Some other
(such as Traff's butterfly or double tree) because the implementation
efforts shifted to other types of collective, or because there was a lack
of manpower.

It would be interesting to have optimized algorithm for some of the
collective that have been ignored for a long time (scan and exscan) as well
as some of the newer algorithms (Traff’s butterfly). I'm looking forward to
your pull request.

  George.




On Fri, Mar 23, 2018 at 11:58 PM, Mikhail Kurnosov 
wrote:

> Dear Devel List,
>
> Current version of collective communication frameworks basic, base and
> tuned does not include implementations of some well-known algorithms:
>
> * MPI_Bcast: knomial, knomial with segmentation, binomial scatter +
> recursive doubling allgather, binomial scatter + ring allgather
> * MPI_Gather: binomial with segmentation
> * MPI_Reduce_scatter_block: Traff’s butterfly, recursive doubling with
> vector halving, recursive doubling, pairwise exch. with scattered
> destinations
> * MPI_Exscan: recursive doubling
> * MPI_Scan: recursive doubling
> * MPI_Alleduce: Rabensifner’s, knomial
> * MPI_Reduce: Rabensifner’s, knomial
>
> As far as I know, some of well-known collective algorithms have already
> been implemented in the Open MPI. But for the different reasons these
> algorithms are not used.
> I am implementing the above algorithms for my research. Does it make
> sense to implement these algorithms in the Open MPI?
>
>
> Thanks,
> Mikhail Kurnosov
>
> --
> Computer Systems Department
> Siberian State University of Telecommunications and Information Sciences
> 86 Kirova str., Novosibirsk, Russia
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Default tag for OFI MTL

2018-03-04 Thread George Bosilca
On Sat, Mar 3, 2018 at 6:35 PM, Cabral, Matias A 
wrote:

> Hi George,
>
>
>
> Thanks for the feedback, appreciated.  Few questions/comments:
>
>
>
> > Regarding the tag with your proposal the OFI MTL will support a wider
> range of tags than the OB1 PML, where we are limited to 16 bits. Just make
> sure you correctly expose your tag limit via the MPI_TAG_UB.
>
>
>
> I will take a look at MPI_TAG_UB.
>

It is a predefined attribute and should be automatically set by the MPI
layer using the pml_max_tag field of the selected PML.


> > I personally would prefer a solution where we can alter the
> distribution of bits between bits in the cid and tag at compile time.
>
>
>
> Sure, I can do this. What would you suggest for plan B? Fewer tag bits and
> more cid ones? Numbers?
>

As I mentioned PML (OB1) only supports 16 bits tags (in fact 15 because
negative tags are reserved for OPI internal usage). I do not recall any
complaints about this limit. Targeting consistency across PMLs provide
user-friendliness, thus a default of 16 bits for tag and then everything
else for the cid might be a sensible choice.

George.


> >. We can also envision this selection to be driven by an MCA parameter,
> but this might be too costly
>
>
>
> I did think about it. However, as you say, I’m not yet convinced it is
> worth it:
>
> a)  I will be soon reviewing synchronous send protocol. Not reviewed
> thoroughly yet, but I’m quite sure I can reduce it to use 2 bits (maybe
> just 1). Freeing 2 (or 3) more bits for cids or ranks.
>
> b)  Most of the providers TODAY effectively support FI_REMOTE_CQ_DATA
> and FI_DIRECTED_RECV (psm2, gni, verbs;ofi_rxm, sockets). This is just a
> fallback for potential new ones.  FI_DIRECTED_RECV is necessary to
> discriminate the source at RX time when the source is not in the tag.
>
> c)   I will include build_time_plan_B you just suggested ;)
>
>
>
> Thanks, again.
>
>
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@lists.open-mpi.org] *On Behalf Of *George
> Bosilca
> *Sent:* Saturday, March 03, 2018 6:29 AM
> *To:* Open MPI Developers 
> *Subject:* Re: [OMPI devel] Default tag for OFI MTL
>
>
>
> Hi Matias,
>
>
>
> Relaxing the restriction on the number of ranks is definitively a good
> thing. The cost will be reflected on the number of communicators and tags,
> and we must be careful how we balance this.
>
>
>
> Assuming context_id is the communicator cid, with 10 bits you can only
> support 1024. A little low, even lower than MVAPICH. The way we allocate
> cid is very sparse, and with a limited number of possible cid, we might run
> in troubles very quickly for the few applications that are using a large
> number of communicators, and for the resilience support. Yet another reason
> to revisit the cid allocation in the short term.
>
>
>
> Regarding the tag with your proposal the OFI MTL will support a wider
> range of tags than the OB1 PML, where we are limited to 16 bits. Just make
> sure you correctly expose your tag limit via the MPI_TAG_UB.
>
>
>
> I personally would prefer a solution where we can alter the distribution
> of bits between bits in the cid and tag at compile time. We can also
> envision this selection to be driven by an MCA parameter, but this might be
> too costly.
>
>   George.
>
>
>
>
>
>
>
>
>
> On Sat, Mar 3, 2018 at 2:56 AM, Cabral, Matias A <
> matias.a.cab...@intel.com> wrote:
>
> Hi all,
>
>
>
> I’m working on extending the OFI MTL to support FI_REMOTE_CQ_DATA (1) to
> extend the number of ranks currently supported by the MTL. Currently
> limited to only 16 bits included in the OFI tag (2). After the feature is
> implemented there will be no limitation for providers that support
> FI_REMOTE_CQ_DATA and FI_DIRECTED_RECEIVE (3). However, there will be a
> fallback mode for providers that do not support these features and I would
> like to get consensus on the default tag distribution. This is my proposal:
>
>
>
> * Default: No FI_REMOTE_CQ_DATA
>
> * 01234567 01| 234567 01234567 0123| 4567 |01234567 01234567 01234567
> 01234567
>
> * context_id   |source rank |proto|  message
> tag
>
>
>
> #define MTL_OFI_CONTEXT_MASK(0xFFC0ULL)
>
> #define MTL_OFI_SOURCE_MASK (0x00300ULL)
>
> #define MTL_OFI_SOURCE_BITS_COUNT   (18) /* 262,143 ranks */
>
> #define MTL_OFI_CONTEXT_BITS_COUNT  (10) /* 1,023 communicators */
>
> #define MTL_OFI_TAG_BITS_COUNT  (32) /* no restrictions */
>
> #define MTL_OFI_PROTO_BITS_COUNT(4)
>
>
>
> No

Re: [OMPI devel] Default tag for OFI MTL

2018-03-03 Thread George Bosilca
Hi Matias,

Relaxing the restriction on the number of ranks is definitively a good
thing. The cost will be reflected on the number of communicators and tags,
and we must be careful how we balance this.

Assuming context_id is the communicator cid, with 10 bits you can only
support 1024. A little low, even lower than MVAPICH. The way we allocate
cid is very sparse, and with a limited number of possible cid, we might run
in troubles very quickly for the few applications that are using a large
number of communicators, and for the resilience support. Yet another reason
to revisit the cid allocation in the short term.

Regarding the tag with your proposal the OFI MTL will support a wider range
of tags than the OB1 PML, where we are limited to 16 bits. Just make sure
you correctly expose your tag limit via the MPI_TAG_UB.

I personally would prefer a solution where we can alter the distribution of
bits between bits in the cid and tag at compile time. We can also envision
this selection to be driven by an MCA parameter, but this might be too
costly.

  George.




On Sat, Mar 3, 2018 at 2:56 AM, Cabral, Matias A 
wrote:

> Hi all,
>
>
>
> I’m working on extending the OFI MTL to support FI_REMOTE_CQ_DATA (1) to
> extend the number of ranks currently supported by the MTL. Currently
> limited to only 16 bits included in the OFI tag (2). After the feature is
> implemented there will be no limitation for providers that support
> FI_REMOTE_CQ_DATA and FI_DIRECTED_RECEIVE (3). However, there will be a
> fallback mode for providers that do not support these features and I would
> like to get consensus on the default tag distribution. This is my proposal:
>
>
>
> * Default: No FI_REMOTE_CQ_DATA
>
> * 01234567 01| 234567 01234567 0123| 4567 |01234567 01234567 01234567
> 01234567
>
> * context_id   |source rank |proto|  message
> tag
>
>
>
> #define MTL_OFI_CONTEXT_MASK(0xFFC0ULL)
>
> #define MTL_OFI_SOURCE_MASK (0x00300ULL)
>
> #define MTL_OFI_SOURCE_BITS_COUNT   (18) /* 262,143 ranks */
>
> #define MTL_OFI_CONTEXT_BITS_COUNT  (10) /* 1,023 communicators */
>
> #define MTL_OFI_TAG_BITS_COUNT  (32) /* no restrictions */
>
> #define MTL_OFI_PROTO_BITS_COUNT(4)
>
>
>
> Notes:
>
> -  More ranks and fewer context ids than the current
> implementation.
>
> -  Moved the protocol bits from the most significant bits because
> some providers may reserve starting from there (see mem_tag_format (4)) and
> sync send will not work.
>
>
>
> Thoughts?
>
>
>
> Today we had a call with Howard (LANL), John and Hamuri (HPE) and briefly
> talked about this, and also thought about sending this email as a query to
> find other developers keeping an eye on OFI support in OMPI.
>
>
>
> Thanks,
>
> _MAC
>
>
>
>
>
> (1)https://ofiwg.github.io/libfabric/master/man/fi_cq.3.html
>
> (2)https://github.com/open-mpi/ompi/blob/master/ompi/mca/mtl/
> ofi/mtl_ofi_types.h#L70
>
> (3)https://ofiwg.github.io/libfabric/master/man/fi_getinfo.3.html
>
> (4)https://ofiwg.github.io/libfabric/master/man/fi_endpoint.3.html
>
>
>
>
>
>
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] OSC module change

2017-11-28 Thread George Bosilca
Hi Brian,

Let me first start with explaining why we need the communicator. We need to
translate local to global rank (aka. rank in your MPI_COMM_WORLD), so that
the communication map we provide make sense. The only way today is to go
back to a communicator and then basically translate a rank between this
communicator and MPI_COMM_WORLD. We could use the gid, but then we have a
hash table lookup for every operation.

While a communicator is not needed internally by an OSC, in MPI world all
windows start with a communicator. This is the reason why I was proposing
the change, not to force a window to create or hold a communicator, but
simply because the existence of a communicator linked to the window is more
of less enforced by the MPI standard.

  George.



On Tue, Nov 28, 2017 at 1:02 PM, Barrett, Brian via devel <
devel@lists.open-mpi.org> wrote:

> The objection I have to this is that it forces an implementation where
> every one-sided component is backed by a communicator.  While that’s the
> case today, it’s certainly not required.  If you look at Portal 4, for
> example, there’s one collective call outside of initialization, and that’s
> a barrier in MPI_FENCE.  The SM component is the same way and given some of
> the use cases for shared memory allocation using the SM component, it’s
> very possible that we’ll be faced with a situation where creating a
> communicator per SM region is too expensive in terms of overall
> communicator count.
>
> I guess a different question would be what you need the communicator for.
> It shouldn’t have any useful semantic meaning, so why isn’t a silent
> implementation detail for the monitoring component?
>
> Brian
>
>
> On Nov 28, 2017, at 8:45 AM, George Bosilca  wrote:
>
> Devels,
>
> We would like to change the definition of the OSC module to move the
> communicator one level up from the different module structures into the
> base OSC module. The reason for this, as well as a lengthy discussion on
> other possible solutions can be found in https://github.com/open-mpi/
> ompi/pull/4527.
>
> We need to take a decision on this asap, to prepare the PR for the 3.1.
> Please comment asap.
>
>   George.
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] OSC module change

2017-11-28 Thread George Bosilca
Devels,

We would like to change the definition of the OSC module to move the
communicator one level up from the different module structures into the
base OSC module. The reason for this, as well as a lengthy discussion on
other possible solutions can be found in
https://github.com/open-mpi/ompi/pull/4527.

We need to take a decision on this asap, to prepare the PR for the 3.1.
Please comment asap.

  George.
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread George Bosilca
Samuel,

You are right, we use qsort to sort the keys, but the qsort only applies on
participants with the same color. So while the complexity of the qsort
might reach bottom only when most of the processes participate with the
same color.

What I think is OMPI problem in this are is the selection of the next cid
for the newly created communicator. We are doing the selection of the cid
on the original communicator, and this basically counts for a significant
increase in the duration, as will need to iterate a longer to converge to a
common cid.

We haven't made any improvement in this area for the last few years, we
simply transformed the code to use non-blocking communications instead of
blocking, but this has little impact on the performance of the split itself.

  George.


On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams  wrote:

> I'll ask my collaborators if they've submitted a ticket.
> (they have the accounts; built the code; ran the code; observed the issues)
>
> I believe the issue on MPICH was a qsort issue and not a Allreduce issue.
> When this is coupled with the fact that it looked like qsort is called in
> ompi_comm_split (https://github.com/open-mpi/ompi/blob/
> a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c), I
> wanted to raise the issue so that it may be investigated to understand
> whether users can naively blunder into worst case computational complexity
> issues.
>
> We've been running hpgmg-fv (not -fe).  They were using the flux variants
> (requires local.mk build operators.flux.c instead of operators.fv4.c) and
> they are a couple commits behind.  Regardless, this issue has persisted on
> K for several years.  By default, it will build log(N) subcommunicators
> where N is the problem size.  Weak scaling experiments has shown
> comm_split/dup times growing consistently with worst case complexity.  That
> being said, AMR codes might rebuild the sub communicators as they
> regrid/adapt.
>
>
>
>
>
>
>
>
> > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> > Samuel,
> >
> > The default MPI library on the K computer is Fujitsu MPI, and yes, it
> > is based on Open MPI.
> > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> > From a support perspective, this should be reported to the HPCI
> > helpdesk http://www.hpci-office.jp/pages/e_support
> >
> > As far as i understand, Fujitsu MPI currently available on K is not
> > based on the latest Open MPI.
> > I suspect most of the time is spent trying to find the new
> > communicator ID (CID) when a communicator is created (vs figuring out
> > the new ranks)
> > iirc, on older versions of Open MPI, that was implemented with as many
> > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> > unused CID for the newly created communicator.
> >
> > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> > your program, only one MPI_Allreduce() should be involved per
> > MPI_Comm_dup().
> > But if you do the same thing in the middle of your run, and after each
> > rank has a different lower unused CID, the performances can be (much)
> > worst.
> > If i understand correctly your description of the issue, that would
> > explain the performance discrepancy between static vs dynamic
> > communicator creation time.
> >
> > fwiw, this part has been (highly) improved in the latest releases of
> Open MPI.
> >
> > If your benchmark is available for download, could you please post a
> link ?
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams 
> wrote:
> >> Some of my collaborators have had issues with one of my benchmarks at
> high concurrency (82K MPI procs) on the K machine in Japan.  I believe K
> uses OpenMPI and the issues has been tracked to time in
> MPI_Comm_dup/Comm_split increasing quadratically with process concurrency.
> At 82K processes, each call to dup/split is taking 15s to complete.  These
> high times restrict comm_split/dup to be used statically (at the beginning)
> and not dynamically in an application.
> >>
> >> I had a similar issue a few years ago on ANL/Mira/MPICH where they
> called qsort to split the ranks.  Although qsort/quicksort has ideal
> computational complexity of O(PlogP)  [P is the number of MPI ranks], it
> can have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x
> slowdown.
> >>
> >> Can you confirm whether qsort (or the like) is (still) used in these
> routines in OpenMPI?  It seems mergesort (worst case complexity of PlogP)
> would be a more scalable approach.  I have not observed this issue on the
> Cray MPICH implementation and the Mira MPICH issues has since been resolved.
> >>
> >>
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/devel
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://l

Re: [OMPI devel] Open MPI3.0

2017-10-22 Thread George Bosilca
Thanks Gilles.

George.


On Mon, Oct 23, 2017 at 12:34 AM, Gilles Gouaillardet 
wrote:

> George,
>
>
> since this is an automatically generated file (at configure time), this is
> likely a packaging issue in upstream PMIx
>
> i made https://github.com/pmix/pmix/pull/567 in order to fix that.
>
>
> fwiw, nightly tarballs for v3.0.x, v3.1.x and master are affected
>
>
> Cheers,
>
>
> Gilles
>
>
> On 10/23/2017 5:47 AM, George Bosilca wrote:
>
>> Did we include by mistake the PMIX config header
>> (opal/mca/pmix/pmix2x/pmix/src/include/pmix_config.h) in the 3.0 release
>> ?
>>
>>   George.
>>
>>
>>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI3.0

2017-10-22 Thread George Bosilca
Did we include by mistake the PMIX config header
(opal/mca/pmix/pmix2x/pmix/src/include/pmix_config.h) in the 3.0 release ?

  George.
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Jenkins nowhere land again

2017-10-03 Thread George Bosilca
We have an unused mac that we can add to the pool. I'll be more than happy
to help set it up.

  George.



On Tue, Oct 3, 2017 at 5:43 PM, Barrett, Brian via devel <
devel@lists.open-mpi.org> wrote:

> My MacOS box is back up and jobs are progressing again. The queue got kind
> of long, so it might be an hour or so before it catches up. I have some
> thoughts on monitoring so we get emails in case this happens and my team’s
> Product Manager found an unused Amazon-owned Mac Mini we’ll add to the pool
> so that I won’t have to drive home if this happens again.
>
> Brian
>
> On Oct 3, 2017, at 13:40, "r...@open-mpi.org"  wrote:
>
> I’m not sure either - I have the patch to fix the loop_spawn test problem,
> but can’t get it into the repo.
>
>
> On Oct 3, 2017, at 1:22 PM, Barrett, Brian via devel <
> devel@lists.open-mpi.org> wrote:
>
>
> I’m not sure entirely what we want to do.  It looks like both Nathan and
> I’s OS X servers died on the same day.  It looks like mine might be a
> larger failure than just Jenkins, because I can’t log into the machine
> remotely.  It’s going to be a couple hours before I can get home.  Nathan,
> do you know what happened to your machine?
>
>
> The only options for the OMPI builder are to either wait until Nathan or I
> get home and get our servers running again or to not test OS X (which has
> its own problems).  I don’t have a strong preference here, but I also don’t
> want to make the decision unilaterally.
>
>
> Brian
>
>
>
> On Oct 3, 2017, at 1:14 PM, r...@open-mpi.org wrote:
>
>
> We are caught between two infrastructure failures:
>
>
> Mellanox can’t pull down a complete PR
>
>
> OMPI is hanging on the OS-X server
>
>
> Can someone put us out of our misery?
>
> Ralph
>
>
> ___
>
> devel mailing list
>
> devel@lists.open-mpi.org
>
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> ___
>
> devel mailing list
>
> devel@lists.open-mpi.org
>
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Stale PRs

2017-08-31 Thread George Bosilca
Ralph,

I updated the TCP-related pending PR. It offers a better solution that what
we have today, unfortunately not perfect as it would require additions to
the configure. Waiting for reviews.

  George.


On Thu, Aug 31, 2017 at 10:12 AM, r...@open-mpi.org  wrote:

> Thanks to those who made a first pass at these old PRs. The oldest one is
> now dated Dec 2015 - nearly a two-year old change for large messages over
> the TCP BTL, waiting for someone to commit.
>
>
> > On Aug 30, 2017, at 7:34 AM, r...@open-mpi.org wrote:
> >
> > Hey folks
> >
> > This is getting ridiculous - we have PRs sitting on GitHub that are more
> than a year old! If they haven’t been committed in all that time, they
> can’t possibly be worth anything now.
> >
> > Would people _please_ start paying attention to their PRs? Either close
> them, or update/commit them.
> >
> > Ralph
> >
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] PMIX visibility

2017-07-24 Thread George Bosilca
The last PMIX import broke the master on all platforms that support
visibility. I have pushed a patch that solves __most__ of the issues (that
I could find). I say most because there is a big left that require a
significant change in PMIX design.

This problem arise from the use of the pmix_setenv symbol in one of the MCA
components (a totally legit operation). Except that in PMIX the pmix_setenv
is defined in opal/mca/pmix/pmix2x/pmix/include/pmix_common.h, which is one
of these headers that is self-contained and does not include the
config_bottom.h, and thus has no access to the PMIX_EXPORT.

Here are 3 possible solutions:
1. don't use pmix_setenv in any of the MCA components
2. create a new header that provides support for all util functions
(similar to OPAL) and that supports PMIX_EXPORT
3. make pmix_common.h not self-contained in order to provide access to
PMIX_EXPORT.

Any of this approach requires changes to PMIX (and a push upstream).
Meanwhile the trunk seems to be broken on all platforms that support
visibility.

  George.
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] orterun busted

2017-06-23 Thread George Bosilca
Ralph,

I got consistent segfaults during the infrastructure tearing down in the
orterun (I noticed them on a OSX). After digging a little bit it turns out
that the opal_buffet_t class has been cleaned-up in orte_finalize before
orte_proc_info_finalize is called, leading to calling the destructors into
a randomly initialized memory. If I change the order of the teardown to
move orte_proc_info_finalize before orte_finalize things work better, but I
still get a very annoying warning about a "Bad file descriptor in select".

Any better fix ?

George.

PS: Here is the patch I am currently using to get rid of the segfaults

diff --git a/orte/tools/orterun/orterun.c b/orte/tools/orterun/orterun.c
index 85aba0a0f3..506b931d35 100644
--- a/orte/tools/orterun/orterun.c
+++ b/orte/tools/orterun/orterun.c
@@ -222,10 +222,10 @@ int orterun(int argc, char *argv[])
  DONE:
 /* cleanup and leave */
 orte_submit_finalize();
-orte_finalize();
-orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
 /* cleanup the process info */
 orte_proc_info_finalize();
+orte_finalize();
+orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);

 if (orte_debug_flag) {
 fprintf(stderr, "exiting with status %d\n", orte_exit_status);
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Master warnings

2017-06-13 Thread George Bosilca
e9d533e62ecb should fix these warnings. They are harmless, as we cannot be
reaching the context needed for them to have an impact because collectives
communications with 0 bytes are trimmed out in the MPI layer.

Thanks for reporting.
  George.


On Tue, Jun 13, 2017 at 12:43 PM, r...@open-mpi.org  wrote:

> Configured master with no options given to it (i.e., out-of-the-box) - the
> key being that it defaults to --disable-debug - and drowned in warnings:
>
> *base/coll_base_scatter.c:* In function ‘
> *ompi_coll_base_scatter_intra_binomial*’:
> *base/coll_base_scatter.c:90:28:* *warning: *‘*sgap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  ptmp = tempbuf *-* sgap;
> *^*
> *base/coll_base_scatter.c:117:24:* *warning: *‘*rgap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  ptmp = tempbuf *-* rgap;
> *^*
> *base/coll_base_alltoall.c:* In function ‘
> *mca_coll_base_alltoall_intra_basic_inplace*’:
> *base/coll_base_alltoall.c:69:35:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  tmp_buffer = allocated_buffer *-* gap;
>*^*
> *base/coll_base_alltoall.c:* In function ‘
> *ompi_coll_base_alltoall_intra_bruck*’:
> *base/coll_base_alltoall.c:230:26:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  tmpbuf = tmpbuf_free *-* gap;
>   *^*
> *base/coll_base_allgather.c:* In function ‘
> *ompi_coll_base_allgather_intra_bruck*’:
> *base/coll_base_allgather.c:179:30:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  shift_buf = free_buf *-* gap;
>   *^*
> *base/coll_base_gather.c:* In function ‘
> *ompi_coll_base_gather_intra_binomial*’:
> *base/coll_base_gather.c:91:28:* *warning: *‘*rgap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  ptmp = tempbuf *-* rgap;
> *^*
> *base/coll_base_gather.c:114:24:* *warning: *‘*sgap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  ptmp = tempbuf *-* sgap;
> *^*
> *base/coll_base_reduce_scatter.c:* In function ‘
> *ompi_coll_base_reduce_scatter_intra_nonoverlapping*’:
> *base/coll_base_reduce_scatter.c:83:36:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  tmprbuf = tmprbuf_free *-* gap;
> *^*
> *base/coll_base_reduce.c:* In function ‘*ompi_coll_base_reduce_generic*’:
> *base/coll_base_reduce.c:124:34:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  inbuf[0] = inbuf_free[0] *-* gap;
>   *^*
> *base/coll_base_allreduce.c:* In function ‘
> *ompi_coll_base_allreduce_intra_recursivedoubling*’:
> *base/coll_base_allreduce.c:159:34:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  inplacebuf = inplacebuf_free *-* gap;
>   *^*
> *base/coll_base_reduce_scatter.c:* In function ‘
> *ompi_coll_base_reduce_scatter_intra_basic_recursivehalving*’:
> *base/coll_base_reduce_scatter.c:178:30:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  recv_buf = recv_buf_free *-* gap;
>   *^*
> *base/coll_base_reduce.c:* In function ‘
> *ompi_coll_base_reduce_intra_in_order_binary*’:
> *base/coll_base_reduce.c:549:34:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  tmpbuf = tmpbuf_free *-* gap;
>   *^*
> *base/coll_base_reduce.c:* In function ‘
> *ompi_coll_base_reduce_intra_basic_linear*’:
> *base/coll_base_reduce.c:653:34:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  pml_buffer = free_buffer *-* gap;
>   *^*
> *base/coll_base_reduce_scatter.c:* In function ‘
> *ompi_coll_base_reduce_scatter_intra_ring*’:
> *base/coll_base_reduce_scatter.c:513:30:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  accumbuf = accumbuf_free *-* gap;
>
> *coll_basic_allgather.c:* In function ‘*mca_coll_basic_allgather_inter*’:
> *coll_basic_allgather.c:112:30:* *warning: *‘*gap*’ may be used
> uninitialized in this function [*-Wmaybe-uninitialized*]
>  tmpbuf = tmpbuf_free *-* gap;
>   *^*
> *coll_basic_exscan.c:* In function ‘*mca_coll_basic_exscan_intra*’:
> *coll_basic_exscan.c:92:33:* *warning: *‘*gap*’ may be used uninitialized
> in this function [*-Wmaybe-uninitialized*]
>  reduce_buffer = free_buffer *-* gap;
>  *

Re: [OMPI devel] ompi_info "developer warning"

2017-06-05 Thread George Bosilca
I do care a little as the default size for most terminal is still 80 chars.
I would prefer your second choice where we replace "disabled" by "-" to
 losing information on the useful part of the message.

George.


On Mon, Jun 5, 2017 at 9:45 AM,  wrote:

> George,
>
>
>
> it seems today the limit is more something like max 24 + max 56.
>
> we can keep the 80 character limit (i have zero opinion on that) and move
> to
>
> max 32 + max 48 for example.
>
> an other option is to replace "(disabled) " with something more compact
>
> "(-) " or even "- "
>
>
>
> Cheers,
>
>
>
> Gilles
>
> - Original Message -
>
> So we are finally getting rid of the 80 chars per line limit?
>
>   George.
>
>
>
> On Sun, Jun 4, 2017 at 11:23 PM, r...@open-mpi.org 
> wrote:
>
>> Really? Sigh - frustrating. I’ll change itas it gets irritating to keep
>> get this warning.
>>
>> Frankly, I find I’m constantly doing --all because otherwise I have no
>> earthly idea how to find what I’m looking for anymore...
>>
>>
>> > On Jun 4, 2017, at 7:25 PM, Gilles Gouaillardet 
>> wrote:
>> >
>> > Ralph,
>> >
>> >
>> > in your environment, pml/monitoring is disabled.
>> >
>> > so instead of displaying "MCA pml monitoring", ompi_info --all displays
>> >
>> > "MCA (disabled) pml monitoring" which is larger than 24 characters.
>> >
>> >
>> > fwiw, you can observe the same behavior with
>> >
>> > OMPI_MCA_sharedfp=^lockedfile ompi_info --all
>> >
>> >
>> > one option is to bump centerpoint (opal/runtime/opal_info_support.c)
>> from 24 to something larger,
>> > an other option is to mark disabled components with a shorter string,
>> for example
>> > "MCA (-) pml monitoring"
>> >
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > On 6/3/2017 5:26 AM, r...@open-mpi.org wrote:
>> >> I keep seeing this when I run ompi_info --all:
>> >>
>> >> 
>> **
>> >> *** DEVELOPER WARNING: A field in ompi_info output is too long and
>> >> *** will appear poorly in the prettyprint output.
>> >> ***
>> >> ***   Value:  "MCA (disabled) pml monitoring"
>> >> ***   Max length: 24
>> >> 
>> **
>> >> 
>> **
>> >> *** DEVELOPER WARNING: A field in ompi_info output is too long and
>> >> *** will appear poorly in the prettyprint output.
>> >> ***
>> >> ***   Value:  "MCA (disabled) pml monitoring"
>> >> ***   Max length: 24
>> >> 
>> **
>> >>
>> >> Anyone know what this is about???
>> >> Ralph
>> >>
>> >>
>> >>
>> >> ___
>> >> devel mailing list
>> >> devel@lists.open-mpi.org
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> >
>> > ___
>> > devel mailing list
>> > devel@lists.open-mpi.org
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] ompi_info "developer warning"

2017-06-05 Thread George Bosilca
So we are finally getting rid of the 80 chars per line limit?

  George.



On Sun, Jun 4, 2017 at 11:23 PM, r...@open-mpi.org  wrote:

> Really? Sigh - frustrating. I’ll change itas it gets irritating to keep
> get this warning.
>
> Frankly, I find I’m constantly doing --all because otherwise I have no
> earthly idea how to find what I’m looking for anymore...
>
>
> > On Jun 4, 2017, at 7:25 PM, Gilles Gouaillardet 
> wrote:
> >
> > Ralph,
> >
> >
> > in your environment, pml/monitoring is disabled.
> >
> > so instead of displaying "MCA pml monitoring", ompi_info --all displays
> >
> > "MCA (disabled) pml monitoring" which is larger than 24 characters.
> >
> >
> > fwiw, you can observe the same behavior with
> >
> > OMPI_MCA_sharedfp=^lockedfile ompi_info --all
> >
> >
> > one option is to bump centerpoint (opal/runtime/opal_info_support.c)
> from 24 to something larger,
> > an other option is to mark disabled components with a shorter string,
> for example
> > "MCA (-) pml monitoring"
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On 6/3/2017 5:26 AM, r...@open-mpi.org wrote:
> >> I keep seeing this when I run ompi_info --all:
> >>
> >> 
> **
> >> *** DEVELOPER WARNING: A field in ompi_info output is too long and
> >> *** will appear poorly in the prettyprint output.
> >> ***
> >> ***   Value:  "MCA (disabled) pml monitoring"
> >> ***   Max length: 24
> >> 
> **
> >> 
> **
> >> *** DEVELOPER WARNING: A field in ompi_info output is too long and
> >> *** will appear poorly in the prettyprint output.
> >> ***
> >> ***   Value:  "MCA (disabled) pml monitoring"
> >> ***   Max length: 24
> >> 
> **
> >>
> >> Anyone know what this is about???
> >> Ralph
> >>
> >>
> >>
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] about ompi_datatype_is_valid

2017-06-01 Thread George Bosilca
You have to pass it an allocated datatype, and it tells you if the pointer
object is a valid MPI datatype for communications (aka it has a
corresponding type with a well defined size, extent and alignment).

There is no construct in C able to tell you if a random number if a valid C
"object".

  George.


On Thu, Jun 1, 2017 at 5:42 PM, Dahai Guo  wrote:

> Hi,
>
> if I insert following lines somewhere openmpi, such
> as ompi/mpi/c/iscatter.c
>
>   printf(" --- in MPI_Iscatter\n");
> //MPI_Datatype dt00 = (MPI_Datatype) MPI_INT;
> *MPI_Datatype dt00 = (MPI_Datatype) -1;*
> if(*!ompi_datatype_is_valid(dt00)* ) {
>   printf(" --- dt00 is NOT valid \n");
> }
>
> The attached test code will give the errors:
>
> *** Process received signal ***
> Signal: Segmentation fault (11)
> Signal code: Address not mapped (1)
> Failing at address: 0xf
> [ 0] [0x3fff9d480478]
> ...
>
> Is it a bug in the function *ompi_datatype_is_valid(..) *? or I miss
> something?
>
> Dahai
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] PMIX busted

2017-05-31 Thread George Bosilca
After removing all leftover files and redoing the autogen things went back
to normal. Sorry for the noise.

  George.



On Wed, May 31, 2017 at 10:06 AM, r...@open-mpi.org  wrote:

> No - I just rebuilt it myself, and I don’t see any relevant MTT build
> failures. Did you rerun autogen?
>
>
> > On May 31, 2017, at 7:02 AM, George Bosilca  wrote:
> >
> > I have problems compiling the current master. Anyone else has similar
> issues ?
> >
> >   George.
> >
> >
> >   CC   base/ptl_base_frame.lo
> > In file included from /Users/bosilca/unstable/ompi/
> trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/thread_usage.h:31:0,
> >  from /Users/bosilca/unstable/ompi/
> trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/mutex.h:32,
> >  from /Users/bosilca/unstable/ompi/
> trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/threads.h:37,
> >  from /Users/bosilca/unstable/ompi/
> trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client_ops.h:18,
> >  from ../../../../../../../../../../
> opal/mca/pmix/pmix2x/pmix/src/mca/ptl/base/ptl_base_frame.c:45:
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/
> pmix2x/pmix/src/atomics/sys/atomic.h:80:34: warning:
> "PMIX_C_GCC_INLINE_ASSEMBLY" is not defined [-Wundef]
> >  #define PMIX_GCC_INLINE_ASSEMBLY PMIX_C_GCC_INLINE_ASSEMBLY
> >   ^
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/
> pmix2x/pmix/src/atomics/sys/atomic.h:115:6: note: in expansion of macro
> 'PMIX_GCC_INLINE_ASSEMBLY'
> >  #if !PMIX_GCC_INLINE_ASSEMBLY
> >   ^
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/
> pmix2x/pmix/src/atomics/sys/atomic.h:153:7: warning:
> "PMIX_ASSEMBLY_BUILTIN" is not defined [-Wundef]
> >  #elif PMIX_ASSEMBLY_BUILTIN == PMIX_BUILTIN_SYNC
> >^
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/
> pmix2x/pmix/src/atomics/sys/atomic.h:155:7: warning:
> "PMIX_ASSEMBLY_BUILTIN" is not defined [-Wundef]
> >  #elif PMIX_ASSEMBLY_BUILTIN == PMIX_BUILTIN_GCC
> >^
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] PMIX busted

2017-05-31 Thread George Bosilca
I have problems compiling the current master. Anyone else has similar
issues ?

  George.


  CC   base/ptl_base_frame.lo
In file included from
/Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/thread_usage.h:31:0,
 from
/Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/mutex.h:32,
 from
/Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/threads.h:37,
 from
/Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client_ops.h:18,
 from
../../../../../../../../../../opal/mca/pmix/pmix2x/pmix/src/mca/ptl/base/ptl_base_frame.c:45:
/Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:80:34:
warning: "PMIX_C_GCC_INLINE_ASSEMBLY" is not defined [-Wundef]
 #define PMIX_GCC_INLINE_ASSEMBLY PMIX_C_GCC_INLINE_ASSEMBLY
  ^
/Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:115:6:
note: in expansion of macro 'PMIX_GCC_INLINE_ASSEMBLY'
 #if !PMIX_GCC_INLINE_ASSEMBLY
  ^
/Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:153:7:
warning: "PMIX_ASSEMBLY_BUILTIN" is not defined [-Wundef]
 #elif PMIX_ASSEMBLY_BUILTIN == PMIX_BUILTIN_SYNC
   ^
/Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:155:7:
warning: "PMIX_ASSEMBLY_BUILTIN" is not defined [-Wundef]
 #elif PMIX_ASSEMBLY_BUILTIN == PMIX_BUILTIN_GCC
   ^
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] NetPIPE performance curves

2017-05-09 Thread George Bosilca
Dave,

I think I know the reason, or at least part of the reason, for these
spikes. As an example, when we select between the different protocols to
use to exchange the message between peers, we only use predefined lengths,
and we completely disregard buffer alignment.

I was planning to address this issue, by selecting a small, eager and
pipeline size fragment also based on the alignment of the remaining buffer,
in such a way that we minimize the non-aligned transactions across the PCI
bus. Unfortunately my schedule for the next month looks grim, I don't think
I would have the opportunity to play with the code.

  George.



On Thu, May 4, 2017 at 4:22 PM, Dave Turner  wrote:

>  I don't see anything in NetPIPE itself that could cause this as non
> factors
> of 8 are still aligned to 8 bytes.  Can you think of anything in OpenMPI
> that
> would result in the message being treated differently when the message size
> is not a factor of 8?  The sends/recvs all use a data type of MPI_BYTE
> in NetPIPE.
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] about MPI_ANY_SOURCE in MPI_Sendrecv_replace

2017-05-09 Thread George Bosilca
PR#3500 (https://github.com/open-mpi/ompi/pull/3500) should fix the
problem. Is not optimal, but it is simple and works in all cases.

  George.


On Tue, May 9, 2017 at 2:39 PM, George Bosilca  wrote:

> Please go ahead and open an issue, I will attach the PR once I have the
> core ready. A little later today I think.
>
>   George.
>
>
> On May 9, 2017, at 14:32 , Dahai Guo  wrote:
>
> Hi, George:
>
> any progress on it? an issue should be opened in github? or you already
> opened one?
>
> Dahai
>
> On Fri, May 5, 2017 at 1:27 PM, George Bosilca 
> wrote:
>
>> Indeed, our current implementation of the MPI_Sendrecv_replace prohibits
>> the use of MPI_ANY_SOURCE. Will work a patch later today.
>>
>>   George.
>>
>>
>> On Fri, May 5, 2017 at 11:49 AM, Dahai Guo  wrote:
>>
>>> The following code causes memory fault problem. The initial check shows
>>> that it seemed caused by *ompi_comm_peer_lookup* with MPI_ANY_SOURCE,
>>> which somehow messed up the allocated  temporary buffer used in SendRecv.
>>>
>>> any idea?
>>>
>>> Dahai
>>>
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>>
>>> int main(int argc, char *argv[]) {
>>>
>>>int  local_rank;
>>>int  numtask, myrank;
>>>int  count;
>>>
>>>MPI_Status   status;
>>>
>>>long long   *msg_sizes_vec;
>>>long long   *mpi_buf;
>>>long longhost_buf[4];
>>>
>>>int  send_tag;
>>>int  recv_tag;
>>>int  malloc_size;
>>>int  dest;
>>>
>>> MPI_Init(&argc,&argv);
>>>
>>> MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
>>> fprintf(stdout,"my RanK is %d\n",myrank);
>>>
>>> MPI_Comm_size(MPI_COMM_WORLD, &numtask);
>>> fprintf(stdout,"Num Task is %d\n",numtask);
>>>
>>> malloc_size=32;
>>> count = malloc_size / sizeof(long long);
>>> dest = (myrank+1)%2;
>>> fprintf(stdout,"my dest is %d\n",dest);
>>>
>>>
>>> host_buf[0] = 100 + myrank;
>>> host_buf[1] = 200 + myrank;
>>> host_buf[2] = 300 + myrank;
>>> host_buf[3] = 400 + myrank;
>>>
>>> fprintf(stdout,"BEFORE %lld %lld %lld %lld
>>> \n",host_buf[0],host_buf[1],host_buf[2],host_buf[3]);
>>> fflush(stdout);
>>> fprintf(stdout,"Doing sendrecv_replace with host buffer\n");
>>> fflush(stdout);
>>>
>>> MPI_Sendrecv_replace ( host_buf,
>>>  count,
>>>  MPI_LONG_LONG,
>>>   dest,
>>> myrank,
>>>   MPI_ANY_SOURCE,
>>>   dest,
>>> MPI_COMM_WORLD,
>>>&status);
>>>
>>> fprintf(stdout,"Back from doing sendrecv_replace with host
>>> buffer\n");
>>> fprintf(stdout,"AFTER %lld %lld %lld %lld
>>> \n",host_buf[0],host_buf[1],host_buf[2],host_buf[3]);
>>> fflush(stdout);
>>>
>>>
>>> MPI_Finalize();
>>> exit(0);
>>> }
>>>
>>>
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] about MPI_ANY_SOURCE in MPI_Sendrecv_replace

2017-05-09 Thread George Bosilca
Please go ahead and open an issue, I will attach the PR once I have the core 
ready. A little later today I think.

  George.


> On May 9, 2017, at 14:32 , Dahai Guo  wrote:
> 
> Hi, George:
> 
> any progress on it? an issue should be opened in github? or you already 
> opened one?
> 
> Dahai
> 
> On Fri, May 5, 2017 at 1:27 PM, George Bosilca  <mailto:bosi...@icl.utk.edu>> wrote:
> Indeed, our current implementation of the MPI_Sendrecv_replace prohibits the 
> use of MPI_ANY_SOURCE. Will work a patch later today.
> 
>   George.
> 
> 
> On Fri, May 5, 2017 at 11:49 AM, Dahai Guo  <mailto:dahai@gmail.com>> wrote:
> The following code causes memory fault problem. The initial check shows that 
> it seemed caused by ompi_comm_peer_lookup with MPI_ANY_SOURCE, which somehow 
> messed up the allocated  temporary buffer used in SendRecv.
> 
> any idea?
> 
> Dahai
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main(int argc, char *argv[]) {
> 
>int  local_rank;
>int  numtask, myrank;
>int  count;
> 
>MPI_Status   status;
> 
>long long   *msg_sizes_vec;
>long long   *mpi_buf;
>long longhost_buf[4];
> 
>int  send_tag;
>int  recv_tag;
>int  malloc_size;
>int  dest;
> 
> MPI_Init(&argc,&argv);
> 
> MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
> fprintf(stdout,"my RanK is %d\n",myrank);
> 
> MPI_Comm_size(MPI_COMM_WORLD, &numtask);
> fprintf(stdout,"Num Task is %d\n",numtask);
> 
> malloc_size=32;
> count = malloc_size / sizeof(long long);
> dest = (myrank+1)%2;
> fprintf(stdout,"my dest is %d\n",dest);
> 
> 
> host_buf[0] = 100 + myrank;
> host_buf[1] = 200 + myrank;
> host_buf[2] = 300 + myrank;
> host_buf[3] = 400 + myrank;
> 
> fprintf(stdout,"BEFORE %lld %lld %lld %lld 
> \n",host_buf[0],host_buf[1],host_buf[2],host_buf[3]);
> fflush(stdout);
> fprintf(stdout,"Doing sendrecv_replace with host buffer\n");
> fflush(stdout);
> 
> MPI_Sendrecv_replace ( host_buf,
>  count,
>  MPI_LONG_LONG,
>   dest,
> myrank,
>   MPI_ANY_SOURCE,
>   dest,
> MPI_COMM_WORLD,
>&status);
> 
> fprintf(stdout,"Back from doing sendrecv_replace with host buffer\n");
> fprintf(stdout,"AFTER %lld %lld %lld %lld 
> \n",host_buf[0],host_buf[1],host_buf[2],host_buf[3]);
> fflush(stdout);
> 
> 
> MPI_Finalize();
> exit(0);
> }
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.x branch naming

2017-05-05 Thread George Bosilca
If we rebranch from master for every "major" release it makes sense to
rename the branch. In the long term renaming seems like the way to go, and
thus the pain of altering everything that depends on the naming will exist
at some point. I'am in favor of doing it asap (but I have no stakes in the
game as UTK does not have an MTT).

  George.



On Fri, May 5, 2017 at 1:53 PM, Barrett, Brian via devel <
devel@lists.open-mpi.org> wrote:

> Hi everyone -
>
> We’ve been having discussions among the release managers about the choice
> of naming the branch for Open MPI 3.0.0 as v3.x (as opposed to v3.0.x).
> Because the current plan is that each “major” release (in the sense of the
> three release points from master per year, not necessarily in increasing
> the major number of the release number) is to rebranch off of master,
> there’s a feeling that we should have named the branch v3.0.x, and then
> named the next one 3.1.x, and so on.  If that’s the case, we should
> consider renaming the branch and all the things that depend on the branch
> (web site, which Jeff has already half-done; MTT testing; etc.).  The
> disadvantage is that renaming will require everyone who’s configured MTT to
> update their test configs.
>
> The first question is should we rename the branch?  While there would be
> some ugly, there’s nothing that really breaks long term if we don’t.  Jeff
> has stronger feelings than I have here.
>
> If we are going to rename the branch from v3.x to v3.0.x, my proposal
> would be that we do it next Saturday evening (May 13th).  I’d create a new
> branch from the current state of v3.x and then delete the old branch.  We’d
> try to push all the PRs Friday so that there were no outstanding PRs that
> would have to be reopened.  We’d then bug everyone to update their nightly
> testing to pull from a different URL and update their MTT configs.  After a
> week or two, we’d stop having tarballs available at both v3.x and v3.0.x on
> the Open MPI web page.
>
> Thoughts?
>
> Brian
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] about MPI_ANY_SOURCE in MPI_Sendrecv_replace

2017-05-05 Thread George Bosilca
Indeed, our current implementation of the MPI_Sendrecv_replace prohibits
the use of MPI_ANY_SOURCE. Will work a patch later today.

  George.


On Fri, May 5, 2017 at 11:49 AM, Dahai Guo  wrote:

> The following code causes memory fault problem. The initial check shows
> that it seemed caused by *ompi_comm_peer_lookup* with MPI_ANY_SOURCE,
> which somehow messed up the allocated  temporary buffer used in SendRecv.
>
> any idea?
>
> Dahai
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int main(int argc, char *argv[]) {
>
>int  local_rank;
>int  numtask, myrank;
>int  count;
>
>MPI_Status   status;
>
>long long   *msg_sizes_vec;
>long long   *mpi_buf;
>long longhost_buf[4];
>
>int  send_tag;
>int  recv_tag;
>int  malloc_size;
>int  dest;
>
> MPI_Init(&argc,&argv);
>
> MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
> fprintf(stdout,"my RanK is %d\n",myrank);
>
> MPI_Comm_size(MPI_COMM_WORLD, &numtask);
> fprintf(stdout,"Num Task is %d\n",numtask);
>
> malloc_size=32;
> count = malloc_size / sizeof(long long);
> dest = (myrank+1)%2;
> fprintf(stdout,"my dest is %d\n",dest);
>
>
> host_buf[0] = 100 + myrank;
> host_buf[1] = 200 + myrank;
> host_buf[2] = 300 + myrank;
> host_buf[3] = 400 + myrank;
>
> fprintf(stdout,"BEFORE %lld %lld %lld %lld \n",host_buf[0],host_buf[1],
> host_buf[2],host_buf[3]);
> fflush(stdout);
> fprintf(stdout,"Doing sendrecv_replace with host buffer\n");
> fflush(stdout);
>
> MPI_Sendrecv_replace ( host_buf,
>  count,
>  MPI_LONG_LONG,
>   dest,
> myrank,
>   MPI_ANY_SOURCE,
>   dest,
> MPI_COMM_WORLD,
>&status);
>
> fprintf(stdout,"Back from doing sendrecv_replace with host buffer\n");
> fprintf(stdout,"AFTER %lld %lld %lld %lld \n",host_buf[0],host_buf[1],
> host_buf[2],host_buf[3]);
> fflush(stdout);
>
>
> MPI_Finalize();
> exit(0);
> }
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] count = -1 for reduce

2017-05-05 Thread George Bosilca
On Fri, May 5, 2017 at 10:41 AM, Josh Hursey  wrote:

> To Dahai's last point - The second MPI_Reduce will fail with this error:
> *** An error occurred in MPI_Reduce
> *** reported by process [2212691969,0]
> *** on communicator MPI_COMM_WORLD
> *** MPI_ERR_ARG: invalid argument of some other kind
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
>
> That's because both the send and recv buffers are equal to NULL (and thus
> equal to each other) and the root process will fail at this check even
> though the count==0:
>  * https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/
> reduce.c#L97-L99
>
> So Dahai is suggesting moving this code block:
>  * https://github.com/open-mpi/ompi/blob/master/ompi/mpi/
> c/reduce.c#L123-L129
> To the top of the function so that if the count==0 then we return success
> without checking any of the parameters.
>
>
> In my mind this raises a philosophical question about the MPI Standard. If
> the user is passing bad arguments, but the count = 0 should we return
> MPI_SUCCESS (since there is nothing to do) or MPI_ERR_ARG (since there was
> a bad argument, but harmless since no action is to be taken). The standard
> says that a high quality implementation should make a best effort in
> checking these parameters and returning an error. But is it in line with
> the spirit of the standard to disregard this checking when we know that the
> operation is a no-op.
>
> It should be noted that the whole count==0 case is outside the scope of
> the standard anyway and is only there because some benchmarks erroneously
> make calls to reduce with count==0 (per the comment in the file). So maybe
> at the end of the day I've had too little sleep and too much coffee - and
> this is nothing to worry about :)
>

We had a similar discussion on the MPI Forum mailing list about the
validity of the datatype passed as argument when count is zero. We
converged (in the MPI forum sense) that the datatype should be correct,
even when the count is zero. While the discussion is slightly different,
the spirit is the same: what argument should be correct and when.

But the send buf = recv buf = NULL when count is 0, should be allowed, but
not by hoisting the count == 0, but by adding an exception to the
MPI_IN_PLACE check.


> There is another related question about the consistency of parameter
> checking between reduce.c and allreduce.c in the case where count==1:
>   https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/
> allreduce.c#L86-L90
> Should reduce.c also have a similar check? It seems like the check at the
> root rank in MPI_Reduce should match that of Allreduce, right?
>

:+1:

  George.



>
>
> On Fri, May 5, 2017 at 8:39 AM, Dahai Guo  wrote:
>
>> Thanks, George. It works!  In addition, the following code would also
>> cause a problem.  checking if count ==0 should be moved to the beginning of
>> the code ompi/mpi/c/reduce.c and ireduce.c, or fix it in other way.
>>
>> Dahai
>>
>>
>> #include 
>> #include 
>> #include 
>>
>> int main(int argc, char** argv)
>> {
>> int r[1], s[1];
>> MPI_Init(&argc,&argv);
>>
>> s[0] = 1;
>> r[0] = -1;
>> MPI_Reduce(s,r,0,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
>> printf("%d\n",r[0]);
>> MPI_Reduce(NULL,NULL,0,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
>> MPI_Finalize();
>> }
>> ~
>>
>>
>> On Thu, May 4, 2017 at 9:18 PM, George Bosilca 
>> wrote:
>>
>>> I was able to reproduce it (with the correct version of OMPI, aka. the
>>> v2.x branch). The problem seems to be that we are lacking a part of
>>> the fe68f230991 commit, that remove a free on a statically allocated array.
>>> Here is the corresponding patch:
>>>
>>> diff --git a/ompi/errhandler/errhandler_predefined.c
>>> b/ompi/errhandler/errhandler_predefined.c
>>> index 4d50611c12..54ac63553c 100644
>>> --- a/ompi/errhandler/errhandler_predefined.c
>>> +++ b/ompi/errhandler/errhandler_predefined.c
>>> @@ -15,6 +15,7 @@
>>>   * Copyright (c) 2010-2011 Oak Ridge National Labs.  All rights
>>> reserved.
>>>   * Copyright (c) 2012  Los Alamos National Security, LLC.
>>>   * All rights reserved.
>>> + * Copyright (c) 2016  Intel, Inc.  All rights reserved.
>>>   * $COPYRIGHT$
>>>   *
>>>   * Additional copyrights may follow
>>> @@ -181,6 +182,7 @@ static void backend_fatal_aggregate(char *type,
>>>  const char* const unknown_error_code = "Error co

Re: [OMPI devel] count = -1 for reduce

2017-05-04 Thread George Bosilca
I was able to reproduce it (with the correct version of OMPI, aka. the v2.x
branch). The problem seems to be that we are lacking a part of
the fe68f230991 commit, that remove a free on a statically allocated array.
Here is the corresponding patch:

diff --git a/ompi/errhandler/errhandler_predefined.c
b/ompi/errhandler/errhandler_predefined.c
index 4d50611c12..54ac63553c 100644
--- a/ompi/errhandler/errhandler_predefined.c
+++ b/ompi/errhandler/errhandler_predefined.c
@@ -15,6 +15,7 @@
  * Copyright (c) 2010-2011 Oak Ridge National Labs.  All rights reserved.
  * Copyright (c) 2012  Los Alamos National Security, LLC.
  * All rights reserved.
+ * Copyright (c) 2016  Intel, Inc.  All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -181,6 +182,7 @@ static void backend_fatal_aggregate(char *type,
 const char* const unknown_error_code = "Error code: %d (no associated
error message)";
 const char* const unknown_error = "Unknown error";
 const char* const unknown_prefix = "[?:?]";
+bool generated = false;

 // these do not own what they point to; they're
 // here to avoid repeating expressions such as
@@ -211,6 +213,8 @@ static void backend_fatal_aggregate(char *type,
 err_msg = NULL;
 opal_output(0, "%s", "Could not write to err_msg");
 opal_output(0, unknown_error_code, *error_code);
+} else {
+generated = true;
 }
 }
 }
@@ -256,7 +260,9 @@ static void backend_fatal_aggregate(char *type,
 }

 free(prefix);
-free(err_msg);
+if (generated) {
+free(err_msg);
+}
 }

 /*

  George.



On Thu, May 4, 2017 at 10:03 PM, Jeff Squyres (jsquyres)  wrote:

> Can you get a stack trace?
>
> > On May 4, 2017, at 6:44 PM, Dahai Guo  wrote:
> >
> > Hi, George:
> >
> > attached is the ompi_info.  I built it on Power8 arch. The configure is
> also simple.
> >
> > ../configure --prefix=${installdir} \
> > --enable-orterun-prefix-by-default
> >
> > Dahai
> >
> > On Thu, May 4, 2017 at 4:45 PM, George Bosilca 
> wrote:
> > Dahai,
> >
> > You are right the segfault is unexpected. I can't replicate this on my
> mac. What architecture are you seeing this issue ? How was your OMPI
> compiled ?
> >
> > Please post the output of ompi_info.
> >
> > Thanks,
> > George.
> >
> >
> >
> > On Thu, May 4, 2017 at 5:42 PM, Dahai Guo  wrote:
> > Those messages are what I like to see. But, there are some other error
> messages and core dump I don't like, as I attached in my previous email.  I
> think something might be wrong with errhandler in openmpi.  Similar thing
> happened for Bcast, etc
> >
> > Dahai
> >
> > On Thu, May 4, 2017 at 4:32 PM, Nathan Hjelm  wrote:
> > By default MPI errors are fatal and abort. The error message says it all:
> >
> > *** An error occurred in MPI_Reduce
> > *** reported by process [3645440001,0]
> > *** on communicator MPI_COMM_WORLD
> > *** MPI_ERR_COUNT: invalid count argument
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > *** and potentially your MPI job)
> >
> > If you want different behavior you have to change the default error
> handler on the communicator using MPI_Comm_set_errhandler. You can set it
> to MPI_ERRORS_RETURN and check the error code or you can create your own
> function. See MPI 3.1 Chapter 8.
> >
> > -Nathan
> >
> > On May 04, 2017, at 02:58 PM, Dahai Guo  wrote:
> >
> >> Hi,
> >>
> >> Using opemi 2.1,  the following code resulted in the core dump,
> although only a simple error msg was expected.  Any idea what is wrong?  It
> seemed related the errhandler somewhere.
> >>
> >>
> >> D.G.
> >>
> >>
> >>  *** An error occurred in MPI_Reduce
> >>  *** reported by process [3645440001,0]
> >>  *** on communicator MPI_COMM_WORLD
> >>  *** MPI_ERR_COUNT: invalid count argument
> >>  *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> >>  ***and potentially your MPI job)
> >> ..
> >>
> >> [1,1]:1000151c-1000151e rw-p  00:00 0
> >> [1,1]:1000151e-10001525 rw-p  00:00 0
> >> [1,1]:10001525-10001527 rw-p  00:00 0
> >> [1,1]:10001527-1000152e rw-p  00:00 0
> >> [1,1]:1000152e-10001530 rw-p  00:00 0
> >> [1,1]:10001530-10001551 rw-p  00:00 0
> >> [1,1]:1000155

Re: [OMPI devel] count = -1 for reduce

2017-05-04 Thread George Bosilca
Dahai,

You are right the segfault is unexpected. I can't replicate this on my mac.
What architecture are you seeing this issue ? How was your OMPI compiled ?

Please post the output of ompi_info.

Thanks,
George.



On Thu, May 4, 2017 at 5:42 PM, Dahai Guo  wrote:

> Those messages are what I like to see. But, there are some other error
> messages and core dump I don't like, as I attached in my previous email.  I
> think something might be wrong with errhandler in openmpi.  Similar thing
> happened for Bcast, etc
>
> Dahai
>
> On Thu, May 4, 2017 at 4:32 PM, Nathan Hjelm  wrote:
>
>> By default MPI errors are fatal and abort. The error message says it all:
>>
>> *** An error occurred in MPI_Reduce
>> *** reported by process [3645440001 <(364)%20544-0001>,0]
>> *** on communicator MPI_COMM_WORLD
>> *** MPI_ERR_COUNT: invalid count argument
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> *** and potentially your MPI job)
>>
>> If you want different behavior you have to change the default error
>> handler on the communicator using MPI_Comm_set_errhandler. You can set it
>> to MPI_ERRORS_RETURN and check the error code or you can create your own
>> function. See MPI 3.1 Chapter 8.
>>
>> -Nathan
>>
>> On May 04, 2017, at 02:58 PM, Dahai Guo  wrote:
>>
>> Hi,
>>
>> Using opemi 2.1,  the following code resulted in the core dump, although
>> only a simple error msg was expected.  Any idea what is wrong?  It seemed
>> related the errhandler somewhere.
>>
>>
>> D.G.
>>
>>
>>  *** An error occurred in MPI_Reduce
>>  *** reported by process [3645440001 <(364)%20544-0001>,0]
>>  *** on communicator MPI_COMM_WORLD
>>  ***
>> *MPI_ERR_COUNT: invalid count argument* *** MPI_ERRORS_ARE_FATAL
>> (processes in this communicator will now abort,
>>  ***and potentially your MPI job)
>> ..
>>
>> [1,1]:1000151c-1000151e rw-p  00:00 0
>> [1,1]:1000151e-10001525 rw-p  00:00 0
>> [1,1]:10001525-10001527 rw-p  00:00 0
>> [1,1]:10001527-1000152e rw-p  00:00 0
>> [1,1]:1000152e-10001530 rw-p  00:00 0
>> [1,1]:10001530-10001551 rw-p  00:00 0
>> [1,1]:10001551-10001553 rw-p  00:00 0
>> [1,1]:10001553-10001574 rw-p  00:00 0
>> [1,1]:10001574-10001576 rw-p  00:00 0
>> [1,1]:10001576-10001597 rw-p  00:00 0
>> [1,1]:10001597-10001599 rw-p  00:00 0
>> [1,1]:10001599-100015ba rw-p  00:00 0
>> [1,1]:100015ba-100015bc rw-p  00:00 0
>> [1,1]:100015bc-100015dd rw-p  00:00 0
>> [1,1]:100015dd-100015df rw-p  00:00 0
>> [1,1]:100015df-10001600 rw-p  00:00 0
>> [1,1]:10001600-10001602 rw-p  00:00 0
>> [1,1]:10001602-10001623 rw-p  00:00 0
>> [1,1]:10001623-10001625 rw-p  00:00 0
>> [1,1]:10001625-10001646 rw-p  00:00 0
>> [1,1]:10001646-10001647 rw-p  00:00 0
>> [1,1]:3fffd463-3fffd46c rw-p  00:00
>> 0  [stack]
>> 
>> --
>>
>> #include 
>> #include 
>> #include 
>> int main(int argc, char** argv)
>> {
>>
>> int r[1], s[1];
>> MPI_Init(&argc,&argv);
>>
>> s[0] = 1;
>> r[0] = -1;
>> MPI_Reduce(s,r,*-1*,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
>> printf("%d\n",r[0]);
>> MPI_Finalize();
>> }
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] NetPIPE performance curves

2017-05-03 Thread George Bosilca
Dave,

I specifically forced the OB1 PML with the OpenIB BTL. I have "--mca pml
ob1 --mca btl openib,self" on my mpirun.

Originally, I assumed that the pipeline protocol was not kicking in as
expected, and that the large cost you are seeing was due to pinning the
entire buffer for the communication. Thus, I tried to alter the MCA
parameters driving the pipeline protocol, but failed to see any major
benefit (compared with the stock version).

Here is what I used:
mpirun --map-by node --mca pml ob1 --mca btl openib,self --mca
btl_openib_get_limit $((1024*1024)) --mca btl_openib_put_limit
$((1024*1024)) ./NPmpi --nocache --start 100

  George.



On Wed, May 3, 2017 at 4:27 PM, Dave Turner  wrote:

> George,
>
> Our local cluster runs Gentoo which I think prevents us from
> using MXM and we do not use UCX.  It's a pretty standard build
> of 2.0.1 (ompi_info -a for Beocat is attached).
>
> I've also attached the ompi_info -a dump for Comet which is
> running 1.8.4.  A grep shows nothing about MXM or UCX.
>
> Are you testing with MXM or UCX that would be giving you
> the different results?
>
> Dave
>
> On Wed, May 3, 2017 at 1:00 PM,  wrote:
>
>> Send devel mailing list submissions to
>> devel@lists.open-mpi.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> or, via email, send a message with subject or body 'help' to
>> devel-requ...@lists.open-mpi.org
>>
>> You can reach the person managing the list at
>> devel-ow...@lists.open-mpi.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of devel digest..."
>>
>>
>> Today's Topics:
>>
>>1. NetPIPE performance curves (Dave Turner)
>>2. Re: NetPIPE performance curves (George Bosilca)
>>3. remote spawn - have no children (Justin Cinkelj)
>>4. Re: remote spawn - have no children (r...@open-mpi.org)
>>5. Re: remote spawn - have no children (r...@open-mpi.org)
>>6. Re: remote spawn - have no children (Justin Cinkelj)
>>7. Re: remote spawn - have no children (r...@open-mpi.org)
>>
>>
>> --
>>
>> Message: 1
>> Date: Tue, 2 May 2017 15:40:59 -0500
>> From: Dave Turner 
>> To: Open MPI Developers 
>> Subject: [OMPI devel] NetPIPE performance curves
>> Message-ID:
>> > ail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>>
>> I've used my NetPIPE communication benchmark (
>> http://netpipe.cs.ksu.edu)
>>
>> to measure the performance of OpenMPI and other implementations on
>> Comet at SDSC (FDR IB, graph attached, same results measured elsewhere
>> too).
>> The uni-directional performance is good at 50 Gbps, the bi-directional
>> performance
>> is double that at 97 Gbps, and the aggregate bandwidth from measuring
>> 24 bi-directional ping-pongs across the link between 2 nodes is a little
>> lower
>> than I'd like to see but still respectable, and similar for MVAPICH.  All
>> these
>> were measured by reusing the same source and destination buffers
>> each time.
>>
>> When I measure using the --nocache flag where the data comes
>> from a new buffer in main memory each time, and is therefore also
>> not already registered with the IB card, and likewise gets put into a
>> new buffer in main memory, I see a loss in performance of at least
>> 20%.  Could someone please give me a short description of whether
>> this is due to data being copied into a memory buffer that is already
>> registered with the IB card, or whether this is the cost of registering
>> the new memory with the IB card for its first use?
>>  I also see huge performance losses in this case when the message
>> size is not a factor of 8 bytes (factors of 8 are the tops of the spikes).
>> I've seen this in the past when there was a memory copy involved and
>> the copy routine switched to a byte-by-byte copy for non factors of 8.
>> While I don't know how many apps fall into this worst case scenario
>> that the --nocache measurements represent, I could certainly see large
>> bioinformatics runs being affected as the message lengths are not
>> going to be factors of 8 bytes.
>>
>>  Dave Turner
>>
>> --
>> Work: davetur...@ksu.edu (785) 532-7791
>>  2219 Engineering Hall, Manha

Re: [OMPI devel] NetPIPE performance curves

2017-05-02 Thread George Bosilca
David,

Are you using the OB1 PML or one of our IB-enabled MTLs (UCX or MXM) ? I
have access to similar cards, and I can't replicate your results. I do see
a performance loss, but nowhere near what you have seen (it is going down
to 47Gb instead of 50Gb).

George.


On Tue, May 2, 2017 at 4:40 PM, Dave Turner  wrote:

>
> I've used my NetPIPE communication benchmark (
> http://netpipe.cs.ksu.edu)
> to measure the performance of OpenMPI and other implementations on
> Comet at SDSC (FDR IB, graph attached, same results measured elsewhere
> too).
> The uni-directional performance is good at 50 Gbps, the bi-directional
> performance
> is double that at 97 Gbps, and the aggregate bandwidth from measuring
> 24 bi-directional ping-pongs across the link between 2 nodes is a little
> lower
> than I'd like to see but still respectable, and similar for MVAPICH.  All
> these
> were measured by reusing the same source and destination buffers
> each time.
>
> When I measure using the --nocache flag where the data comes
> from a new buffer in main memory each time, and is therefore also
> not already registered with the IB card, and likewise gets put into a
> new buffer in main memory, I see a loss in performance of at least
> 20%.  Could someone please give me a short description of whether
> this is due to data being copied into a memory buffer that is already
> registered with the IB card, or whether this is the cost of registering
> the new memory with the IB card for its first use?
>  I also see huge performance losses in this case when the message
> size is not a factor of 8 bytes (factors of 8 are the tops of the spikes).
> I've seen this in the past when there was a memory copy involved and
> the copy routine switched to a byte-by-byte copy for non factors of 8.
> While I don't know how many apps fall into this worst case scenario
> that the --nocache measurements represent, I could certainly see large
> bioinformatics runs being affected as the message lengths are not
> going to be factors of 8 bytes.
>
>  Dave Turner
>
> --
> Work: davetur...@ksu.edu (785) 532-7791
>  2219 Engineering Hall, Manhattan KS  66506
> Home:drdavetur...@gmail.com
>   cell: (785) 770-5929
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] MPI_Type_Dup specification

2017-05-02 Thread George Bosilca
In my interpretation of the text you mention I have not considered the type
name as an intrinsic property of the datatype (unlike ub, lb, size and many
others). Thus, I took the freedom to alter it in a meaningful way for
debugging purposes. This should not affect the users if they set the name,
as it will be overwritten.

  George.


PS: the text is obviously wrong when it claims that the "[type] yields the
same net result when fully decoded with the functions in Section 4.1.13".
If that was the case we wouldn't have had a need for MPI_COMBINER_DUP.

On Tue, May 2, 2017 at 4:31 AM, Aboorva Devarajan  wrote:

> This particular test from MPICH fails : https://github.com/pmodels/
> mpich/blob/master/test/mpi/f77/datatype/typesnamef.f
>
> $ mpirun -n 1 ./typesnamef
>
> (type2) Expected length 0, got  17
> (type2) Datatype name is not all blank
> Found  2  errors
>
> The test case expects the duplicated datatype's name to be blank soon
> after duplication.
>
> I could see in open-mpi we are actually duplicating the old datatype name
> in MPI_Type_Dup and pre-pending it with string "Dup"
>
> https://github.com/open-mpi/ompi/blob/872cf44c28203fcb21838b0705d5b9
> c85c3e1407/ompi/datatype/ompi_datatype_create.c#L111
>
>
> According to the MPI Standard v3.1 On Page 111:
>
> "MPI_TYPE_DUP is a type constructor which duplicates the existing
> oldtype with associated key values. For each key value, the respective
> copy callback function determines the attribute value associated with this
> key in the new communicator; one particular action that a copy callback may
> take is to delete the attribute from the new datatype. Returns in newtype a
> new datatype with exactly the same properties as oldtype and any copied
> cached information, see Section 6.7.4. The new datatype has identical upper
> bound and lower bound and yields the same net result when fully decoded
> with the functions in Section 4.1.13. The newtype has the same committed
> state as the old oldtype.
>
> *"Returns in newtype a new datatype with exactly the same properties as
> oldtype"*
>
> Any information on how this spec must be interpreted? Should we consider
> datatype name as a property?
>
>
> - Aboorva
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] TCP BTL's multi-link behavior

2017-04-26 Thread George Bosilca
I wouldn't put too much fate in my memory either. What I recall is that the
multi-link code was written mainly for the PS4, where the supervisor would
limit the bandwidth per socket to about 60% of the hardware capabilities.
Thus, by using multiple links (in fact sockets between a set of peers) we
could aggregate the bandwidth.

I do not recall all the details, but I think the code was supposed to
increase the latency and decrease the bandwidth for the case where multiple
TCP modules were using the same interface. It would certainly be
interesting to re-write this code, it is 10 years old.

  George.




On Wed, Apr 26, 2017 at 3:24 PM, Barrett, Brian via devel <
devel@lists.open-mpi.org> wrote:

> George -
>
> Do you remember why you adjusted both the latency and bandwidth for
> secondary links when using multi-link support with the TCP BTL [1]?  I
> think I understand why, but your memory of 10 years ago is hopefully more
> accurate than my reverse engineering ;).
>
> Thanks,
>
> Brian
>
>
>
> [1] https://github.com/open-mpi/ompi/blame/master/opal/mca/
> btl/tcp/btl_tcp_component.c#L497
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Segfault during a free in reduce_scatter using basic component

2017-03-28 Thread George Bosilca
Emmanuel,

I tried with both 2.x and master (they are only syntactically different
with regard to reduce_scatter) and I can't reproduce your issue. I run the
OSU test with the following command line:

mpirun -n 97 --mca coll basic,libnbc,self --mca pml ob1
./osu_reduce_scatter -m 524288: -i 1 -x 0

I used the IB, TCP, vader and self BTLs.

  George.





On Tue, Mar 28, 2017 at 6:21 AM, Howard Pritchard 
wrote:

> Hello Emmanuel,
>
> Which version of Open MPI are you using?
>
> Howard
>
>
> 2017-03-28 3:38 GMT-06:00 BRELLE, EMMANUEL :
>
>> Hi,
>>
>> We are working  on a portals4 components and we have found a bug
>> (causing a segmentation fault ) which must be  related to the coll/basic
>> component.
>> Due to a lack of time, we cannot investigate further but this seems to be
>> caused by a “free(disps);“ (around line 300 in coll_basic_reduce_scatter)
>> in some specific situations. In our case it  happens on a
>> osu_reduce_scatter (from the OSU microbenchmarks) with at least 97 procs
>> for sizes bigger than 512Ko
>>
>> Step to reproduce :
>> export OMPI_MCA_mtl=^portals4
>> export OMPI_MCA_btl=^portals4
>> export OMPI_MCA_coll=basic,libnbc,self,tuned
>> export OMPI_MCA_osc=^portals4
>> export OMPI_MCA_pml=ob1
>> mpirun -n 97 osu_reduce_scatter -m 524288:
>>
>> ( reducing the number of iterations with –i 1 –x 0 should keep the bug)
>> Our git branch is based on the v2.x branch and the files differ almost
>> only on portals4 parts.
>>
>> Could someone confirm this bug ?
>>
>> Emmanuel BRELLE
>>
>>
>>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] PVAR definition - Variable type regarding the PVAR class

2017-03-22 Thread George Bosilca
Clement,

I would continue to use the SIZE_T atomics but define the PVARs type
accordingly (UNSIGNED_LONG or UNSIGNED_LONG_LONG depending on which of
their size match the one from size_t).

  George.


On Wed, Mar 22, 2017 at 6:45 AM, Clement FOYER 
wrote:

> Hi everyone,
>
> I'm facing an issue with the typing of the performance variables I try to
> register.
>
> They are basically counters of type MCA_BASE_PVAR_CLASS_AGGREGATE,
> MCA_BASE_PVAR_CLASS_COUNTER, or MCA_BASE_PVAR_CLASS_SIZE depending on the
> case. In order to silence some warnings when using opal_atomic_add_*()
> functions they were switched from uint64_t to int64_t. But doing so, we
> loose some consistency in the code. It was then decided to move from
> int64_t to size_t, in order to keep consistency across every functions of
> our component's API.
>
> The problem the following : I can't register properly my variables
> anymore, because of the restrictions over the types as defined in the
> MPI3.1 standard (Sec. 14.3.7, p.580), I can't define these variable as
> MCA_BASE_VAR_TYPE_SIZE_T, as they are not explicitly authorized. However,
> the SIZE_T type is not even allowed as type for performance variables, as
> it is not defined in the list given Section 14.3.5 p.571. So should this
> type never to be used for defining performance variables, and thus it
> should not be given as part of the enum mca_base_var_type_t type, or can it
> be considered as an alias for unsigned long or unsigned long long, and thus
> it should be possible to use them in the same cases unsigned long int and
> unsigned long long int are ?
>
> Thenk you in advance for your answer,
>
> Clement FOYER
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI, ssh and limits

2017-03-03 Thread George Bosilca
Isn't this supposed to be part of cluster 101?

I would rather add it to our faq, maybe in a slightly more generic way (not
only focused towards 'ulimit - c'. Otherwise we will be bound to define
what is forwarded and what is not, and potentially creates chaos for
knowledgeable users (that know how to deal with these issues).

George


On Mar 3, 2017 3:05 AM, "Gilles Gouaillardet"  wrote:

Folks,


this is a follow-up on https://www.mail-archive.com/u
s...@lists.open-mpi.org//msg30715.html


on my cluster, the core file size is 0 by default, but it can be set to
unlimited by any user.

i think this is a pretty common default.


$ ulimit -c
0
$ bash -c 'ulimit -c'
0
$ mpirun -np 1 bash -c 'ulimit -c'
0

$ mpirun -np 1 --host n1 bash -c 'ulimit -c'
0

$ ssh n1
[n1 ~]$ ulimit -c
0
[n1 ~]$ bash -c 'ulimit -c'
0

*but*

$ ssh motomachi-n1 bash -c 'ulimit -c'
unlimited


now if i manually set the core file size to unlimited

$ ulimit -c unlimited
$ ulimit -c
unlimited
$ bash -c 'ulimit -c'
unlimited
$ mpirun -np 1 bash -c 'ulimit -c'
unlimited


*but*

$ mpirun -np 1 --host n1 bash -c 'ulimit -c'
0


fun fact

$ ssh n1 bash -c 'ulimit -c; bash -c "ulimit -c"'
unlimited
0


bottom line, MPI tasks that run on the same node mpirun was invoked on
inherit

the core file size limit from mpirun, whereas tasks that run on the other
node

use the default core file size limit.


a manual workaround is

mpirun --mca opal_set_max_sys_limits core:unlimited ...


i guess we should do something about that, but what

- just document it

- mpirun forwards all/some limits to all the spawned tasks regardless where
they run

- mpirun forwards all/some limits to all the spawned tasks regardless where
they run

  but only if they are 0 or unlimited

- something else



thoughts anyone ?


Gilles


___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Segfault on MPI init

2017-02-10 Thread George Bosilca
To complete this thread, the problem is now solved. Some .so were lingering 
around from a previous installation causing startup pb.

  George.


> On Feb 10, 2017, at 05:38 , Cyril Bordage  wrote:
> 
> Thank you for your answer.
> I am running the git master version (last tested was cad4c03).
> 
> FYI, Clément Foyer is talking with George Bosilca about this problem.
> 
> 
> Cyril.
> 
> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
>> What version of Open MPI are you running?
>> 
>> The error is indicating that Open MPI is trying to start a user-level helper 
>> daemon on the remote node, and the daemon is seg faulting (which is unusual).
>> 
>> One thing to be aware of:
>> 
>> https://www.open-mpi.org/faq/?category=building#install-overwrite
>> 
>> 
>> 
>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage  wrote:
>>> 
>>> Hello,
>>> 
>>> I cannot run the a program with MPI when I compile it myself.
>>> On some nodes I have the following error:
>>> 
>>> [mimi012:17730] *** Process received signal ***
>>> [mimi012:17730] Signal: Segmentation fault (11)
>>> [mimi012:17730] Signal code: Address not mapped (1)
>>> [mimi012:17730] Failing at address: 0xf8
>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x766c0500]
>>> [mimi012:17730] [ 1]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7781fcb9]
>>> [mimi012:17730] [ 2]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7197fbcd]
>>> [mimi012:17730] [ 3]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x71981e34]
>>> [mimi012:17730] [ 4]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7197bb1d]
>>> [mimi012:17730] [ 5]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7782323c]
>>> [mimi012:17730] [ 6]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x777c534c]
>>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x766b8851]
>>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7640694d]
>>> [mimi012:17730] *** End of error message ***
>>> --
>>> ORTE has lost communication with its daemon located on node:
>>> 
>>> hostname:  mimi012
>>> 
>>> This is usually due to either a failure of the TCP network
>>> connection to the node, or possibly an internal failure of
>>> the daemon itself. We cannot recover from this failure, and
>>> therefore will terminate the job.
>>> --
>>> 
>>> 
>>> The error does not appear with the official MPI installed in the
>>> platform. I asked the admins about their compilation options but there
>>> is nothing particular.
>>> 
>>> Moreover it appears only for some node lists. Still, the nodes seem to
>>> be fine since it works with the official version of MPI of the platform.
>>> 
>>> To be sure it is not a network problem I tried to use "-mca btl
>>> tcp,sm,self" or "-mca btl openib,sm,self" with no change.
>>> 
>>> Do you have any idea where this error may come from?
>>> 
>>> Thank you.
>>> 
>>> 
>>> Cyril Bordage.
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Disable progress engine

2017-01-24 Thread George Bosilca
If you do not explicitly enabled asynchronous progress (which btw is only
supported by some of the BTLs), there will be little support for
asynchronous progress in Open MPI. Indeed, as long as the library is not in
an MPI call, all progress is stalled, no matching is done, and the only
possible progress is the one done outside the OMPI framework (such as
network-level RMA).

The ORTE progress thread only help during the setup of the runtime, it has
no impact on the MPI communication. The fact that OPAL supports threads,
simply means that the library has been compiled in a thread-safe way. This
thread safety is then enabled at runtime, based on the arguments provided
during MPI initialization (MPI_Init / MPI_Init_thread).

  George.


On Tue, Jan 24, 2017 at 5:45 AM, Cyril Bordage 
wrote:

> Hello,
>
> I would like to see how the overlapping between communications and
> computations is done in some applications. And I would like to be able
> to prevent it.
> In this purpose, is it possible to disable the progress engine and have
> the application nearly as if there was no asynchronous communications?
>
> Since I am not familiar with ompi terminology, I would like to know what
> the thread support section "OPAL support: yes, OMPI progress: no, ORTE
> progress: yes" in impi_info means.
>
> Thank you.
>
>
> Cyril.
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] MPI_Bcast algorithm

2017-01-19 Thread George Bosilca
Enim,

Before going into all these efforts to implement a new function, let me
describe what is there already and that you can use to achieve something
similar. It will not be exactly what you describe, because changing a
particular collective algorithm dynamically on a communicator might is not
as safe as it sounds. However, you can reconfigure the collective selection
before creating a new communicator.

For this you will have to set coll_tuned_use_dynamic_rules to 1. Assuming
that you want to play with broadcast, you need to force the MCA
parameter coll_tuned_bcast_algorithm to a new value (by calling the MCA
corresponding MCA function). In same time you can also define the
parameters that regulate the broadcast algorithms such as the segment size
(coll_tuned_bcast_algorithm_segmentsize). The next communication creation
call will inherit these new values.

  George.

PS: feel free to ping me privately if you want more info.

On Thu, Jan 19, 2017 at 1:27 PM, Emin Nuriyev 
wrote:

> Hi,
>
> I want to create broadcast function which allow me to select algorithm in
> application layer.
> For example :
>
> MPI_Bcast_alg(void *buffer, int count, MPI_Datatype datatype, int root,
> MPI_Comm comm, int alg)
>
> Same arguments as MPI_Bcast() and plus  algorithm's id.
>
> *coll_tune_bcast_decision.c* file contain function which allows us to
> select function.
>
> int ompi_coll_tuned_bcast_intra_do_this(void *buf, int count,
> struct ompi_datatype_t *dtype,
> int root,
> struct ompi_communicator_t *comm,
> mca_coll_base_module_t *module,
> int algorithm, int faninout, int
> segsize)
>
> I've got some information about MCA. But yest, it is not clear for me how
> to implement it ? How can I do it ?
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OMPI v1.10.6

2017-01-18 Thread George Bosilca
https://github.com/open-mpi/ompi/issues/2750

  George.



On Wed, Jan 18, 2017 at 12:57 PM, r...@open-mpi.org  wrote:

> Last call for v1.10.6 changes - we still have a few pending for review,
> but none marked as critical. If you want them included, please push for a
> review _now_
>
> Thanks
> Ralph
>
> > On Jan 12, 2017, at 1:54 PM, r...@open-mpi.org wrote:
> >
> > Hi folks
> >
> > It looks like we may have motivation to release 1.10.6 in the near
> future. Please check to see if you have anything that should be included,
> or is pending review.
> >
> > Thanks
> > Ralph
> >
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Wtime is 0.0

2016-12-21 Thread George Bosilca
Jan,

You can use MPI_Type_match_size (
https://www.open-mpi.org/doc/v2.0/man3/MPI_Type_match_size.3.php) to find
the MPI type that meets certain requirements.

  George.



On Wed, Dec 21, 2016 at 3:16 AM, 🐋 Jan Hegewald 
wrote:

> Hi Jeff et al,
>
> > On 20 Dec 2016, at 20:23, Jeff Squyres (jsquyres) 
> wrote:
> >
> > Fair enough.
> >
> > But be aware that if you are building MPICH with double-width Fortran
> types and single-width C types, you might want to verify that MPICH is
> actually working properly.
>
> yes. This episode certainly did bring the possible side effects of the
> "-fdefault-real-8" on my list. But for now I just want this Wtime to work (:
>
> >
> > Open MPI is refusing to configure because for each Fortran type X, it
> looks for a corresponding C type Y.  If it can't find a correspondence,
> then it fails/aborts configure (i.e., it lets a human figure it out --
> usually by ensuring that there are equivalent flags for the C/C++ and
> Fortran compilers).  At least in Open MPI, having a C/Fortran type
> equivalence is necessary for reduction operations (because we do them in
> C).  I don't know if MPICH has this restriction or not.
> >
> > In your case, Open MPI is failing to find a basic (single-width) C type
> that corresponds to the (double-width) Fortran type COMPLEX.  This may not
> matter for your specific application, but we tend to take an all-or-nothing
> approach to configuration/building (i.e., we won't knowingly build a
> half-functional Open MPI).
> >
> > I hope that our rationale for this design choice at least makes sense.
> >
> > Also, if you are compiling your application with double-width Fortran
> types but are compiling your application with single-width Fortran types
> (this is how the thread started), that's quite dangerous -- your MPI
> doesn't agree with your application on the size of Fortran types, and all
> types of unpredictable hilarity can/will ensure.
>
> Yes, this may lead to trouble. I am glad I found this -fdefault-real-8 in
> our setup!
>
> >
> > That's exactly why you were getting a 0 from MPI_WTIME() in Open MPI --
> Open MPI was returning an 8 byte DOUBLE PRECISION, but your application was
> looking for a 16 byte DOUBLE PRECISION.
> >
> > I can't tell from your replies, but I'm guessing you compiled MPICH with
> double-width Fortran types and single-width C types.  In this case, you
> should get correct values back from MPI_WTIME (because both MPI and your
> application use 16-byte DOUBLE PRECISIONS).  What is an open question is
> how MPICH treats reductions on Fortran types (i.e., whether they are done
> in C or Fortran), and/or whether it matters for your application.
>
> I installed the default homebrew mpich. Most HPC systems use a vendor
> tweaked MPI anyway. How can I always be sure of the floating point widths
> it has been compiled with?
>
> Cheers,
> Jan Hegewald
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] LD_PRELOAD a C-coded shared object with a FORTRAN application

2016-12-12 Thread George Bosilca
Indeed this is the best solution. If you really want a clean portable
solution, take a look at any of the files in ompi/mpi/fortran/mpif-h
directory, to see how we define the 4 different versions of the Fortran
interface.

  George.

On Mon, Dec 12, 2016 at 10:42 AM, Clement FOYER 
wrote:

> Thank you all for your answers.
>
> I stayed with the C version, with the FORTRAN symbols added as it worked
> with the tests I was willing to start. Nevertheless, in order to keep a
> more proper/portable solution, is it possible to use the same tools as in
> ompi/mpi/fortran/mpif-h/init_f.c in order to generate the mangled symbols
> (i.e. using #pragma weak or OMPI_GENERATE_F77_BINDINGS ) ?
>
> Thank you.
>
> Clément FOYER
>
>
>
> On 12/12/2016 04:21 PM, Jeff Squyres (jsquyres) wrote:
>
>> If your Fortran compiler is new enough (and it *probably* is...?), you
>> can use the BIND(C) notation to ease C / Fortran interoperability issues.
>>
>>
>> On Dec 12, 2016, at 5:37 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>> Clement,
>>>
>>> Ideally, your LD_PRELOAD'able library should be written in Fortran so
>>> you do not even run into this kind of issues (name mangling + parameter
>>> types)
>>>
>>> If you really want to write it in C, you have to do it all manually
>>>
>>> SUBROUTINE MPI_INIT(ierror)
>>> INTEGER IERROR
>>>
>>> can become
>>>
>>> void mpi_init_(MPI_Fint * ierror)
>>>
>>> Note mangling is compiler dependent.
>>> For most compilers, this is the function name with all lower cases,
>>> followed by one or two underscores.
>>>
>>> You will also have to convert all parameters
>>> INTEGER comm
>>> will be replaced (modulo the typos) with
>>> MPI_Comm c_comm;
>>> MPI_Fint *comm;
>>> c_comm = MPI_Comm_f2c(*comm);
>>>
>>> And so on, that is why Fortran wrapper is preferred,
>>> plus there might be over caveats with Fortean 2008
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Monday, December 12, 2016, Clement FOYER 
>>> wrote:
>>> Hello everyone,
>>>
>>> I have been trying to redirect MPI_Init and MPI_Finalize calls from a
>>> FORTRAN application (the CG benchmark from NAS Parallel Benchmarks). It
>>> appears that in the fortran application the MPI_Init function signature is
>>> "mpi_init_", whereas in my shared object it is MPI_Init. How is the f-to-c
>>> binding done in Open-MPI? How can I change the Makefile.am (or add a
>>> configure.m4) in order to check the way this name mapping is done by the
>>> compiler, and how to add the proper symbols so that my shared object could
>>> be used also with FORTRAN programs ?
>>>
>>> Thank you in advance,
>>>
>>> Clément FOYER
>>>
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>
>>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Current progress threads status in Open MPI

2016-11-22 Thread George Bosilca
Christoph,

This is work in progress. Right now, only the TCP BTL has an integrated
progress thread, but we are working on a more general solution that will
handle all BTLs (and possible some of the MTL). If you want more info, or
want to volunteer for beta-testing, please ping me offline.

Thanks,
  George.



On Thu, Nov 17, 2016 at 3:37 AM, Christoph Niethammer 
wrote:

> Hello,
>
> I was wondering, what is the current status of progress threads in Open
> MPI.
> As far as I know it was on the agenda for 1.10.x to be re-enabled after
> it's removal in 1.8.x.
>
> Now we have Open MPI 2.0.x. How to enable/disable it as the old configure
> options are not recognized any more:
>   configure: WARNING: unrecognized options: --enable-progress-threads,
> --enable-multi-threads
> This is wired to me, as I see a lot of [ test "$enable_progress_threads" =
> "yes" ] in the configure scripts.
>
> My nighly 2.0.x mtt build shows ORTE progress enabled by default. But what
> about "OMPI progress"?
>   ompi_info --all --all | grep -i "Thread support"
>   Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI
> progress: no, ORTE progress: yes, Event lib: yes)
>
> Best regards
> Christoph Niethammer
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] MPI_Win_lock semantic

2016-11-21 Thread George Bosilca
Gilles,

I looked at the test and I think the current behavior is indeed correct.
What matters for an exclusive lock is that all operations in an epoch
(everything surrounded by lock/unlock) are atomically applied to the
destination (and are not interleaved with other updates). As Nathan stated,
MPI_Win_lock might be implemented as non-blocking, in which case it is
totally legit for the process 2 to acquire the lock first, and update the
array before the process 0 access it. Thus the test will fail.

The test will never deadlock, because even if the MPI_Win_lock is
implemented as a blocking operation (which is also legit), the send and
receive match correctly with the lock/unlock.

Moreover, I think the behavior described by the comments can only be
implemented by enforcing an order between the only conceptually meaningful
operations unlock/send/recv.

if (me == 0) {
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 1, 0, win);
 MPI_Get(a,len,MPI_DOUBLE,1,0,len,MPI_DOUBLE,win);
MPI_Win_unlock(1, win);
MPI_Send(NULL, 0, MPI_BYTE, 2, 1001, MPI_COMM_WORLD);
}
if (me == 2) {
/* this should block till 0 releases the lock. */
MPI_Recv(NULL, 0, MPI_BYTE, 0, 1001, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 1, 0, win);
MPI_Put(a,len,MPI_DOUBLE,1,0,len,MPI_DOUBLE,win);
MPI_Win_unlock(1, win);
}

However, if we relax the code a little and want to ensure the atomicity of
the operations, then we need to change the check to make sure that either
no elements of the array have been altered or all of them have been altered
(set to zero by process 2).

  George.



On Mon, Nov 21, 2016 at 8:57 PM, Nathan Hjelm  wrote:

> To be safe I would call MPI_Get then MPI_Win_flush. That lock will always
> be acquired before the MPI_Win_flush call returns. As long as it is more
> than 0 bytes. We always short-circuit 0-byte operations in both osc/rdma
> and osc/pt2pt.
>
> -Nathan
>
> > On Nov 21, 2016, at 8:54 PM, Gilles Gouaillardet 
> wrote:
> >
> > Thanks Nathan,
> >
> >
> > any thoughts about my modified version of the test ?
> >
> > do i need to MPI_Win_flush() after the first MPI_Get() in order to
> ensure the lock was acquired ?
> >
> > (and hence the program will either success or hang, but never fail)
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> >
> > On 11/22/2016 12:29 PM, Nathan Hjelm wrote:
> >> MPI_Win_lock does not have to be blocking. In osc/rdma it is blocking
> in most cases but not others (lock all with on-demand is non-blocking) but
> in osc/pt2pt is is almost always non-blocking (it has to be blocking for
> proc self). If you really want to ensure the lock is acquired you can call
> MPI_Win_flush. I think this should work even if you have not started any
> RMA operations inside the epoch.
> >>
> >> -Nathan
> >>
> >>> On Nov 21, 2016, at 7:53 PM, Gilles Gouaillardet 
> wrote:
> >>>
> >>> Nathan,
> >>>
> >>>
> >>> we briefly discussed the test_lock1 test from the onesided test suite
> using osc/pt2pt
> >>>
> >>> https://github.com/open-mpi/ompi-tests/blob/master/
> onesided/test_lock1.c#L57-L70
> >>>
> >>>
> >>> task 0 does
> >>>
> >>> MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank=1,...);
> >>>
> >>> MPI_Send(...,dest=2,...)
> >>>
> >>>
> >>> and task 2 does
> >>>
> >>> MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank=1,...);
> >>>
> >>> MPI_Recv(...,source=0,...)
> >>>
> >>>
> >>> hoping to guarantee task 0 will acquire the lock first.
> >>>
> >>>
> >>> once in a while, the test fails when task 2 acquires the lock first
> >>>
> >>> /* MPI_Win_lock() only sends a lock request, and return without owning
> the lock */
> >>>
> >>> so if task 1 is running on a loaded server, and even if task 2
> requests the lock *after* task 0,
> >>>
> >>> lock request from task 2 can be processed first, and hence task 2 is
> not guaranteed to acquire the lock *before* task 0.
> >>>
> >>>
> >>> can you please confirm MPI_Win_lock() behaves as it is supposed to ?
> >>>
> >>> if yes, is there a way for task 0 to block until it acquires the lock ?
> >>>
> >>>
> >>> i modified the test, and inserted in task 0 a MPI_Get of 1 MPI_Double
> *before* MPI_Send.
> >>>
> >>> see my patch below (note i increased the message length)
> >>>
> >>>
> >>> my expectation is that the test would either success (e.g. task 0 gets
> the lock first) or hang
> >>>
> >>> (if task 1 gets the lock first)
> >>>
> >>>
> >>>
> >>> surprisingly, the test never hangs (so far ...) but once in a while,
> it fails (!), which makes me very confused
> >>>
> >>>
> >>> Any thoughts ?
> >>>
> >>>
> >>> Cheers,
> >>>
> >>>
> >>> Gilles
> >>>
> >>>
> >>>
> >>> diff --git a/onesided/test_lock1.c b/onesided/test_lock1.c
> >>> index c549093..9fa3f8d 100644
> >>> --- a/onesided/test_lock1.c
> >>> +++ b/onesided/test_lock1.c
> >>> @@ -20,7 +20,7 @@ int
> >>> test_lock1(void)
> >>> {
> >>> double *a = NULL;
> >>> -size_t len = 10;
> >>> +size_t len = 100;
> >>> MPI_Winwin;
> >>> inti;
> >>>
> >>> @@ -56,6

Re: [OMPI devel] MPI_Win_lock semantic

2016-11-21 Thread George Bosilca
Why is MPI_Win_flush required to ensure the lock is acquired ? According to
the standard MPI_Win_flush "completes all outstanding RMA operations
initiated by the calling process to the target rank on the specified
window", which can be read as being a noop if no pending operations exists.

  George.



On Mon, Nov 21, 2016 at 8:29 PM, Nathan Hjelm  wrote:

> MPI_Win_lock does not have to be blocking. In osc/rdma it is blocking in
> most cases but not others (lock all with on-demand is non-blocking) but in
> osc/pt2pt is is almost always non-blocking (it has to be blocking for proc
> self). If you really want to ensure the lock is acquired you can call
> MPI_Win_flush. I think this should work even if you have not started any
> RMA operations inside the epoch.
>
> -Nathan
>
> > On Nov 21, 2016, at 7:53 PM, Gilles Gouaillardet 
> wrote:
> >
> > Nathan,
> >
> >
> > we briefly discussed the test_lock1 test from the onesided test suite
> using osc/pt2pt
> >
> > https://github.com/open-mpi/ompi-tests/blob/master/
> onesided/test_lock1.c#L57-L70
> >
> >
> > task 0 does
> >
> > MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank=1,...);
> >
> > MPI_Send(...,dest=2,...)
> >
> >
> > and task 2 does
> >
> > MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank=1,...);
> >
> > MPI_Recv(...,source=0,...)
> >
> >
> > hoping to guarantee task 0 will acquire the lock first.
> >
> >
> > once in a while, the test fails when task 2 acquires the lock first
> >
> > /* MPI_Win_lock() only sends a lock request, and return without owning
> the lock */
> >
> > so if task 1 is running on a loaded server, and even if task 2 requests
> the lock *after* task 0,
> >
> > lock request from task 2 can be processed first, and hence task 2 is not
> guaranteed to acquire the lock *before* task 0.
> >
> >
> > can you please confirm MPI_Win_lock() behaves as it is supposed to ?
> >
> > if yes, is there a way for task 0 to block until it acquires the lock ?
> >
> >
> > i modified the test, and inserted in task 0 a MPI_Get of 1 MPI_Double
> *before* MPI_Send.
> >
> > see my patch below (note i increased the message length)
> >
> >
> > my expectation is that the test would either success (e.g. task 0 gets
> the lock first) or hang
> >
> > (if task 1 gets the lock first)
> >
> >
> >
> > surprisingly, the test never hangs (so far ...) but once in a while, it
> fails (!), which makes me very confused
> >
> >
> > Any thoughts ?
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> >
> >
> > diff --git a/onesided/test_lock1.c b/onesided/test_lock1.c
> > index c549093..9fa3f8d 100644
> > --- a/onesided/test_lock1.c
> > +++ b/onesided/test_lock1.c
> > @@ -20,7 +20,7 @@ int
> > test_lock1(void)
> > {
> > double *a = NULL;
> > -size_t len = 10;
> > +size_t len = 100;
> > MPI_Winwin;
> > inti;
> >
> > @@ -56,6 +56,7 @@ test_lock1(void)
> >  */
> > if (me == 0) {
> >MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 1, 0, win);
> > +   MPI_Get(a,1,MPI_DOUBLE,1,0,1,MPI_DOUBLE,win);
> > MPI_Send(NULL, 0, MPI_BYTE, 2, 1001, MPI_COMM_WORLD);
> >MPI_Get(a,len,MPI_DOUBLE,1,0,len,MPI_DOUBLE,win);
> > MPI_Win_unlock(1, win);
> > @@ -76,6 +77,7 @@ test_lock1(void)
> > /* make sure 0 got the data from 1 */
> >for (i = 0; i < len; i++) {
> >if (a[i] != (double)(10*1+i)) {
> > +if (0 == nfail) fprintf(stderr, "at index %d, expected
> %lf but got %lf\n", i, (double)10*1+i, a[i]);
> >nfail++;
> >}
> >}
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Removing from opal_hashtable while iterating over the elements

2016-11-18 Thread George Bosilca
Absolutely, if you keep the pointer to the previous or next element, it is
safe to remove an element. If you are in the process of completely emptying
the hashtable you can just keep removing the head element.

George

On Nov 18, 2016 6:51 AM, "Clement FOYER"  wrote:

> Hi everyone,
>
> I was wondering if it was possible to remove an element while iterating
> over the elements of a hashtable. As saw that it wasn't while using the
> OPAL_HASHTABLE_FOREACH macro, and I suppose it's because of the possible
> loss of the current next element. But how about if the element have already
> been iterated over? If I save a pointer to the previous element, it is safe
> to remove it from the hashtable?
>
> Thank you,
>
> Clément FOYER
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] New Open MPI Community Bylaws to discuss

2016-10-12 Thread George Bosilca
Yes, my understanding is that unsystematic contributors will not have to
sign the contributor agreement, but instead will have to provide a signed
patch.

  George.


On Wed, Oct 12, 2016 at 9:29 AM, Pavel Shamis 
wrote:

> Does it mean that contributors don't have to sign contributor agreement ?
>
> On Tue, Oct 11, 2016 at 2:35 PM, Geoffrey Paulsen 
> wrote:
>
>> We have been discussing new Bylaws for the Open MPI Community.  The
>> primary motivator is to allow non-members to commit code.  Details in the
>> proposal (link below).
>>
>> Old Bylaws / Procedures:  https://github.com/open-mpi/om
>> pi/wiki/Admistrative-rules
>>
>> New Bylaws proposal: https://github.com/open-mpi/om
>> pi/wiki/Proposed-New-Bylaws
>>
>> Open MPI members will be voting on October 25th.  Please voice any
>> comments or concerns.
>>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] use of OBJ_NEW and related calls

2016-10-10 Thread George Bosilca
These macros are defined in opal/class/opal_object.h. We are using them all
over the OMPI code base, including OPAL, ORTE, OSHMEM and OMPI. These calls
are indeed somewhat similar to an OO language, the intent was to have a
thread-safe way to refcount objects to keep them around for as long as they
are necessary.

  George.


On Mon, Oct 10, 2016 at 4:18 PM, Emani, Murali  wrote:

> Hi,
>
> Could someone help me in understanding where the functions OBJ_NEW/
> OBJ_CONSTRUCT/ OBJ_DESTRUCT are defined in the source code. Are these
> specific to OpenMPI code base?
> Is the assumption correct that these calls are wrappers to create new
> objects, initialize and destroy, similar to any object oriented language.
>
> —
> Murali
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-23 Thread George Bosilca
RFC applied via 93fa94f9.


On Fri, Sep 23, 2016 at 7:13 AM, George Bosilca  wrote:

> It turns out the OMPI behavior today was divergent from what is written in
> the README. We already explicitly state that
>
>   - If specified, the "btl_tcp_if_exclude" parameter must include the
> loopback device ("lo" on many Linux platforms), or Open MPI will
> not be able to route MPI messages using the TCP BTL.  For example:
> "mpirun --mca btl_tcp_if_exclude lo,eth1 ..."
>
> So, with this patch we are now README compliant !
>
>   George.
>
>
>
> On Fri, Sep 23, 2016 at 7:03 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> George,
>>
>> OK then,
>> I recommend we explicitly state in the README that loopback interface can
>> no more be omitted from btl_tcp_if_exclude when running on multiple nodes
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Thursday, September 22, 2016, George Bosilca 
>> wrote:
>>
>>> Thanks for clarifying, I now understand what your objection/suggestion
>>> was. We all misconfigured OMPI at least once, but that allowed us to learn
>>> how to do it right.
>>>
>>> Instead of adding extra protections for corner-cases, maybe we should
>>> fix our exclusivity flag so that the scenario you describe would not happen.
>>>
>>>   George.
>>>
>>> PS: "btl_tcp_if_exclude = ^ib0" qualifies as a honest mistake. I
>>> wouldn't dare proposing a new MCA param to prevent this ...
>>>
>>>
>>> On Wed, Sep 21, 2016 at 10:54 PM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>>> ok, i was not clear
>>>>
>>>> by "let's consider the case where "lo" is *not* excluded via the
>>>> btl_tcp_if_exclude MCA param" i really meant
>>>> "let's consider the case where the value of the btl_tcp_if_exclude MCA
>>>> param has been forced to a list of network/interfaces that do not
>>>> contain any reference (e.g. name nor subnet) to the loopback
>>>> interface"
>>>> /* in a previous example, i did mpirun --mca btl_tcp_if_exclude ^ib0 */
>>>>
>>>> my concern is that openmpi-mca-params.conf contains
>>>> btl_tcp_if_exclude = ^ib0
>>>>
>>>> then hiccups will start when Open MPI is updated, and i expect some
>>>> complains.
>>>> of course we can reply, doc should have been read and advices
>>>> followed, so one cannot complain just because he has been lucky so
>>>> far.
>>>> or we can do things a bit differently so we do not run into this case
>>>>
>>>> /* if btl/self is excluded, the app will not start and it is trivial
>>>> to append to the error message a note asking to ensure btl/self was
>>>> not excluded.
>>>> in this case, i do not think we have a mechanism to issue a warning
>>>> message (e.g. "ensure lo is excluded") when hiccups occur. */
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Thu, Sep 22, 2016 at 9:54 AM, George Bosilca 
>>>> wrote:
>>>> > On Wednesday, September 21, 2016, Gilles Gouaillardet
>>>> >  wrote:
>>>> >>
>>>> >> George,
>>>> >>
>>>> >> let's consider the case where "lo" is *not* excluded via the
>>>> >> btl_tcp_if_exclude MCA param
>>>> >> (if i understand correctly, the following is also true if "lo" is
>>>> >> included via the btl_tcp_if_include MCA param)
>>>> >>
>>>> >> currently, and because of/thanks to the test that is done "deep
>>>> inside"
>>>> >> 1) on a disconnected laptop, mpirun --mca btl tcp,self ... fails with
>>>> >> 2 tasks or more because tasks cannot reach each other
>>>> >> 2) on a (connected) cluster, "lo" is never used and mpirun --mca btl
>>>> >> tcp,self ... does not hang when tasks are running on two nodes or
>>>> more
>>>> >>
>>>> >> with your proposal :
>>>> >> 3) on a disconnected laptop, mpirun --mca btl tcp,self ... works with
>>>> >> any number of taks, because "lo" is used by btl/tcp
>>>> >> 4) on a (connected) cluster, "lo" is used and

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-23 Thread George Bosilca
It turns out the OMPI behavior today was divergent from what is written in
the README. We already explicitly state that

  - If specified, the "btl_tcp_if_exclude" parameter must include the
loopback device ("lo" on many Linux platforms), or Open MPI will
not be able to route MPI messages using the TCP BTL.  For example:
"mpirun --mca btl_tcp_if_exclude lo,eth1 ..."

So, with this patch we are now README compliant !

  George.



On Fri, Sep 23, 2016 at 7:03 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> George,
>
> OK then,
> I recommend we explicitly state in the README that loopback interface can
> no more be omitted from btl_tcp_if_exclude when running on multiple nodes
>
> Cheers,
>
> Gilles
>
>
> On Thursday, September 22, 2016, George Bosilca 
> wrote:
>
>> Thanks for clarifying, I now understand what your objection/suggestion
>> was. We all misconfigured OMPI at least once, but that allowed us to learn
>> how to do it right.
>>
>> Instead of adding extra protections for corner-cases, maybe we should fix
>> our exclusivity flag so that the scenario you describe would not happen.
>>
>>   George.
>>
>> PS: "btl_tcp_if_exclude = ^ib0" qualifies as a honest mistake. I
>> wouldn't dare proposing a new MCA param to prevent this ...
>>
>>
>> On Wed, Sep 21, 2016 at 10:54 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> ok, i was not clear
>>>
>>> by "let's consider the case where "lo" is *not* excluded via the
>>> btl_tcp_if_exclude MCA param" i really meant
>>> "let's consider the case where the value of the btl_tcp_if_exclude MCA
>>> param has been forced to a list of network/interfaces that do not
>>> contain any reference (e.g. name nor subnet) to the loopback
>>> interface"
>>> /* in a previous example, i did mpirun --mca btl_tcp_if_exclude ^ib0 */
>>>
>>> my concern is that openmpi-mca-params.conf contains
>>> btl_tcp_if_exclude = ^ib0
>>>
>>> then hiccups will start when Open MPI is updated, and i expect some
>>> complains.
>>> of course we can reply, doc should have been read and advices
>>> followed, so one cannot complain just because he has been lucky so
>>> far.
>>> or we can do things a bit differently so we do not run into this case
>>>
>>> /* if btl/self is excluded, the app will not start and it is trivial
>>> to append to the error message a note asking to ensure btl/self was
>>> not excluded.
>>> in this case, i do not think we have a mechanism to issue a warning
>>> message (e.g. "ensure lo is excluded") when hiccups occur. */
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Thu, Sep 22, 2016 at 9:54 AM, George Bosilca 
>>> wrote:
>>> > On Wednesday, September 21, 2016, Gilles Gouaillardet
>>> >  wrote:
>>> >>
>>> >> George,
>>> >>
>>> >> let's consider the case where "lo" is *not* excluded via the
>>> >> btl_tcp_if_exclude MCA param
>>> >> (if i understand correctly, the following is also true if "lo" is
>>> >> included via the btl_tcp_if_include MCA param)
>>> >>
>>> >> currently, and because of/thanks to the test that is done "deep
>>> inside"
>>> >> 1) on a disconnected laptop, mpirun --mca btl tcp,self ... fails with
>>> >> 2 tasks or more because tasks cannot reach each other
>>> >> 2) on a (connected) cluster, "lo" is never used and mpirun --mca btl
>>> >> tcp,self ... does not hang when tasks are running on two nodes or more
>>> >>
>>> >> with your proposal :
>>> >> 3) on a disconnected laptop, mpirun --mca btl tcp,self ... works with
>>> >> any number of taks, because "lo" is used by btl/tcp
>>> >> 4) on a (connected) cluster, "lo" is used and mpirun --mca btl
>>> >> tcp,self ... will very likely hang when tasks are running on two nodes
>>> >> or more
>>> >>
>>> >> am i right so far ?
>>> >
>>> >
>>> > No, you are missing the fact that thanks to our if_exclude (which
>>> contains
>>> > by default 127.0.0.0/24) we will never use lo (not even with my
>>> patch).
>>> > Thus, local interfaces will remain out

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-22 Thread George Bosilca
Thanks for clarifying, I now understand what your objection/suggestion was.
We all misconfigured OMPI at least once, but that allowed us to learn how
to do it right.

Instead of adding extra protections for corner-cases, maybe we should fix
our exclusivity flag so that the scenario you describe would not happen.

  George.

PS: "btl_tcp_if_exclude = ^ib0" qualifies as a honest mistake. I wouldn't
dare proposing a new MCA param to prevent this ...


On Wed, Sep 21, 2016 at 10:54 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> ok, i was not clear
>
> by "let's consider the case where "lo" is *not* excluded via the
> btl_tcp_if_exclude MCA param" i really meant
> "let's consider the case where the value of the btl_tcp_if_exclude MCA
> param has been forced to a list of network/interfaces that do not
> contain any reference (e.g. name nor subnet) to the loopback
> interface"
> /* in a previous example, i did mpirun --mca btl_tcp_if_exclude ^ib0 */
>
> my concern is that openmpi-mca-params.conf contains
> btl_tcp_if_exclude = ^ib0
>
> then hiccups will start when Open MPI is updated, and i expect some
> complains.
> of course we can reply, doc should have been read and advices
> followed, so one cannot complain just because he has been lucky so
> far.
> or we can do things a bit differently so we do not run into this case
>
> /* if btl/self is excluded, the app will not start and it is trivial
> to append to the error message a note asking to ensure btl/self was
> not excluded.
> in this case, i do not think we have a mechanism to issue a warning
> message (e.g. "ensure lo is excluded") when hiccups occur. */
>
> Cheers,
>
> Gilles
>
> On Thu, Sep 22, 2016 at 9:54 AM, George Bosilca 
> wrote:
> > On Wednesday, September 21, 2016, Gilles Gouaillardet
> >  wrote:
> >>
> >> George,
> >>
> >> let's consider the case where "lo" is *not* excluded via the
> >> btl_tcp_if_exclude MCA param
> >> (if i understand correctly, the following is also true if "lo" is
> >> included via the btl_tcp_if_include MCA param)
> >>
> >> currently, and because of/thanks to the test that is done "deep inside"
> >> 1) on a disconnected laptop, mpirun --mca btl tcp,self ... fails with
> >> 2 tasks or more because tasks cannot reach each other
> >> 2) on a (connected) cluster, "lo" is never used and mpirun --mca btl
> >> tcp,self ... does not hang when tasks are running on two nodes or more
> >>
> >> with your proposal :
> >> 3) on a disconnected laptop, mpirun --mca btl tcp,self ... works with
> >> any number of taks, because "lo" is used by btl/tcp
> >> 4) on a (connected) cluster, "lo" is used and mpirun --mca btl
> >> tcp,self ... will very likely hang when tasks are running on two nodes
> >> or more
> >>
> >> am i right so far ?
> >
> >
> > No, you are missing the fact that thanks to our if_exclude (which
> contains
> > by default 127.0.0.0/24) we will never use lo (not even with my patch).
> > Thus, local interfaces will remain out of reach for most users, with the
> > exception of those that manually force the inclusion of lo via
> if_include.
> >
> > On a cluster where a user explicitly enable lo, there will be some
> hiccups
> > during startup. However, as Paul states we explicitly discourage people
> of
> > doing that in the README. Second, the connection over lo will eventually
> > timeout, and lo it will be dropped and all pending communications will be
> > redirected through another TCP interface.
> >
> > Cheers,
> > George.
> >
> >
> >>
> >> my concern is 4)
> >> as Paul pointed out, we can consider this is not an issue since this
> >> is a user/admin mistake, and we do not care whether this is an honest
> >> one or not. that being said, this is not very friendly since something
> >> that is working fine today will (likely) start hanging when your patch
> >> is merged.
> >>
> >> my suggestion differs since it is basically 2) and 3), which can be
> >> seen as the best of both worlds
> >>
> >> makes sense ?
> >>
> >> as a side note, there were some discussions about automatically adding
> >> the self btl,
> >> and even offering a user friendly alternative to --mca btl xxx
> >> (for example --networks shm,infiniband. today Open MPI does not
> >> provide any alternative to btl/self. also infiniband can be 

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
On Wednesday, September 21, 2016, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> George,
>
> let's consider the case where "lo" is *not* excluded via the
> btl_tcp_if_exclude MCA param
> (if i understand correctly, the following is also true if "lo" is
> included via the btl_tcp_if_include MCA param)
>
> currently, and because of/thanks to the test that is done "deep inside"
> 1) on a disconnected laptop, mpirun --mca btl tcp,self ... fails with
> 2 tasks or more because tasks cannot reach each other
> 2) on a (connected) cluster, "lo" is never used and mpirun --mca btl
> tcp,self ... does not hang when tasks are running on two nodes or more
>
> with your proposal :
> 3) on a disconnected laptop, mpirun --mca btl tcp,self ... works with
> any number of taks, because "lo" is used by btl/tcp
> 4) on a (connected) cluster, "lo" is used and mpirun --mca btl
> tcp,self ... will very likely hang when tasks are running on two nodes
> or more
>
> am i right so far ?


No, you are missing the fact that thanks to our if_exclude (which contains
by default 127.0.0.0/24) we will never use lo (not even with my patch).
Thus, local interfaces will remain out of reach for most users, with the
exception of those that manually force the inclusion of lo via if_include.

On a cluster where a user explicitly enable lo, there will be some hiccups
during startup. However, as Paul states we explicitly discourage people of
doing that in the README. Second, the connection over lo will eventually
timeout, and lo it will be dropped and all pending communications will be
redirected through another TCP interface.

Cheers,
George.


> my concern is 4)
> as Paul pointed out, we can consider this is not an issue since this
> is a user/admin mistake, and we do not care whether this is an honest
> one or not. that being said, this is not very friendly since something
> that is working fine today will (likely) start hanging when your patch
> is merged.
>
> my suggestion differs since it is basically 2) and 3), which can be
> seen as the best of both worlds
>
> makes sense ?
>
> as a side note, there were some discussions about automatically adding
> the self btl,
> and even offering a user friendly alternative to --mca btl xxx
> (for example --networks shm,infiniband. today Open MPI does not
> provide any alternative to btl/self. also infiniband can be used via
> btl/openib, mtl/mxm or libfabric, which makes it painful to
> blacklist). i cannot remember the outcome of the discussion (if any).
>
> Cheers,
>
> Gilles
>
> On Thu, Sep 22, 2016 at 4:57 AM, George Bosilca  > wrote:
> > Gilles,
> >
> > I don't understand how your proposal is any different than what we have
> > today. I quote "If [locality flag is set], then we could keep a hard
> coded
> > test so 127.x.y.z address (and IPv6 equivalent) are never used (even if
> > included or not excluded) for inter node communication". We already have
> a
> > hardcoded test to prevent 127.x.y.z addresses from being used. In fact we
> > have 2 tests, one because this address range is part of our default
> > if_exclude, and then a second test (that only does something useful in
> case
> > you manually added lo* to if_include) deep inside the IP matching logic.
> >
> >   George.
> >
> >
> > On Wed, Sep 21, 2016 at 12:36 PM, Gilles Gouaillardet
> > > wrote:
> >>
> >> George,
> >>
> >> i got that, and i consider my suggestion as an improvement to your
> >> proposal.
> >>
> >> if i want to exclude ib0, i might want to
> >> mpirun --mca btl_tcp_if_exclude ib0 ...
> >>
> >> to me, this is an honest mistake, but with your proposal, i would be
> >> screwed when
> >> running on more than one node because i should have
> >> mpirun --mca btl_tcp_if_exclude ib0,lo ...
> >>
> >> and if this parameter is set by the admin in the system-wide config,
> >> then this configuration must be adapted by the admin, and that could
> >> generate some confusion.
> >>
> >> my suggestion simply adds a "safety net" to your proposal
> >>
> >> for the sake of completion, i do not really care whether there should
> >> be a safety net or not if localhost is explicitly included via the the
> >> btl_tcp_if_include MCA parameter
> >>
> >> a different and safe/friendly proposal is to add a new
> >> btl_tcp_if_exclude_localhost MCA param, which is true by default, so
> >> you would simply force it to false if you want to MPI_Comm_spa

Re: [OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
On Wed, Sep 21, 2016 at 11:23 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> > I would have agreed with you if the current code was doing a better
> decision of what is local and what not. But it is not, it simply remove all
> 127.x.x.x interfaces (opal/util/net.c:222). Thus, the only thing the
> current code does, is preventing a power-user from using the loopback
> (despite being explicitly enabled via the corresponding MCA parameters).
>
> Fair enough.
>
> Should we have a keyword that can be used in the
> btl_tcp_if_include/exclude (e.g., "local") that removes all local-only
> interfaces?  I.E., all 127.x.x.x/8 interfaces *and* all local-only
> interfaces (e.g., bridging interfaces to local VMs and the like)?
>
> We could then replace the default "127.0.0.0/8" value in
> btl_tcp_if_exclude with this token, and therefore actually exclude the
> VM-only interfaces (which have caused some users problems in the past).


I thought about having a more global naming scheme when writing the RFC,
but then I decided I was only interested in minimizing the scope and impact
of the patch (allowing developers to debug non-vader/sm processes on a
non-internet connected machine).

  George.
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
Gilles,

I don't understand how your proposal is any different than what we have
today. I quote "If [locality flag is set], then we could keep a hard coded
test so 127.x.y.z address (and IPv6 equivalent) are never used (even if
included or not excluded) for inter node communication". We already have a
hardcoded test to prevent 127.x.y.z addresses from being used. In fact we
have 2 tests, one because this address range is part of our default
if_exclude, and then a second test (that only does something useful in case
you manually added lo* to if_include) deep inside the IP matching logic.

  George.


On Wed, Sep 21, 2016 at 12:36 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> George,
>
> i got that, and i consider my suggestion as an improvement to your
> proposal.
>
> if i want to exclude ib0, i might want to
> mpirun --mca btl_tcp_if_exclude ib0 ...
>
> to me, this is an honest mistake, but with your proposal, i would be
> screwed when
> running on more than one node because i should have
> mpirun --mca btl_tcp_if_exclude ib0,lo ...
>
> and if this parameter is set by the admin in the system-wide config,
> then this configuration must be adapted by the admin, and that could
> generate some confusion.
>
> my suggestion simply adds a "safety net" to your proposal
>
> for the sake of completion, i do not really care whether there should
> be a safety net or not if localhost is explicitly included via the the
> btl_tcp_if_include MCA parameter
>
> a different and safe/friendly proposal is to add a new
> btl_tcp_if_exclude_localhost MCA param, which is true by default, so
> you would simply force it to false if you want to MPI_Comm_spawn or
> use the tcp btl on your disconnected laptop.
>
> as a side note, this reminds me that the openib/btl is used by default
> for intra node communication between two tasks from different jobs (sm
> nor vader cannot be used yet, and btl/openib has a higher exclusivity
> than btl/tcp). my first impression is that i am not so comfortable
> with that, and we could add yet an other MCA parameter so btl/openib
> disqualifies itself for intra node communications.
>
>
> Cheers,
>
> Gilles
>
> On Thu, Sep 22, 2016 at 12:56 AM, George Bosilca 
> wrote:
> > My proposal is not about adding new ways of deciding what is local and
> what
> > not. I proposed to use the corresponding MCA parameters to allow the
> user to
> > decide. More specifically, I want to be able to change the exclude and
> > include MCA to enable TCP over local addresses.
> >
> > George
> >
> >
> > On Sep 21, 2016 4:32 PM, "Gilles Gouaillardet"
> >  wrote:
> >>
> >> George,
> >>
> >> Is proc locality already set at that time ?
> >>
> >> If yes, then we could keep a hard coded test so 127.x.y.z address (and
> >> IPv6 equivalent) are never used (even if included or not excluded) for
> inter
> >> node communication
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> "Jeff Squyres (jsquyres)"  wrote:
> >> >On Sep 21, 2016, at 10:56 AM, George Bosilca 
> wrote:
> >> >>
> >> >> No, because 127.x.x.x is by default part of the exclude, so it will
> >> >> never get into the modex. The problem today, is that even if you
> manually
> >> >> remove it from the exclude and add it to the include, it will not
> work,
> >> >> because of the hardcoded checks. Once we remove those checks, things
> will
> >> >> work the way we expect, interfaces are removed because they don't
> match the
> >> >> provided addresses.
> >> >
> >> >Gotcha.
> >> >
> >> >> I would have agreed with you if the current code was doing a better
> >> >> decision of what is local and what not. But it is not, it simply
> remove all
> >> >> 127.x.x.x interfaces (opal/util/net.c:222). Thus, the only thing the
> current
> >> >> code does, is preventing a power-user from using the loopback
> (despite being
> >> >> explicitly enabled via the corresponding MCA parameters).
> >> >
> >> >Fair enough.
> >> >
> >> >Should we have a keyword that can be used in the
> >> > btl_tcp_if_include/exclude (e.g., "local") that removes all local-only
> >> > interfaces?  I.E., all 127.x.x.x/8 interfaces *and* all local-only
> >> > interfaces (e.g., bridging interfaces to local VMs and the like)?
> >> >
> >

Re: [OMPI devel] OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
My proposal is not about adding new ways of deciding what is local and what
not. I proposed to use the corresponding MCA parameters to allow the user
to decide. More specifically, I want to be able to change the exclude and
include MCA to enable TCP over local addresses.

George

On Sep 21, 2016 4:32 PM, "Gilles Gouaillardet" <
gilles.gouaillar...@gmail.com> wrote:

> George,
>
> Is proc locality already set at that time ?
>
> If yes, then we could keep a hard coded test so 127.x.y.z address (and
> IPv6 equivalent) are never used (even if included or not excluded) for
> inter node communication
>
> Cheers,
>
> Gilles
>
> "Jeff Squyres (jsquyres)"  wrote:
> >On Sep 21, 2016, at 10:56 AM, George Bosilca  wrote:
> >>
> >> No, because 127.x.x.x is by default part of the exclude, so it will
> never get into the modex. The problem today, is that even if you manually
> remove it from the exclude and add it to the include, it will not work,
> because of the hardcoded checks. Once we remove those checks, things will
> work the way we expect, interfaces are removed because they don't match the
> provided addresses.
> >
> >Gotcha.
> >
> >> I would have agreed with you if the current code was doing a better
> decision of what is local and what not. But it is not, it simply remove all
> 127.x.x.x interfaces (opal/util/net.c:222). Thus, the only thing the
> current code does, is preventing a power-user from using the loopback
> (despite being explicitly enabled via the corresponding MCA parameters).
> >
> >Fair enough.
> >
> >Should we have a keyword that can be used in the
> btl_tcp_if_include/exclude (e.g., "local") that removes all local-only
> interfaces?  I.E., all 127.x.x.x/8 interfaces *and* all local-only
> interfaces (e.g., bridging interfaces to local VMs and the like)?
> >
> >We could then replace the default "127.0.0.0/8" value in
> btl_tcp_if_exclude with this token, and therefore actually exclude the
> VM-only interfaces (which have caused some users problems in the past).
> >
> >--
> >Jeff Squyres
> >jsquy...@cisco.com
> >For corporate legal information go to: http://www.cisco.com/web/
> about/doing_business/legal/cri/
> >
> >___
> >devel mailing list
> >devel@lists.open-mpi.org
> >https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
On Wed, Sep 21, 2016 at 10:41 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> What will happen when you run this in a TCP-based networked environment?
>
> I.e., won't the TCP BTL then publish the 127.x.x.x address in the modex,
> and then other peers will think "oh, that's on the same subnet as me, so
> therefore I should be able to communicate with that endpoint over my
> 127.x.x.x address", right?
>

No, because 127.x.x.x is by default part of the exclude, so it will never
get into the modex. The problem today, is that even if you manually remove
it from the exclude and add it to the include, it will not work, because of
the hardcoded checks. Once we remove those checks, things will work the way
we expect, interfaces are removed because they don't match the provided
addresses.


> I agree that it's a bit weird that we are excluding the loopback
> interfaces inside the logic, but the problem is that we really only want
> loopback IP addresses to communicate with peers on the same server. That
> kind of restriction is not well exposed through the if_include/exclude MCA
> vars.
>
> We basically have the same problem with IP bridge interfaces (e.g., to VMs
> on the same server).  Meaning: yes, if you just compare IP address+subnet,
> two peers may be ruled to be "reachable".  But in reality, they may *not*
> be reachable (especially when you start talking about the private subnets
> of 127.x.x.x/8, 10.x.x.x/8, 192.168.x.x/24, ...etc.).
>

I would have agreed with you if the current code was doing a better
decision of what is local and what not. But it is not, it simply remove all
127.x.x.x interfaces (opal/util/net.c:222). Thus, the only thing the
current code does, is preventing a power-user from using the loopback
(despite being explicitly enabled via the corresponding MCA parameters).

  George.



>
>
>
> > On Sep 21, 2016, at 7:59 AM, George Bosilca  wrote:
> >
> > The current code in the TCP BTL prevents local execution on a laptop not
> exposing a public IP address, by unconditionally disqualifying all
> interfaces with local addresses. This is not done based on MCA parameters
> but instead is done deep inside the IP matching logic, independent of what
> the user specified in the corresponding MCA parameters (if_include and/or
> if_exclude).
> >
> > Instead, I propose we exclude the local interface only via the exclude
> MCA (both IPv4 and IPv6 local addresses are already in the default
> if_exclude), and remove all the code that prevents local addresses. I
> propose the following patch (local addresses are accepted via the second if
> because opal_net_samenetwork returns true).
> >
> > If no complaints by Friday morning, I will push the code.
> >
> >   Thanks,
> > George.
> >
> >
> >
> > diff --git a/opal/mca/btl/tcp/btl_tcp_proc.c b/opal/mca/btl/tcp/btl_tcp_
> proc.c
> > index a727a43..f7decc4 100644
> > --- a/opal/mca/btl/tcp/btl_tcp_proc.c
> > +++ b/opal/mca/btl/tcp/btl_tcp_proc.c
> > @@ -541,9 +541,9 @@ int mca_btl_tcp_proc_insert( mca_btl_tcp_proc_t*
> btl_proc,
> >  }
> >
> >
> > -for(i=0; inum_local_interfaces; ++i) {
> > +for( i = 0; i < proc_data->num_local_interfaces; ++i ) {
> >  mca_btl_tcp_interface_t* local_interface =
> proc_data->local_interfaces[i];
> > -for(j=0; jnum_peer_interfaces; ++j) {
> > +for( j = 0; j < proc_data->num_peer_interfaces; ++j ) {
> >
> >  /*  initially, assume no connection is possible */
> >  proc_data->weights[i][j] = CQ_NO_CONNECTION;
> > @@ -552,19 +552,8 @@ int mca_btl_tcp_proc_insert( mca_btl_tcp_proc_t*
> btl_proc,
> >  if(NULL != proc_data->local_interfaces[i]->ipv4_address &&
> > NULL != peer_interfaces[j]->ipv4_address) {
> >
> > -/*  check for loopback */
> > -if ((opal_net_islocalhost((struct sockaddr
> *)local_interface->ipv4_address) &&
> > - !opal_net_islocalhost((struct sockaddr
> *)peer_interfaces[j]->ipv4_address)) ||
> > -(opal_net_islocalhost((struct sockaddr
> *)peer_interfaces[j]->ipv4_address) &&
> > - !opal_net_islocalhost((struct sockaddr
> *)local_interface->ipv4_address)) ||
> > -(opal_net_islocalhost((struct sockaddr
> *)local_interface->ipv4_address) &&
> > - !opal_ifislocal(proc_hostname))) {
> > -
> > -/* No connection is possible on these interfaces */
> > -
> > -

[OMPI devel] RFC: Reenabling the TCP BTL over local interfaces (when specifically requested)

2016-09-21 Thread George Bosilca
The current code in the TCP BTL prevents local execution on a laptop not
exposing a public IP address, by unconditionally disqualifying all
interfaces with local addresses. This is not done based on MCA parameters
but instead is done deep inside the IP matching logic, independent of what
the user specified in the corresponding MCA parameters (if_include and/or
if_exclude).

Instead, I propose we exclude the local interface only via the exclude MCA
(both IPv4 and IPv6 local addresses are already in the default if_exclude),
and remove all the code that prevents local addresses. I propose the
following patch (local addresses are accepted via the second if because
opal_net_samenetwork returns true).

If no complaints by Friday morning, I will push the code.

  Thanks,
George.



diff --git a/opal/mca/btl/tcp/btl_tcp_proc.c
b/opal/mca/btl/tcp/btl_tcp_proc.c
index a727a43..f7decc4 100644
--- a/opal/mca/btl/tcp/btl_tcp_proc.c
+++ b/opal/mca/btl/tcp/btl_tcp_proc.c
@@ -541,9 +541,9 @@ int mca_btl_tcp_proc_insert( mca_btl_tcp_proc_t*
btl_proc,
 }


-for(i=0; inum_local_interfaces; ++i) {
+for( i = 0; i < proc_data->num_local_interfaces; ++i ) {
 mca_btl_tcp_interface_t* local_interface =
proc_data->local_interfaces[i];
-for(j=0; jnum_peer_interfaces; ++j) {
+for( j = 0; j < proc_data->num_peer_interfaces; ++j ) {

 /*  initially, assume no connection is possible */
 proc_data->weights[i][j] = CQ_NO_CONNECTION;
@@ -552,19 +552,8 @@ int mca_btl_tcp_proc_insert( mca_btl_tcp_proc_t*
btl_proc,
 if(NULL != proc_data->local_interfaces[i]->ipv4_address &&
NULL != peer_interfaces[j]->ipv4_address) {

-/*  check for loopback */
-if ((opal_net_islocalhost((struct sockaddr
*)local_interface->ipv4_address) &&
- !opal_net_islocalhost((struct sockaddr
*)peer_interfaces[j]->ipv4_address)) ||
-(opal_net_islocalhost((struct sockaddr
*)peer_interfaces[j]->ipv4_address) &&
- !opal_net_islocalhost((struct sockaddr
*)local_interface->ipv4_address)) ||
-(opal_net_islocalhost((struct sockaddr
*)local_interface->ipv4_address) &&
- !opal_ifislocal(proc_hostname))) {
-
-/* No connection is possible on these interfaces */
-
-/*  check for RFC1918 */
-} else if(opal_net_addr_isipv4public((struct sockaddr*)
local_interface->ipv4_address) &&
-  opal_net_addr_isipv4public((struct sockaddr*)
peer_interfaces[j]->ipv4_address)) {
+if(opal_net_addr_isipv4public((struct sockaddr*)
local_interface->ipv4_address) &&
+   opal_net_addr_isipv4public((struct sockaddr*)
peer_interfaces[j]->ipv4_address)) {
 if(opal_net_samenetwork((struct sockaddr*)
local_interface->ipv4_address,
 (struct sockaddr*)
peer_interfaces[j]->ipv4_address,

 local_interface->ipv4_netmask)) {
@@ -574,17 +563,16 @@ int mca_btl_tcp_proc_insert( mca_btl_tcp_proc_t*
btl_proc,
 }
 proc_data->best_addr[i][j] =
peer_interfaces[j]->ipv4_endpoint_addr;
 continue;
+}
+if(opal_net_samenetwork((struct sockaddr*)
local_interface->ipv4_address,
+(struct sockaddr*)
peer_interfaces[j]->ipv4_address,
+local_interface->ipv4_netmask)) {
+proc_data->weights[i][j] = CQ_PRIVATE_SAME_NETWORK;
 } else {
-if(opal_net_samenetwork((struct sockaddr*)
local_interface->ipv4_address,
-(struct sockaddr*)
peer_interfaces[j]->ipv4_address,
-
 local_interface->ipv4_netmask)) {
-proc_data->weights[i][j] = CQ_PRIVATE_SAME_NETWORK;
-} else {
-proc_data->weights[i][j] =
CQ_PRIVATE_DIFFERENT_NETWORK;
-}
-proc_data->best_addr[i][j] =
peer_interfaces[j]->ipv4_endpoint_addr;
-continue;
+proc_data->weights[i][j] =
CQ_PRIVATE_DIFFERENT_NETWORK;
 }
+proc_data->best_addr[i][j] =
peer_interfaces[j]->ipv4_endpoint_addr;
+continue;
 }

 /* check state of ipv6 address pair - ipv6 is always public,
@@ -593,19 +581,9 @@ int mca_btl_tcp_proc_insert( mca_btl_tcp_proc_t*
btl_proc,
 if(NULL != local_interface->ipv6_address &&
NULL != peer_interfaces[j]->ipv6_address) {

-/*  check for loopback */
-if ((opal_net_islocalhost((struct sockaddr
*)local_interface->ipv6_address) &&
- !opal_net_islocalhost((struct sockaddr
*)peer_interfaces[j]->ipv6_address)) ||
-(opal_net_i

Re: [OMPI devel] Deadlock in sync_wait_mt(): Proposed patch

2016-09-21 Thread George Bosilca
Nice catch. Keeping the first check only works because the signaling field
prevent us from releasing the condition too early. I added some comments
around the code (131fe42d).

  George.

On Wed, Sep 21, 2016 at 5:33 AM, Nathan Hjelm  wrote:

> Yeah, that looks like a bug to me. We need to keep the check before the
> lock but otherwise this is fine and should be fixed in 2.0.2.
>
> -Nathan
>
> > On Sep 21, 2016, at 3:16 AM, DEVEZE, PASCAL 
> wrote:
> >
> > I encountered a deadlock in sync_wait_mt().
> >
> > After investigations, it appears that a first thread executing
> wait_sync_update() decrements sync->count just after a second thread in
> sync_wait_mt() made the test :
> >
> > if(sync->count <= 0)
> > return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> >
> > After that, there is a narrow window in which the first thread may call
> pthread_cond_signal() before the second thread calls pthread_cond_wait().
> >
> > If I protect this test by the sync->lock, this window is closed and the
> problem does not reproduce.
> >
> > To easy reproduce the problem, just add a call to usleep(100) before the
> call to pthread_mutex(&sync->lock);
> >
> > So my proposed patch is:
> >
> > diff --git a/opal/threads/wait_sync.c b/opal/threads/wait_sync.c
> > index c9b9137..2f90965 100644
> > --- a/opal/threads/wait_sync.c
> > +++ b/opal/threads/wait_sync.c
> > @@ -25,12 +25,14 @@ static ompi_wait_sync_t* wait_sync_list = NULL;
> >
> > int sync_wait_mt(ompi_wait_sync_t *sync)
> > {
> > -if(sync->count <= 0)
> > -return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> > -
> >  /* lock so nobody can signal us during the list updating */
> >  pthread_mutex_lock(&sync->lock);
> >
> > +if(sync->count <= 0) {
> > +pthread_mutex_unlock(&sync->lock);
> > +return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> > +}
> > +
> >  /* Insert sync on the list of pending synchronization constructs */
> >  OPAL_THREAD_LOCK(&wait_sync_lock);
> >  if( NULL == wait_sync_list ) {
> >
> > For performance reasons, it is also possible to leave the first test
> call. So if the request is terminated, we do not spend time to take and
> free the lock.
> >
> >
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Sample of merging ompi and ompi-release

2016-09-19 Thread George Bosilca
:+1:

  George.


On Mon, Sep 19, 2016 at 6:56 PM, Jeff Squyres (jsquyres)  wrote:

> (we can discuss all of this on the Webex tomorrow)
>
> Here's a sample repo where I merged ompi and ompi-release:
>
> https://github.com/open-mpi/ompi-all-the-branches
>
> Please compare it to:
>
> https://github.com/open-mpi/ompi
> and https://github.com/open-mpi/ompi-release
>
> It's current to as of within the last hour or so.
>
> Feel free to make dummy commits / pull requests on this rep.  It's a
> sandbox repo that will eventually be deleted; it's safe to make whatever
> changes you want on this repo.
>
> Notes:
>
> - All current OMPI developers have been given the same push/merge access
> on master
> - Force pushes are disabled on *all* branches
> - On release branches:
> - Pull requests cannot be merged without at least 1 review
>   *** ^^^ This is a new Github feature
> - Only the gatekeeper team can merge PRs
>
> If no one sees any problem with this sandbox repo, I can merge all the
> ompi-release branches to the ompi repo as soon as tomorrow, and config the
> same settings.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: http://www.cisco.com/web/
> about/doing_business/legal/cri/
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [MPI_Tools] How to branch components

2016-09-08 Thread George Bosilca
On Thu, Sep 8, 2016 at 10:17 AM, Clément  wrote:

> Hi every one,
>
> I'm currently working on a monitoring component for OpenMPI. After
> browsing the MPI standard, in order to know what the MPI_Tools interface
> looks like, and how it's working, I had a look at the code I recieved and
> something surprised me. I can't find anywhere any reference to the sessions
> (MPI_T_pvar_session).
>

This being an MPI level type it should be defined in the mpi.h. However, as
our MPI header file is auto-generated, you have to look in the mpi.h.in
instead. Here is the definition:

./ompi/include/mpi.h.in:344:typedef struct mca_base_pvar_session_t
*MPI_T_pvar_session;


>
> How is it dealt with in the ompi engine? Are the call-back functions
> simply called when necessary, trusting the calling layers for that, or
> should it be managed manually? How to get the reference to the object on
> which the pvar is linked? And to the session (for a per-session  monitoring
> for example, in parallel with a global monitopring)? If one object is
> monitored in two sessions, both actives at the same time, are the call-back
> functions called twice?
>
> Probably, it should be possible to answer to most of this questions in one
> simple one : is there a programmation reference for the MPI_Tools
> implementation somewhere? If so, where would it be?
>

Everything you ever wanted to know about our PVAR implementation is
in opal/mca/base/mca_base_pvar.

  George.




>
> Thank you in advance.
>
> Clément FOYER
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Hanging tests

2016-09-06 Thread George Bosilca
I can make MPI_Issend_rtoa deadlock with vader and sm.

  George.


On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org  wrote:

> FWIW: those tests hang for me with TCP (I don’t have openib on my
> cluster). I’ll check it with your change as well
>
>
> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet  wrote:
>
> Ralph,
>
>
> this looks like an other hang :-(
>
>
> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores
> per socket) with infiniband,
>
> and i always observe the same hang at the same place.
>
>
> surprisingly, i do not get any hang if i blacklist the openib btl
>
>
> the patch below can be used to avoid the hang with infiniband or for
> debugging purpose
>
> the hang occurs in communicator 6, and if i skip tests on communicator 2,
> no hang happens.
>
> the hang occur on an intercomm :
>
> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
>
> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
>
> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then
> both hang in MPI_Wait()
>
> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling
> the hang only occurs with the openib btl,
>
> since vader should be used here.
>
>
> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c
> b/intel_tests/src/MPI_Issend_rtoa_c.c
> index 8b26f84..b9a704b 100644
> --- a/intel_tests/src/MPI_Issend_rtoa_c.c
> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c
> @@ -173,8 +177,9 @@ int main(int argc, char *argv[])
>
>  for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
>   comm_count++) {
>  comm_index = MPITEST_get_comm_index(comm_count);
>  comm_type = MPITEST_get_comm_type(comm_count);
> +if (2 == comm_count) continue;
>
>  /*
> @@ -312,6 +330,9 @@ int main(int argc, char *argv[])
>   * left sub-communicator
>   */
>
> +if (6 == comm_count && 12 == length_count &&
> MPITEST_current_rank < 2) {
> +/* insert a breakpoint here */
> +}
>   * Reset a bunch of variables that will be set when we get our
>
>
>
> as a side note, which is very unlikely related to this issue, i noticed
> the following programs works fine,
>
> though it is reasonnable to expect a hang.
>
> the root cause is MPI_Send uses the eager protocol, and though
> communicators used by MPI_Send and MPI_Recv
>
> are different, they have the same (recycled) CID.
>
> fwiw, the tests also completes with mpich.
>
>
> if not already done, should we provide an option not to recycle CIDs ?
>
> or flush unexpected/unmatched messages when a communicator is free'd ?
>
>
> Cheers,
>
>
> Gilles
>
>
> #include 
> #include 
>
> /* send a message (eager mode) in a communicator, and then
>  * receive it in an other communicator, but with the same CID
>  */
> int main(int argc, char *argv[]) {
> int rank, size;
> int b;
> MPI_Comm comm;
>
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
>
> MPI_Comm_dup(MPI_COMM_WORLD, &comm);
> if (0 == rank) {
> b = 0x;
> MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
> }
> MPI_Comm_free(&comm);
>
> MPI_Comm_dup(MPI_COMM_WORLD, &comm);
> if (1 == rank) {
> b = 0x;
> MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
> if (0x != b) MPI_Abort(MPI_COMM_WORLD, 2);
> }
> MPI_Comm_free(&comm);
>
> MPI_Finalize();
>
> return 0;
> }
>
>
> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
>
> ok,  will double check tomorrow this was the very same hang i fixed
> earlier
>
> Cheers,
>
> Gilles
>
> On Monday, September 5, 2016, r...@open-mpi.org  wrote:
>
>> I was just looking at the overnight MTT report, and these were present
>> going back a long ways in both branches. They are in the Intel test suite.
>>
>> If you have already addressed them, then thanks!
>>
>> > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>> >
>> > Ralph,
>> >
>> > I fixed a hang earlier today in master, and the PR for v2.x is at
>> https://github.com/open-mpi/ompi-release/pull/1368
>> >
>> > Can you please make sure you are running the latest master ?
>> >
>> > Which testsuite do these tests come from ?
>> > I will have a look tomorrow if the hang is still there
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > r...@open-mpi.org wrote:
>> >> Hey folks
>> >>
>> >> All of the tests that involve either ISsend_ator, SSend_ator,
>> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone know
>> what these tests do, and why we never seem to pass them?
>> >>
>> >> Do we care?
>> >> Ralph
>> >>
>> >> ___
>> >> devel mailing list
>> >> devel@lists.open-mpi.org
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/dev

Re: [OMPI devel] Question about Open MPI bindings

2016-09-05 Thread George Bosilca
Indeed. As indicated on the other thread if I add the novm and hetero and
specify both the --bind-to and --map-by I get the expected behavior.

Thanks,
  George.


On Mon, Sep 5, 2016 at 2:14 PM, r...@open-mpi.org  wrote:

> I didn’t define the default behaviors - I just implemented what everyone
> said they wanted, as eventually captured in a Google spreadsheet Jeff
> posted (and was available and discussed for weeks before implemented). So
> the defaults are:
>
> * if np <= 2, we map-by core bind-to core
>
> * if np > 2, we map-by socket bind-to socket
>
> In your case, you chose to specify a change in the binding pattern, but
> you left the mapping pattern to be the default. With 3 procs, that means
> you mapped by socket, and bound to core. ORTE did exactly what you told it
> to do (intentionally or not).
>
> If you want the behavior you describe, then you simply tell ORTE to
> “--map-by core --bind-to core”
>
> On Sep 5, 2016, at 11:05 AM, George Bosilca  wrote:
>
> On Sat, Sep 3, 2016 at 10:34 AM, r...@open-mpi.org 
> wrote:
>
>> Interesting - well, it looks like ORTE is working correctly. The map is
>> what you would expect, and so is planned binding.
>>
>> What this tells us is that we are indeed binding (so far as ORTE is
>> concerned) to the correct places. Rank 0 is being bound to 0,8, and that is
>> what the OS reports. Rank 1 is bound to 4,12, and rank 2 is bound to 1,9.
>> All of this matches what the OS reported.
>>
>> So it looks like it is report-bindings that is messed up for some reason.
>>
>
> Ralph,
>
> I have a hard time agreeing with you here. The binding you find correct
> is, from a performance point of view, terrible. Why would anybody want a
> process to be bound to 2 cores on different sockets ?
>
> Please help me with the following exercise. How do I bind each process to
> a single core, allocated in a round robin fashion (such as rank 0 on core
> 0, rank 1 on core 1 and rank 2 on core 2) ?
>
>   George.
>
>
>
>>
>>
>> On Sep 3, 2016, at 7:14 AM, George Bosilca  wrote:
>>
>> $mpirun -np 3 --tag-output --bind-to core --report-bindings
>> --display-devel-map --mca rmaps_base_verbose 10 true
>>
>> [dancer.icl.utk.edu:17451] [[41198,0],0]: Final mapper priorities
>>> [dancer.icl.utk.edu:17451]  Mapper: ppr Priority: 90
>>> [dancer.icl.utk.edu:17451]  Mapper: seq Priority: 60
>>> [dancer.icl.utk.edu:17451]  Mapper: resilient Priority: 40
>>> [dancer.icl.utk.edu:17451]  Mapper: mindist Priority: 20
>>> [dancer.icl.utk.edu:17451]  Mapper: round_robin Priority: 10
>>> [dancer.icl.utk.edu:17451]  Mapper: staged Priority: 5
>>> [dancer.icl.utk.edu:17451]  Mapper: rank_file Priority: 0
>>> [dancer.icl.utk.edu:17451] mca:rmaps: mapping job [41198,1]
>>> [dancer.icl.utk.edu:17451] mca:rmaps: setting mapping policies for job
>>> [41198,1]
>>> [dancer.icl.utk.edu:17451] mca:rmaps[153] mapping not set by user -
>>> using bysocket
>>> [dancer.icl.utk.edu:17451] mca:rmaps:ppr: job [41198,1] not using ppr
>>> mapper PPR NULL policy PPR NOTSET
>>> [dancer.icl.utk.edu:17451] mca:rmaps:seq: job [41198,1] not using seq
>>> mapper
>>> [dancer.icl.utk.edu:17451] mca:rmaps:resilient: cannot perform initial
>>> map of job [41198,1] - no fault groups
>>> [dancer.icl.utk.edu:17451] mca:rmaps:mindist: job [41198,1] not using
>>> mindist mapper
>>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping job [41198,1]
>>> [dancer.icl.utk.edu:17451] AVAILABLE NODES FOR MAPPING:
>>> [dancer.icl.utk.edu:17451] node: arc00 daemon: NULL
>>> [dancer.icl.utk.edu:17451] node: arc01 daemon: NULL
>>> [dancer.icl.utk.edu:17451] node: arc02 daemon: NULL
>>> [dancer.icl.utk.edu:17451] node: arc03 daemon: NULL
>>> [dancer.icl.utk.edu:17451] node: arc04 daemon: NULL
>>> [dancer.icl.utk.edu:17451] node: arc05 daemon: NULL
>>> [dancer.icl.utk.edu:17451] node: arc06 daemon: NULL
>>> [dancer.icl.utk.edu:17451] node: arc07 daemon: NULL
>>> [dancer.icl.utk.edu:17451] node: arc08 daemon: NULL
>>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping no-span by Package for
>>> job [41198,1] slots 180 num_procs 3
>>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: found 2 Package objects on
>>> node arc00
>>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: calculated nprocs 20
>>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: assigning nprocs 20
>>> [dancer.icl.utk.edu:17451] mca:rmaps:base: computing vpids by slot for
>>> job [4119

Re: [OMPI devel] OMPI devel] Question about Open MPI bindings

2016-09-05 Thread George Bosilca
Thanks for all these suggestions. I could get the expected bindings by 1)
removing the vm and 2) adding hetero. This is far from an ideal setting, as
now I have to make my own machinefile for every single run, or spawn
daemons on all the machines on the cluster.

Wouldn't it be useful to make the daemon check the number of slots provided
in the machine file and check if this match the local cores (and if not
force the hetero node automatically) ?

  George.

PS: Is there an MCA parameter for "hetero-nodes" ?


On Sat, Sep 3, 2016 at 8:07 PM, r...@open-mpi.org  wrote:

> Ah, indeed - if the node where mpirun is executing doesn’t match the
> compute nodes, then you must remove that --novm option. Otherwise, we have
> no way of knowing what the compute node topology looks like.
>
>
> On Sep 3, 2016, at 4:13 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> George,
>
> If i understand correctly, you are running mpirun on dancer, which has
> 2 sockets, 4 cores per socket and 2 hwthreads per core,
> and orted are running on arc[00-08], though the tasks only run on arc00,
> which has
> 2 sockets, 10 cores per socket and 2 hwthreads per core
>
> to me, it looks like OpenMPI assumes all nodes are similar to dancer,
> which is incorrect.
>
> Can you try again with the --hetero-nodes option ?
> (iirc, that should not be needed because nodes should have different
> "hwloc signatures", and OpenMPI is supposed to handle that automatically
> and correctly)
>
> That could be a side effect of your MCA params, can you try to remove them
> and
> mpirun --host arc00 --bind-to core -np 2 --report bindings grep
> Cpus_alliwed_list /proc/self/status
> And one more test plus the --hetero-nodes option ?
>
> Bottom line, you might have to set yet an other MCA param equivalent to
> the --hetero-nodes option.
>
> Cheers,
>
> Gilles
>
> r...@open-mpi.org wrote:
> Interesting - well, it looks like ORTE is working correctly. The map is
> what you would expect, and so is planned binding.
>
> What this tells us is that we are indeed binding (so far as ORTE is
> concerned) to the correct places. Rank 0 is being bound to 0,8, and that is
> what the OS reports. Rank 1 is bound to 4,12, and rank 2 is bound to 1,9.
> All of this matches what the OS reported.
>
> So it looks like it is report-bindings that is messed up for some reason.
>
>
> On Sep 3, 2016, at 7:14 AM, George Bosilca  wrote:
>
> $mpirun -np 3 --tag-output --bind-to core --report-bindings
> --display-devel-map --mca rmaps_base_verbose 10 true
>
> [dancer.icl.utk.edu:17451] [[41198,0],0]: Final mapper priorities
>> [dancer.icl.utk.edu:17451]  Mapper: ppr Priority: 90
>> [dancer.icl.utk.edu:17451]  Mapper: seq Priority: 60
>> [dancer.icl.utk.edu:17451]  Mapper: resilient Priority: 40
>> [dancer.icl.utk.edu:17451]  Mapper: mindist Priority: 20
>> [dancer.icl.utk.edu:17451]  Mapper: round_robin Priority: 10
>> [dancer.icl.utk.edu:17451]  Mapper: staged Priority: 5
>> [dancer.icl.utk.edu:17451]  Mapper: rank_file Priority: 0
>> [dancer.icl.utk.edu:17451] mca:rmaps: mapping job [41198,1]
>> [dancer.icl.utk.edu:17451] mca:rmaps: setting mapping policies for job
>> [41198,1]
>> [dancer.icl.utk.edu:17451] mca:rmaps[153] mapping not set by user -
>> using bysocket
>> [dancer.icl.utk.edu:17451] mca:rmaps:ppr: job [41198,1] not using ppr
>> mapper PPR NULL policy PPR NOTSET
>> [dancer.icl.utk.edu:17451] mca:rmaps:seq: job [41198,1] not using seq
>> mapper
>> [dancer.icl.utk.edu:17451] mca:rmaps:resilient: cannot perform initial
>> map of job [41198,1] - no fault groups
>> [dancer.icl.utk.edu:17451] mca:rmaps:mindist: job [41198,1] not using
>> mindist mapper
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping job [41198,1]
>> [dancer.icl.utk.edu:17451] AVAILABLE NODES FOR MAPPING:
>> [dancer.icl.utk.edu:17451] node: arc00 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc01 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc02 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc03 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc04 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc05 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc06 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc07 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc08 daemon: NULL
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping no-span by Package for
>> job [41198,1] slots 180 num_procs 3
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: found 2 Package objects on node
>> arc00
>> [dancer.icl.utk.edu:17451] mca:rmap

Re: [OMPI devel] Question about Open MPI bindings

2016-09-05 Thread George Bosilca
On Sat, Sep 3, 2016 at 10:34 AM, r...@open-mpi.org  wrote:

> Interesting - well, it looks like ORTE is working correctly. The map is
> what you would expect, and so is planned binding.
>
> What this tells us is that we are indeed binding (so far as ORTE is
> concerned) to the correct places. Rank 0 is being bound to 0,8, and that is
> what the OS reports. Rank 1 is bound to 4,12, and rank 2 is bound to 1,9.
> All of this matches what the OS reported.
>
> So it looks like it is report-bindings that is messed up for some reason.
>

Ralph,

I have a hard time agreeing with you here. The binding you find correct is,
from a performance point of view, terrible. Why would anybody want a
process to be bound to 2 cores on different sockets ?

Please help me with the following exercise. How do I bind each process to a
single core, allocated in a round robin fashion (such as rank 0 on core 0,
rank 1 on core 1 and rank 2 on core 2) ?

  George.



>
>
> On Sep 3, 2016, at 7:14 AM, George Bosilca  wrote:
>
> $mpirun -np 3 --tag-output --bind-to core --report-bindings
> --display-devel-map --mca rmaps_base_verbose 10 true
>
> [dancer.icl.utk.edu:17451] [[41198,0],0]: Final mapper priorities
>> [dancer.icl.utk.edu:17451]  Mapper: ppr Priority: 90
>> [dancer.icl.utk.edu:17451]  Mapper: seq Priority: 60
>> [dancer.icl.utk.edu:17451]  Mapper: resilient Priority: 40
>> [dancer.icl.utk.edu:17451]  Mapper: mindist Priority: 20
>> [dancer.icl.utk.edu:17451]  Mapper: round_robin Priority: 10
>> [dancer.icl.utk.edu:17451]  Mapper: staged Priority: 5
>> [dancer.icl.utk.edu:17451]  Mapper: rank_file Priority: 0
>> [dancer.icl.utk.edu:17451] mca:rmaps: mapping job [41198,1]
>> [dancer.icl.utk.edu:17451] mca:rmaps: setting mapping policies for job
>> [41198,1]
>> [dancer.icl.utk.edu:17451] mca:rmaps[153] mapping not set by user -
>> using bysocket
>> [dancer.icl.utk.edu:17451] mca:rmaps:ppr: job [41198,1] not using ppr
>> mapper PPR NULL policy PPR NOTSET
>> [dancer.icl.utk.edu:17451] mca:rmaps:seq: job [41198,1] not using seq
>> mapper
>> [dancer.icl.utk.edu:17451] mca:rmaps:resilient: cannot perform initial
>> map of job [41198,1] - no fault groups
>> [dancer.icl.utk.edu:17451] mca:rmaps:mindist: job [41198,1] not using
>> mindist mapper
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping job [41198,1]
>> [dancer.icl.utk.edu:17451] AVAILABLE NODES FOR MAPPING:
>> [dancer.icl.utk.edu:17451] node: arc00 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc01 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc02 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc03 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc04 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc05 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc06 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc07 daemon: NULL
>> [dancer.icl.utk.edu:17451] node: arc08 daemon: NULL
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping no-span by Package for
>> job [41198,1] slots 180 num_procs 3
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: found 2 Package objects on node
>> arc00
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: calculated nprocs 20
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: assigning nprocs 20
>> [dancer.icl.utk.edu:17451] mca:rmaps:base: computing vpids by slot for
>> job [41198,1]
>> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 0 to node arc00
>> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 1 to node arc00
>> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 2 to node arc00
>> [dancer.icl.utk.edu:17451] mca:rmaps: compute bindings for job [41198,1]
>> with policy CORE[4008]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: node arc00 has 3
>> procs on it
>> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
>> [[41198,1],0]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
>> [[41198,1],1]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
>> [[41198,1],2]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] bind_depth: 5 map_depth 1
>> [dancer.icl.utk.edu:17451] mca:rmaps: bind downward for job [41198,1]
>> with bindings CORE
>> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
>> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],0] BITMAP 0,8
>> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],0][arc00]
>> TO socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..
>> ]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
>> [dancer.icl.utk.edu:17451] [[4

Re: [OMPI devel] Question about Open MPI bindings

2016-09-03 Thread George Bosilca
$mpirun -np 3 --tag-output --bind-to core --report-bindings
--display-devel-map --mca rmaps_base_verbose 10 true

[dancer.icl.utk.edu:17451] [[41198,0],0]: Final mapper priorities
> [dancer.icl.utk.edu:17451]  Mapper: ppr Priority: 90
> [dancer.icl.utk.edu:17451]  Mapper: seq Priority: 60
> [dancer.icl.utk.edu:17451]  Mapper: resilient Priority: 40
> [dancer.icl.utk.edu:17451]  Mapper: mindist Priority: 20
> [dancer.icl.utk.edu:17451]  Mapper: round_robin Priority: 10
> [dancer.icl.utk.edu:17451]  Mapper: staged Priority: 5
> [dancer.icl.utk.edu:17451]  Mapper: rank_file Priority: 0
> [dancer.icl.utk.edu:17451] mca:rmaps: mapping job [41198,1]
> [dancer.icl.utk.edu:17451] mca:rmaps: setting mapping policies for job
> [41198,1]
> [dancer.icl.utk.edu:17451] mca:rmaps[153] mapping not set by user - using
> bysocket
> [dancer.icl.utk.edu:17451] mca:rmaps:ppr: job [41198,1] not using ppr
> mapper PPR NULL policy PPR NOTSET
> [dancer.icl.utk.edu:17451] mca:rmaps:seq: job [41198,1] not using seq
> mapper
> [dancer.icl.utk.edu:17451] mca:rmaps:resilient: cannot perform initial
> map of job [41198,1] - no fault groups
> [dancer.icl.utk.edu:17451] mca:rmaps:mindist: job [41198,1] not using
> mindist mapper
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping job [41198,1]
> [dancer.icl.utk.edu:17451] AVAILABLE NODES FOR MAPPING:
> [dancer.icl.utk.edu:17451] node: arc00 daemon: NULL
> [dancer.icl.utk.edu:17451] node: arc01 daemon: NULL
> [dancer.icl.utk.edu:17451] node: arc02 daemon: NULL
> [dancer.icl.utk.edu:17451] node: arc03 daemon: NULL
> [dancer.icl.utk.edu:17451] node: arc04 daemon: NULL
> [dancer.icl.utk.edu:17451] node: arc05 daemon: NULL
> [dancer.icl.utk.edu:17451] node: arc06 daemon: NULL
> [dancer.icl.utk.edu:17451] node: arc07 daemon: NULL
> [dancer.icl.utk.edu:17451] node: arc08 daemon: NULL
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping no-span by Package for
> job [41198,1] slots 180 num_procs 3
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: found 2 Package objects on node
> arc00
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: calculated nprocs 20
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: assigning nprocs 20
> [dancer.icl.utk.edu:17451] mca:rmaps:base: computing vpids by slot for
> job [41198,1]
> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 0 to node arc00
> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 1 to node arc00
> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 2 to node arc00
> [dancer.icl.utk.edu:17451] mca:rmaps: compute bindings for job [41198,1]
> with policy CORE[4008]
> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: node arc00 has 3
> procs on it
> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
> [[41198,1],0]
> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
> [[41198,1],1]
> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
> [[41198,1],2]
> [dancer.icl.utk.edu:17451] [[41198,0],0] bind_depth: 5 map_depth 1
> [dancer.icl.utk.edu:17451] mca:rmaps: bind downward for job [41198,1]
> with bindings CORE
> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],0] BITMAP 0,8
> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],0][arc00]
> TO socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..
> ]
> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],1] BITMAP 4,12
> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],1][arc00]
> TO socket 1[core 4[hwt 0-1]]: [../../../..][BB/../../..
> ]
> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],2] BITMAP 1,9
> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],2][arc00]
> TO socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..
> ]
> [1,0]:[arc00:07612] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket 0[core 8[hwt 0]]: [B./../../../../../../../B./..
> ][../../../../../../../../../..]
> [1,1]:[arc00:07612] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> socket 1[core 12[hwt 0]]: [../../../../B./../../../../.
> .][../../B./../../../../../../..]
> [1,2]:[arc00:07612] MCW rank 2 bound to socket 0[core 1[hwt 0]],
> socket 0[core 9[hwt 0]]: [../B./../../../../../../../B.
> ][../../../../../../../../../..]


On Sat, Sep 3, 2016 at 9:44 AM, r...@open-mpi.org  wrote:

> Okay, can you add --display-devel-map --mca rmaps_base_verbose 10 to your
> cmd line?
>
> It sounds like there is something about that topo that is bothering the
> mapper
>
> On Sep 2, 2016, at 9:31 PM, George Bosilca  wrote:
>
> Thanks Gilles, that's 

Re: [OMPI devel] Question about Open MPI bindings

2016-09-02 Thread George Bosilca
Thanks Gilles, that's a very useful trick. The bindings reported by ORTE
are in sync with the one reported by the OS.

$ mpirun -np 2 --tag-output --bind-to core --report-bindings grep
Cpus_allowed_list /proc/self/status
[1,0]:[arc00:90813] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket 0[core 4[hwt 0]]:
[B./../../../B./../../../../..][../../../../../../../../../..]
[1,1]:[arc00:90813] MCW rank 1 bound to socket 1[core 10[hwt 0]],
socket 1[core 14[hwt 0]]:
[../../../../../../../../../..][B./../../../B./../../../../..]
[1,0]:Cpus_allowed_list:0,8
[1,1]:Cpus_allowed_list:1,9

George.



On Sat, Sep 3, 2016 at 12:27 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> George,
>
> I cannot help much with this i am afraid
>
> My best bet would be to rebuild OpenMPI with --enable-debug and an
> external recent hwloc (iirc hwloc v2 cannot be used in Open MPI yet)
>
> You might also want to try
> mpirun --tag-output --bind-to xxx --report-bindings grep Cpus_allowed_list
> /proc/self/status
>
> So you can confirm both openmpi and /proc/self/status report the same thing
>
> Hope this helps a bit ...
>
> Gilles
>
>
> George Bosilca  wrote:
> While investigating the ongoing issue with OMPI messaging layer, I run
> into some troubles with process binding. I read the documentation, but I
> still find this puzzling.
>
> Disclaimer: all experiments were done with current master (9c496f7)
> compiled in optimized mode. The hardware: a single node 20 core
> Xeon E5-2650 v3 (hwloc-ls is at the end of this email).
>
> First and foremost, trying to bind to NUMA nodes was a sure way to get a
> segfault:
>
> $ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true
> --
> No objects of the specified type were found on at least one node:
>
>   Type: NUMANode
>   Node: arc00
>
> The map cannot be done as specified.
> --
> [dancer:32162] *** Process received signal ***
> [dancer:32162] Signal: Segmentation fault (11)
> [dancer:32162] Signal code: Address not mapped (1)
> [dancer:32162] Failing at address: 0x3c
> [dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0]
> [dancer:32162] [ 1] /home/bosilca/opt/trunk/fast/
> lib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0]
> [dancer:32162] [ 2] /home/bosilca/opt/trunk/fast/
> lib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54]
> [dancer:32162] [ 3] /home/bosilca/opt/trunk/fast/
> lib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308]
> [dancer:32162] [ 4] /home/bosilca/opt/trunk/fast/
> lib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e]
> [dancer:32162] [ 5] /home/bosilca/opt/trunk/fast/
> lib/libopen-rte.so.0(orte_state_base_check_all_complete+
> 0x324)[0x7f9075bedca4]
> [dancer:32162] [ 6] /home/bosilca/opt/trunk/fast/
> lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+
> 0x53c)[0x7f90758eafec]
> [dancer:32162] [ 7] mpirun[0x401251]
> [dancer:32162] [ 8] mpirun[0x400e24]
> [dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x312621ed1d]
> [dancer:32162] [10] mpirun[0x400d49]
> [dancer:32162] *** End of error message ***
> Segmentation fault
>
> As you can see on the hwloc output below, there are 2 NUMA nodes on the
> node and HWLOC correctly identifies them, making OMPI error message
> confusing. Anyway, we should not segfault but report a more meaning error
> message.
>
> Binding to slot (I got this from the man page for 2.0) is apparently not
> supported anymore. Reminder: We should update the manpage accordingly.
>
> Trying to bind to core looks better, the application at least starts.
> Unfortunately the reported bindings (or at least my understanding of these
> bindings) are troubling. Assuming that the way we report the bindings is
> correct, why are my processes assigned to 2 cores far apart each ?
>
> $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
> [arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
> [arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>
> Maybe because I only used the binding option. Adding the mapping to the
> mix (--map-by option) seem hopeless, the binding remains unchanged for 2
> processes.
>
> $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
> [arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
> [

Re: [OMPI devel] Question about Open MPI bindings

2016-09-02 Thread George Bosilca
On Sat, Sep 3, 2016 at 12:18 AM, r...@open-mpi.org  wrote:

> I’ll dig more later, but just checking offhand, I can’t replicate this on
> my box, so it may be something in hwloc for that box (or maybe you have
> some MCA params set somewhere?):
>

Yes, I have 2 MCA parameters set (orte_default_hostfile and
state_novm_select), but I don't think they are expected to affect the
bindings. Or are they ?

  George.



> $ mpirun -n 2 --bind-to core --report-bindings hostname
> [rhc001:83938] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
> [BB/../../../../../../../../../../..][../../../../../../../../../../../..]
> [rhc001:83938] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]:
> [../BB/../../../../../../../../../..][../../../../../../../../../../../..]
>
> $ mpirun -n 2 --bind-to numa --report-bindings hostname
> [rhc001:83927] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket
> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]],
> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt
> 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core
> 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]]:
> [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../..]
> [rhc001:83927] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket
> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]],
> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt
> 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core
> 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]]:
> [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../..]
>
>
> $ mpirun -n 2 --bind-to socket --report-bindings hostname
> [rhc001:83965] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket
> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]],
> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt
> 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core
> 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]]:
> [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../..]
> [rhc001:83965] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket
> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]],
> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt
> 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core
> 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]]:
> [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../..]
>
>
> I have seen the segfault when something fails early in the setup procedure
> - planned to fix that this weekend.
>
>
> On Sep 2, 2016, at 9:09 PM, George Bosilca  wrote:
>
> While investigating the ongoing issue with OMPI messaging layer, I run
> into some troubles with process binding. I read the documentation, but I
> still find this puzzling.
>
> Disclaimer: all experiments were done with current master (9c496f7)
> compiled in optimized mode. The hardware: a single node 20 core
> Xeon E5-2650 v3 (hwloc-ls is at the end of this email).
>
> First and foremost, trying to bind to NUMA nodes was a sure way to get a
> segfault:
>
> $ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true
> --
> No objects of the specified type were found on at least one node:
>
>   Type: NUMANode
>   Node: arc00
>
> The map cannot be done as specified.
> --
> [dancer:32162] *** Process received signal ***
> [dancer:32162] Signal: Segmentation fault (11)
> [dancer:32162] Signal code: Address not mapped (1)
> [dancer:32162] Failing at address: 0x3c
> [dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0]
> [dancer:32162] [ 1] /home/bosilca/opt/trunk/fast/
> lib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0]
> [dancer:32162] [ 2] /home/bosilca/opt/trunk/fast/
> lib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54]
> [dancer:32162] [ 3] /home/bosilca/opt/trunk/fast/
> lib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308]
> [dancer:32162] [ 4] /home/bosilca/opt/trunk/fast/
> lib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e]
> [dancer:32162] [ 5] /home/bosilca/opt/trunk/fast/
> lib/libopen-rte.so.0(orte_state_base_check_all_complete+
> 0x324)[0x7f9075bedca4]
> [dancer:32162] [ 6] /home/bosilca/opt/trunk/fast/
> lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+
> 0x53c)[0x7f90758eafec]
> [dancer:32162] [ 7] mpirun[0x401251]
> [dancer:32162] [ 8] mpirun[0x

[OMPI devel] Question about Open MPI bindings

2016-09-02 Thread George Bosilca
While investigating the ongoing issue with OMPI messaging layer, I run into
some troubles with process binding. I read the documentation, but I still
find this puzzling.

Disclaimer: all experiments were done with current master (9c496f7)
compiled in optimized mode. The hardware: a single node 20 core
Xeon E5-2650 v3 (hwloc-ls is at the end of this email).

First and foremost, trying to bind to NUMA nodes was a sure way to get a
segfault:

$ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true
--
No objects of the specified type were found on at least one node:

  Type: NUMANode
  Node: arc00

The map cannot be done as specified.
--
[dancer:32162] *** Process received signal ***
[dancer:32162] Signal: Segmentation fault (11)
[dancer:32162] Signal code: Address not mapped (1)
[dancer:32162] Failing at address: 0x3c
[dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0]
[dancer:32162] [ 1]
/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0]
[dancer:32162] [ 2]
/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54]
[dancer:32162] [ 3]
/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308]
[dancer:32162] [ 4]
/home/bosilca/opt/trunk/fast/lib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e]
[dancer:32162] [ 5]
/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_state_base_check_all_complete+0x324)[0x7f9075bedca4]
[dancer:32162] [ 6]
/home/bosilca/opt/trunk/fast/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7f90758eafec]
[dancer:32162] [ 7] mpirun[0x401251]
[dancer:32162] [ 8] mpirun[0x400e24]
[dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x312621ed1d]
[dancer:32162] [10] mpirun[0x400d49]
[dancer:32162] *** End of error message ***
Segmentation fault

As you can see on the hwloc output below, there are 2 NUMA nodes on the
node and HWLOC correctly identifies them, making OMPI error message
confusing. Anyway, we should not segfault but report a more meaning error
message.

Binding to slot (I got this from the man page for 2.0) is apparently not
supported anymore. Reminder: We should update the manpage accordingly.

Trying to bind to core looks better, the application at least starts.
Unfortunately the reported bindings (or at least my understanding of these
bindings) are troubling. Assuming that the way we report the bindings is
correct, why are my processes assigned to 2 cores far apart each ?

$ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
[arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]

Maybe because I only used the binding option. Adding the mapping to the mix
(--map-by option) seem hopeless, the binding remains unchanged for 2
processes.

$ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
[arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:40401] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]

At this point I really wondered what is going on. To clarify I tried to
launch 3 processes on the node. Bummer ! the reported binding shows that
one of my processes got assigned to cores on different sockets.

$ mpirun -np 3 --mca btl vader,self --bind-to core --report-bindings true
[arc00:40311] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:40311] MCW rank 2 bound to socket 0[core 1[hwt 0]], socket 0[core
9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
[arc00:40311] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 1[core
12[hwt 0]]: [../../../../B./../../../../..][../../B./../../../../../../..]

Why is rank 1 on core 4 and rank 2 on core 1 ? Maybe specifying the mapping
will help. Will I get a more sensible binding (as suggested by our online
documentation and the man pages) ?

$ mpirun -np 3 --mca btl vader,self --bind-to core --map-by core
--report-bindings true
[arc00:40254] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:40254] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
[arc00:40254] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 1[core
10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..]

There is a difference. T

Re: [OMPI devel] [2.0.1rc1] ppc64 atomics (still) broken w/ xlc-12.1

2016-08-27 Thread George Bosilca
v2.x: https://github.com/open-mpi/ompi-release/pull/1344
master: https://github.com/open-mpi/ompi/commit/a6d515b

Thanks,
  George.


On Sat, Aug 27, 2016 at 12:45 PM, George Bosilca 
wrote:

> Paul,
>
> Sorry for the half-fix. I'll submit a patch and PRs to the releases asap.
>
>   George.
>
>
> On Sat, Aug 27, 2016 at 4:14 AM, Paul Hargrove  wrote:
>
>> I didn't get to test 2.0.1rc1 with xlc-12.1 until just now because I need
>> a CRYPTOCard for access (== not fully automated like my other tests).
>>
>> It appears that the problem I reported in 2.0.0rc2 and thought to be as
>> fixed by pr1140 <https://github.com/open-mpi/ompi-release/pull/1140> was
>> never /fully/ fixed.
>> The commit in that PR includes only ONE of the TWO patch hunks in my
>> original email (URL in the PR's initial comment).
>> So, opal_atomic_ll_32() was fixed but opal_atomic_ll_64() was not.
>>
>> The same half-fixed state exists on master as well, but is masked by the
>> default use of "__sync builtin atomics".
>>
>> -Paul
>>
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [2.0.1rc1] ppc64 atomics (still) broken w/ xlc-12.1

2016-08-27 Thread George Bosilca
Paul,

Sorry for the half-fix. I'll submit a patch and PRs to the releases asap.

  George.


On Sat, Aug 27, 2016 at 4:14 AM, Paul Hargrove  wrote:

> I didn't get to test 2.0.1rc1 with xlc-12.1 until just now because I need
> a CRYPTOCard for access (== not fully automated like my other tests).
>
> It appears that the problem I reported in 2.0.0rc2 and thought to be as
> fixed by pr1140  was
> never /fully/ fixed.
> The commit in that PR includes only ONE of the TWO patch hunks in my
> original email (URL in the PR's initial comment).
> So, opal_atomic_ll_32() was fixed but opal_atomic_ll_64() was not.
>
> The same half-fixed state exists on master as well, but is masked by the
> default use of "__sync builtin atomics".
>
> -Paul
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Performance analysis proposal

2016-08-26 Thread George Bosilca
As your name appear inside (not on commits thou) I though you knew about
that repo, so I was confused on why do you want another one. ;)

  George.


On Fri, Aug 26, 2016 at 10:36 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> Cool.  Let me know if you need anything else.
>
>
> > On Aug 26, 2016, at 10:34 AM, Artem Polyakov  wrote:
> >
> > Sufficient. Probably I missed it. No need to do anything.
> >
> > 2016-08-26 21:31 GMT+07:00 Jeff Squyres (jsquyres) :
> > Just curious: is https://github.com/open-mpi/2016-summer-perf-testing
> not sufficient?
> >
> >
> >
> > > On Aug 26, 2016, at 10:28 AM, Artem Polyakov 
> wrote:
> > >
> > > I'd prefer it to be created.
> > >
> > > 2016-08-26 20:59 GMT+07:00 Jeff Squyres (jsquyres)  >:
> > > Sorry for jumping in so late.
> > >
> > > Honestly, there's no problem with making a repo in the open-mpi github
> org.  It's just as trivial to make one there as anywhere else.
> > >
> > > Let me know if you want one.
> > >
> > >
> > > > On Aug 26, 2016, at 8:46 AM, Artem Polyakov 
> wrote:
> > > >
> > > > I've marked the first week.
> > > >
> > > > 2016-08-26 19:26 GMT+07:00 George Bosilca :
> > > > Let's go regular for a period and then adapt.
> > > >
> > > > For everybody interested in the performance discussion, I setup a
> doodle for next week. The dates themselves are not important, we need a
> regular timeslot. Please answer with the idea that we do 4 weeks in a row
> and then assess the situation and dece if we need to continue, decrease the
> frequency or declare the problem solved. Here is the participation link:
> http://doodle.com/poll/w4fkb9gr3h2q5p6v
> > > >
> > > >   George.
> > > >
> > > >
> > > > On Fri, Aug 26, 2016 at 6:39 AM, Artem Polyakov 
> wrote:
> > > > This is a good idea. We also have data with SM/Vader to discuss.
> I'll send them later this week.
> > > >
> > > > Do you think of regular calls or per agreement?
> > > >
> > > > пятница, 26 августа 2016 г. пользователь George Bosilca написал:
> > > > We are serious about this. However, we not only have to define a set
> of meaningful tests (which we don't have yet) but also decide the
> conditions in which they are executed, and more critically what additional
> information we need to make them reproducible, understandable and
> comparable.
> > > >
> > > > We started discussion on these topics during the developers meeting
> few weeks ago, but we barely define what we think will be necessary for
> trivial tests such as single threaded bandwidth. It might be worth having a
> regular phone call (in addition to the Tuesday morning) to make progress.
> > > >
> > > >   George.
> > > >
> > > >
> > > > On Thu, Aug 25, 2016 at 9:37 PM, Artem Polyakov 
> wrote:
> > > > If we are serious about this problem I don't see why we can't create
> a repo for this data and keep the history of all measurements.
> > > >
> > > > Is there any chance that we will not came up with well defined set
> of tests and drop the ball here?
> > > >
> > > > пятница, 26 августа 2016 г. пользователь George Bosilca написал:
> > > >
> > > > Arm repo is a good location until we converge to a well-defined set
> of tests.
> > > >
> > > >   George.
> > > >
> > > >
> > > > On Thu, Aug 25, 2016 at 1:44 PM, Artem Polyakov 
> wrote:
> > > > That's a good question. I have results myself and I don't know where
> to place them.
> > > > I think that Arm's repo is not a right place to collect the data.
> > > >
> > > > Jeff, can we create the repo in open mpi organization on github or
> do we have something appropriate already?
> > > >
> > > > четверг, 25 августа 2016 г. пользователь Christoph Niethammer
> написал:
> > > >
> > > > Hi Artem,
> > > >
> > > > Thanks for the links. I tested now with 1.10.3, 2.0.0+sm/vader
> performance regression patch under
> > > > https://github.com/hjelmn/ompi/commit/4079eec9749e47dddc6acc9c0847b3
> 091601919f.patch
> > > > and master. I will do the 2.0.1rc in the next days as well.
> > > >
> > > > Is it possible to add me to the results repository at github or
> should I

Re: [OMPI devel] Performance analysis proposal

2016-08-26 Thread George Bosilca
Let's go regular for a period and then adapt.

For everybody interested in the performance discussion, I setup a doodle
for next week. The dates themselves are not important, we need a regular
timeslot. Please answer with the idea that we do 4 weeks in a row and then
assess the situation and dece if we need to continue, decrease the
frequency or declare the problem solved. Here is the participation link:
http://doodle.com/poll/w4fkb9gr3h2q5p6v

  George.


On Fri, Aug 26, 2016 at 6:39 AM, Artem Polyakov  wrote:

> This is a good idea. We also have data with SM/Vader to discuss. I'll send
> them later this week.
>
> Do you think of regular calls or per agreement?
>
> пятница, 26 августа 2016 г. пользователь George Bosilca написал:
>
>> We are serious about this. However, we not only have to define a set of
>> meaningful tests (which we don't have yet) but also decide the conditions
>> in which they are executed, and more critically what additional information
>> we need to make them reproducible, understandable and comparable.
>>
>> We started discussion on these topics during the developers meeting few
>> weeks ago, but we barely define what we think will be necessary for trivial
>> tests such as single threaded bandwidth. It might be worth having a regular
>> phone call (in addition to the Tuesday morning) to make progress.
>>
>>   George.
>>
>>
>> On Thu, Aug 25, 2016 at 9:37 PM, Artem Polyakov 
>> wrote:
>>
>>> If we are serious about this problem I don't see why we can't create a
>>> repo for this data and keep the history of all measurements.
>>>
>>> Is there any chance that we will not came up with well defined set of
>>> tests and drop the ball here?
>>>
>>> пятница, 26 августа 2016 г. пользователь George Bosilca написал:
>>>
>>> Arm repo is a good location until we converge to a well-defined set of
>>>> tests.
>>>>
>>>>   George.
>>>>
>>>>
>>>> On Thu, Aug 25, 2016 at 1:44 PM, Artem Polyakov 
>>>> wrote:
>>>>
>>>>> That's a good question. I have results myself and I don't know where
>>>>> to place them.
>>>>> I think that Arm's repo is not a right place to collect the data.
>>>>>
>>>>> Jeff, can we create the repo in open mpi organization on github or do
>>>>> we have something appropriate already?
>>>>>
>>>>> четверг, 25 августа 2016 г. пользователь Christoph Niethammer написал:
>>>>>
>>>>> Hi Artem,
>>>>>>
>>>>>> Thanks for the links. I tested now with 1.10.3, 2.0.0+sm/vader
>>>>>> performance regression patch under
>>>>>> https://github.com/hjelmn/ompi/commit/4079eec9749e47dddc6acc
>>>>>> 9c0847b3091601919f.patch
>>>>>> and master. I will do the 2.0.1rc in the next days as well.
>>>>>>
>>>>>> Is it possible to add me to the results repository at github or
>>>>>> should I fork and request you to pull?
>>>>>>
>>>>>> Best
>>>>>> Christoph
>>>>>>
>>>>>>
>>>>>> - Original Message -
>>>>>> From: "Artem Polyakov" 
>>>>>> To: "Open MPI Developers" 
>>>>>> Sent: Tuesday, August 23, 2016 5:13:30 PM
>>>>>> Subject: Re: [OMPI devel] Performance analysis proposal
>>>>>>
>>>>>> Hi, Christoph
>>>>>>
>>>>>> Please, check https://github.com/open-mpi/om
>>>>>> pi/wiki/Request-refactoring-test for the testing methodology and
>>>>>> https://github.com/open-mpi/2016-summer-perf-testing
>>>>>> for examples and launch scripts.
>>>>>>
>>>>>> 2016-08-23 21:20 GMT+07:00 Christoph Niethammer < nietham...@hlrs.de
>>>>>> > :
>>>>>>
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I just came over this and would like to contribute some results from
>>>>>> our system(s).
>>>>>> Are there any specific configure options you want to see enabled
>>>>>> beside --enable-mpi-thread-multiple?
>>>>>> How to commit results?
>>>>>>
>>>>>> Best
>>>>>> Christoph Niethammer
>>>>>>
>>>

Re: [OMPI devel] Performance analysis proposal

2016-08-26 Thread George Bosilca
We are serious about this. However, we not only have to define a set of
meaningful tests (which we don't have yet) but also decide the conditions
in which they are executed, and more critically what additional information
we need to make them reproducible, understandable and comparable.

We started discussion on these topics during the developers meeting few
weeks ago, but we barely define what we think will be necessary for trivial
tests such as single threaded bandwidth. It might be worth having a regular
phone call (in addition to the Tuesday morning) to make progress.

  George.


On Thu, Aug 25, 2016 at 9:37 PM, Artem Polyakov  wrote:

> If we are serious about this problem I don't see why we can't create a
> repo for this data and keep the history of all measurements.
>
> Is there any chance that we will not came up with well defined set of
> tests and drop the ball here?
>
> пятница, 26 августа 2016 г. пользователь George Bosilca написал:
>
> Arm repo is a good location until we converge to a well-defined set of
>> tests.
>>
>>   George.
>>
>>
>> On Thu, Aug 25, 2016 at 1:44 PM, Artem Polyakov 
>> wrote:
>>
>>> That's a good question. I have results myself and I don't know where to
>>> place them.
>>> I think that Arm's repo is not a right place to collect the data.
>>>
>>> Jeff, can we create the repo in open mpi organization on github or do we
>>> have something appropriate already?
>>>
>>> четверг, 25 августа 2016 г. пользователь Christoph Niethammer написал:
>>>
>>> Hi Artem,
>>>>
>>>> Thanks for the links. I tested now with 1.10.3, 2.0.0+sm/vader
>>>> performance regression patch under
>>>> https://github.com/hjelmn/ompi/commit/4079eec9749e47dddc6acc
>>>> 9c0847b3091601919f.patch
>>>> and master. I will do the 2.0.1rc in the next days as well.
>>>>
>>>> Is it possible to add me to the results repository at github or should
>>>> I fork and request you to pull?
>>>>
>>>> Best
>>>> Christoph
>>>>
>>>>
>>>> - Original Message -
>>>> From: "Artem Polyakov" 
>>>> To: "Open MPI Developers" 
>>>> Sent: Tuesday, August 23, 2016 5:13:30 PM
>>>> Subject: Re: [OMPI devel] Performance analysis proposal
>>>>
>>>> Hi, Christoph
>>>>
>>>> Please, check https://github.com/open-mpi/om
>>>> pi/wiki/Request-refactoring-test for the testing methodology and
>>>> https://github.com/open-mpi/2016-summer-perf-testing
>>>> for examples and launch scripts.
>>>>
>>>> 2016-08-23 21:20 GMT+07:00 Christoph Niethammer < nietham...@hlrs.de >
>>>> :
>>>>
>>>>
>>>> Hello,
>>>>
>>>> I just came over this and would like to contribute some results from
>>>> our system(s).
>>>> Are there any specific configure options you want to see enabled beside
>>>> --enable-mpi-thread-multiple?
>>>> How to commit results?
>>>>
>>>> Best
>>>> Christoph Niethammer
>>>>
>>>>
>>>>
>>>> - Original Message -
>>>> From: "Arm Patinyasakdikul (apatinya)" < apati...@cisco.com >
>>>> To: "Open MPI Developers" < devel@lists.open-mpi.org >
>>>> Sent: Friday, July 29, 2016 8:41:06 PM
>>>> Subject: Re: [OMPI devel] Performance analysis proposal
>>>>
>>>> Hey Artem, all,
>>>>
>>>> Thank you for the benchmark prototype. I have created the discussion
>>>> page here : https://github.com/open-mpi/20
>>>> 16-summer-perf-testing/issues/1 .
>>>>
>>>>
>>>> * There, I have single threaded and multithreaded performance posted.
>>>> * The script I used is now in the repo (also in the discussion page)
>>>> * Result with openib will come up probably next week. I have to access
>>>> UTK machine for that.
>>>> * I did some test and yes, I have seen some openib hang in
>>>> multithreaded case.
>>>> Thank you,
>>>> Arm
>>>>
>>>> From: devel < devel-boun...@lists.open-mpi.org > on behalf of Artem
>>>> Polyakov < artpo...@gmail.com >
>>>> Reply-To: Open MPI Developers < devel@lists.open-mpi.org >
>>>> Date: Thursday, July 28, 2016 at 10:42 PM
>>>>

Re: [OMPI devel] Performance analysis proposal

2016-08-25 Thread George Bosilca
Arm repo is a good location until we converge to a well-defined set of
tests.

  George.


On Thu, Aug 25, 2016 at 1:44 PM, Artem Polyakov  wrote:

> That's a good question. I have results myself and I don't know where to
> place them.
> I think that Arm's repo is not a right place to collect the data.
>
> Jeff, can we create the repo in open mpi organization on github or do we
> have something appropriate already?
>
> четверг, 25 августа 2016 г. пользователь Christoph Niethammer написал:
>
> Hi Artem,
>>
>> Thanks for the links. I tested now with 1.10.3, 2.0.0+sm/vader
>> performance regression patch under
>> https://github.com/hjelmn/ompi/commit/4079eec9749e47dddc6acc
>> 9c0847b3091601919f.patch
>> and master. I will do the 2.0.1rc in the next days as well.
>>
>> Is it possible to add me to the results repository at github or should I
>> fork and request you to pull?
>>
>> Best
>> Christoph
>>
>>
>> - Original Message -
>> From: "Artem Polyakov" 
>> To: "Open MPI Developers" 
>> Sent: Tuesday, August 23, 2016 5:13:30 PM
>> Subject: Re: [OMPI devel] Performance analysis proposal
>>
>> Hi, Christoph
>>
>> Please, check https://github.com/open-mpi/om
>> pi/wiki/Request-refactoring-test for the testing methodology and
>> https://github.com/open-mpi/2016-summer-perf-testing
>> for examples and launch scripts.
>>
>> 2016-08-23 21:20 GMT+07:00 Christoph Niethammer < nietham...@hlrs.de > :
>>
>>
>> Hello,
>>
>> I just came over this and would like to contribute some results from our
>> system(s).
>> Are there any specific configure options you want to see enabled beside
>> --enable-mpi-thread-multiple?
>> How to commit results?
>>
>> Best
>> Christoph Niethammer
>>
>>
>>
>> - Original Message -
>> From: "Arm Patinyasakdikul (apatinya)" < apati...@cisco.com >
>> To: "Open MPI Developers" < devel@lists.open-mpi.org >
>> Sent: Friday, July 29, 2016 8:41:06 PM
>> Subject: Re: [OMPI devel] Performance analysis proposal
>>
>> Hey Artem, all,
>>
>> Thank you for the benchmark prototype. I have created the discussion page
>> here : https://github.com/open-mpi/2016-summer-perf-testing/issues/1 .
>>
>>
>> * There, I have single threaded and multithreaded performance posted.
>> * The script I used is now in the repo (also in the discussion page)
>> * Result with openib will come up probably next week. I have to access
>> UTK machine for that.
>> * I did some test and yes, I have seen some openib hang in multithreaded
>> case.
>> Thank you,
>> Arm
>>
>> From: devel < devel-boun...@lists.open-mpi.org > on behalf of Artem
>> Polyakov < artpo...@gmail.com >
>> Reply-To: Open MPI Developers < devel@lists.open-mpi.org >
>> Date: Thursday, July 28, 2016 at 10:42 PM
>> To: Open MPI Developers < devel@lists.open-mpi.org >
>> Subject: Re: [OMPI devel] Performance analysis proposal
>>
>> Thank you, Arm!
>>
>> Good to have vader results (I haven't tried it myself yet). Few
>> comments/questions:
>> 1. I guess we also want to have 1-threaded performance for the "baseline"
>> reference.
>> 2. Have you tried to run with openib, as I mentioned on the call I had
>> some problems with it and I'm curious if you can reproduce in your
>> environment.
>>
>> Github issue sounds good for me!
>>
>> 2016-07-29 12:30 GMT+07:00 Arm Patinyasakdikul (apatinya) <
>> apati...@cisco.com > :
>>
>>
>> I added some result to https://github.com/open-mpi/20
>> 16-summer-perf-testing
>>
>> The result shows much better performance from 2.0.0 and master over
>> 1.10.3 for vader. The test ran with Artem’s version of benchmark on OB1,
>> single node, bind to socket.
>>
>> We should have a place to discuss/comment/collaborate on results. Should
>> I open an issue on that repo? So we can have github style
>> commenting/referencing.
>>
>>
>> Arm
>>
>>
>>
>>
>> On 7/28/16, 3:02 PM, "devel on behalf of Jeff Squyres (jsquyres)" <
>> devel-boun...@lists.open-mpi.org on behalf of jsquy...@cisco.com > wrote:
>>
>> >On Jul 28, 2016, at 6:28 AM, Artem Polyakov < artpo...@gmail.com >
>> wrote:
>> >>
>> >> Jeff and others,
>> >>
>> >> 1. The benchmark was updated to support shared memory case.
>> >> 2. The wiki was updated with the benchmark description:
>> https://github.com/open-mpi/ompi/wiki/Request-refactoring-te
>> st#benchmark-prototype
>> >
>> >Sweet -- thanks!
>> >
>> >> Let me know if we want to put this prototype to some general place. I
>> think it makes sense.
>> >
>> >I just created:
>> >
>> > https://github.com/open-mpi/2016-summer-perf-testing
>> >
>> >Want to put it there?
>> >
>> >Arm just ran a bunch of tests today and will be committing a bunch of
>> results in there shortly.
>> >
>> >--
>> >Jeff Squyres
>> > jsquy...@cisco.com
>> >For corporate legal information go to: http://www.cisco.com/web/about
>> /doing_business/legal/cri/
>> >
>> >___
>> >devel mailing list
>> > devel@lists.open-mpi.org
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 

Re: [OMPI devel] Coll/sync component missing???

2016-08-22 Thread George Bosilca
Gilles,

You should also remove the reduce_scatter from the sync (it has an implicit
synchronization).

  George.


On Mon, Aug 22, 2016 at 10:02 AM, r...@open-mpi.org  wrote:

> If you see ways to improve it, you are welcome to do so.
>
> On Aug 22, 2016, at 12:30 AM, Gilles Gouaillardet 
> wrote:
>
> Folks,
>
>
> i was reviewing the sources of the coll/sync module, and
>
>
> 1) i noticed the same pattern is used in *every* sources :
>
> if (s->in_operation) {
> return s->c_coll.coll_xxx(...);
> } else {
> COLL_SYNC(s, s->c_coll.coll_xxx(...));
>
>   }
>
> is there any rationale for not moving the if(s->in_operation) test into
> the COLL_SYNC macro ?
>
>
> 2) i could not find a rationale for using s->in_operation :
> - if a barrier must be performed, the barrier of the underlying module
> (e.g. coll/tuned) is directly invoked, so coll/sync is not somehow reentrant
> - with MPI_THREAD_MULTIPLE, it is the enduser responsability that two
> threads never invoke simultaneously a collective operation on the *same*
> communicator
>   (and s->in_operation is a per-communicator boolean), so i do not see how
> s->in_operation can be true in a valid MPI program.
>
>
> Though the first point can be seen as a "matter of style", i am pretty
> curious about the second one.
>
>
> Cheers,
>
>
> Gilles
>
> On 8/21/2016 3:44 AM, George Bosilca wrote:
>
> Ralph,
>
> Bringing back the coll/sync is a cheap shot at hiding a real issue behind
> a smoke curtain. As Nathan described in his email, Open MPI lacks of
> control flow on eager messages is the real culprit here, and the loop
> around any one-to-many collective (bcast and scatter*) was only helping to
> exacerbate the issue. However, doing a loop around a small MPI_Send will
> also end on a memory exhaustion issue, one that would not be easily
> circumvented by adding synchronizations deep inside the library.
>
>   George.
>
>
> On Sat, Aug 20, 2016 at 12:30 AM, r...@open-mpi.org 
> wrote:
>
>> I can not provide the user report as it is a proprietary problem.
>> However, it consists of a large loop of calls to MPI_Bcast that crashes due
>> to unexpected messages. We have been looking at instituting flow control,
>> but that has way too widespread an impact. The coll/sync component would be
>> a simple solution.
>>
>> I honestly don’t believe the issue I was resolving was due to a bug - it
>> was a simple problem of one proc running slow and creating an overload of
>> unexpected messages that eventually consumed too much memory. Rather, I
>> think you solved a different problem - by the time you arrived at LANL, the
>> app I was working with had already modified their code to no longer create
>> the problem (essentially refactoring the algorithm to avoid the massive
>> loop over allreduce).
>>
>> I have no issue supporting it as it takes near-zero effort to maintain,
>> and this is a fairly common problem with legacy codes that don’t want to
>> refactor their algorithms.
>>
>>
>> > On Aug 19, 2016, at 8:48 PM, Nathan Hjelm  wrote:
>> >
>> >> On Aug 19, 2016, at 4:24 PM, r...@open-mpi.org wrote:
>> >>
>> >> Hi folks
>> >>
>> >> I had a question arise regarding a problem being seen by an OMPI user
>> - has to do with the old bugaboo I originally dealt with back in my LANL
>> days. The problem is with an app that repeatedly hammers on a collective,
>> and gets overwhelmed by unexpected messages when one of the procs falls
>> behind.
>> >
>> > I did some investigation on roadrunner several years ago and determined
>> that the user code issue coll/sync was attempting to fix was due to a bug
>> in ob1/cksum (really can’t remember). coll/sync was simply masking a
>> live-lock problem. I committed a workaround for the bug in r26575 (
>> https://github.com/open-mpi/ompi/commit/59e529cf1dfe986e40d
>> 14ec4d2a2e5ef0cea5e35) and tested it with the user code. After this
>> change the user code ran fine without coll/sync. Since lanl no longer had
>> any users of coll/sync we stopped supporting it.
>> >
>> >> I solved this back then by introducing the “sync” component in
>> ompi/mca/coll, which injected a barrier operation every N collectives. You
>> could even “tune” it by doing the injection for only specific collectives.
>> >>
>> >> However, I can no longer find that component in the code base - I find
>> it in the 1.6 series, but someone removed it during the 1.7 series.
>> >>
>> >> Can som

Re: [OMPI devel] Coll/sync component missing???

2016-08-20 Thread George Bosilca
Ralph,

Bringing back the coll/sync is a cheap shot at hiding a real issue behind a
smoke curtain. As Nathan described in his email, Open MPI lacks of control
flow on eager messages is the real culprit here, and the loop around any
one-to-many collective (bcast and scatter*) was only helping to exacerbate
the issue. However, doing a loop around a small MPI_Send will also end on a
memory exhaustion issue, one that would not be easily circumvented by
adding synchronizations deep inside the library.

  George.


On Sat, Aug 20, 2016 at 12:30 AM, r...@open-mpi.org  wrote:

> I can not provide the user report as it is a proprietary problem. However,
> it consists of a large loop of calls to MPI_Bcast that crashes due to
> unexpected messages. We have been looking at instituting flow control, but
> that has way too widespread an impact. The coll/sync component would be a
> simple solution.
>
> I honestly don’t believe the issue I was resolving was due to a bug - it
> was a simple problem of one proc running slow and creating an overload of
> unexpected messages that eventually consumed too much memory. Rather, I
> think you solved a different problem - by the time you arrived at LANL, the
> app I was working with had already modified their code to no longer create
> the problem (essentially refactoring the algorithm to avoid the massive
> loop over allreduce).
>
> I have no issue supporting it as it takes near-zero effort to maintain,
> and this is a fairly common problem with legacy codes that don’t want to
> refactor their algorithms.
>
>
> > On Aug 19, 2016, at 8:48 PM, Nathan Hjelm  wrote:
> >
> >> On Aug 19, 2016, at 4:24 PM, r...@open-mpi.org wrote:
> >>
> >> Hi folks
> >>
> >> I had a question arise regarding a problem being seen by an OMPI user -
> has to do with the old bugaboo I originally dealt with back in my LANL
> days. The problem is with an app that repeatedly hammers on a collective,
> and gets overwhelmed by unexpected messages when one of the procs falls
> behind.
> >
> > I did some investigation on roadrunner several years ago and determined
> that the user code issue coll/sync was attempting to fix was due to a bug
> in ob1/cksum (really can’t remember). coll/sync was simply masking a
> live-lock problem. I committed a workaround for the bug in r26575 (
> https://github.com/open-mpi/ompi/commit/59e529cf1dfe986e40d14ec4d2a2e5
> ef0cea5e35) and tested it with the user code. After this change the user
> code ran fine without coll/sync. Since lanl no longer had any users of
> coll/sync we stopped supporting it.
> >
> >> I solved this back then by introducing the “sync” component in
> ompi/mca/coll, which injected a barrier operation every N collectives. You
> could even “tune” it by doing the injection for only specific collectives.
> >>
> >> However, I can no longer find that component in the code base - I find
> it in the 1.6 series, but someone removed it during the 1.7 series.
> >>
> >> Can someone tell me why this was done??? Is there any reason not to
> bring it back? It solves a very real, not uncommon, problem.
> >> Ralph
> >
> > This was discussed during one (or several) tel-cons years ago. We agreed
> to kill it and bring it back if there is 1) a use case, and 2) someone is
> willing to support it. See https://github.com/open-mpi/ompi/commit/
> 5451ee46bd6fcdec002b333474dec919475d2d62 .
> >
> > Can you link the user email?
> >
> > -Nathan
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

  1   2   3   4   5   6   7   8   9   10   >