[OMPI devel] Open MPI 4.1.6rc3 posted

2023-09-25 Thread Barrett, Brian via devel
Open MPI 4.1.6rc3 is posted at https://www.open-mpi.org/software/ompi/v4.1/.  
Assuming MTT looks good tonight and there are no other regressions found, we’re 
going to ship 4.1.6 later this week.

Brian


[OMPI devel] Open MPI 4.1.5rc4 release candidate posted

2023-02-17 Thread Barrett, Brian via devel
The next (and hopefully last) release candidate for the 4.1.5 release is posted 
at https://www.open-mpi.org/software/ompi/v4.1/.

Significant changes since the 2nd release candidate (we skipped the 3rd release 
candidate for logistical reasons) are:

  * Fixed a crash in the rdma osc component
  * Fixed a compilation issue in the OFI MTL when Open MPI is compiled with 
CUDA support
  * Updated PMIx to v3.2.4 (which includes a number of bug fixes not called out 
here)
  * UCC and UCX bug fixes.

We would welcome any testing and feedback; barring any negative feedback we 
intend to ship 4.1.5 late next week.

Thanks,

Brian & Jeff



[OMPI devel] Open MPI 4.1.5rc2 posted

2022-12-10 Thread Barrett, Brian via devel
Open MPI 4.1.5rc2 is now available at 
https://www.open-mpi.org/software/ompi/v4.1/.  Barring any negative feedback, 
we're intending to release the 4.1.5 at the end of this week, so please give 
things a go.

Thanks,

Brian & Jeff



Re: [OMPI devel] Open MPI v5.0.0rc9 is available for testing

2022-11-17 Thread Barrett, Brian via devel
--enable-mca-no-build=io-romio341 should still work.  Or just 
--disable-io-romio.

No comment around the RHEL7 part; that's pretty old, but I don't think we've 
officially said it is too old.  Probably something worth filing a ticket for so 
that we can run to ground before 5.0 release.  Oddly, CI continues to pass on 
RHEL7 in AWS.  I'm not sure what we've done to cause that, but also worth 
investigating.

Brian

On 11/14/22, 10:32 PM, "devel on behalf of Gilles Gouaillardet via devel" 
 wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Folks,


I tried to build on a RHEL7 like, and it fails at make time because
ROMIO requires stdatomic.h

(it seems this file is only available from GCC 4.9)


Are we supposed to be able to build Open MPI 5 with GCC 4.8 (e.g. stock
RHEL7 compiler)?


--enable-mca-no-build=io-romio314 cannot help since ROMIO has been moved
to 3rd-party

is there an other workaround? if not, should be add something like
--enable-3rd-party-no-build=romio?


Cheers,


Gilles

On 10/22/2022 6:23 AM, Austen W Lauria via devel wrote:
> Open MPI v5.0.0rc9 is now available for testing
> athttps://www.open-mpi.org/software/ompi/v5.0/
> .
>
> Please test, and send feedback either via the user mailing lists, or
> create an issue athttps://github.com/open-mpi/ompi/issues/
>  .
>
> Seehttps://docs.open-mpi.org/en/v5.0.x/news/news-v5.0.x.html
>  for a list
> of changes since rc8.
>
> Thank you,
> v5.0 Release Managers




[OMPI devel] Open MPI Submodule changes in main branch

2022-09-15 Thread Barrett, Brian via devel
Hi all -

tl;dr Please run git submodule update --init --recursive next time you update 
your checkout of the main branch.

This morning, we merged a change that adds another submodule to the main 
branch.  This submodule (named oac, because we suck at naming) will contain a 
set of Autoconf macros that are shared between Open MPI, OpenPMIx, and PRRTE.  
Ralph and I have been hand-synchronizing macros, and it was getting too much 
for us to continue doing so.

In the short term, the only change is another submodule to make sure you 
update.  We're creating compatibility macros so that all existing macros in the 
OMPI tree will work without more changes.  Hopefully, this will help avoid some 
debugging issues that some of us have had as we're trying to keep things sane 
across the three projects.

Thanks,

Brian



[OMPI devel] Component configure change proposal

2022-09-06 Thread Barrett, Brian via devel
Hi all -



I filed https://github.com/open-mpi/ompi/pull/10769 this morning, proposing to 
remove two capabilities from OMPI's configure script:



  1.  The ability to take an OMPI tarball, and drop in a component source tree, 
then run configure to build OMPI with that component.
  2.  The ability to use full configure scripts (instead of configure.m4 stubs 
like most components today).



Note that both functionalities have been broken for at least 3 months in main, 
although possibly in a way that would only be somewhat broken and may actually 
result in a working build (but with many shell errors during configure).  Since 
we don’t have good tests, have provably broken this path recently, we should 
remove it unless there’s a clear user today.  So this is really a call for any 
users of this functionality to identify themselves.  If I don’t hear back by 
Sept 19, I will consider silence to be assent.



Brian


[OMPI devel] Open MPI 4.1.4rc1 posted

2022-04-27 Thread Barrett, Brian via devel
Open MPI 4.1.4rc1 has been posted.  Significant difference from v4.1.3 is the 
addition of the UCC collectives component.  Ideally, we would like to release 
4.1.4 by the end of the month, although the end of next week might be more 
likely.

Brian & Jeff



[OMPI devel] Master tree removed from nightly tarball archive

2022-04-18 Thread Barrett, Brian via devel
All -

A week late, but I removed the (now 3 weeks out of data) master tree from the 
archive of nightly tarballs.  Theoretically, this means that MTT tests against 
master (instead of main) will now start failing due to download failures.  
Please check your MTT runs and update any configs to test main instead of 
master.

Thanks,

Brian



Re: [OMPI devel] Script-based wrapper compilers

2022-04-05 Thread Barrett, Brian via devel
Thanks for the note.  I originally wrote the script wrapper compilers to make 
it easy to build the wrappers in a cross-compile environment.  With OMPI 5.0, 
mpicc shouldn't even have a dependency on libopen-pal.so, so we've gotten to 
the point where the right answer is to do what Fujitsu does.  But I ended up 
fixing the script wrapper compilers anyway, so they'll live on at least through 
the 5.0 series.

Brian

On 4/4/22, 9:29 PM, "devel on behalf of t-kawashima--- via devel" 
 wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Jeff, Gilles,

I missed this mail thread.

As Gilles said, Fujitsu MPI provides both native- and cross-compiler for 
Fujitsu AArch64 machines (including Fugaku). However, we don't use the 
script-based wrapper compilers.
We prepare AArch64 binaries and x86_64 opal_wrapper command. Putting 
AArch64 libraries and x86_64 opal_wrapper and writing wrapper-data.txt allows 
cross compiling AArch64 MPI programs on x86_64.

Thanks,
Takahiro Kawashima,
Fujitsu

> Jeff,
>
> Cross compilation is the recommended way on Fugaku.
> In all fairness, even if Fujitsu MPI is based on Open MPI, they built the
> MPI wrappers (that invoke the cross compilers) on top of opal (read not 
the
> scripts).
>
> Cheers,
>
> Gilles
>
> On Fri, Mar 25, 2022 at 1:06 AM Jeff Squyres (jsquyres) 

> wrote:
>
> > Gilles --
> >
> > Do you know if anyone is actually cross compiling?  I agree that this is
> > in the "nice to have" category, but it is costing Brian time -- if no 
one
> > is using this functionality, it's not worth the time.  If people are 
using
> > this functionality, then it's potentially worth the time.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> > 
> > From: devel  on behalf of Gilles
> > Gouaillardet via devel 
> > Sent: Wednesday, March 23, 2022 10:28 PM
> > To: Open MPI Developers
> > Cc: Gilles Gouaillardet
> > Subject: Re: [OMPI devel] Script-based wrapper compilers
> >
> > Brian,
> >
> > My 0.02 US$
> >
> > Script based wrapper compilers are very useful when cross compiling,
> > so ideally, they should be maintained.
> >
> > Cheers,
> >
> > Gilles
> >
> > On Thu, Mar 24, 2022 at 11:18 AM Barrett, Brian via devel <
> > devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
> > Does anyone still use the script based wrapper compilers?  I have been
> > working on fixing a number of static library compile issues caused by us
> > historically not having been great about tracking library dependencies 
and
> > the OMPI/PMIx/PRRTE split.  Part of this is some fairly significant
> > modifications to the wrapper compilers (here's the PMIx version:
> > 
https://github.com/openpmix/openpmix/commit/e15de4b52f2d331297bbca31beb54b5a377557bc).
> > It would be easiest to just remove the script based wrapper compilers, 
but
> > I'll update them if someone uses them.
> >
> > Thanks,
> >
> > Brian



[OMPI devel] Open MPI 4.1.3rc2 posted

2022-03-30 Thread Barrett, Brian via devel
The second release candidate of Open MPI 4.1.3rc1, a bug fix release, has been 
posted on the Open MPI web site.  This RC fixes a potential crash when using 
the smcuda transport mechanism and cleans up some documentation.  Assuming 
there are no major bugs found, we plan on releasing tomorrow, March 31.

https://www.open-mpi.org/software/ompi/v4.1/

Thanks!

Brian & Jeff



[OMPI devel] Script-based wrapper compilers

2022-03-23 Thread Barrett, Brian via devel
Does anyone still use the script based wrapper compilers?  I have been working 
on fixing a number of static library compile issues caused by us historically 
not having been great about tracking library dependencies and the 
OMPI/PMIx/PRRTE split.  Part of this is some fairly significant modifications 
to the wrapper compilers (here's the PMIx version: 
https://github.com/openpmix/openpmix/commit/e15de4b52f2d331297bbca31beb54b5a377557bc).
  It would be easiest to just remove the script based wrapper compilers, but 
I'll update them if someone uses them.

Thanks,

Brian



[OMPI devel] Open MPI 4.1.3rc1 posted

2022-03-17 Thread Barrett, Brian via devel
The first release candidate of Open MPI 4.1.3rc1, a bug fix release, has been 
posted on the Open MPI web site.  Assuming there are no regressions from 4.1.2 
found in the next couple of days, we plan on releasing 4.1.3 at the end of next 
week.  Please file issues on GitHub for any new issues you find in testing.

https://www.open-mpi.org/software/ompi/v4.1/

Thanks!

Brian & Jeff



Re: [OMPI devel] Announcing Open MPI v5.0.0rc2

2022-01-01 Thread Barrett, Brian via devel
Marco -

There are some patches that haven't made it to the 5.0 branch to make this 
behavior better.  I didn't get a chance to back port them before the holiday 
break, but they will be in the next RC.  That said, the issue below is a 
warning, not an error, so you should still end up with a build that works (with 
an included PMIx).  The issue is that png-config can't be found, so we have 
trouble guessing what libraries are dependencies of PMIx, which is a potential 
problem in complicated builds with static libraries.

Brian


From: devel  on behalf of Marco Atzeri via 
devel 
Sent: Wednesday, December 22, 2021 9:09 AM
To: devel@lists.open-mpi.org
Cc: Marco Atzeri
Subject: RE: [EXTERNAL] [OMPI devel] Announcing Open MPI v5.0.0rc2

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



On 18.10.2021 20:39, Austen W Lauria via devel wrote:
> The second release candidate for the Open MPI v5.0.0 release is posted
> at: https://www.open-mpi.org/software/ompi/v5.0/
> 


Question:
there is a easy way to configure and build the 3rd party included
in the package source for a simple build ?


configure: = done with 3rd-party/openpmix configure =
checking for pmix.h... no
configure: Looking for pc file for pmix
Package pmix was not found in the pkg-config search path.
Perhaps you should add the directory containing `pmix.pc'
to the PKG_CONFIG_PATH environment variable
Package 'pmix', required by 'virtual:world', not found
configure: WARNING: Could not find viable pmix.pc


Regards
Marco

Cygwin package maintainer


Re: [OMPI devel] External PMIx/PRRTE and "make dist"

2021-11-12 Thread Barrett, Brian via devel
That error message is coming from autogen.  If you specify "--no-3rdparty 
pmix", you can avoid trying to run autogen in the pmix directory.  Otherwise, 
you still have to have the pmix source to run autogen.

Brian

On 11/12/21, 1:41 PM, "Heinz, Michael  William" 
 wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Brian, just a heads up - I still see

=== Submodule: 3rd-party/openpmix
==> ERROR: Missing

The submodule "3rd-party/openpmix" is missing.

Perhaps you forgot to "git clone --recursive ...", or you need to
"git submodule update --init --recursive"...?

Even though I specified --with-pmix=/usr/local.

-Original Message-----
    From: devel  On Behalf Of Barrett, Brian 
via devel
Sent: Friday, November 12, 2021 3:35 PM
To: Open MPI Developers 
Cc: Barrett, Brian 
Subject: [OMPI devel] External PMIx/PRRTE and "make dist"

Just a quick heads up that I just committed 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-mpi%2Fompi%2Fpull%2F9649data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C7f53e4e2a9cc48d7a9a308d9a61c0133%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637723461509758169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=mYdwzx5qHVKf8l8TYQTeJFK%2F9MK0bJ5MkzJ8kHDB4K0%3Dreserved=0,
 which changes Open MPI's behavior around PMIX/PRRTE and external builds.  
Previously, the configure script for the internally packaged PMIX and PRRTE 
were always run.  Now, if the user specifies 
--with-{pmix,prrte}={external,[path]}, Open MPI's configure will not run the 
sub-configure for the package that the user has asked to be an external 
dependency.  This has the side-effect of breaking "make dist" in those 
situations.  So, going forward, if you add --with-pmix=external or 
--with-prrte=external on master (and likely soon 5.0), you will *not* be able 
to successfully run "make dist" in that build tree.  You can run Open MPI's 
configure with no pmix/prrte arguments if you need to run "make dist".  Given 
the general split in use cases between where you'd want to link against an 
external PMIX/PRRTE and where you would want to build a distribution tarball, 
this is not anticipated to be a problem in practice.

Thanks,

Brian

External recipient



[OMPI devel] External PMIx/PRRTE and "make dist"

2021-11-12 Thread Barrett, Brian via devel
Just a quick heads up that I just committed 
https://github.com/open-mpi/ompi/pull/9649, which changes Open MPI's behavior 
around PMIX/PRRTE and external builds.  Previously, the configure script for 
the internally packaged PMIX and PRRTE were always run.  Now, if the user 
specifies --with-{pmix,prrte}={external,[path]}, Open MPI's configure will not 
run the sub-configure for the package that the user has asked to be an external 
dependency.  This has the side-effect of breaking "make dist" in those 
situations.  So, going forward, if you add --with-pmix=external or 
--with-prrte=external on master (and likely soon 5.0), you will *not* be able 
to successfully run "make dist" in that build tree.  You can run Open MPI's 
configure with no pmix/prrte arguments if you need to run "make dist".  Given 
the general split in use cases between where you'd want to link against an 
external PMIX/PRRTE and where you would want to build a distribution tarball, 
this is not anticipated to be a problem in practice.

Thanks,

Brian



Re: [OMPI devel] Question regarding the completion of btl_flush

2021-09-29 Thread Barrett, Brian via devel
Thanks, George.  I think we’re on the same page.  I’d love for Nathan to jump 
in here, since I’m guessing he has opinions on this subject.  Once we reach 
consensus, Wei or I will submit a PR to clarify the BTL documentation.

Brian

On 9/29/21, 7:40 AM, "George Bosilca" 
mailto:bosi...@icl.utk.edu>> wrote:


Brian,

My comment was mainly about the BTL code. MPI_Win_fence does not require remote 
completion, the call only guarantees that all outbound operations have been 
locally completed, and that all inbound operations from other sources on the 
process are also complete. I agree with you on the Win_flush implementation we 
have, it only guarantees the first part, and assumes the barrier will drain the 
network of all pending messages.

You're right, the current implementation assumes that the MPI_Barrier having a 
more synchronizing behavior and requiring more messages to be exchanged between 
the participants, might increase the likelihood that even with overtaking all 
pending messages have reached destination.

  George.


On Tue, Sep 28, 2021 at 10:36 PM Barrett, Brian 
mailto:bbarr...@amazon.com>> wrote:
George –

Is your comment about the code path referring to the BTL code or the OSC RDMA 
code?  The OSC code seems to expect remote completion, at least for the fence 
operation.  Fence is implemented as a btl flush followed by a window-wide 
barrier.  There’s no ordering specified between the RDMA operations completed 
by the flush and the send messages in the collective, so overtaking is 
possible.  Given that the BTL and the UCX PML (or OFI MTL or whatever) are 
likely using different QPs, ordering of the packets is doubtful.

Like you, we saw that many BTLs appear to only guarantee local completion with 
flush().  So the question is which one is broken (and then we’ll have to figure 
out how to fix…).

Brian

On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel" 
mailto:devel-boun...@lists.open-mpi.org> on 
behalf of devel@lists.open-mpi.org> wrote:


Based on my high-level understanding of the code path and according to the UCX 
implementation of the flush, the required level of completion is local.

  George.


On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel 
mailto:devel@lists.open-mpi.org>> wrote:
Dear All,

I have a question regarding the completion semantics of btl_flush,

In opal/mca/btl/btl.h,

https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146

the comment about btl_flush says:


* This function returns when all outstanding RDMA (put, get, atomic) operations

* that were started prior to the flush call have completed.

However, it is not clear to me what “complete” actually means? E.g. does it 
mean local completion (the action on RDMA initiator side has completed), or 
does it mean “remote completion”, (the action of RDMA remote side has 
completed). We are interested in this  because for many RDMA btls, “local 
completion” does not equal to “remote completion”.

From the way btl_flush is used in osc/rdma’s fence operation (which is a call 
to flush followed by a MPI_Barrier), we think that btl_flush should mean remote 
completion, but want to get the clarification from the community.

Sincerely,

Wei Zhang



Re: [OMPI devel] Question regarding the completion of btl_flush

2021-09-28 Thread Barrett, Brian via devel
George –

Is your comment about the code path referring to the BTL code or the OSC RDMA 
code?  The OSC code seems to expect remote completion, at least for the fence 
operation.  Fence is implemented as a btl flush followed by a window-wide 
barrier.  There’s no ordering specified between the RDMA operations completed 
by the flush and the send messages in the collective, so overtaking is 
possible.  Given that the BTL and the UCX PML (or OFI MTL or whatever) are 
likely using different QPs, ordering of the packets is doubtful.

Like you, we saw that many BTLs appear to only guarantee local completion with 
flush().  So the question is which one is broken (and then we’ll have to figure 
out how to fix…).

Brian

On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel" 
mailto:devel-boun...@lists.open-mpi.org> on 
behalf of devel@lists.open-mpi.org> wrote:


Based on my high-level understanding of the code path and according to the UCX 
implementation of the flush, the required level of completion is local.

  George.


On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel 
mailto:devel@lists.open-mpi.org>> wrote:
Dear All,

I have a question regarding the completion semantics of btl_flush,

In opal/mca/btl/btl.h,

https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146

the comment about btl_flush says:


* This function returns when all outstanding RDMA (put, get, atomic) operations

* that were started prior to the flush call have completed.

However, it is not clear to me what “complete” actually means? E.g. does it 
mean local completion (the action on RDMA initiator side has completed), or 
does it mean “remote completion”, (the action of RDMA remote side has 
completed). We are interested in this  because for many RDMA btls, “local 
completion” does not equal to “remote completion”.

From the way btl_flush is used in osc/rdma’s fence operation (which is a call 
to flush followed by a MPI_Barrier), we think that btl_flush should mean remote 
completion, but want to get the clarification from the community.

Sincerely,

Wei Zhang



[OMPI devel] Open MPI 4.1.2rc1 posted

2021-09-24 Thread Barrett, Brian via devel
The first release candidate for Open MPI 4.1.2 has been posted at 
https://www.open-mpi.org/software/ompi/v4.1/.  Barring any major issues, we 
anticipate a short release candidate period before an official 4.1.2 release.

Major changes from 4.1.1 include:

- Fix oshmem_shmem_finalize() when main() returns non-zero value.
- Fix wrong affinity under LSF with the membind option.
- Fix count==0 cases in MPI_REDUCE and MPI_IREDUCE.
- Fix ssh launching on Bourne-flavored shells when the user has "set
  -u" set in their shell startup files.
- Correctly process 0 slots with the mpirun --host option.
- Ensure to unlink and rebind socket when the Open MPI session
  directory already exists.
- Fix a segv in mpirun --disable-dissable-map.
- Fix a potential hang in the memory hook handling.
- Slight performance improvement in MPI_WAITALL when running in
  MPI_THREAD_MULTIPLE.
- Fix hcoll datatype mapping.
- Correct some operations modifying MPI_Status.MPI_ERROR when it is
  disallowed by the MPI standard.
- UCX updates:
  - Fix datatype reference count issues.
  - Detach dynamic window memory when freeing a window.
  - Fix memory leak in datatype handling.
- Fix various atomic operations issues.
- mpirun: try to set the curses winsize to the pty of the spawned
  task.  Thanks to Stack Overflow user @Seriously for reporting the
  issue.
- PMIx updates:
  - Fix compatibility with external PMIx v4.x installations.
  - Fix handling of PMIx v3.x compiler/linker flags.  Thanks to Erik
Schnetter for reporting the issue.
  - Skip SLURM-provided PMIx detection when appropriate.  Thanks to
Alexander Grund for reporting the issue.
- Fix handling by C++ compilers when they #include the STL ""
  header file, which ends up including Open MPI's text VERSION file
  (which is not C code).  Thanks to @srpgilles for reporting the
  issue.
- Fix MPI_Op support for MPI_LONG.
- Make the MPI C++ bindings library (libmpi_cxx) explicitly depend on
  the OPAL internal library (libopen-pal).  Thanks to Ye Luo for
  reporting the issue.
- Fix configure handling of "--with-libevent=/usr".
- Fix memory leak when opening Lustre files.  Thanks to Bert Wesarg
  for submitting the fix.
- Fix MPI_SENDRECV_REPLACE to correctly process datatype errors.
  Thanks to Lisandro Dalcin for reporting the issue.
- Fix MPI_SENDRECV_REPLACE to correctly handle large data.  Thanks
  Jakub Benda for reporting this issue and suggesting a fix.
- Add workaround for TCP "dropped connection" errors to drastically
  reduce the possibility of this happening.
- OMPIO updates:
  - Fix handling when AMODE is not set.  Thanks to Rainer Keller for
reporting the issue and supplying the fix.
  - Fix FBTL "posix" component linking issue.  Thanks for Honggang Li
for reporting the issue.
  - Fixed segv with MPI_FILE_GET_BYTE_OFFSET on 0-sized file view.
  - Thanks to GitHub user @shanedsnyder for submitting the issue.
- OFI updates:
  - Add support for Libfabric memhooks monitoring.
  - Ensure that Cisco usNIC devices are never selected by the OFI
MTL.
  - Fix buffer overflow in OFI networking setup.  Thanks to Alexander
Grund for reporting the issue and supplying the fix.
- Fix SSEND on tag matching networks.
- Fix error handling in several MPI collectives.
- Fix the ordering of MPI_COMM_SPLIT_TYPE.  Thanks to Wolfgang
  Bangerth for raising the issue.
- No longer install the orted-mpir library (it's an internal / Libtool
  convenience library).  Thanks to Andrew Hesford for the fix.
- PSM2 updates:
  - Allow advanced users to disable PSM2 version checking.
  - Fix to allow non-default installation locations of psm2.h.



[OMPI devel] Configure --help results

2021-02-01 Thread Barrett, Brian via devel
Josh posted https://github.com/open-mpi/ompi/pull/8432 last week to propagate 
arguments from PMIx and PRRTE into the top level configure --help, and it 
probably deserves more discussion.

There are three patches, the last of which renames arguments so you could (for 
example) specify different –with-slurm arguments to PMIx, PRRTE, and Open MPI.  
I would strongly encourage us not to bring in that patch.  Ralph and I 
discussed that handling when we revamped the configure system last fall and the 
platform file was the only place we could think of where anything sane would 
happen in that case.  For the one in a million situation, customers can always 
build external PMIx and PRRTE to achieve the same outcome.

The rest of this email focuses on what we want to have happen with configure 
–help.  By default, Autoconf’s generated configure –help provides the long 
format of help, with all arguments and descriptions, for the top level 
configure only.  The Autoconf way of handling sub-configures is to expect the 
user to know to specify “./configure –help=recursive” to see all the arguments 
for sub-configure scripts.  This does work correctly in Open MPI, but isn’t 
what users expect.  I spent some time looking, and there does not appear to be 
a sane way to make ./configure –help=recursive the default.  So that leaves us 
with three options:


  1.  Document (possibly at the end of the default –help output, definitely in 
the README) that you need to run –help=recursive to see the options for PMIx 
and PRRTE
  2.  Add “dummy” help options for the parameters from PMIx and PRRTE we think 
are worth exporting.  This is likely prime for bit rot
  3.  Josh’s script to create a dummy help option for each argument in PMIx and 
PRRTE not in the top level configure.

While (3) bothers me from a code complexity standpoint, it is probably the 
easiest on our customers.  I do worry about expectation setting when those 
arguments don’t do anything because external PMIx/PRRTE are being built, but 
that is probably worth the tradeoff.

Anyone else have thoughts or better ideas?

Brian


[OMPI devel] UCX and older hardwre

2020-10-21 Thread Barrett, Brian via devel
UCX folks -

As part of https://github.com/open-mpi/ompi/issues/7968, a README item was 
added about the segfault in UCX for older IB hardware.  That note said the 
issue would be fixed in UCX 1.10.  Aurelien later added a note saying it was 
fixed in UCX 1.9.0-rc3.  Which version should be referenced in the README: 1.9 
or 1.10?  We are trying to get the documentation set for Open MPI 4.1 and 
master.

Thanks,

Brian and Jeff



Re: [OMPI devel] Which compiler versions to test?

2020-10-19 Thread Barrett, Brian via devel
Sorry, keep forgetting to reply.  I think your list is a reasonable starting 
point.  I think AWS is still a little thin in its OS/compiler testing, we 
definitely have room to improve there.  We probably won't pick up the older 
Intel compilers, for various reasons.  We are running recent Intel Compilers 
with TCP and EFA (but obviously not usnic).  I think your list is pretty 
reasonable for compilers, and we should probably pick up more here along the 
way.  If I were you and had to prune, I'd probably prune the GCC 9 / CLANG 9...

Brian

-Original Message-
From: devel  on behalf of "Jeff Squyres 
(jsquyres) via devel" 
Reply-To: Open MPI Developers 
Date: Thursday, October 8, 2020 at 9:38 AM
To: Open MPI Developers List 
Cc: Jeff Squyres 
Subject: [EXTERNAL] [OMPI devel] Which compiler versions to test?

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Open question to the Open MPI dev community...

Over time, the size of my MTT cluster has been growing smaller (due to 
hardware failure, power budget restrictions, etc.).  This means that I have far 
fewer CPU cycles available for testing various compilers and configure CLI 
options than I used to.

What compilers does the community think are worthwhile to test these days?  
I generally have access to gcc/gfortran, clang, and some versions of the Intel 
compiler suite.

master, 4.0.x, and 4.1.x branches
- gcc 4.8.5 (i.e., the default gcc on RHEL 7.x)
- gcc 9.latest
- gcc 10.latest
- clang 9.0.latest
- clang 10.0.latest
- Intel 2017
- Intel 2019

(I don't have Intel 2018 or Intel 2020)

Is this sufficient?  Or is it worthwhile to test other versions of these 
compilers?

--
Jeff Squyres
jsquy...@cisco.com




[OMPI devel] Open MPI 3rd party packaging changes

2020-10-01 Thread Barrett, Brian via devel
All -

Only 6 months after I promised the code would be done, the changes we discussed 
in February around 3rd party packages (Libevent, HWLOC, PMIx, and PRRTE) are 
merged to master.  With these changes, Open MPI will prefer an external version 
of any of those packages if a "new enough" version is found already installed 
on the system.  New enough is no longer defined as "newer than the built-in 
package", but is defined as "as new as the oldest version we know works".  This 
may cause unexpected changes from using the built-in versions to the system 
versions for some packages, particularly libevent.

If you have Libevent and hwloc pre-installed, your total build times will go 
down a couple percent, so there's that :).

Brian



[OMPI devel] Libevent changes

2020-07-06 Thread Barrett, Brian via devel
https://github.com/open-mpi/ompi/pull/6784 went in while I was on vacation.  
I'm a little confused; I thought we no longer patched libevent locally?  This 
is certainly going to be a problem as we move to external dependencies; we 
won't have a way of pulling in this change (whether using the bundled libevent 
or not).  I don't think we should have this patch locally in master, because 
we're going to lose it in the next couple of weeks with the configure changes 
I'm hopefully near completing.

Brian



[OMPI devel] Open MPI release update

2020-06-15 Thread Barrett, Brian via devel
Greetings -

As you may know, Open MPI 5.0 is going to include an ambitious improvement in 
Open MPI's runtime system along with a number of performance improvements, and 
was targeted to ship this summer.  While we are still going to make those 
improvements to our runtime system, it is taking us longer than we anticipated 
and we want to make sure we get everything right before releasing Open MPI 5.0 
later this year.

To get many of the performance improvements to the MPI layer in your hands 
sooner, the Open MPI development team recently made the decision to start a 4.1 
release series.  Open MPI 4.1.0 will be based off of Open MPI 4.0.4, but with a 
number of significant performance improvements backported.  This includes 
improved collective routine performance, the OFI BTL to support one-sided 
operations over the Libfabric software stack, MPI I/O improvements, and general 
performance improvements.  Nightly tarballs of the 4.1 branch are available at 
https://www.open-mpi.org/nightly/v4.1.x/ and we plan on releasing Open MPI 
4.1.0 in July of 2020.

Thank you,

The Open MPI Development Team



[OMPI devel] Jenkins / Web server outage tonight (Pacific Time)

2020-04-29 Thread Barrett, Brian via devel
Hi all -

As part of supporting Pandoc man pages, I'm going to update the base images 
used to run tests in AWS.  I usually screw this up once or twice, so expect 
Jenkins to be offline for an hour or two this evening, starting at 7:00pm 
Pacific Time.  At the same time, I am going to do some maintenance on the 
server in general, so the web site may be unavailable for 10-15 minutes while 
things reboot.

Brian 



Re: [OMPI devel] Add multi nic support for ofi MTL using hwloc

2020-03-20 Thread Barrett, Brian via devel
Ok, that makes total sense.  I'm leaning towards us fixing this in the OFI MTL 
rather than making everyone load.  I agree with you that it probably doesn't 
matter, but let's not create a corner case.  I'm also going to follow up with 
the dev who wrote this code, but my guess is that we should add a note in the 
header docs somewhere.  We'll take that action item.

Brian

From: Ralph Castain 
Date: Friday, March 20, 2020 at 9:46 AM
To: "Barrett, Brian" 
Cc: OpenMPI Devel 
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc


If you call "hwloc_topology_load", then hwloc merrily does its discovery and 
slams many-core systems. If you call "opal_hwloc_get_topology", then that is 
fine - it checks if we already have it, tries to get it from PMIx (using shared 
mem for hwloc 2.x), and only does the discovery if no other method is available.

IIRC, we might have decided to let those who needed the topology call 
"opal_hwloc_get_topology" to ensure the topo was available so that we don't 
load it unless someone actually needs it. However, I get the sense we wound up 
always needing the topology, so it was kind of a moot point.


Given that all we do is setup a shmem link (since hwloc 2 is now widely 
available), it shouldn't matter. However, if you want to stick with the "only 
get it if needed" approach, then just add a call to "opal_hwloc_get_topology" 
prior to using the topology and close that PR as "unneeded"




On Mar 20, 2020, at 9:35 AM, Barrett, Brian 
mailto:bbarr...@amazon.com>> wrote:

But does raise the question; should we call get_topology() for belt and 
suspenders in OFI?  Or will that cause your concerns from the start of this 
thread?

Brian

From: Ralph Castain mailto:r...@open-mpi.org>>
Date: Friday, March 20, 2020 at 9:31 AM
To: OpenMPI Devel mailto:devel@lists.open-mpi.org>>
Cc: "Barrett, Brian" mailto:bbarr...@amazon.com>>
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc


https://github.com/open-mpi/ompi/pull/7547 fixes it and has an explanation as 
to why it wasn't catching us elsewhere in the MPI code




On Mar 20, 2020, at 9:22 AM, Ralph Castain via devel 
mailto:devel@lists.open-mpi.org>> wrote:

Odd - the topology object gets filled in during init, well before the fence (as 
it doesn't need the fence, being a purely local op). Let me take a look




On Mar 20, 2020, at 9:15 AM, Barrett, Brian 
mailto:bbarr...@amazon.com>> wrote:

PMIx folks -

When using mpirun for launching, it looks like opal_hwloc_topology isn't filled 
in at the point where we need the information (mtl_ofi_component_init()).  This 
would end up being before the modex fence, since the goal is to figure out 
which address the process should publish.  I'm not sure that makes a difference 
here, but wanted to figure out if this was expected and, if so, if we had 
options for getting the right data from PMIx early enough in the process.  
Sorry, this is part of the runtime changes I haven't been following closely 
enough.

Brian

-Original Message-
From: devel 
mailto:devel-boun...@lists.open-mpi.org>> on 
behalf of Ralph Castain via devel 
mailto:devel@lists.open-mpi.org>>
Reply-To: Open MPI Developers 
mailto:devel@lists.open-mpi.org>>
Date: Wednesday, March 18, 2020 at 2:08 PM
To: "Zhang, William" mailto:wilzh...@amazon.com>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>, OpenMPI Devel 
mailto:devel@lists.open-mpi.org>>
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc




  Excellent - thanks! Now if only the OpenMP people would be so 
reasonable...sigh.




On Mar 18, 2020, at 10:26 AM, Zhang, William 
mailto:wilzh...@amazon.com>> wrote:

Hello,

We're getting the topology info using the opal_hwloc_topology object, we won't 
be doing our own discovery.

William

On 3/17/20, 11:54 PM, "devel on behalf of Ralph Castain via devel" 
mailto:devel-boun...@lists.open-mpi.org>on 
behalf of devel@lists.open-mpi.org> wrote:




 Hey folks

 I saw the referenced "new feature" on the v5 feature spreadsheet and wanted to 
ask a quick question. Is the OFI MTL going to be doing its own hwloc topology 
discovery for this feature? Or is it going to access the topology info via PMIx 
and the OPAL hwloc abstraction?

 I ask because we know that having every proc do its own topology discovery is 
a major problem on large-core systems (e.g., KNL or Power9). If OFI is going to 
do an hwloc discovery operation, then we need to ensure this doesn't happen 
unless specifically requested by a user willing to pay that price (and it was 
significant).

 Can someone from Amazon (as the item is assigned to them) please clarify?
 Ralph



Re: [OMPI devel] Add multi nic support for ofi MTL using hwloc

2020-03-20 Thread Barrett, Brian via devel
But does raise the question; should we call get_topology() for belt and 
suspenders in OFI?  Or will that cause your concerns from the start of this 
thread?

Brian

From: Ralph Castain 
Date: Friday, March 20, 2020 at 9:31 AM
To: OpenMPI Devel 
Cc: "Barrett, Brian" 
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc


https://github.com/open-mpi/ompi/pull/7547 fixes it and has an explanation as 
to why it wasn't catching us elsewhere in the MPI code



On Mar 20, 2020, at 9:22 AM, Ralph Castain via devel 
mailto:devel@lists.open-mpi.org>> wrote:

Odd - the topology object gets filled in during init, well before the fence (as 
it doesn't need the fence, being a purely local op). Let me take a look



On Mar 20, 2020, at 9:15 AM, Barrett, Brian 
mailto:bbarr...@amazon.com>> wrote:

PMIx folks -

When using mpirun for launching, it looks like opal_hwloc_topology isn't filled 
in at the point where we need the information (mtl_ofi_component_init()).  This 
would end up being before the modex fence, since the goal is to figure out 
which address the process should publish.  I'm not sure that makes a difference 
here, but wanted to figure out if this was expected and, if so, if we had 
options for getting the right data from PMIx early enough in the process.  
Sorry, this is part of the runtime changes I haven't been following closely 
enough.

Brian

-Original Message-
From: devel 
mailto:devel-boun...@lists.open-mpi.org>> on 
behalf of Ralph Castain via devel 
mailto:devel@lists.open-mpi.org>>
Reply-To: Open MPI Developers 
mailto:devel@lists.open-mpi.org>>
Date: Wednesday, March 18, 2020 at 2:08 PM
To: "Zhang, William" mailto:wilzh...@amazon.com>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>, OpenMPI Devel 
mailto:devel@lists.open-mpi.org>>
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc




  Excellent - thanks! Now if only the OpenMP people would be so 
reasonable...sigh.



On Mar 18, 2020, at 10:26 AM, Zhang, William 
mailto:wilzh...@amazon.com>> wrote:

Hello,

We're getting the topology info using the opal_hwloc_topology object, we won't 
be doing our own discovery.

William

On 3/17/20, 11:54 PM, "devel on behalf of Ralph Castain via devel" 
mailto:devel-boun...@lists.open-mpi.org> on 
behalf of devel@lists.open-mpi.org> wrote:




 Hey folks

 I saw the referenced "new feature" on the v5 feature spreadsheet and wanted to 
ask a quick question. Is the OFI MTL going to be doing its own hwloc topology 
discovery for this feature? Or is it going to access the topology info via PMIx 
and the OPAL hwloc abstraction?

 I ask because we know that having every proc do its own topology discovery is 
a major problem on large-core systems (e.g., KNL or Power9). If OFI is going to 
do an hwloc discovery operation, then we need to ensure this doesn't happen 
unless specifically requested by a user willing to pay that price (and it was 
significant).

 Can someone from Amazon (as the item is assigned to them) please clarify?
 Ralph










Re: [OMPI devel] Add multi nic support for ofi MTL using hwloc

2020-03-20 Thread Barrett, Brian via devel
PMIx folks -

When using mpirun for launching, it looks like opal_hwloc_topology isn't filled 
in at the point where we need the information (mtl_ofi_component_init()).  This 
would end up being before the modex fence, since the goal is to figure out 
which address the process should publish.  I'm not sure that makes a difference 
here, but wanted to figure out if this was expected and, if so, if we had 
options for getting the right data from PMIx early enough in the process.  
Sorry, this is part of the runtime changes I haven't been following closely 
enough.

Brian

-Original Message-
From: devel  on behalf of Ralph Castain via 
devel 
Reply-To: Open MPI Developers 
Date: Wednesday, March 18, 2020 at 2:08 PM
To: "Zhang, William" 
Cc: Ralph Castain , OpenMPI Devel 
Subject: RE: [EXTERNAL] [OMPI devel] Add multi nic support for ofi MTL using 
hwloc




Excellent - thanks! Now if only the OpenMP people would be so 
reasonable...sigh.


> On Mar 18, 2020, at 10:26 AM, Zhang, William  wrote:
>
> Hello,
>
> We're getting the topology info using the opal_hwloc_topology object, we 
won't be doing our own discovery.
>
> William
>
> On 3/17/20, 11:54 PM, "devel on behalf of Ralph Castain via devel" 
 wrote:
>
>
>
>
>Hey folks
>
>I saw the referenced "new feature" on the v5 feature spreadsheet and 
wanted to ask a quick question. Is the OFI MTL going to be doing its own hwloc 
topology discovery for this feature? Or is it going to access the topology info 
via PMIx and the OPAL hwloc abstraction?
>
>I ask because we know that having every proc do its own topology 
discovery is a major problem on large-core systems (e.g., KNL or Power9). If 
OFI is going to do an hwloc discovery operation, then we need to ensure this 
doesn't happen unless specifically requested by a user willing to pay that 
price (and it was significant).
>
>Can someone from Amazon (as the item is assigned to them) please 
clarify?
>Ralph
>
>
>
>






Re: [OMPI devel] Reachable framework integration

2020-01-02 Thread Barrett, Brian via devel
Ralph -

Are you looking for the "best" single connection between two hosts, or the best 
set of pairings, or even "just any pairing that works"?  The TCP BTL code is 
complicated because it's looking for the best overall set of pairings, to 
maximize the number (and quality) of links between the two ranks.  This 
(theoretically) maximizes throughput.  The usage model for "I just want one 
connection that moves some bits" is probably a lot simpler.  I don't think you 
need to build the entire graph in that case.

Some of the complexity was also that we were trying to reuse as much code as 
possible between TCP and usnic, while making minimum changes to usnic.  Maybe 
not the best option, but AWS doesn't have a way of testing on usnic, so...

Brian

From: devel  on behalf of Ralph Castain via 
devel 
Reply-To: Open MPI Developers 
Date: Thursday, January 2, 2020 at 12:41 PM
To: George Bosilca 
Cc: Ralph Castain , OpenMPI Devel 
Subject: Re: [OMPI devel] Reachable framework integration

Hmmm...pretty complex code in there. Looks like it has to be "replicated" for 
reuse as functions are passing in btl_tcp specific structs. Is it worth 
developing an abstracted version of such functions as 
mca_btl_tcp_proc_create_interface_graph and 
mca_btl_tcp_proc_store_matched_interfaces?




On Jan 2, 2020, at 9:35 AM, George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:

Ralph,

I think the first use is still pending reviews (more precisely my review) at 
https://github.com/open-mpi/ompi/pull/7134.

  George.


On Wed, Jan 1, 2020 at 9:53 PM Ralph Castain via devel 
mailto:devel@lists.open-mpi.org>> wrote:
Hey folks

I can't find where the opal/reachable framework is being used in OMPI. I would 
like to utilize it in the PRRTE oob/tcp component, but need some guidance on 
how to do so, or pointers to an example.

Ralph




Re: [OMPI devel] Reachable framework integration

2020-01-02 Thread Barrett, Brian via devel
George, this is a great reminder to please review that code.  William's being 
way too polite, and we know this helps with some of our problems :).

Brian

From: devel  on behalf of George Bosilca via 
devel 
Reply-To: Open MPI Developers 
Date: Thursday, January 2, 2020 at 9:48 AM
To: Open MPI Developers 
Cc: George Bosilca 
Subject: Re: [OMPI devel] Reachable framework integration

Ralph,

I think the first use is still pending reviews (more precisely my review) at 
https://github.com/open-mpi/ompi/pull/7134.

  George.


On Wed, Jan 1, 2020 at 9:53 PM Ralph Castain via devel 
mailto:devel@lists.open-mpi.org>> wrote:
Hey folks

I can't find where the opal/reachable framework is being used in OMPI. I would 
like to utilize it in the PRRTE oob/tcp component, but need some guidance on 
how to do so, or pointers to an example.

Ralph



[OMPI devel] Open MPI 3.0.5 and 3.1.5 release plans

2019-11-05 Thread Barrett, Brian via devel
We have posted release candidates for both Open MPI 3.0.5 and 3.1.5, with plans 
to release on Friday, Nov 8, if we do not hear any new issues compared to the 
previous releases.  Please give the releases a test, if you are so inclined.

  https://www.open-mpi.org/software/ompi/v3.0/
  https://www.open-mpi.org/software/ompi/v3.1/

3.1.5 Release notes are below.  3.0.5 is very similar.

- Fix OMPIO issue limiting file reads/writes to 2GB.  Thanks to
  Richard Warren for reporting the issue.
- At run time, automatically disable Linux cross-memory attach (CMA)
  for vader BTL (shared memory) copies when running in user namespaces
  (i.e., containers).  Many thanks to Adrian Reber for raising the
  issue and providing the fix.
- Sending very large MPI messages using the ofi MTL will fail with
  some of the underlying Libfabric transports (e.g., PSM2 with
  messages >=4GB, verbs with messages >=2GB).  Prior version of Open
  MPI failed silently; this version of Open MPI invokes the
  appropriate MPI error handler upon failure.  See
  https://github.com/open-mpi/ompi/issues/7058 for more details.
  Thanks to Emmanuel Thomé for raising the issue.
- Fix case where 0-extent datatypes might be eliminated during
  optimization.  Thanks to Github user @tjahns for raising the issue.
- Ensure that the MPIR_Breakpoint symbol is not optimized out on
  problematic platforms.
- Fix MPI one-sided 32 bit atomic support.
- Fix OMPIO offset calculations with SEEK_END and SEEK_CUR in
  MPI_FILE_GET_POSITION.  Thanks to Wei-keng Liao for raising the
  issue.
- Add "naive" regx component that will never fail, no matter how
  esoteric the hostnames are.
- Fix corner case for datatype extent computations.  Thanks to David
  Dickenson for raising the issue.
- Allow individual jobs to set their map/rank/bind policies when
  running LSF.  Thanks to Nick R. Papior for assistance in solving the
  issue.
- Fix MPI buffered sends with the "cm" PML.
- Properly propagate errors to avoid deadlocks in MPI one-sided operations.
- Update to PMIx v2.2.3.
- Fix data corruption in non-contiguous MPI accumulates over UCX.
- Fix ssh-based tree-based spawning at scale.  Many thanks to Github
  user @zrss for the report and diagnosis.
- Fix the Open MPI RPM spec file to not abort when grep fails.  Thanks
  to Daniel Letai for bringing this to our attention.
- Handle new SLURM CLI options (SLURM 19 deprecated some options that
  Open MPI was using).  Thanks to Jordan Hayes for the report and the
  initial fix.
- OMPI: fix division by zero with an empty file view.
- Also handle shmat()/shmdt() memory patching with OS-bypass networks.
- Add support for unwinding info to all files that are present in the
  stack starting from MPI_Init, which is helpful with parallel
  debuggers.  Thanks to James Clark for the report and initial fix.
- Fixed inadvertant use of bitwise operators in the MPI C++ bindings
  header files.  Thanks to Bert Wesarg for the report and the fix.

Thanks,

Brian & Jeff



[OMPI devel] Open MPI 3.0.4 and 3.1.4 Now Available

2019-04-15 Thread Barrett, Brian via devel
The Open MPI Team, representing a consortium of research, academic, and 
industry partners, is pleased to announce the release of Open MPI 3.0.4 and 
3.1.4.

Both 3.0.4 and 3.1.4 are bug fix releases with largely the same set of bug 
fixes.  All users of both series are encouraged to upgrade when possible.

Version 3.0.4 can be downloaded from the main Open MPI web site: 
https://www.open-mpi.org/software/ompi/v3.0/
Version 3.1.4 can be downloaded from the main Open MPI web site: 
https://www.open-mpi.org/software/ompi/v3.1/


Major Changes in both releases:

- Fix compile error when configured with --enable-mpi-java and
  --with-devel-headers.  Thanks to @g-raffy for reporting the issue.
- Fix possible floating point rounding and division issues in OMPIO
  which led to crashes and/or data corruption with very large data.
  Thanks to Axel Huebl and René Widera for identifing the issue,
  supplying and testing the fix (** also appeared: v3.0.4).
- Use static_cast<> in mpi.h where appropriate.  Thanks to @shadow-fx
  for identifying the issue.
- Fix datatype issue with RMA accumulate.  Thanks to Jeff Hammond for
  raising the issue.
- Fix RMA accumulate of non-predefined datatypes with predefined
  operators.  Thanks to Jeff Hammond for raising the issue.
- Fix race condition when closing open file descriptors when launching
  MPI processes.  Thanks to Jason Williams for identifying the issue and
  supplying the fix.
- Fix Valgrind warnings for some MPI_TYPE_CREATE_* functions.  Thanks
  to Risto Toijala for identifying the issue and supplying the fix.
- Fix MPI_TYPE_CREATE_F90_{REAL,COMPLEX} for r=38 and r=308.
- Fix assembly issues with old versions of gcc (<6.0.0) that affected
  the stability of shared memory communications (e.g., with the vader
  BTL).
- Fix the OFI MTL handling of MPI_ANY_SOURCE.
- Fix noisy errors in the openib BTL with regards to
  ibv_exp_query_device().  Thanks to Angel Beltre and others who
  reported the issue.

Changes only in v3.1:

- Only use hugepages with appropriate permissions.  Thanks to Hunter
  Easterday for the fix.
- Fix support for external PMIx v3.1.x.
- Fix MPI_Allreduce crashes with some cases in the coll/spacc module.

Thanks,

Your Open MPI release team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] 3.0.4rc1/3.1.4rc1 posted

2019-03-27 Thread Barrett, Brian via devel
Both 3.0.4rc1 and 3.1.4rc1 are posted in the usual places.  Given the late 
release cycle and the small set of changes, we are intending to release on 
Friday, unless someone reports a blocking issue.  So have a go, give feedback, 
be happy.

3.0.4rc1:
  download: https://www.open-mpi.org/software/ompi/v3.0/
  changelog: https://raw.githubusercontent.com/open-mpi/ompi/v3.0.x/NEWS

3.1.4rc1:
  download: https://www.open-mpi.org/software/ompi/v3.1/
  changelog: https://raw.githubusercontent.com/open-mpi/ompi/v3.1.x/NEWS


Thanks,

Your 3.0 and 3.1 maintainers
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] Open MPI Jenkins down

2018-11-20 Thread Barrett, Brian via devel
All -

A recent security patch on the Open MPI Jenkins setup has appeared to have 
caused some fairly significant instability.  Something in our build 
configuration is causing Jenkins to become unresponsive to any scheduling 
changes (including completing jobs or starting new testing instances).  The 
problem doesn’t occur every time, but has happened three times in the last week.

It appears at least part of the problem is the EC2 plug-in.  Unfortunately, I 
will not have time to investigate further until this weekend at the earliest.  
I have disabled the Open MPI pull request builder job so that other tests can 
run to completion.  I’m going to start trying to eliminate causes this weekend, 
but in the mean time, please exercise care when merging.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Open-MPI backwards compatibility and library version changes

2018-11-16 Thread Barrett, Brian via devel
Gilles -

Look at the output of Chris’s libtool link line; you can see it’s explicitly 
adding a dependency on libopen-pal.so to the test binary.  Once it does that, 
it’s game over, the OS linking system will, rightly, complain about us changing 
the c:r:a in the libtool version system in a way that isn’t backwards 
compatible.

Unfortunately, I don’t have a good idea of what to do now.  We already did the 
damage on the 3.x series.  Our backwards compatibility testing (as lame as it 
is) just links libmpi, so it’s all good.  But if anyone uses libtool, we’ll 
have a problem, because we install the .la files that allow libtool to see the 
dependency of libmpi on libopen-pal, and it gets too excited.

We’ll need to talk about how we think about this change in the future.

Brian

> On Nov 14, 2018, at 6:07 PM, Gilles Gouaillardet 
>  wrote:
> 
> Chris,
> 
> I am a bit puzzled at your logs.
> 
> As far as I understand,
> 
> ldd libhhgttg.so.1
> 
> reports that libopen-rte.so.40 and libopen-pal.so.40 are both
> dependencies, but that does not say anything on
> who is depending on them. They could be directly needed by
> libhhgttg.so.1 (I hope / do not think it is the case),
> or indirectly by libmpi.so.40 (I'd rather bet on that).
> 
> In the latter case, having libhhgttg.so.1 point to an other
> libmpi.so.40 that depends on newer opal/orte libraries should just
> work.
> 
> You might want to run string libhhgttg.so.1 and look for libmpi.so.40
> (I found it) and libopen-pal.so.40 (I did not find it) or
> libopen-rte.so.40 (I did not find it too).
> 
> 
> Note if you
> gcc -shared -o libhhgttg.so.1 libhhgttg.c -lmpi -lopen-rte -lopen-pal
> then your lib will explicitly depend on the "internal" MPI libraries
> and you will face the same issue that your end user.
> You should not need to do that (I assume you do not explicitly call
> internal opal/orte subroutines), and hence avoid doing it.
> That being said, keep in mind that some build systems might do that
> for you under the hood (I have seen that, but I cannot remember which
> one), and that would be a bad thing, at least from an Open MPI point
> of view.
> 
> 
> Cheers,
> 
> Gilles
> On Wed, Nov 14, 2018 at 6:46 PM Christopher Samuel  
> wrote:
>> 
>> On 15/11/18 2:16 am, Barrett, Brian via devel wrote:
>> 
>>> In practice, this should not be a problem. The wrapper compilers (and
>>> our instructions for linking when not using the wrapper compilers)
>>> only link against libmpi.so (or a set of libraries if using Fortran),
>>> as libmpi.so contains the public interface. libmpi.so has a
>>> dependency on libopen-pal.so so the loader will load the version of
>>> libopen-pal.so that matches the version of Open MPI used to build
>>> libmpi.so However, if someone explicitly links against libopen-pal.so
>>> you end up where we are today.
>> 
>> Unfortunately that's not the case, just creating a shared library
>> that only links in libmpi.so will create dependencies on the private
>> libraries too in the final shared library. :-(
>> 
>> Here's a toy example to illustrate that.
>> 
>> [csamuel@farnarkle2 libtool]$ cat hhgttg.c
>> int answer(void)
>> {
>>return(42);
>> }
>> 
>> [csamuel@farnarkle2 libtool]$ gcc hhgttg.c -c -o hhgttg.o
>> 
>> [csamuel@farnarkle2 libtool]$ gcc -shared -Wl,-soname,libhhgttg.so.1 -o
>> libhhgttg.so.1 hhgttg.o -lmpi
>> 
>> [csamuel@farnarkle2 libtool]$ ldd libhhgttg.so.1
>>linux-vdso.so.1 =>  (0x7ffc625b3000)
>>libmpi.so.40 =>
>> /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libmpi.so.40
>> (0x7f018a582000)
>>libc.so.6 => /lib64/libc.so.6 (0x7f018a09e000)
>>libopen-rte.so.40 =>
>> /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-rte.so.40
>> (0x7f018a4b5000)
>>libopen-pal.so.40 =>
>> /apps/skylake/software/compiler/gcc/6.4.0/openmpi/3.0.0/lib/libopen-pal.so.40
>> (0x7f0189fde000)
>>libdl.so.2 => /lib64/libdl.so.2 (0x7f0189dda000)
>>librt.so.1 => /lib64/librt.so.1 (0x7f0189bd2000)
>>libutil.so.1 => /lib64/libutil.so.1 (0x7f01899cf000)
>>libm.so.6 => /lib64/libm.so.6 (0x7f01896cd000)
>>libpthread.so.0 => /lib64/libpthread.so.0 (0x7f01894b1000)
>>libz.so.1 => /lib64/libz.so.1 (0x7f018929b000)
>>libhwloc.so.5 => /lib64/libhwloc.so.5 (0x7f018905e000)
>>/lib64/ld-linux-x86-64.so.2 (0x7f018a46b000)
>>libnuma.so.1 => /lib64/libnuma.so.1 (0x7f

Re: [OMPI devel] Open-MPI backwards compatibility and library version changes

2018-11-14 Thread Barrett, Brian via devel
Chris -

When we look at ABI stability for Open MPI releases, we look only at the MPI 
and SHMEM interfaces, not the internal interfaces used by Open MPI internally.  
libopen-pal.so is an internal library, and we do not guarantee ABI stability 
across minor releases.  In 3.0.3, there was a backwards incompatible change in 
libopen-pal.so, which is why the shared library version numbers were increased 
in a way that prevented loading a new version of libopen-pal.so when the 
application was linked against an earlier version of the library.

In practice, this should not be a problem.  The wrapper compilers (and our 
instructions for linking when not using the wrapper compilers) only link 
against libmpi.so (or a set of libraries if using Fortran), as libmpi.so 
contains the public interface.  libmpi.so has a dependency on libopen-pal.so, 
so the loader will load the version of libopen-pal.so that matches the version 
of Open MPI used to build libmpi.so.  However, if someone explicitly links 
against libopen-pal.so, you end up where we are today.

There’s probably a bug in HDF5’s mechanism for linking against Open MPI, since 
it pulled in a dependency on libopen-pal.so.  However, there may be some things 
we can do in the future to better handle this scenario.  Unfortunately, most of 
the Open MPI developers (myself included) are at the SC’18 conference this 
week, so it will take us some time to investigate further.

Brian

> On Nov 14, 2018, at 5:20 AM, Christopher Samuel  wrote:
> 
> Hi folks,
> 
> Just resub'd after a long time to ask a question about binary/backwards 
> compatibility.
> 
> We got bitten when upgrading from 3.0.0 to 3.0.3 which we assumed would be 
> binary compatible and so (after some testing to confirm it was) replaced our 
> existing 3.0.0 install with the 3.0.3 one (because we're using hierarchical 
> namespaces in Lmod it meant we avoided needed to recompile everything we'd 
> already built over the last 12 months with 3.0.0).
> 
> However, once we'd done that we heard from a user that their code would no 
> longer run because it couldn't find libopen-pal.so.40 and saw that instead 
> 3.0.3 had libopen-pal.so.42.
> 
> Initially we thought this was some odd build system problem, but then on 
> digging further we realised that they were linking against libraries that in 
> turn were built against OpenMPI (HDF5) and that those had embedded the 
> libopen-pal.so.40 names.
> 
> Of course our testing hadn't found that because we weren't linking against 
> anything like those for our MPI tests. :-(
> 
> But I was really surprised to see that these version numbers were changing, I 
> thought the idea was to keep things backwardly compatible within these series?
> 
> Now fortunately our reason for doing the forced upgrade (we found our 3.0.0 
> didn't work with our upgrade to Slurm 18.08.3) was us missing one combination 
> out of our testing whilst fault-finding and having gotten it going we've been 
> able to drop back to the original 3.0.0 & fixed it for them.
> 
> But is this something that you folks have come across before?
> 
> All the best,
> Chris
> -- 
>  Christopher Samuel OzGrav Senior Data Science Support
>  ARC Centre of Excellence for Gravitational Wave Discovery
>  http://www.ozgrav.org/  http://twitter.com/ozgrav
> 
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] 3.0.3/3.1.3 delay

2018-10-26 Thread Barrett, Brian via devel
Due to the shared memory one-sided issue (which was reverted today), I’m going 
to hold off on releasing 3.0.3 and 3.1.3 until at least Monday.  Would like to 
see the weekend’s MTT runs run clean before we release.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.0.3rc2 and 3.1.3rc2 released

2018-10-23 Thread Barrett, Brian via devel
I have posted Open MPI 3.0.3rc2 as well as 3.1.3rc2 this afternoon.  Both are 
minor changes over rc1: the SIGCHLD and opal_fifo race patches are the primary 
changes.

Assuming no negative feedback, I’ll release both 3.0.3 and 3.1.3 on Friday.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Patcher on MacOS

2018-09-28 Thread Barrett, Brian via devel
Is there any practical reason to have the memory patcher component enabled for 
MacOS?  As far as I know, we don’t have any transports which require memory 
hooks on MacOS, and with the recent deprecation of the syscall interface, it 
emits a couple of warnings.  It would be nice to crush said warnings and the 
easiest way would be to not build the component.

Thoughts?

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Mac OS X 10.4.x users?

2018-09-28 Thread Barrett, Brian via devel
All -

In trying to clean up some warnings, I noticed one (around pack/unpack in 
net/if.h) that is due to a workaround of a bug in MacOS X 10.4.x and earlier.  
The simple way to remove the warning would be to remove the workaround, which 
would break the next major version of Open MPI on 10.4.x and earlier on 64 bit 
systems.  10.5.x was released 11 years ago and didn’t drop support for any 64 
bit systems.  I posted a PR which removes support for 10.4.x and earlier 
(through the README) and removes the warning generated workaround 
(https://github.com/open-mpi/ompi/pull/5803).

Does anyone object to breaking 10.4.x and earlier?

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] v3.1.2rc1 is posted

2018-08-15 Thread Barrett, Brian via devel
The first release candidate for the 3.1.2 release is posted at 
https://www.open-mpi.org/software/ompi/v3.1/

Major changes include fixing the race condition in vader (the same one that 
caused v2.1.5rc1 to be posted today) as well as:

- Assorted Portals 4.0 bug fixes.
- Fix for possible data corruption in MPI_BSEND.
- Move shared memory file for vader btl into /dev/shm on Linux.
- Fix for MPI_ISCATTER/MPI_ISCATTERV Fortran interfaces with MPI_IN_PLACE.
- Upgrade PMIx to v2.1.3.
- Numerous One-sided bug fixes.
- Fix for race condition in uGNI BTL.
- Improve handling of large number of interfaces with TCP BTL.
- Numerous UCX bug fixes.


Our goal is to release 3.1.2 around the same time as 2.1.5 (hopefully end of 
this week), so any testing is appreciated.


___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] v3.1.1rc2 posted

2018-07-02 Thread Barrett, Brian via devel
Thanks for the report.

Yes, we normally pull the release candidate from the web site when we do the 
final release.  The tarballs are still available on 
https://download.open-mpi.org/ if you guess the tarball name (which is the same 
as the final release name, but with rc? in the version), should someone ever 
need to find an rc tarball.

Brian

On Jul 2, 2018, at 8:47 AM, Vallee, Geoffroy R. 
mailto:valle...@ornl.gov>> wrote:

Hi,

I do not see a 3.1.1rc2 but instead a final 3.1.1, is it normal? Anyway, I 
tested the 3.1.1 tarball on 8 summit nodes with netpipe and imb. I did not see 
any problem and performance numbers look good.

Thanks




From: Barrett, Brian via devel 
mailto:devel@lists.open-mpi.org>>
Date: July 1, 2018 at 6:31:26 PM EDT
To: Open MPI Developers 
mailto:devel@lists.open-mpi.org>>
Cc: Barrett, Brian mailto:bbarr...@amazon.com>>
Subject: [OMPI devel] v3.1.1rc2 posted


v3.1.1rc2 is posted at the usual place: 
https://www.open-mpi.org/software/ompi/v3.1/

Primary changes are some important UCX bug fixes and a forward compatibility 
fix in PMIx.

We’re targeting a release on Friday, please test and send results before then.

Thanks,

Brian
___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] v3.1.1rc2 posted

2018-07-01 Thread Barrett, Brian via devel
v3.1.1rc2 is posted at the usual place: 
https://www.open-mpi.org/software/ompi/v3.1/

Primary changes are some important UCX bug fixes and a forward compatibility 
fix in PMIx.

We’re targeting a release on Friday, please test and send results before then.

Thanks,

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.1.1rc1 posted

2018-06-14 Thread Barrett, Brian via devel
The first release candidate for Open MPI 3.1.1 is posted at 
https://www.open-mpi.org/software/ompi/v3.1/.  We’re a bit behind on getting it 
out the door, so appreciate any testing feedback you have.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] OMPI OS X CI offline

2018-05-25 Thread Barrett, Brian via devel
> 
> On May 25, 2018, at 12:44 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Looks like "piper" OS X OMPI CI machine is offline.  I just marked it offline 
> in Jenkins and will bot:ompi:retest all the PR's that are obviously stuck.

Which wasn’t going to do anything, because the issue was that we had zero os-x 
builders available.  We now have two os-x builders available again (sorry, I 
had a power outage), so things should be flowing.

Brian

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.1.0 Release Update

2018-05-03 Thread Barrett, Brian via devel
It appears that we have resolved the outstanding issues with 3.1.0.  I’ve 
posted 3.1.0rc6 at https://www.open-mpi.org/software/ompi/v3.1/.  Please give 
it a go and let me know what you find.  Barring anyone posting a blocking 
issue, I intend to post 3.1.0 (final) tomorrow morning Pacific Time.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.1.0rc4 posted

2018-04-17 Thread Barrett, Brian via devel
Do we honestly care for 3.1.0?  I mean, we went 6 months without it working and 
no one cared.  We can’t fix all bugs, and I’m a little concerned about making 
changes right before release.

Brian

> On Apr 17, 2018, at 7:49 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Brian,
> 
> https://github.com/open-mpi/ompi/pull/5081 fixes support for external PMIx 
> v2.0
> 
> Support for external PMIx v1 is broken (same in master) and extra dev would 
> be required to fix it.
> 
> The easiest path, if acceptable, is to simply drop support for PMIx v1
> 
> Cheers,
> 
> Gilles
> 
> 
> 
> "Barrett, Brian via devel" <devel@lists.open-mpi.org> wrote:
>> In what we hope is the last RC for the 3.1.0 series, I’ve posted 3.1.0rc4 at:
>> 
>>   https://www.open-mpi.org/software/ompi/v3.1/
>> 
>> Please give it a try and provide feedback asap; goal is to release end of 
>> the week if we don’t find any major issues.
>> 
>> Brian
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.1.0rc4 posted

2018-04-17 Thread Barrett, Brian via devel
In what we hope is the last RC for the 3.1.0 series, I’ve posted 3.1.0rc4 at:

https://www.open-mpi.org/software/ompi/v3.1/

Please give it a try and provide feedback asap; goal is to release end of the 
week if we don’t find any major issues.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Github CI stalls: ARM and/or SLES

2018-03-28 Thread Barrett, Brian via devel
To keep things moving, I removed ARM from the pull request checker until LANL 
and ARM can get their builders back online.

You should be able to restart your build request builds and have them complete 
now.

Brian

> On Mar 28, 2018, at 9:12 AM, Barrett, Brian via devel 
> <devel@lists.open-mpi.org> wrote:
> 
> The ARM builders are all down; it was ARM that caused the problems.
> 
> Brian
> 
>> On Mar 28, 2018, at 6:48 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>> wrote:
>> 
>> Several PR's from last night appear to be stalled in the community CI.  I 
>> can't tell if they're stalled in ARM or SLES builds -- everything *appears* 
>> to be done, but Jenkins doesn't think that they are done.
>> 
>> For example, on https://github.com/open-mpi/ompi/pull/4983:
>> 
>> * Corresponds to 
>> https://jenkins.open-mpi.org/jenkins/job/open-mpi.pull_request/2336/
>> * open-mpi.build.platforms is blinking
>> * Corresponds to 
>> https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.platforms/2457/
>> * ARMv8 icon is light blue.  What does that mean?
>> * Clicking through the ARMv8 icon and looking at its console output, it says 
>> it's done:
>> 
>> ...
>> Process  1 exiting
>> --> skipping usempif08 examples
>> --> All done!
>> Finished: SUCCESS
>> 
>> * Status bar in the top right says "Build has been executing for 8 hr 52 min 
>> on Suse SLES 12sp2 (sir-rfxr92nq)", but the sles12sp2 icon is dark blue.
>> * Clicking through sles12sp2 icon and looking at its console output, it says 
>> it's done, too:
>> 
>> ...
>> Process  1 exiting
>> --> skipping usempif08 examples
>> --> All done!
>> Finished: SUCCESS
>> 
>> According to 
>> https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.platforms/, there's 
>> 5 builds stuck like this.  Curiously, they're not contiguous -- i.e., 2456 
>> and 2457 are stalled, but then 2458 finished successfully.  Then 2459-2461 
>> are also stalled.
>> 
>> This is unfortunately the limit of my Jenkins knowledge.  :-(
>> 
>> How to figure out where Jenkins is stuck?
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Github CI stalls: ARM and/or SLES

2018-03-28 Thread Barrett, Brian via devel
The ARM builders are all down; it was ARM that caused the problems.

Brian

> On Mar 28, 2018, at 6:48 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Several PR's from last night appear to be stalled in the community CI.  I 
> can't tell if they're stalled in ARM or SLES builds -- everything *appears* 
> to be done, but Jenkins doesn't think that they are done.
> 
> For example, on https://github.com/open-mpi/ompi/pull/4983:
> 
> * Corresponds to 
> https://jenkins.open-mpi.org/jenkins/job/open-mpi.pull_request/2336/
> * open-mpi.build.platforms is blinking
> * Corresponds to 
> https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.platforms/2457/
> * ARMv8 icon is light blue.  What does that mean?
>  * Clicking through the ARMv8 icon and looking at its console output, it says 
> it's done:
> 
> ...
> Process  1 exiting
> --> skipping usempif08 examples
> --> All done!
> Finished: SUCCESS
> 
> * Status bar in the top right says "Build has been executing for 8 hr 52 min 
> on Suse SLES 12sp2 (sir-rfxr92nq)", but the sles12sp2 icon is dark blue.
>  * Clicking through sles12sp2 icon and looking at its console output, it says 
> it's done, too:
> 
> ...
> Process  1 exiting
> --> skipping usempif08 examples
> --> All done!
> Finished: SUCCESS
> 
> According to 
> https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.platforms/, there's 5 
> builds stuck like this.  Curiously, they're not contiguous -- i.e., 2456 and 
> 2457 are stalled, but then 2458 finished successfully.  Then 2459-2461 are 
> also stalled.
> 
> This is unfortunately the limit of my Jenkins knowledge.  :-(
> 
> How to figure out where Jenkins is stuck?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Upcoming nightly tarball URL changes

2018-03-21 Thread Barrett, Brian via devel
Yeah, it’s failing in exactly the reason I thought it would (and why we didn’t 
support md5sum files in the first place).

Josh and I updated the MTT client yesterday and I pushed the change today.  
Please update your client and it should work.

Brian

On Mar 20, 2018, at 10:35 PM, Boris Karasev 
> wrote:

This issue occurs again:

Running command: /bin/md5sum 
/scratch/20180321_050534_23183_145104/sources/mpi_get__ompi-nightly-master/tarballs/openmpi-master-201803200242-9eb426e.tar.gz
e418ee1ea52345ba7ba72c4d4b7cd1ac 
/scratch/20180321_050534_23183_145104/sources/mpi_get__ompi-nightly-master/tarballs/openmpi-master-201803200242-9eb426e.tar.gz
*** Child process stdout closed
*** Child process now dead
Command complete, exit status: 0
*** ERROR: md5sum from checksum file does not match actual ( != 
e418ee1ea52345ba7ba72c4d4b7cd1ac)*** ERROR: Module aborted: 
MTT::MPI::Get::OMPI_Snapshot:Get: *** ERROR: md5sum from checksum file does not 
match actual ( != e418ee1ea52345ba7ba72c4d4b7cd1ac) at /hpc/newhome/mtt/svn/mt

Currently, the files contain those commits:
https://download.open-mpi.org/nightly/open-mpi/master/latest_snapshot.txt:
master-201803200242-9eb426e

https://download.open-mpi.org/nightly/open-mpi/master/[md5sums.txt|sha1sums.txt]:
master-201803081852-70c59f7
master-201803140349-5f58e7b
master-201803180329-a2d1419
master-201803110243-50d07e9
master-201803170305-bf3dd8a
master-201803082122-0f345c0
master-201803160306-e08e580


--

Best regards, Boris Karasev.

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Upcoming nightly tarball URL changes

2018-03-16 Thread Barrett, Brian via devel
Eventual consistency for the win.  It looks like I forgot to set a short cache 
time for the CDN that fronts the artifact repository.  So the previous day’s 
file was returned.  I fixed that and flushed the cache on the CDN, so it should 
work now.

Brian

On Mar 15, 2018, at 10:41 PM, Boris Karasev 
> wrote:

Hello,
I have an issue with md5sums.txt/sha1sums.txt files for master. They are 
received but they looks outdated:
…
Value returning:
Running command: /bin/md5sum 
/scratch/20180316_063515_26360_144526/sources/mpi_get__ompi-nightly-master/tarballs/openmpi-master-201803160306-e08e580.tar.gz
3b05d9b122995c63c74dcc290e654624  
/scratch/20180316_063515_26360_144526/sources/mpi_get__ompi-nightly-master/tarballs/openmpi-master-201803160306-e08e580.tar.gz
*** Child process stdout closed
*** Child process now dead
Command complete, exit status: 0
*** ERROR: md5sum from checksum file does not match actual ( != 
3b05d9b122995c63c74dcc290e654624)*** ERROR: Module aborted: 
MTT::MPI::Get::OMPI_Snapshot:Get: *** ERROR: md5sum from checksum file does not 
match actual ( != 3b05d9b122995c63c74dcc290e654624) at 
/hpc/newhome/mtt/svn/mtt.git/trunk/lib/MTT/Messages.pm li
ne 133.

I checked the file: 
https://download.open-mpi.org/nightly/open-mpi/master/latest_snapshot.txt
On that moment it contains this: "master-201803160306-e08e580"
But the files md5sums.txt/sha1sums.txt did not contained the checksum for 
"master-201803160306-e08e580".
The last checksum contained in the files is intended for 
"openmpi-master-201803081852-70c59f7".

Thanks.


--

Best regards, Boris Karasev.

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.1.0rc3 Posted

2018-03-14 Thread Barrett, Brian via devel
Open MPI 3.1.0rc3 is now available at 
https://www.open-mpi.org/software/ompi/v3.1/.  Assuming that there are no new 
negative bug reports, the plan is to release 3.1.1 next Monday.  Changes since 
3.1.0rc2 include:

- Fixes to parallel debugger attach functionality
- Fixes to the MPI I/O interface
- Fix for an out of order message problem with multi-threaded applications
- Fix for a potential hang when sending extremely large messages in shared 
memory

Thanks,

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] Open MPI 3.0.1rc4 Posted

2018-03-14 Thread Barrett, Brian via devel
Open MPI 3.0.1rc4 is now available at 
https://www.open-mpi.org/software/ompi/v3.0/.  Assuming that there are no new 
negative bug reports, the plan is to release 3.0.1 next Monday.  Changes since 
3.0.1rc3 include:

 - Fixes to parallel debugger attach functionality
 - Fixes to the MPI I/O interface
 - Fix for an out of order message problem with multi-threaded applications
 - Fix for a potential hang when sending extremely large messages in shared 
memory

Thanks,

Brian & Howard
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Upcoming nightly tarball URL changes

2018-03-08 Thread Barrett, Brian via devel
Sorry about that; we screwed up understanding the “API” between MTT and the 
Open MPI download path.  We made a change in the nightly tarball builder and 
things should be working now.

Brian

> On Mar 8, 2018, at 2:31 AM, Christoph Niethammer  wrote:
> 
> Hello,
> 
> After the change of the nighly tarball URL I have issues with mtt getting the 
> checksum files:
> 
> ...
> Running command: wget --no-check-certificate -nv
>   https://download.open-mpi.org/nightly/open-mpi/v2.x/md5sums.txt
> md5sums.txt
> https://download.open-mpi.org/nightly/open-mpi/v2.x/md5sums.txt:
>   2018-03-07 23:16:19 ERROR 403: Forbidden.
> ...
> 
> Fetching the tarballs itself works fine.
> Anything else I have to change in the setup?
> 
> 
> Best
> 
> Christoph Niethammer
> 
> 
> - Original Message -
> From: "Open MPI Developers" 
> To: "Open MPI Developers" 
> Cc: "Barrett, Brian" 
> Sent: Monday, February 26, 2018 11:09:42 PM
> Subject: [OMPI devel] Upcoming nightly tarball URL changes
> 
> On Sunday, March 18th, the Open MPI team is going to make a change in where 
> nightly tarballs are stored that will likely impact MTT test configuration.  
> If you have an automated system (including MTT) that fetches nightly 
> tarballs, you will likely need to make a change in the next two weeks to 
> avoid breakage.
> 
> Over the last year, we’ve been working behind the scenes to improve some of 
> the workflows around building and releasing tarballs (both nightly and 
> release), including moving them out of our web tree and into Amazon S3 (and 
> the Amazon CloudFront CDN for downloads).  It’s time to make the next step, 
> moving the nightly tarballs out of the web tree.
> 
> As of December, the nightly tarball builder uploads the build results to S3, 
> which can be accessed from:
> 
>   https://download.open-mpi.org/nightly/open-mpi// name>
> 
> So to get the latest 3.0 nightly tarball version, you’d download 
> https://download.open-mpi.org/nightly/open-mpi/v3.0.x/latest_snapshot.txt.  
> The build artifact tree under https://download.open-mpi.org/nightly/open-mpi/ 
> matches the tree under https://www.open-mpi.org/nightly/, so scripts should 
> work with only the change in root of the tree.
> 
> On Sunday, March 18th, we’ll stop mirroring tarballs to www.open-mpi.org and 
> rewrite the web pages to direct users to download.open-mpi.org/ for 
> downloading nightly tarballs of Open MPI.  We will add redirects from the old 
> tarballs and latest_snapshot.txt files to the new location, but not all 
> clients follow redirects by default.  So we’re asking everyone to proactively 
> update their MTT scripts.  It should just be updating lines like:
> 
>  ompi_snapshot_url = https://www.open-mpi.org/nightly/master
> 
> to read:
> 
>  ompi_snapshot_url = https://download.open-mpi.org/nightly/open-mpi/master
> 
> 
> Thanks,
> 
> Brian
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] cannot push directly to master anymore

2018-01-31 Thread Barrett, Brian via devel

> On Jan 31, 2018, at 8:33 AM, r...@open-mpi.org wrote:
> 
> 
> 
>> On Jan 31, 2018, at 7:36 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> On Jan 31, 2018, at 10:14 AM, Gilles Gouaillardet 
>>  wrote:
>>> 
>>> I tried to push some trivial commits directly to the master branch and
>>> was surprised that is no more allowed.
>>> 
>>> The error message is not crystal clear, but I guess the root cause is
>>> the two newly required checks (Commit email checker and
>>> Signed-off-by-checker) were not performed.
>> 
>> That is probably my fault; I was testing something and didn't mean to leave 
>> that enabled.  Oops -- sorry.  :-(
>> 
>> That being said -- is it a terrible thing to require a PR to ensure that we 
>> get a valid email address (e.g., not a "root@localhost") and that we have a 
>> proper signed-off-by line?
> 
>> 
>>> /* note if the commit is trivial, then it is possible to add the following 
>>> line
>>> [skip ci]
>>> into the commit message, so Jenkins will not check the PR. */
>> 
>> We've had some discussions about this on the Tuesday calls -- the point was 
>> made that if you allow skipping CI for "trivial" commits, it starts you down 
>> the slippery slope of precisely defining what "trivial" means.  Indeed, I 
>> know that I have been guilty of making a "trivial" change that ended up 
>> breaking something.
>> 
>> FWIW, I have stopped using the "[skip ci]" stuff -- even if I made docs-only 
>> changes.  I.e., just *always* go through CI.  That way there's never any 
>> question, and never any possibility of a human mistake (e.g., accidentally 
>> marking "[skip ci]" on a PR that really should have had CI).
> 
> If CI takes 30 min, then not a problem - when CI takes 6 hours (as it 
> sometimes does), then that’s a different story.

If CI takes more than about 30 minutes, something’s broke.  Unfortunately, 
Jenkins makes that particular problem hard to deal with (monitoring for job 
length).  I have some thoughts on how to make it better for the OMPI Jenkins, 
but am short on implementation time.  So if we have any volunteers to help 
here… :)

I have no objections to PR-only for master.  The other option is pre-commit 
hooks, but that’s kind of ugly.  We allow “rebase and merge” if people don’t 
like the merge commits on trunk.  Also makes updating NEWS/README items easier 
when we eventually get that workflow built (because you can change the message 
later with a PR, but not a commit).

We can also experiment with adding GitHub messages to the PR when CI completes, 
if people would like an async notification...

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.1.0 pre-release available

2018-01-23 Thread Barrett, Brian via devel
The Open MPI team is pleased to announce the first pre-release of the Open MPI 
3.1 series, available at:

  https://www.open-mpi.org/software/ompi/v3.1/

RC1 has two known issues:

  - We did not complete work to support hwloc 2.x, even when hwloc is built as 
an external library.  This may or may not be complete before 3.1.0 is shipped.
  - 3.1.0 is shipping with a pre-release version of PMIx 2.1.  We will finish 
the update to PMIx 2.1 before 3.1.0 is released.

We look forward to any other issues you may find in testing.

Thanks,

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] Open MPI 3.0.1rc2 available for testing

2018-01-23 Thread Barrett, Brian via devel
I’ve posted the first public release candidate of Open MPI 3.0.1 this evening.  
It can be downloaded for testing from:

  https://www.open-mpi.org/software/ompi/v3.0/

We appreciate any testing you can do in preparation for a release in the next 
week or two.


Thanks,

Brian & Howard
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] 32-bit builder in OMPI Jenkins

2018-01-09 Thread Barrett, Brian via devel
All -

Just an FYI, as we previously discussed, there’s now a 32 bit builder in the 
OMPI Community Jenkins CI tests.  The test passes on the current branches, so 
this shouldn’t impact anyone until such time as you break 32 bit x86 builds :). 
 Sorry this took so long (I believe the original issue that caused this test to 
be added was from October), but we had to fix 32 bit builds in all the branches 
(especially around the example programs) first.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] 3.1.0 NEWS updates

2017-12-11 Thread Barrett, Brian via devel
All -

We’re preparing to start the 3.1.0 release process.  There have been a number 
of updates since the v3.0.x branch was created and we haven’t necessarily been 
great at updating the NEWS file.  I took a stab at the update, can everyone 
have a look and see what I missed?

  https://github.com/open-mpi/ompi/pull/4609

Thanks,

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] 3.0.1 NEWS updates

2017-12-11 Thread Barrett, Brian via devel
All -

We’re preparing to start the 3.0.1 release process.  There have been a number 
of updates since 3.0.0 and we haven’t necessarily been great at updating the 
NEWS file.  I took a stab at the update, can everyone have a look and see what 
I missed?

  https://github.com/open-mpi/ompi/pull/4607/files

Thanks,

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] OSC module change

2017-11-30 Thread Barrett, Brian via devel
One day, I should really go remember how all that code I wrote many moons ago 
works… :).  ompi_win_t has to have the group pointer so that the MPI layer can 
implement MPI_WIN_GET_GROUP.  I should have remembered that, rather than 
suggesting there was work to do.  Sorry about that.

Brian

> On Nov 30, 2017, at 9:48 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> 
> Woo hoo!  Thanks for doing that.  :-)
> 
> 
>> On Nov 30, 2017, at 12:43 PM, Clement FOYER <clement.fo...@gmail.com> wrote:
>> 
>> Hi devels,
>> 
>> In fact the communicator's group was already retained in the window 
>> structure. So everything was already in place. I pushed the last 
>> modifications, and everything seems ready to be merged in PR#4527.
>> 
>> Jeff, the fixup commits are squashed :)
>> 
>> Clément
>> 
>> On 11/30/2017 12:00 AM, Barrett, Brian via devel wrote:
>>> The group is the easiest way to do the mapping from rank in window to 
>>> ompi_proc_t, so it’s safe to say every window will have one (also, as a way 
>>> of holding a reference to the ompi_proc_t).  So I think it’s safe to say 
>>> that every OSC module has a group handle somewhere (directly or through the 
>>> communicator).
>>> 
>>> Remember that in some implementations of the MTL, a communicator ID is a 
>>> precious resource.  I don’t know where Portals 4 falls right now, but in 
>>> various of the 64 bit tag matching implementations, it’s been as low as 4k 
>>> communicators.  There’s no need for a cid if all you hold is a group 
>>> reference.  Plus, a communicator has a bunch of other state (collective 
>>> modules handles, etc.) that aren’t necessarily needed by a window.
>>> 
>>> Brian
>>> 
>>>> On Nov 29, 2017, at 5:57 AM, Clement FOYER <clement.fo...@gmail.com> wrote:
>>>> 
>>>> Hi Brian,
>>>> 
>>>> Even if I see your point, I don't think a user request de free the 
>>>> communicator should necesserily lead to the communicator being deleted, 
>>>> only released from one hold, and available to be disposed by the library. 
>>>> I don't see objection to have the library keep a grab on these 
>>>> communicators, as the user give a handle to the actual object.
>>>> 
>>>> I do agree the point of asking if we want to keep only information 
>>>> relevant to all OSC components. Nevertheless, what would the difference be 
>>>> between holding the complete communicator and holding the group only? Is 
>>>> group the smallest part common to every component?
>>>> 
>>>> Clément
>>>> 
>>>> On 11/28/2017 07:46 PM, Barrett, Brian via devel wrote:
>>>>> The following is perfectly legal:
>>>>> 
>>>>> MPI_Comm_dup(some_comm, _comm);
>>>>> MPI_Win_create(…., tmp_comm, );
>>>>> MPI_Comm_free(tmp_comm);
>>>>> 
>>>>> 
>>>>> 
>>>>> So I don’t think stashing away a communicator is the solution.  Is a 
>>>>> group sufficient?  I think any rational reading of the standard would 
>>>>> lead to windows needing to hold a group reference for the life of the 
>>>>> window.  I’d be ok putting a group pointer in the base window, if that 
>>>>> would work?
>>>>> 
>>>>> Brian
>>>>> 
>>>>>> On Nov 28, 2017, at 10:19 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>>> 
>>>>>> Hi Brian,
>>>>>> 
>>>>>> Let me first start with explaining why we need the communicator. We need 
>>>>>> to translate local to global rank (aka. rank in your MPI_COMM_WORLD), so 
>>>>>> that the communication map we provide make sense. The only way today is 
>>>>>> to go back to a communicator and then basically translate a rank between 
>>>>>> this communicator and MPI_COMM_WORLD. We could use the gid, but then we 
>>>>>> have a hash table lookup for every operation.
>>>>>> 
>>>>>> While a communicator is not needed internally by an OSC, in MPI world 
>>>>>> all windows start with a communicator. This is the reason why I was 
>>>>>> proposing the change, not to force a window to create or hold a 
>>>>>> communicator, but simply because the existence of a communicator linked 
>>>>>> to the window is more of les

Re: [OMPI devel] OSC module change

2017-11-29 Thread Barrett, Brian via devel
The group is the easiest way to do the mapping from rank in window to 
ompi_proc_t, so it’s safe to say every window will have one (also, as a way of 
holding a reference to the ompi_proc_t).  So I think it’s safe to say that 
every OSC module has a group handle somewhere (directly or through the 
communicator).

Remember that in some implementations of the MTL, a communicator ID is a 
precious resource.  I don’t know where Portals 4 falls right now, but in 
various of the 64 bit tag matching implementations, it’s been as low as 4k 
communicators.  There’s no need for a cid if all you hold is a group reference. 
 Plus, a communicator has a bunch of other state (collective modules handles, 
etc.) that aren’t necessarily needed by a window.

Brian

On Nov 29, 2017, at 5:57 AM, Clement FOYER 
<clement.fo...@gmail.com<mailto:clement.fo...@gmail.com>> wrote:


Hi Brian,

Even if I see your point, I don't think a user request de free the communicator 
should necesserily lead to the communicator being deleted, only released from 
one hold, and available to be disposed by the library. I don't see objection to 
have the library keep a grab on these communicators, as the user give a handle 
to the actual object.

I do agree the point of asking if we want to keep only information relevant to 
all OSC components. Nevertheless, what would the difference be between holding 
the complete communicator and holding the group only? Is group the smallest 
part common to every component?

Clément

On 11/28/2017 07:46 PM, Barrett, Brian via devel wrote:
The following is perfectly legal:

MPI_Comm_dup(some_comm, _comm);
MPI_Win_create(…., tmp_comm, );
MPI_Comm_free(tmp_comm);



So I don’t think stashing away a communicator is the solution.  Is a group 
sufficient?  I think any rational reading of the standard would lead to windows 
needing to hold a group reference for the life of the window.  I’d be ok 
putting a group pointer in the base window, if that would work?

Brian

On Nov 28, 2017, at 10:19 AM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:

Hi Brian,

Let me first start with explaining why we need the communicator. We need to 
translate local to global rank (aka. rank in your MPI_COMM_WORLD), so that the 
communication map we provide make sense. The only way today is to go back to a 
communicator and then basically translate a rank between this communicator and 
MPI_COMM_WORLD. We could use the gid, but then we have a hash table lookup for 
every operation.

While a communicator is not needed internally by an OSC, in MPI world all 
windows start with a communicator. This is the reason why I was proposing the 
change, not to force a window to create or hold a communicator, but simply 
because the existence of a communicator linked to the window is more of less 
enforced by the MPI standard.

  George.



On Tue, Nov 28, 2017 at 1:02 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
The objection I have to this is that it forces an implementation where every 
one-sided component is backed by a communicator.  While that’s the case today, 
it’s certainly not required.  If you look at Portal 4, for example, there’s one 
collective call outside of initialization, and that’s a barrier in MPI_FENCE.  
The SM component is the same way and given some of the use cases for shared 
memory allocation using the SM component, it’s very possible that we’ll be 
faced with a situation where creating a communicator per SM region is too 
expensive in terms of overall communicator count.

I guess a different question would be what you need the communicator for.  It 
shouldn’t have any useful semantic meaning, so why isn’t a silent 
implementation detail for the monitoring component?

Brian


On Nov 28, 2017, at 8:45 AM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:

Devels,

We would like to change the definition of the OSC module to move the 
communicator one level up from the different module structures into the base 
OSC module. The reason for this, as well as a lengthy discussion on other 
possible solutions can be found in https://github.com/open-mpi/ompi/pull/4527.

We need to take a decision on this asap, to prepare the PR for the 3.1. Please 
comment asap.

  George.

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel


___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel





___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@list

Re: [OMPI devel] OSC module change

2017-11-28 Thread Barrett, Brian via devel
The following is perfectly legal:

MPI_Comm_dup(some_comm, _comm);
MPI_Win_create(…., tmp_comm, );
MPI_Comm_free(tmp_comm);



So I don’t think stashing away a communicator is the solution.  Is a group 
sufficient?  I think any rational reading of the standard would lead to windows 
needing to hold a group reference for the life of the window.  I’d be ok 
putting a group pointer in the base window, if that would work?

Brian

On Nov 28, 2017, at 10:19 AM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:

Hi Brian,

Let me first start with explaining why we need the communicator. We need to 
translate local to global rank (aka. rank in your MPI_COMM_WORLD), so that the 
communication map we provide make sense. The only way today is to go back to a 
communicator and then basically translate a rank between this communicator and 
MPI_COMM_WORLD. We could use the gid, but then we have a hash table lookup for 
every operation.

While a communicator is not needed internally by an OSC, in MPI world all 
windows start with a communicator. This is the reason why I was proposing the 
change, not to force a window to create or hold a communicator, but simply 
because the existence of a communicator linked to the window is more of less 
enforced by the MPI standard.

  George.



On Tue, Nov 28, 2017 at 1:02 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
The objection I have to this is that it forces an implementation where every 
one-sided component is backed by a communicator.  While that’s the case today, 
it’s certainly not required.  If you look at Portal 4, for example, there’s one 
collective call outside of initialization, and that’s a barrier in MPI_FENCE.  
The SM component is the same way and given some of the use cases for shared 
memory allocation using the SM component, it’s very possible that we’ll be 
faced with a situation where creating a communicator per SM region is too 
expensive in terms of overall communicator count.

I guess a different question would be what you need the communicator for.  It 
shouldn’t have any useful semantic meaning, so why isn’t a silent 
implementation detail for the monitoring component?

Brian


On Nov 28, 2017, at 8:45 AM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:

Devels,

We would like to change the definition of the OSC module to move the 
communicator one level up from the different module structures into the base 
OSC module. The reason for this, as well as a lengthy discussion on other 
possible solutions can be found in https://github.com/open-mpi/ompi/pull/4527.

We need to take a decision on this asap, to prepare the PR for the 3.1. Please 
comment asap.

  George.

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel


___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel


___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] OSC module change

2017-11-28 Thread Barrett, Brian via devel
The objection I have to this is that it forces an implementation where every 
one-sided component is backed by a communicator.  While that’s the case today, 
it’s certainly not required.  If you look at Portal 4, for example, there’s one 
collective call outside of initialization, and that’s a barrier in MPI_FENCE.  
The SM component is the same way and given some of the use cases for shared 
memory allocation using the SM component, it’s very possible that we’ll be 
faced with a situation where creating a communicator per SM region is too 
expensive in terms of overall communicator count.

I guess a different question would be what you need the communicator for.  It 
shouldn’t have any useful semantic meaning, so why isn’t a silent 
implementation detail for the monitoring component?

Brian


On Nov 28, 2017, at 8:45 AM, George Bosilca 
> wrote:

Devels,

We would like to change the definition of the OSC module to move the 
communicator one level up from the different module structures into the base 
OSC module. The reason for this, as well as a lengthy discussion on other 
possible solutions can be found in https://github.com/open-mpi/ompi/pull/4527.

We need to take a decision on this asap, to prepare the PR for the 3.1. Please 
comment asap.

  George.

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] 3.0 / 3.1 Pull Requests caught up?

2017-11-01 Thread Barrett, Brian via devel
All -

I think we’re finally caught up on pull requests for the 3.0 and 3.1 release.  
There are a small number of PRs against both branches waiting for code reviews, 
but otherwise everything has been committed.  If I somehow missed something, 
first please make sure it has a “Target: ” label on it, and second 
please let me know.

We ran into some Jenkins difficulties with the ompio patch series for 3.1, but 
merged those in this morning.  I’ll probably cut 3.0.1 and 3.1.0 rc tarballs 
tomorrow morning off of tonight’s nightly builds, assuming MTT looks reasonable.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Jenkins jobs offline

2017-10-27 Thread Barrett, Brian via devel
The ARM builder for CI tests is offline, meaning that all CI jobs will fail 
until Pasha can get it back online.  It looks like the agent didn’t call back 
after Jenkins restarted this morning.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Jenkins/MTT/Trac outage Sunday morning

2017-10-27 Thread Barrett, Brian via devel
All -

There will be a 1-2 hour outage of the server that hosts Open MPI’s Jenkins / 
MTT / Trac servers on Sunday morning (likely starting around 8:30am PDT).  In 
addition to the usual security updates, I’m going to repartition the root 
volume a little bit in an attempt to mitigate the blast damage of some of our 
services misbehaving (like last week’s Jenkins outage).  If this is going to 
cause significant issues for anyone, please let me know.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] HWLOC / rmaps ppr build failure

2017-10-04 Thread Barrett, Brian via devel
It looks like a change in either HWLOC or the rmaps ppr component is causing 
Cisco build failures on master for the last couple of days:

  https://mtt.open-mpi.org/index.php?do_redir=2486

rmaps_ppr.c:665:17: error: ‘HWLOC_OBJ_NUMANODE’ undeclared (first use in this 
function); did you mean ‘HWLOC_OBJ_NODE’?
 level = HWLOC_OBJ_NUMANODE;
 ^~
 HWLOC_OBJ_NODE
rmaps_ppr.c:665:17: note: each undeclared identifier is reported only once for 
each function it

Can someone take a look?
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Cuda build break

2017-10-04 Thread Barrett, Brian via devel
All -

It looks like nVidia’s MTT started failing on 9/26, due to not finding Cuda.  
There’s a suspicious commit given the error message in the hwloc cuda changes.  
Jeff and Brice, it’s your patch, can you dig into the build failures?

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Jenkins nowhere land again

2017-10-03 Thread Barrett, Brian via devel
My MacOS box is back up and jobs are progressing again. The queue got kind of 
long, so it might be an hour or so before it catches up. I have some thoughts 
on monitoring so we get emails in case this happens and my team's Product 
Manager found an unused Amazon-owned Mac Mini we'll add to the pool so that I 
won't have to drive home if this happens again.

Brian

On Oct 3, 2017, at 13:40, "r...@open-mpi.org<mailto:r...@open-mpi.org>" 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:

I'm not sure either - I have the patch to fix the loop_spawn test problem, but 
can't get it into the repo.


On Oct 3, 2017, at 1:22 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:

I'm not sure entirely what we want to do.  It looks like both Nathan and I's OS 
X servers died on the same day.  It looks like mine might be a larger failure 
than just Jenkins, because I can't log into the machine remotely.  It's going 
to be a couple hours before I can get home.  Nathan, do you know what happened 
to your machine?

The only options for the OMPI builder are to either wait until Nathan or I get 
home and get our servers running again or to not test OS X (which has its own 
problems).  I don't have a strong preference here, but I also don't want to 
make the decision unilaterally.

Brian


On Oct 3, 2017, at 1:14 PM, r...@open-mpi.org<mailto:r...@open-mpi.org> wrote:

We are caught between two infrastructure failures:

Mellanox can't pull down a complete PR

OMPI is hanging on the OS-X server

Can someone put us out of our misery?
Ralph

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Jenkins nowhere land again

2017-10-03 Thread Barrett, Brian via devel
I’m not sure entirely what we want to do.  It looks like both Nathan and I’s OS 
X servers died on the same day.  It looks like mine might be a larger failure 
than just Jenkins, because I can’t log into the machine remotely.  It’s going 
to be a couple hours before I can get home.  Nathan, do you know what happened 
to your machine?

The only options for the OMPI builder are to either wait until Nathan or I get 
home and get our servers running again or to not test OS X (which has its own 
problems).  I don’t have a strong preference here, but I also don’t want to 
make the decision unilaterally.

Brian


> On Oct 3, 2017, at 1:14 PM, r...@open-mpi.org wrote:
> 
> We are caught between two infrastructure failures:
> 
> Mellanox can’t pull down a complete PR
> 
> OMPI is hanging on the OS-X server
> 
> Can someone put us out of our misery?
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.0.0rc5 available for testing

2017-09-07 Thread Barrett, Brian via devel
Hi all -

Open MPI 3.0.0rc5 is available for testing at the usual place:

  https://www.open-mpi.org/software/ompi/v3.0/

There are four major changes since the last RC:

  * Components built as DSOs now link against the library associated with their 
project, making it significantly easier to work with Open MPI from environments 
like mpi4py.
  * Added a C++ wrapper compiler for oshmem
  * Some dynamic process management cleanups
  * Finalized the README and NEWS files

Barring any unexpected bugs found in this release, we are planning on spinning 
the final 3.0.0 release from this RC in the next couple of days.  Please let us 
know if you find any issues.

Thank you,

Brian and Howard
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] Open MPI 3.1 Feature List

2017-09-05 Thread Barrett, Brian via devel
All -

With 3.0 (finally) starting to wrap up, we’re starting discussion of the 3.1 
release.  As a reminder, we are targeting 2017 for the release, are going to 
cut the release from master, and are not going to have a feature whitelist for 
the release.  We are currently looking at a timeline for cutting the 3.1 branch 
from master.  It will be after 2.0.x is wrapped (possibly one more bugfix 
release) and 3.0.0 has released.  That said, we are looking for feedback on 
features that your organization plans to contribute for 3.1 that are not 
already in master.  What are the features and what is the timeline for 
submission to master?  If you have something not in master that needs to be, 
please comment on timelines before next Tuesday’s con-call.

Thanks,

The 3.1 release managers
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.0.0rc4 available

2017-08-29 Thread Barrett, Brian via devel
The fourth release candidate for Open MPI 3.0.0 is now available for download.  
Changes since rc2 include:

* Better handling of OPAL_PREFIX for PMIx
* Update to hwloc 1.11.7
* Sync with PMIx 2.0.1
* Revert atomics behavior to that of the 2.x series
* usnic, openib, and portals4 bug fixes
* Add README notes about older versions of the XL compiler
* Use UCX by default if found in usual locations

We will be releasing rc5 later this week, which we hope will be the basis of 
the 3.0.0 release early next week.  So the rc4 release is the last chance to 
get bugs addressed.  Please give it a go on your systems.  Open MPI 3.0.0rc4 
can be downloaded from:

https://www.open-mpi.org/software/ompi/v3.0/


Thanks,

The Open MPI Team
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] v3.0.0 blocker issues

2017-08-01 Thread Barrett, Brian via devel
Here’s the full list: 
https://github.com/open-mpi/ompi/issues?q=is%3Aissue%20is%3Aopen%20label%3A%22Target%3A%203.0.x%22%20label%3A%22blocker%22

There’s obviously a bunch of XLC issues in there, and IBM’s working on the 
right documentation / configure checks so that we feel comfortable releasing 
with at least documented XLC support.

However, there’s a number of issues outside of XLC that, if nothing else, could 
use some updates on the tickets.  We can’t release 3.0.0 with a bunch of 
blocker bugs, so any time you can spend knocking down the bug list would be 
appreciated.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Issue/PR tagging

2017-08-01 Thread Barrett, Brian via devel
I finally updated https://github.com/open-mpi/ompi/wiki/SubmittingPullRequests.

Holler if there are questions…

Brian

On Jul 28, 2017, at 10:30 AM, Artem Polyakov 
<artpo...@gmail.com<mailto:artpo...@gmail.com>> wrote:

Brian,

Have you had a chance to put this on the wiki? If so - can you send the link - 
I can't find it.

2017-07-19 16:47 GMT-07:00 Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>>:
I’ll update the wiki (and figure out where on our wiki to put more general 
information), but the basics are:

If you find a bug, file an issue.  Add Target:v??? labels for any branch it 
impacts.  If we decide later not to fix the issue on a branch, we’ll remove the 
label
Open/find an issue for any PR going to release branches.  That issue can 
(possibly should, if the issue impacts multiple branches) have multiple 
Target:v??? labels
If a PR is for a release branch (ie, it’s immediate target to merge to is a 
release branch), please add a Target:v??? label and reference the issue
If a PR is for master, it can reference an issue (if there’s an issue 
associated with it), but should not have a Target:v??? label
If an issue is fixed in master, but not merged into branches, don’t close the 
issue

I think that’s about it.  There’s some workflows we want to build to automate 
enforcing many of these things, but for now, it’s just hints to help the RMs 
not lose track of issues.

Brian

> On Jul 19, 2017, at 12:18 PM, r...@open-mpi.org<mailto:r...@open-mpi.org> 
> wrote:
>
> Hey folks
>
> I know we made some decisions last week about how to tag issues and PRs to 
> make things easier to track for release branches, but the wiki notes don’t 
> cover what we actually decided to do. Can someone briefly summarize? I 
> honestly have forgotten if we tag issues, or tag PRs
>
> Ralph
>
> ___
> devel mailing list
> devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.0.0 Release Candidate 2 available

2017-07-29 Thread Barrett, Brian via devel
Just in time for your weekend, the Open MPI team is releasing Open MPI 
3.0.0rc2, the second release candidate of the 3.0.0 series.  New in this 
release candidate is PMIx 2.0 integration, so we would appreciate any testing 
around run-time environments.  There have also been a round of bug fixes and 
README updates (including finding more compilers that are old and broken in 
obscure ways), so please check the README if you run into compilation problems. 
 If you don’t find anything, please open an issue against the 3.0.0 milestone 
on GitHub.

Open MPI 3.0.0rc2 was cut at the same git commit as Wednesday night’s tarball.  
This means that any commits made yesterday are not in 3.0.0rc2 (but will be in 
later RCs and likely 3.0.0 final).

Tarballs can be downloaded from:

   https://www.open-mpi.org/software/ompi/v3.0/


Thanks,

Brian & Howard
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Issue/PR tagging

2017-07-19 Thread Barrett, Brian via devel
I’ll update the wiki (and figure out where on our wiki to put more general 
information), but the basics are:

If you find a bug, file an issue.  Add Target:v??? labels for any branch it 
impacts.  If we decide later not to fix the issue on a branch, we’ll remove the 
label
Open/find an issue for any PR going to release branches.  That issue can 
(possibly should, if the issue impacts multiple branches) have multiple 
Target:v??? labels
If a PR is for a release branch (ie, it’s immediate target to merge to is a 
release branch), please add a Target:v??? label and reference the issue
If a PR is for master, it can reference an issue (if there’s an issue 
associated with it), but should not have a Target:v??? label
If an issue is fixed in master, but not merged into branches, don’t close the 
issue

I think that’s about it.  There’s some workflows we want to build to automate 
enforcing many of these things, but for now, it’s just hints to help the RMs 
not lose track of issues.

Brian

> On Jul 19, 2017, at 12:18 PM, r...@open-mpi.org wrote:
> 
> Hey folks
> 
> I know we made some decisions last week about how to tag issues and PRs to 
> make things easier to track for release branches, but the wiki notes don’t 
> cover what we actually decided to do. Can someone briefly summarize? I 
> honestly have forgotten if we tag issues, or tag PRs
> 
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Yoda SPML and master/v3.0.0

2017-07-13 Thread Barrett, Brian via devel
Mellanox developers -

The btl sm header leak in the yoda spml brought up questions about the status 
of the yoda spml.  My understanding was that Mellanox was going to remove it 
after the decision that we didn’t require supporting btl transports and 
Mellanox no longer wanting to support yoda.  But it looks like yoda is still in 
master and v3.0.x.  Can we remove it?  Is it possible to get a patch in the 
next couple of days from Mellanox?

Thanks,

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] v3.0.x / v3.x branch mixup

2017-07-07 Thread Barrett, Brian via devel
Hi all -

Earlier this week, we discovered that a couple of pull requests had been posted 
against the deprecated v3.x branch (instead of the active v3.0.x branch).  
Worse, Github allowed me to merge those requests, despite Github reporting that 
nobody had permissions to write to the branch (oddly, Howard can not push to 
the branch).  After some auditing, it appears that there is only one commit in 
the v3.x branch that is not in the v3.0.x branch:

  
https://github.com/open-mpi/ompi/commit/8a4487900831f9b0dbed1f00cb9cc30921988bc2

Boris, can you file a PR against v3.0.x if this patch is still desired for the 
v3.0.0 release?

Apologies to all; we’ll do better for the fall release.

Brian


___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] 3.0.0 review reminder

2017-06-30 Thread Barrett, Brian via devel
There are a number of outstanding PRs on the 3.0.0 branch which are awaiting 
review.  The door for these is rapidly closing, so if you have some time, 
please go review the PRs and take appropriate action.  

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


[OMPI devel] Open MPI 3.0.0 first release candidate posted

2017-06-28 Thread Barrett, Brian via devel
The first release candidate of Open MPI 3.0.0 is now available 
(https://www.open-mpi.org/software/ompi/v3.0/).  We expect to have at least one 
more release candidate, as there are still outstanding MPI-layer issues to be 
resolved (particularly around one-sided).  We are posting 3.0.0rc1 to get 
feedback on run-time stability, as one of the big features of Open MPI 3.0 is 
the update to the PMIx 2 runtime environment.  We would appreciate any and all 
testing you can do,  around run-time behaviors.

Thank you,

Brian & Howard
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Abstraction violation!

2017-06-22 Thread Barrett, Brian via devel
Thanks, Nathan.

There’s no mpi.h available on the PR builder hosts, so something works out.  
Haven’t thought through that path, however.

Brian

> On Jun 22, 2017, at 6:04 PM, Nathan Hjelm <hje...@me.com> wrote:
> 
> I have a fix I am working on. Will open a PR tomorrow morning.
> 
> -Nathan
> 
>> On Jun 22, 2017, at 6:11 PM, r...@open-mpi.org wrote:
>> 
>> Here’s something even weirder. You cannot build that file unless mpi.h 
>> already exists, which it won’t until you build the MPI layer. So apparently 
>> what is happening is that we somehow pickup a pre-existing version of mpi.h 
>> and use that to build the file?
>> 
>> Checking around, I find that all my available machines have an mpi.h 
>> somewhere in the default path because we always install _something_. I 
>> wonder if our master would fail in a distro that didn’t have an MPI 
>> installed...
>> 
>>> On Jun 22, 2017, at 5:02 PM, r...@open-mpi.org wrote:
>>> 
>>> It apparently did come in that way. We just never test -no-ompi and so it 
>>> wasn’t discovered until a downstream project tried to update. Then...boom.
>>> 
>>> 
>>>> On Jun 22, 2017, at 4:07 PM, Barrett, Brian via devel 
>>>> <devel@lists.open-mpi.org> wrote:
>>>> 
>>>> I’m confused; looking at history, there’s never been a time when 
>>>> opal/util/info.c hasn’t included mpi.h.  That seems odd, but so does info 
>>>> being in opal.
>>>> 
>>>> Brian
>>>> 
>>>>> On Jun 22, 2017, at 3:46 PM, r...@open-mpi.org wrote:
>>>>> 
>>>>> I don’t understand what someone was thinking, but you CANNOT #include 
>>>>> “mpi.h” in opal/util/info.c. It has broken pretty much every downstream 
>>>>> project.
>>>>> 
>>>>> Please fix this!
>>>>> Ralph
>>>>> 
>>>>> ___
>>>>> devel mailing list
>>>>> devel@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> 
>>>> ___
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Abstraction violation!

2017-06-22 Thread Barrett, Brian via devel
I’m confused; looking at history, there’s never been a time when 
opal/util/info.c hasn’t included mpi.h.  That seems odd, but so does info being 
in opal.

Brian

> On Jun 22, 2017, at 3:46 PM, r...@open-mpi.org wrote:
> 
> I don’t understand what someone was thinking, but you CANNOT #include “mpi.h” 
> in opal/util/info.c. It has broken pretty much every downstream project.
> 
> Please fix this!
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Mellanox Jenkins

2017-06-22 Thread Barrett, Brian via devel
As a fellow Jenkins maintainer, thanks for all the work :).

Brian

On Jun 22, 2017, at 7:35 AM, Joshua Ladd 
<jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:

Update - Mellanox Jenkins is back to normal. All previously failing PRs have 
been retrigged. Thanks for your patience.

Best,

Josh Ladd

On Wed, Jun 21, 2017 at 8:25 PM, Artem Polyakov 
<artpo...@gmail.com<mailto:artpo...@gmail.com>> wrote:
Brian, I'm going to push for the fix tonight. If won't work - we will do as you 
advised.

2017-06-21 17:23 GMT-07:00 Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>>:
In the mean time, is it possible to disable the jobs that listen for pull 
requests on Open MPI’s repos?  I’m trying to get people out of the habit of 
ignoring CI results, so no results are better than failed results :/.

Brian

> On Jun 21, 2017, at 1:49 PM, Jeff Squyres (jsquyres) 
> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
>
> Thanks Josh.
>
>> On Jun 21, 2017, at 2:18 PM, Joshua Ladd 
>> <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:
>>
>> OMPI Developers,
>>
>> We are aware of the issue currently affecting the Mellanox Jenkins servers. 
>> The issue is being addressed and we hope it will be resolved soon. We 
>> apologize for the inconvenience and thank you for your patience.
>>
>> Best,
>>
>> Josh Ladd
>>
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com<mailto:jsquy...@cisco.com>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Mellanox Jenkins

2017-06-21 Thread Barrett, Brian via devel
In the mean time, is it possible to disable the jobs that listen for pull 
requests on Open MPI’s repos?  I’m trying to get people out of the habit of 
ignoring CI results, so no results are better than failed results :/.

Brian

> On Jun 21, 2017, at 1:49 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Thanks Josh.
> 
>> On Jun 21, 2017, at 2:18 PM, Joshua Ladd  wrote:
>> 
>> OMPI Developers,
>> 
>> We are aware of the issue currently affecting the Mellanox Jenkins servers. 
>> The issue is being addressed and we hope it will be resolved soon. We 
>> apologize for the inconvenience and thank you for your patience.
>> 
>> Best,
>> 
>> Josh Ladd
>> 
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] SLURM 17.02 support

2017-06-19 Thread Barrett, Brian via devel
By the way, there was a change between 2.x and 3.0.x:

2.x:

Hello, world, I am 0 of 1, (Open MPI v2.1.2a1, package: Open MPI 
bbarrett@ip-172-31-64-10 Distribution, ident: 2.1.2a1, repo rev: 
v2.1.1-59-gdc049e4, Unreleased developer copy, 148)
Hello, world, I am 0 of 1, (Open MPI v2.1.2a1, package: Open MPI 
bbarrett@ip-172-31-64-10 Distribution, ident: 2.1.2a1, repo rev: 
v2.1.1-59-gdc049e4, Unreleased developer copy, 148)


3.0.x:

% srun  -n 2 ./hello_c
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[ip-172-31-64-100:72545] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[ip-172-31-64-100:72546] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
srun: error: ip-172-31-64-100: tasks 0-1: Exited with exit code 1

Don’t think it really matters, since v2.x probably wasn’t what the customer 
wanted.

Brian

On Jun 19, 2017, at 7:18 AM, Howard Pritchard 
> wrote:

Hi Ralph

I think the alternative you mention below should suffice.

Howard

r...@open-mpi.org 
> schrieb am Mo. 19. Juni 2017 um 
07:24:
So what you guys want is for me to detect that no opal/pmix framework 
components could run, detect that we are in a slurm job, and so print out an 
error message saying “hey dummy - you didn’t configure us with slurm pmi 
support”?

It means embedding slurm job detection code in the heart of ORTE (as opposed to 
in a component), which bothers me a bit.

As an alternative, what if I print out a generic “you didn’t configure us with 
pmi support for this environment” instead of the “pmix select failed” message? 
I can mention how to configure the support in a general way, but it avoids 
having to embed slurm detection into ORTE outside of a component.

> On Jun 16, 2017, at 8:39 AM, Jeff Squyres (jsquyres) 
> > wrote:
>
> +1 on the error message.
>
>
>
>> On Jun 16, 2017, at 10:06 AM, Howard Pritchard 
>> > wrote:
>>
>> Hi Ralph
>>
>> I think a helpful  error message would suffice.
>>
>> Howard
>>
>> r...@open-mpi.org 
>> > schrieb am Di. 13. Juni 2017 
>> um 11:15:
>> Hey folks
>>
>> Brian brought this up today on the call, so I spent a little time 
>> investigating. After installing SLURM 17.02 (with just --prefix as config 
>> args), I configured OMPI with just --prefix config args. Getting an 
>> allocation and then executing “srun ./hello” failed, as expected.
>>
>> However, configuring OMPI --with-pmi= resolved the problem. 
>> SLURM continues to default to PMI-1, and so we pick that option up and use 
>> it. Everything works fine.
>>
>> FWIW: I also went back and checked using SLURM 15.08 and got the identical 
>> behavior.
>>
>> So the issue is: we don’t pick up PMI support by default, and never have due 
>> to the SLURM license issue. Thus, we have always required that the user 
>> explicitly configure --with-pmi so they take responsibility for the license. 
>> This is an acknowledged way of avoiding having GPL pull OMPI under its 
>> umbrella as it is the user, and not the OMPI community, that is making the 
>> link.
>>
>> I’m not sure there is anything we can or should do about this, other than 
>> perhaps providing a nicer error message. Thoughts?
>> Ralph
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Master MTT results

2017-06-01 Thread Barrett, Brian via devel
Also, we merged the RTE changes into 3.0.x this morning, so we should see how 
that works tonight on MTT.  Thanks for all the work, Ralph!

Brian

> On Jun 1, 2017, at 8:34 AM, r...@open-mpi.org wrote:
> 
> Hey folks
> 
> I scanned the nightly MTT results from last night on master, and the RTE 
> looks pretty solid. However, there are a LOT of onesided segfaults occurring, 
> and I know that will eat up people’s disk space.
> 
> Just wanted to ensure folks were aware of the problem
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Time to remove Travis?

2017-06-01 Thread Barrett, Brian via devel
+1

On Jun 1, 2017, at 7:36 AM, Howard Pritchard 
> wrote:

I vote for removal too.

Howard
r...@open-mpi.org 
> schrieb am Do. 1. Juni 2017 um 
08:10:
I’d vote to remove it - it’s too unreliable anyway

> On Jun 1, 2017, at 6:30 AM, Jeff Squyres (jsquyres) 
> > wrote:
>
> Is it time to remove Travis?
>
> I believe that the Open MPI PRB now covers all the modern platforms that 
> Travis covers, and we have people actively maintaining all of the machines / 
> configurations being used for CI.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.x branch naming

2017-05-31 Thread Barrett, Brian via devel

> On May 31, 2017, at 7:52 AM, r...@open-mpi.org wrote:
> 
>> On May 31, 2017, at 7:48 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>> wrote:
>> 
>> On May 30, 2017, at 11:37 PM, Barrett, Brian via devel 
>> <devel@lists.open-mpi.org> wrote:
>>> 
>>> We have now created a v3.0.x branch based on today’s v3.x branch.  I’ve 
>>> reset all outstanding v3.x PRs to the v3.0.x branch.  No one has 
>>> permissions to pull into the v3.x branch, although I’ve left it in place 
>>> for a couple of weeks so that people can slowly update their local git 
>>> repositories.  
>> 
>> A thought on this point...
>> 
>> I'm kinda in favor of ripping off the band aid and deleting the 
>> old/stale/now-unwritable v3.x branch in order to force everyone to update to 
>> the new branch name ASAP.
>> 
>> Thoughts?
> 
> FWIW: Brian very kindly already re-pointed all the existing PRs to the new 
> branch.

Yes, I should have noted that in my original email.  That solves the existing 
PR problem, but everyone still has a bit of work to do if they had local 
changes to a branch based on v3.x.  In theory, it shouldn’t be much work to 
clean all that up.  But theory and practice don’t always match when using git 
:).

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.x branch naming

2017-05-30 Thread Barrett, Brian via devel
We have now created a v3.0.x branch based on today’s v3.x branch.  I’ve reset 
all outstanding v3.x PRs to the v3.0.x branch.  No one has permissions to pull 
into the v3.x branch, although I’ve left it in place for a couple of weeks so 
that people can slowly update their local git repositories.  The nightly 
tarballs are still building, but everything else should be setup.  Yell when 
you figure out what I missed.

Brian

On May 30, 2017, at 2:44 PM, Barrett, Brian 
<bbarr...@amazon.com<mailto:bbarr...@amazon.com>> wrote:

For various reasons, the rename didn’t happen on Saturday.  It will, instead, 
happen tonight at 6:30pm PDT.  Sorry for the delay!

Brian

On May 23, 2017, at 9:38 PM, Barrett, Brian 
<bbarr...@amazon.com<mailto:bbarr...@amazon.com>> wrote:

All -

Per the discussion on today’s telecon, we’re going to rename the branch 
Saturday (5/27) morning (Pacific time).  We’ll branch v3.0.x from v3.x and 
update all the nightly builds and web pages.  I’m going to push through all the 
PRs on 3.x which are currently outstanding, but please be careful about pushing 
together complex PRs on 3.x for the rest of the week.  If something is 
submitted before Saturday and doesn’t make it due to reviews, you’ll have to 
resubmit.

Brian

On May 5, 2017, at 4:21 PM, r...@open-mpi.org<mailto:r...@open-mpi.org> wrote:

+1 Go for it :-)

On May 5, 2017, at 2:34 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:

To be clear, we’d do the move all at once on Saturday morning.  Things that 
would change:

1) nightly tarballs would rename from openmpi-v3.x--.tar.gz to 
openmpi-v3.0.x--.tar.gz
2) nightly tarballs would build from v3.0.x, not v3.x branch
3) PRs would need to be filed against v3.0.x
4) Both https://www.open-mpi.org/nightly/v3.x/ and 
https://www.open-mpi.org/nightly/v3.0.x/ would work for searching for new 
nightly tarballs

At some point in the future (say, two weeks), (4) would change, and only 
https://www.open-mpi.org/nightly/v3.0.x/ would work.  Otherwise, we need to 
have a coordinated name switch, which seems way harder than it needs to be.  
MTT, for example, requires a configured directory for nightlies, but as long as 
the latest_tarball.txt is formatted correctly, everything else works fine.

Brian

On May 5, 2017, at 2:26 PM, Paul Hargrove 
<phhargr...@lbl.gov<mailto:phhargr...@lbl.gov>> wrote:

As a maintainer of non-MTT scripts that need to know the layout of the 
directories containing nighty and RC tarball, I also think that all the changes 
should be done soon (and all together, not spread over months).

-Paul

On Fri, May 5, 2017 at 2:16 PM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
If we rebranch from master for every "major" release it makes sense to rename 
the branch. In the long term renaming seems like the way to go, and thus the 
pain of altering everything that depends on the naming will exist at some 
point. I'am in favor of doing it asap (but I have no stakes in the game as UTK 
does not have an MTT).

  George.



On Fri, May 5, 2017 at 1:53 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
Hi everyone -

We’ve been having discussions among the release managers about the choice of 
naming the branch for Open MPI 3.0.0 as v3.x (as opposed to v3.0.x).  Because 
the current plan is that each “major” release (in the sense of the three 
release points from master per year, not necessarily in increasing the major 
number of the release number) is to rebranch off of master, there’s a feeling 
that we should have named the branch v3.0.x, and then named the next one 3.1.x, 
and so on.  If that’s the case, we should consider renaming the branch and all 
the things that depend on the branch (web site, which Jeff has already 
half-done; MTT testing; etc.).  The disadvantage is that renaming will require 
everyone who’s configured MTT to update their test configs.

The first question is should we rename the branch?  While there would be some 
ugly, there’s nothing that really breaks long term if we don’t.  Jeff has 
stronger feelings than I have here.

If we are going to rename the branch from v3.x to v3.0.x, my proposal would be 
that we do it next Saturday evening (May 13th).  I’d create a new branch from 
the current state of v3.x and then delete the old branch.  We’d try to push all 
the PRs Friday so that there were no outstanding PRs that would have to be 
reopened.  We’d then bug everyone to update their nightly testing to pull from 
a different URL and update their MTT configs.  After a week or two, we’d stop 
having tarballs available at both v3.x and v3.0.x on the Open MPI web page.

Thoughts?

Brian
___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.o

Re: [OMPI devel] Open MPI 3.x branch naming

2017-05-30 Thread Barrett, Brian via devel
For various reasons, the rename didn’t happen on Saturday.  It will, instead, 
happen tonight at 6:30pm PDT.  Sorry for the delay!

Brian

On May 23, 2017, at 9:38 PM, Barrett, Brian 
<bbarr...@amazon.com<mailto:bbarr...@amazon.com>> wrote:

All -

Per the discussion on today’s telecon, we’re going to rename the branch 
Saturday (5/27) morning (Pacific time).  We’ll branch v3.0.x from v3.x and 
update all the nightly builds and web pages.  I’m going to push through all the 
PRs on 3.x which are currently outstanding, but please be careful about pushing 
together complex PRs on 3.x for the rest of the week.  If something is 
submitted before Saturday and doesn’t make it due to reviews, you’ll have to 
resubmit.

Brian

On May 5, 2017, at 4:21 PM, r...@open-mpi.org<mailto:r...@open-mpi.org> wrote:

+1 Go for it :-)

On May 5, 2017, at 2:34 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:

To be clear, we’d do the move all at once on Saturday morning.  Things that 
would change:

1) nightly tarballs would rename from openmpi-v3.x--.tar.gz to 
openmpi-v3.0.x--.tar.gz
2) nightly tarballs would build from v3.0.x, not v3.x branch
3) PRs would need to be filed against v3.0.x
4) Both https://www.open-mpi.org/nightly/v3.x/ and 
https://www.open-mpi.org/nightly/v3.0.x/ would work for searching for new 
nightly tarballs

At some point in the future (say, two weeks), (4) would change, and only 
https://www.open-mpi.org/nightly/v3.0.x/ would work.  Otherwise, we need to 
have a coordinated name switch, which seems way harder than it needs to be.  
MTT, for example, requires a configured directory for nightlies, but as long as 
the latest_tarball.txt is formatted correctly, everything else works fine.

Brian

On May 5, 2017, at 2:26 PM, Paul Hargrove 
<phhargr...@lbl.gov<mailto:phhargr...@lbl.gov>> wrote:

As a maintainer of non-MTT scripts that need to know the layout of the 
directories containing nighty and RC tarball, I also think that all the changes 
should be done soon (and all together, not spread over months).

-Paul

On Fri, May 5, 2017 at 2:16 PM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
If we rebranch from master for every "major" release it makes sense to rename 
the branch. In the long term renaming seems like the way to go, and thus the 
pain of altering everything that depends on the naming will exist at some 
point. I'am in favor of doing it asap (but I have no stakes in the game as UTK 
does not have an MTT).

  George.



On Fri, May 5, 2017 at 1:53 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
Hi everyone -

We’ve been having discussions among the release managers about the choice of 
naming the branch for Open MPI 3.0.0 as v3.x (as opposed to v3.0.x).  Because 
the current plan is that each “major” release (in the sense of the three 
release points from master per year, not necessarily in increasing the major 
number of the release number) is to rebranch off of master, there’s a feeling 
that we should have named the branch v3.0.x, and then named the next one 3.1.x, 
and so on.  If that’s the case, we should consider renaming the branch and all 
the things that depend on the branch (web site, which Jeff has already 
half-done; MTT testing; etc.).  The disadvantage is that renaming will require 
everyone who’s configured MTT to update their test configs.

The first question is should we rename the branch?  While there would be some 
ugly, there’s nothing that really breaks long term if we don’t.  Jeff has 
stronger feelings than I have here.

If we are going to rename the branch from v3.x to v3.0.x, my proposal would be 
that we do it next Saturday evening (May 13th).  I’d create a new branch from 
the current state of v3.x and then delete the old branch.  We’d try to push all 
the PRs Friday so that there were no outstanding PRs that would have to be 
reopened.  We’d then bug everyone to update their nightly testing to pull from 
a different URL and update their MTT configs.  After a week or two, we’d stop 
having tarballs available at both v3.x and v3.0.x on the Open MPI web page.

Thoughts?

Brian
___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



--
Paul H. Hargrove  
phhargr...@lbl.gov<mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] Open MPI 3.x branch naming

2017-05-23 Thread Barrett, Brian via devel
All -

Per the discussion on today’s telecon, we’re going to rename the branch 
Saturday (5/27) morning (Pacific time).  We’ll branch v3.0.x from v3.x and 
update all the nightly builds and web pages.  I’m going to push through all the 
PRs on 3.x which are currently outstanding, but please be careful about pushing 
together complex PRs on 3.x for the rest of the week.  If something is 
submitted before Saturday and doesn’t make it due to reviews, you’ll have to 
resubmit.

Brian

On May 5, 2017, at 4:21 PM, r...@open-mpi.org<mailto:r...@open-mpi.org> wrote:

+1 Go for it :-)

On May 5, 2017, at 2:34 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:

To be clear, we’d do the move all at once on Saturday morning.  Things that 
would change:

1) nightly tarballs would rename from openmpi-v3.x--.tar.gz to 
openmpi-v3.0.x--.tar.gz
2) nightly tarballs would build from v3.0.x, not v3.x branch
3) PRs would need to be filed against v3.0.x
4) Both https://www.open-mpi.org/nightly/v3.x/ and 
https://www.open-mpi.org/nightly/v3.0.x/ would work for searching for new 
nightly tarballs

At some point in the future (say, two weeks), (4) would change, and only 
https://www.open-mpi.org/nightly/v3.0.x/ would work.  Otherwise, we need to 
have a coordinated name switch, which seems way harder than it needs to be.  
MTT, for example, requires a configured directory for nightlies, but as long as 
the latest_tarball.txt is formatted correctly, everything else works fine.

Brian

On May 5, 2017, at 2:26 PM, Paul Hargrove 
<phhargr...@lbl.gov<mailto:phhargr...@lbl.gov>> wrote:

As a maintainer of non-MTT scripts that need to know the layout of the 
directories containing nighty and RC tarball, I also think that all the changes 
should be done soon (and all together, not spread over months).

-Paul

On Fri, May 5, 2017 at 2:16 PM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
If we rebranch from master for every "major" release it makes sense to rename 
the branch. In the long term renaming seems like the way to go, and thus the 
pain of altering everything that depends on the naming will exist at some 
point. I'am in favor of doing it asap (but I have no stakes in the game as UTK 
does not have an MTT).

  George.



On Fri, May 5, 2017 at 1:53 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
Hi everyone -

We’ve been having discussions among the release managers about the choice of 
naming the branch for Open MPI 3.0.0 as v3.x (as opposed to v3.0.x).  Because 
the current plan is that each “major” release (in the sense of the three 
release points from master per year, not necessarily in increasing the major 
number of the release number) is to rebranch off of master, there’s a feeling 
that we should have named the branch v3.0.x, and then named the next one 3.1.x, 
and so on.  If that’s the case, we should consider renaming the branch and all 
the things that depend on the branch (web site, which Jeff has already 
half-done; MTT testing; etc.).  The disadvantage is that renaming will require 
everyone who’s configured MTT to update their test configs.

The first question is should we rename the branch?  While there would be some 
ugly, there’s nothing that really breaks long term if we don’t.  Jeff has 
stronger feelings than I have here.

If we are going to rename the branch from v3.x to v3.0.x, my proposal would be 
that we do it next Saturday evening (May 13th).  I’d create a new branch from 
the current state of v3.x and then delete the old branch.  We’d try to push all 
the PRs Friday so that there were no outstanding PRs that would have to be 
reopened.  We’d then bug everyone to update their nightly testing to pull from 
a different URL and update their MTT configs.  After a week or two, we’d stop 
having tarballs available at both v3.x and v3.0.x on the Open MPI web page.

Thoughts?

Brian
___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



--
Paul H. Hargrove  
phhargr...@lbl.gov<mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org

Re: [OMPI devel] Open MPI 3.x branch naming

2017-05-05 Thread Barrett, Brian via devel
To be clear, we’d do the move all at once on Saturday morning.  Things that 
would change:

1) nightly tarballs would rename from openmpi-v3.x--.tar.gz to 
openmpi-v3.0.x--.tar.gz
2) nightly tarballs would build from v3.0.x, not v3.x branch
3) PRs would need to be filed against v3.0.x
4) Both https://www.open-mpi.org/nightly/v3.x/ and 
https://www.open-mpi.org/nightly/v3.0.x/ would work for searching for new 
nightly tarballs

At some point in the future (say, two weeks), (4) would change, and only 
https://www.open-mpi.org/nightly/v3.0.x/ would work.  Otherwise, we need to 
have a coordinated name switch, which seems way harder than it needs to be.  
MTT, for example, requires a configured directory for nightlies, but as long as 
the latest_tarball.txt is formatted correctly, everything else works fine.

Brian

On May 5, 2017, at 2:26 PM, Paul Hargrove 
<phhargr...@lbl.gov<mailto:phhargr...@lbl.gov>> wrote:

As a maintainer of non-MTT scripts that need to know the layout of the 
directories containing nighty and RC tarball, I also think that all the changes 
should be done soon (and all together, not spread over months).

-Paul

On Fri, May 5, 2017 at 2:16 PM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
If we rebranch from master for every "major" release it makes sense to rename 
the branch. In the long term renaming seems like the way to go, and thus the 
pain of altering everything that depends on the naming will exist at some 
point. I'am in favor of doing it asap (but I have no stakes in the game as UTK 
does not have an MTT).

  George.



On Fri, May 5, 2017 at 1:53 PM, Barrett, Brian via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
Hi everyone -

We’ve been having discussions among the release managers about the choice of 
naming the branch for Open MPI 3.0.0 as v3.x (as opposed to v3.0.x).  Because 
the current plan is that each “major” release (in the sense of the three 
release points from master per year, not necessarily in increasing the major 
number of the release number) is to rebranch off of master, there’s a feeling 
that we should have named the branch v3.0.x, and then named the next one 3.1.x, 
and so on.  If that’s the case, we should consider renaming the branch and all 
the things that depend on the branch (web site, which Jeff has already 
half-done; MTT testing; etc.).  The disadvantage is that renaming will require 
everyone who’s configured MTT to update their test configs.

The first question is should we rename the branch?  While there would be some 
ugly, there’s nothing that really breaks long term if we don’t.  Jeff has 
stronger feelings than I have here.

If we are going to rename the branch from v3.x to v3.0.x, my proposal would be 
that we do it next Saturday evening (May 13th).  I’d create a new branch from 
the current state of v3.x and then delete the old branch.  We’d try to push all 
the PRs Friday so that there were no outstanding PRs that would have to be 
reopened.  We’d then bug everyone to update their nightly testing to pull from 
a different URL and update their MTT configs.  After a week or two, we’d stop 
having tarballs available at both v3.x and v3.0.x on the Open MPI web page.

Thoughts?

Brian
___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



--
Paul H. Hargrove  
phhargr...@lbl.gov<mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
___
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Open MPI 3.x branch naming

2017-05-05 Thread Barrett, Brian via devel
Hi everyone -

We’ve been having discussions among the release managers about the choice of 
naming the branch for Open MPI 3.0.0 as v3.x (as opposed to v3.0.x).  Because 
the current plan is that each “major” release (in the sense of the three 
release points from master per year, not necessarily in increasing the major 
number of the release number) is to rebranch off of master, there’s a feeling 
that we should have named the branch v3.0.x, and then named the next one 3.1.x, 
and so on.  If that’s the case, we should consider renaming the branch and all 
the things that depend on the branch (web site, which Jeff has already 
half-done; MTT testing; etc.).  The disadvantage is that renaming will require 
everyone who’s configured MTT to update their test configs.

The first question is should we rename the branch?  While there would be some 
ugly, there’s nothing that really breaks long term if we don’t.  Jeff has 
stronger feelings than I have here.

If we are going to rename the branch from v3.x to v3.0.x, my proposal would be 
that we do it next Saturday evening (May 13th).  I’d create a new branch from 
the current state of v3.x and then delete the old branch.  We’d try to push all 
the PRs Friday so that there were no outstanding PRs that would have to be 
reopened.  We’d then bug everyone to update their nightly testing to pull from 
a different URL and update their MTT configs.  After a week or two, we’d stop 
having tarballs available at both v3.x and v3.0.x on the Open MPI web page.

Thoughts?

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] 3.0 merge backlog

2017-05-05 Thread Barrett, Brian via devel
Hi all -

Howard’s out of office this week and I got swamped with a couple of internal 
issues, so we’ve been behind in getting merges pulled into 3.x.  I merged a 
batch this morning and am going to let Jenkins / MTT catch up with testing.  
Assuming testing looks good, I’ll do another batch Saturday and Sunday and we 
should be caught up by Monday morning.  Sorry for the backlog.

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] [open-mpi/ompi] ompi/opal: add support for HDR and NDR link speeds (#3434)

2017-04-28 Thread Barrett, Brian via devel
Just a reminder from my email last night...  The Pull Request bots are going to 
be a little messed up until Howard and I finish the great Java 8 update of 2017 
(hopefully another couple hours).


Brian

?


From: Jeff Squyres 
Sent: Friday, April 28, 2017 7:26 AM
To: open-mpi/ompi
Cc: Subscribed
Subject: Re: [open-mpi/ompi] ompi/opal: add support for HDR and NDR link speeds 
(#3434)


bot:retest

-
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on 
GitHub, or 
mute the 
thread.
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] TCP BTL's multi-link behavior

2017-04-26 Thread Barrett, Brian via devel
George -

Do you remember why you adjusted both the latency and bandwidth for secondary 
links when using multi-link support with the TCP BTL [1]?  I think I understand 
why, but your memory of 10 years ago is hopefully more accurate than my reverse 
engineering ;).

Thanks,

Brian



[1] 
https://github.com/open-mpi/ompi/blame/master/opal/mca/btl/tcp/btl_tcp_component.c#L497
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] external hwloc causing libevent problems?

2017-04-07 Thread Barrett, Brian via devel
Yeah, that makes sense.  It’s also an easy enough workaround that I’m 
comfortable with that.

Brian

> On Apr 6, 2017, at 5:38 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Brian,
> 
> there used to be two hwloc.h : the one from the hwloc libraries, and
> the one from the OMPI MCA framework.
> as a side effect, we could not --with-hwloc=external, but we had to
> --with-hwloc=/usr, which leads to add
> -I/usr/include to the CPPFLAGS.
> so you might end up having
> CPPFLAGS=-I/usr/include -I/.../internal_libevent/include ...
> which means that if libevent is also installed in /usr, you do not use
> the internal libevent as you expected.
> 
> Jeff avoided this scenario by having the option to rename the
> framework header file (e.g. opal/mca/hwloc/hwloc.h was moved to
> hwloc-internal.h), so --with-hwloc=external works now, and there is no
> more -I/usr/include in the CPPFLAGS
> 
> i am not sure we can run into this issue with libevent nor pmix since
> their base component do not need to #include
> the external pmix.h nor event.h
> 
> note this is not 100% bulletproof yet
> if you have both hwloc and libevent in /opt/include and /opt/lib, you
> have to --with-hwloc=/opt and you will likely
> use the external libevent too
> 
> 
> Cheers,
> 
> Gilles
> 
> On Thu, Apr 6, 2017 at 3:59 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>> It could be that a fairly recent change was made to fix that conflict. I 
>> believe Jeff and Gilles modified the internal hwloc header name(s) to ensure 
>> we got the right one of those. However, that didn’t get done for libevent 
>> and/or pmix, so the conflict likely still resides there.
>> 
>>> On Apr 5, 2017, at 11:56 AM, Barrett, Brian via devel
>> 
>>> <devel@lists.open-mpi.org> wrote:
>>> 
>>> Right; that’s what I built.  I had both libevent and hwloc installed in 
>>> /usr.  I configured with —with-hwloc=external.  It built against the 
>>> external hwloc and the internal libevent.  So there must be some slight 
>>> variation on the theme I’m missing.
>>> 
>>> Brian
>>> 
>>>> On Apr 5, 2017, at 11:36 AM, r...@open-mpi.org wrote:
>>>> 
>>>> Not quite the problem I mentioned. The problem arises if you want external 
>>>> hwloc, but internal libevent - and both have external versions in (say) 
>>>> /usr. If you point hwloc there, then the -I and -L flags will cause us to 
>>>> pull in the /usr libevent versions instead of the internal ones - and 
>>>> havoc ensues.
>>>> 
>>>> 
>>>>> On Apr 5, 2017, at 11:30 AM, Barrett, Brian via devel 
>>>>> <devel@lists.open-mpi.org> wrote:
>>>>> 
>>>>> All -
>>>>> 
>>>>> On the telecon yesterday, there was discussion of external hwloc causing 
>>>>> problems if libevent was also installed in the same location.  Does 
>>>>> anyone have details on exactly what the failure mode is?  I tried what I 
>>>>> think is the issue (./configure —with-hwloc=external with libevent 
>>>>> installed in /usr/ as well) and everything built / works fine.  I checked 
>>>>> the library paths and include and everything looks normal, so I must have 
>>>>> the wrong scenario.  Hints?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Brian
>>>>> ___
>>>>> devel mailing list
>>>>> devel@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> 
>>>> ___
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

  1   2   >