from:"r...@open\-mpi.org"

Re: [OMPI devel] Odd warning in OMPI v3.0.x

2018-07-06 Thread r...@open-mpi.org

ok, i’ll fix it

> On Jul 6, 2018, at 3:09 PM, Nathan Hjelm via devel  
> wrote:
> 
> Looks like a bug to me. The second argument should be a value in v3.x.x.
> 
> -Nathan
> 
>> On Jul 6, 2018, at 4:00 PM, r...@open-mpi.org wrote:
>> 
>> I’m seeing this when building the v3.0.x branch:
>> 
>> runtime/ompi_mpi_init.c:395:49: warning: passing argument 2 of 
>> ‘opal_atomic_cmpset_32’ makes integer from pointer without a cast 
>> [-Wint-conversion]
>> if (!opal_atomic_cmpset_32(&ompi_mpi_state, &expected, desired)) {
>> ^
>> In file included from ../opal/include/opal/sys/atomic.h:159:0,
>> from ../opal/threads/thread_usage.h:30,
>> from ../opal/class/opal_object.h:126,
>> from ../opal/class/opal_list.h:73,
>> from runtime/ompi_mpi_init.c:43:
>> ../opal/include/opal/sys/x86_64/atomic.h:85:19: note: expected ‘int32_t {aka 
>> int}’ but argument is of type ‘int32_t * {aka int *}’
>> static inline int opal_atomic_cmpset_32( volatile int32_t *addr,
>>   ^
>> 
>> 
>> I have a feeling this isn’t correct - yes?
>> Ralph
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Odd warning in OMPI v3.0.x

2018-07-06 Thread r...@open-mpi.org

I’m seeing this when building the v3.0.x branch:

runtime/ompi_mpi_init.c:395:49: warning: passing argument 2 of 
‘opal_atomic_cmpset_32’ makes integer from pointer without a cast 
[-Wint-conversion]
 if (!opal_atomic_cmpset_32(&ompi_mpi_state, &expected, desired)) {
 ^
In file included from ../opal/include/opal/sys/atomic.h:159:0,
 from ../opal/threads/thread_usage.h:30,
 from ../opal/class/opal_object.h:126,
 from ../opal/class/opal_list.h:73,
 from runtime/ompi_mpi_init.c:43:
../opal/include/opal/sys/x86_64/atomic.h:85:19: note: expected ‘int32_t {aka 
int}’ but argument is of type ‘int32_t * {aka int *}’
 static inline int opal_atomic_cmpset_32( volatile int32_t *addr,
   ^


I have a feeling this isn’t correct - yes?
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Fwd: [pmix] Release candidates available for testing

2018-07-01 Thread r...@open-mpi.org

FYI - v3.0.0 will go into master for the OMPI v4 branch. v2.1.2 should go into 
updates for OMPI v3.1 and v3.0 branches

Ralph


> Begin forwarded message:
> 
> From: "r...@open-mpi.org" 
> Subject: [pmix] Release candidates available for testing
> Date: June 29, 2018 at 8:54:01 AM PDT
> To: pmix 
> Reply-To: p...@googlegroups.com
> 
> Hello folks
> 
> Release candidates for v2.1.2 and v3.0.0 have been posted: 
> https://github.com/pmix/pmix/releases/ 
> <https://github.com/pmix/pmix/releases/>
> 
> Please test them!
> Ralph
> 
> v3.0.0
> --
> This is the start of a new release series based on the PMIx v3 standard.
> 
>  NOTE: This release implements the complete PMIX v3.0 Standard
>  and therefore includes a number of new APIs and features. These
>  can be tracked by their RFC's on the community website:
>  https://pmix.org/pmix-standard <https://pmix.org/pmix-standard>.
> 
>   • Added blocking forms of several existing APIs:
>   • PMIx_Log
>   • PMIx_Allocation_request
>   • PMIx_Job_control
>   • PMIx_Process_monitor
>   • Added support for getting/validating security credentials
>   • PMIx_Get_credential, PMIx_Validate_credential
>   • Extended support for debuggers/tools
>   • Added IO forwarding support allowing tools to request
>   forwarding of output from specific application procs,
>   and to forward their input to specified target procs
>   • Extended tool attributes to support synchronization
>   during startup of applications. This includes the
>   ability to modify an application's environment
>   (including support for LD_PRELOAD) and define an
>   alternate fork/exec agent
>   • Added ability for a tool to switch server connections
>   so it can first connect to a system-level server to
>   launch a starter program, and then reconnect to that
>  starter for debugging purposes
>   • Extended network support to collect network inventory by
>   either rolling it up from individual nodes or by direct
>   query of fabric managers. Added an API by which the
>   host can inject any rolled up inventory into the local
>   PMIx server. Applications and/or the host RM can access
>   the inventory via the PMIx_Query function.
>   • Added the ability for applications and/or tools to register
>files and directories for cleanup upon their termination
>   • Added support for inter-library coordination within a process
>   • Extended PMIx_Log support by adding plugin support for new
>   channels, including local/remote syslog and email. Added
>   attributes to query available channels and to tag and
>   format output.
>   • Fix several memory and file descriptor leaks
> 
> 
> 
> v2.1.2
> 
> This is a bug fix release in the v2.1 series:
> 
>   • Added PMIX_VERSION_RELEASE string to pmix_version.h
>   • Added PMIX_SPAWNED and PMIX_PARENT_ID keys to all procs
>   started via PMIx_Spawn
>   • Fixed faulty compares in PMI/PMI2 tests
>   • Fixed bug in direct modex for data on remote node
>   • Correctly transfer all cached job info to the client's
>   shared memory region upon first connection
>   • Fix potential deadlock in PMIx_server_init in an error case
>   • Fix uninitialized variable
>   • Fix several memory and file descriptor leaks
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "pmix" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pmix+unsubscr...@googlegroups.com 
> <mailto:pmix+unsubscr...@googlegroups.com>.
> To post to this group, send email to p...@googlegroups.com 
> <mailto:p...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/pmix 
> <https://groups.google.com/group/pmix>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pmix/94222748-5185-442F-8CDA-45CABE5E14A5%40open-mpi.org
>  
> <https://groups.google.com/d/msgid/pmix/94222748-5185-442F-8CDA-45CABE5E14A5%40open-mpi.org?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI: Undefined reference to pthread_atfork

2018-06-22 Thread r...@open-mpi.org

OMPI 2.1.3??? Is there any way you could update to something more recent?

> On Jun 22, 2018, at 12:28 PM, lille stor  wrote:
> 
> Hi,
> 
>  
> When compiling a C++ source file named test.cpp that needs a shared library 
> named libUtils.so (which in its turn needs Open MPI shared library, hence the 
> parameter -Wl,-rpath-link,/home/dummy/openmpi/build/lib ) as follows:
> 
> g++ test.cpp -lUtils -Wl,-rpath-link,/home/dummy/openmpi/build/lib
> An error is thrown /home/dummy/openmpi/build/lib/libopen-pal.so.20: undefined 
> reference to pthread_atfork.
> 
> I passed -pthread and -lpthread (before and after -lUtils) to g++ but none of 
> these solved the error.
> 
>  
> Environment where this error is thrown:
> 
> OS: Ubuntu 14.04
> Compiler: g++ 4.9
> MPI: Open MPI 2.1.3
>  
> Thank you for your help,
> 
> L.
> 
>  
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] New binding option

2018-06-21 Thread r...@open-mpi.org



> On Jun 21, 2018, at 7:37 AM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> On Jun 21, 2018, at 10:26 AM, r...@open-mpi.org wrote:
>> 
>>>> Alternatively, processes can be assigned to processors based on
>>>> their local rank on a node using the \fI--bind-to cpuset:ordered\fP option
>>>> with an associated \fI--cpu-list "0,2,5"\fP. This directs that the first
>>>> rank on a node be bound to cpu0, the second rank on the node be bound
>>>> to cpu1, and the third rank on the node be bound to cpu5. Note that an
>>>> error will result if more processes are assigned to a node than cpus
>>>> are provided.
>>> 
>>> Question about this: do the CPUs in the list correspond to the Linux 
>>> virtual processor IDs?  E.g., do they correspond to what one would pass to 
>>> numactl(1)?
>> 
>> I didn’t change the meaning of the list - it is still the local cpu ID per 
>> hwloc
>> 
>>> Also, a minor quibble: it might be a little confusing to have --bind-to 
>>> cpuset, and then have to specify a CPU list (not a CPU set).  Should it be 
>>> --cpuset-list or --cpuset?
>> 
>> Your PR is welcome! Historically, that option has always been --cpu-list and 
>> I didn’t change it
> 
> Oh, I see!  I didn't realize / forgot / whatever that --cpu-list is an 
> existing option.
> 
> Let me change my question, then: should "--bind-to cpuset" be changed to 
> "--bind-to cpulist"?  (Or even "cpu-list" to exactly match the existing 
> "--cpu-list" CLI option)  This would be for two reasons:
> 
> 1. Make the terminology agree between the two options.
> 2. Don't use the term "cpuset" because that has a specific meaning in Linux 
> (that isn't tied to hwloc's logical processor IDs)
> 
> (Yes, I'm happy to do a PR to do this)

I don’t think it really matters to the person who requested it, and I don’t 
have any feelings about it - so feel free! Just remember to change the help 
line in schizo_ompi.c and the orterun.1in man page.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] New binding option

2018-06-21 Thread r...@open-mpi.org



> On Jun 21, 2018, at 6:47 AM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> On Jun 21, 2018, at 9:41 AM, r...@open-mpi.org wrote:
>> 
>> Alternatively, processes can be assigned to processors based on
>> their local rank on a node using the \fI--bind-to cpuset:ordered\fP option
>> with an associated \fI--cpu-list "0,2,5"\fP. This directs that the first
>> rank on a node be bound to cpu0, the second rank on the node be bound
>> to cpu1, and the third rank on the node be bound to cpu5. Note that an
>> error will result if more processes are assigned to a node than cpus
>> are provided.
> 
> Question about this: do the CPUs in the list correspond to the Linux virtual 
> processor IDs?  E.g., do they correspond to what one would pass to numactl(1)?

I didn’t change the meaning of the list - it is still the local cpu ID per hwloc

> 
> Also, a minor quibble: it might be a little confusing to have --bind-to 
> cpuset, and then have to specify a CPU list (not a CPU set).  Should it be 
> --cpuset-list or --cpuset?

Your PR is welcome! Historically, that option has always been --cpu-list and I 
didn’t change it

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] New binding option

2018-06-21 Thread r...@open-mpi.org

Hello all

I have added a new binding option to OMPI master:

Alternatively, processes can be assigned to processors based on
their local rank on a node using the \fI--bind-to cpuset:ordered\fP option
with an associated \fI--cpu-list "0,2,5"\fP. This directs that the first
rank on a node be bound to cpu0, the second rank on the node be bound
to cpu1, and the third rank on the node be bound to cpu5. Note that an
error will result if more processes are assigned to a node than cpus
are provided.

Lightly tested at this point, so please let me know if you encounter any issues.

Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] ARM failure on PR to master

2018-06-10 Thread r...@open-mpi.org

Now moved to https://github.com/open-mpi/ompi/pull/5258 
<https://github.com/open-mpi/ompi/pull/5258> - same error


> On Jun 8, 2018, at 9:04 PM, r...@open-mpi.org wrote:
> 
> Can someone who knows/cares about ARM perhaps take a look at PR 
> https://github.com/open-mpi/ompi/pull/5247 
> <https://github.com/open-mpi/ompi/pull/5247>? I’m hitting an error in the ARM 
> CI tests that I can’t understand:
> 
> --> Running example: hello_c
> --
> Failed to create a completion queue (CQ):
> 
> Hostname: juno001
> Requested CQE: 16384
> Error:Cannot allocate memory
> 
> Check the CQE attribute.
> --
> --
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly.  This may
> indicate a problem on this system.
> 
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
> 
> Hostname: juno001
> --
> vsetenv PMIX_SERVER_TMPDIR failed
> vsetenv PMIX_SERVER_TMPDIR failed
> vsetenv PMIX_SERVER_TMPDIR failed
> vsetenv PMIX_SERVER_TMPDIR failed
> 
> I get the UD error - that has been around for years since nobody seems to 
> care about or maintain the ud/oob component. What I don’t understand is why 
> setting an envar would fail solely in the ARM environment.
> 
> Could someone maybe at least provide a hint as to what is going on?
> 
> Thanks
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] ARM failure on PR to master

2018-06-08 Thread r...@open-mpi.org

Can someone who knows/cares about ARM perhaps take a look at PR 
https://github.com/open-mpi/ompi/pull/5247 
? I’m hitting an error in the ARM 
CI tests that I can’t understand:

--> Running example: hello_c
--
Failed to create a completion queue (CQ):

Hostname: juno001
Requested CQE: 16384
Error:Cannot allocate memory

Check the CQE attribute.
--
--
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: juno001
--
vsetenv PMIX_SERVER_TMPDIR failed
vsetenv PMIX_SERVER_TMPDIR failed
vsetenv PMIX_SERVER_TMPDIR failed
vsetenv PMIX_SERVER_TMPDIR failed

I get the UD error - that has been around for years since nobody seems to care 
about or maintain the ud/oob component. What I don’t understand is why setting 
an envar would fail solely in the ARM environment.

Could someone maybe at least provide a hint as to what is going on?

Thanks
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] PRRTE+OMPI status

2018-06-07 Thread r...@open-mpi.org

Hi folks

I now have it so that you can run MTT using OMPI against PRRTE. Current results 
look promising:

+-+-+-+--+--+--+--+--+--+
| Phase   | Section | MPI Version | Duration | Pass | Fail | Time 
out | Skip | Detailed report
  |
+-+-+-+--+--+--+--+--+--+
| MPI Install | my installation | 4.0.0a1 | 00:01| 1|  |
  |  | 
MPI_Install-my_installation-my_installation-4.0.0a1-my_installation.html |
| Test Build  | trivial | 4.0.0a1 | 00:00| 1|  |
  |  | Test_Build-trivial-my_installation-4.0.0a1-my_installation.html  
|
| Test Build  | ibm | 4.0.0a1 | 00:36| 1|  |
  |  | Test_Build-ibm-my_installation-4.0.0a1-my_installation.html  
|
| Test Build  | intel   | 4.0.0a1 | 00:01| 1|  |
  |  | Test_Build-intel-my_installation-4.0.0a1-my_installation.html
|
| Test Build  | java| 4.0.0a1 | 00:01| 1|  |
  |  | Test_Build-java-my_installation-4.0.0a1-my_installation.html 
|
| Test Build  | orte| 4.0.0a1 | 00:00| 1|  |
  |  | Test_Build-orte-my_installation-4.0.0a1-my_installation.html 
|
| Test Run| trivial | 4.0.0a1 | 00:00| 2|  |
  |  | Test_Run-trivial-my_installation-4.0.0a1-my_installation.html
|
| Test Run| ibm | 4.0.0a1 | 05:08| 389  | 2| 1  
  |  | Test_Run-ibm-my_installation-4.0.0a1-my_installation.html
|
| Test Run| spawn   | 4.0.0a1 | 01:55| 3| 4| 1  
  |  | Test_Run-spawn-my_installation-4.0.0a1-my_installation.html  
|
| Test Run| loopspawn   | 4.0.0a1 | 00:00|  | 1|
  |  | Test_Run-loopspawn-my_installation-4.0.0a1-my_installation.html  
|
| Test Run| java| 4.0.0a1 | 00:01| 1|  |
  |  | Test_Run-java-my_installation-4.0.0a1-my_installation.html   
|
| Test Run| orte| 4.0.0a1 | 00:00| 11   | 8|
  |  | Test_Run-orte-my_installation-4.0.0a1-my_installation.html   
|
+-+-+-+--+--+--+--+--+--+

We hit a few errors at the end that might be related to a remaining memory leak 
issue that Artem and Boris are addressing. The other failures will need to be 
investigated as time permits.

Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Remove prun tool from OMPI?

2018-06-06 Thread r...@open-mpi.org

I have renamed prun for now - will do the update in a bit


> On Jun 5, 2018, at 12:20 PM, Thomas Naughton  wrote:
> 
> 
> On Tue, 5 Jun 2018, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote:
> 
>> 
>> 
>>> On Jun 5, 2018, at 11:59 AM, Thomas Naughton  wrote:
>>> Hi Ralph,
>>>> All it means is that PRRTE users must be careful to have PRRTE before OMPI 
>>>> in their path values. Otherwise, they get the wrong “prun” and it fails. I 
>>>> suppose I could update the “prun” in OMPI to match the one in PRRTE, if 
>>>> that helps - there isn’t anything incompatible between ORTE and PRRTE. 
>>>> Would that make sense?
>>> Yes, if updating "OMPI prun" with latest "PRRTE prun" works ok, that
>>> seems like a reasonable way to keep DVM for OMPI usage.
>>> I agree that it does seem likely that users could easily get the wrong
>>> 'prun' but this may be something that falls out in future (based on
>>> discussion on call today).
>>> I guess the main point of interest would be to have some method for
>>> launching the DVM scenario with OMPI.  Another option could be to rename
>>> the binary in OMPI?
>> 
>> Yeah, that’s what the OHPC folks did in their distro - they renamed it to 
>> “ompi-prun”. If that works for you, then perhaps the best path forward is to 
>> do the rename and update it as well.
> 
> 
> Sounds good to me -- seems like a good way to avoid confusion.
> 
> And having the 'ompi-prun' be in sync with (prrte) prun will make sure
> things run properly, i.e., easy to drop in new snapshot of the tool when
> updating PRRTE snapshots in OMPI.  (Or however done in future)
> 
> Thanks, Ralph!
> --tjn
> 
> 
> _
>  Thomas Naughton  naught...@ornl.gov 
> <mailto:naught...@ornl.gov>
>  Research Associate   (865) 576-4184
> 
>> 
>>> Thanks,
>>> --tjn
>>> _
>>> Thomas Naughton  naught...@ornl.gov
>>> Research Associate   (865) 576-4184
>>> On Tue, 5 Jun 2018, r...@open-mpi.org wrote:
>>>> I know we were headed that way - it might still work when run against the 
>>>> current ORTE. I can check that and see. If so, then I guess it might be 
>>>> advisable to retain it.
>>>> All it means is that PRRTE users must be careful to have PRRTE before OMPI 
>>>> in their path values. Otherwise, they get the wrong “prun” and it fails. I 
>>>> suppose I could update the “prun” in OMPI to match the one in PRRTE, if 
>>>> that helps - there isn’t anything incompatible between ORTE and PRRTE. 
>>>> Would that make sense?
>>>> FWIW: Got a similar complaint from the OpenHPC folks - I gather they also 
>>>> have a “prun”’ in their distribution that they use as an abstraction over 
>>>> all the RM launchers. I’m less concerned about that one, though.
>>>>> On Jun 5, 2018, at 9:55 AM, Thomas Naughton  wrote:
>>>>> Hi Ralph,
>>>>> Is the 'prun' tool required to launch the DVM?
>>>>> I know that at some point things shifted to use 'prun' and didn't require
>>>>> the URI on command-line, but I've not tested in few months.
>>>>> Thanks,
>>>>> --tjn
>>>>> _
>>>>> Thomas Naughton  naught...@ornl.gov
>>>>> Research Associate   (865) 576-4184
>>>>> On Tue, 5 Jun 2018, r...@open-mpi.org wrote:
>>>>>> Hey folks
>>>>>> Does anyone have heartburn if I remove the “prun” tool from ORTE? I 
>>>>>> don’t believe anyone is using it, and it doesn’t look like it even works.
>>>>>> I ask because the name conflicts with PRRTE and can cause problems when 
>>>>>> running OMPI against PRRTE
>>>>>> Ralph
>>>>>> ___
>>>>>> devel mailing list
>>>>>> devel@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>>>> ___
>>>>> devel mailing list
>>>&

Re: [OMPI devel] Remove prun tool from OMPI?

2018-06-05 Thread r...@open-mpi.org



> On Jun 5, 2018, at 11:59 AM, Thomas Naughton  wrote:
> 
> Hi Ralph,
> 
>> All it means is that PRRTE users must be careful to have PRRTE before OMPI 
>> in their path values. Otherwise, they get the wrong “prun” and it fails. I 
>> suppose I could update the “prun” in OMPI to match the one in PRRTE, if that 
>> helps - there isn’t anything incompatible between ORTE and PRRTE. Would that 
>> make sense?
> 
> 
> Yes, if updating "OMPI prun" with latest "PRRTE prun" works ok, that
> seems like a reasonable way to keep DVM for OMPI usage.
> 
> I agree that it does seem likely that users could easily get the wrong
> 'prun' but this may be something that falls out in future (based on
> discussion on call today).
> 
> I guess the main point of interest would be to have some method for
> launching the DVM scenario with OMPI.  Another option could be to rename
> the binary in OMPI?

Yeah, that’s what the OHPC folks did in their distro - they renamed it to 
“ompi-prun”. If that works for you, then perhaps the best path forward is to do 
the rename and update it as well.


> 
> Thanks,
> --tjn
> 
> _
>  Thomas Naughton      naught...@ornl.gov
>  Research Associate   (865) 576-4184
> 
> 
> On Tue, 5 Jun 2018, r...@open-mpi.org wrote:
> 
>> I know we were headed that way - it might still work when run against the 
>> current ORTE. I can check that and see. If so, then I guess it might be 
>> advisable to retain it.
>> 
>> All it means is that PRRTE users must be careful to have PRRTE before OMPI 
>> in their path values. Otherwise, they get the wrong “prun” and it fails. I 
>> suppose I could update the “prun” in OMPI to match the one in PRRTE, if that 
>> helps - there isn’t anything incompatible between ORTE and PRRTE. Would that 
>> make sense?
>> 
>> 
>> FWIW: Got a similar complaint from the OpenHPC folks - I gather they also 
>> have a “prun”’ in their distribution that they use as an abstraction over 
>> all the RM launchers. I’m less concerned about that one, though.
>> 
>> 
>>> On Jun 5, 2018, at 9:55 AM, Thomas Naughton  wrote:
>>> Hi Ralph,
>>> Is the 'prun' tool required to launch the DVM?
>>> I know that at some point things shifted to use 'prun' and didn't require
>>> the URI on command-line, but I've not tested in few months.
>>> Thanks,
>>> --tjn
>>> _
>>> Thomas Naughton  naught...@ornl.gov
>>> Research Associate   (865) 576-4184
>>> On Tue, 5 Jun 2018, r...@open-mpi.org wrote:
>>>> Hey folks
>>>> Does anyone have heartburn if I remove the “prun” tool from ORTE? I don’t 
>>>> believe anyone is using it, and it doesn’t look like it even works.
>>>> I ask because the name conflicts with PRRTE and can cause problems when 
>>>> running OMPI against PRRTE
>>>> Ralph
>>>> ___
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Remove prun tool from OMPI?

2018-06-05 Thread r...@open-mpi.org

I know we were headed that way - it might still work when run against the 
current ORTE. I can check that and see. If so, then I guess it might be 
advisable to retain it.

All it means is that PRRTE users must be careful to have PRRTE before OMPI in 
their path values. Otherwise, they get the wrong “prun” and it fails. I suppose 
I could update the “prun” in OMPI to match the one in PRRTE, if that helps - 
there isn’t anything incompatible between ORTE and PRRTE. Would that make sense?

FWIW: Got a similar complaint from the OpenHPC folks - I gather they also have 
a “prun”’ in their distribution that they use as an abstraction over all the RM 
launchers. I’m less concerned about that one, though.

> On Jun 5, 2018, at 9:55 AM, Thomas Naughton  wrote:
> 
> Hi Ralph,
> 
> Is the 'prun' tool required to launch the DVM?
> 
> I know that at some point things shifted to use 'prun' and didn't require
> the URI on command-line, but I've not tested in few months.
> 
> Thanks,
> --tjn
> 
> _
>  Thomas Naughton  naught...@ornl.gov
>  Research Associate           (865) 576-4184
> 
> 
> On Tue, 5 Jun 2018, r...@open-mpi.org wrote:
> 
>> Hey folks
>> 
>> Does anyone have heartburn if I remove the “prun” tool from ORTE? I don’t 
>> believe anyone is using it, and it doesn’t look like it even works.
>> 
>> I ask because the name conflicts with PRRTE and can cause problems when 
>> running OMPI against PRRTE
>> 
>> Ralph
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Remove prun tool from OMPI?

2018-06-05 Thread r...@open-mpi.org

Hey folks

Does anyone have heartburn if I remove the “prun” tool from ORTE? I don’t 
believe anyone is using it, and it doesn’t look like it even works.

I ask because the name conflicts with PRRTE and can cause problems when running 
OMPI against PRRTE

Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Master broken

2018-06-03 Thread r...@open-mpi.org

Here are more problems with a different version of libfabric:

btl_ofi_component.c: In function ‘validate_info’:
btl_ofi_component.c:64:23: error: ‘FI_MR_VIRT_ADDR’ undeclared (first use in 
this function)
  (mr_mode & ~(FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY)) == 
0)) {
   ^~~
btl_ofi_component.c:64:23: note: each undeclared identifier is reported only 
once for each function it appears in
btl_ofi_component.c:64:41: error: ‘FI_MR_ALLOCATED’ undeclared (first use in 
this function)
  (mr_mode & ~(FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY)) == 
0)) {
 ^~~
btl_ofi_component.c:64:59: error: ‘FI_MR_PROV_KEY’ undeclared (first use in 
this function)
  (mr_mode & ~(FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY)) == 
0)) {
   ^~
btl_ofi_component.c: In function ‘mca_btl_ofi_init_device’:
btl_ofi_component.c:410:42: error: ‘FI_MR_VIRT_ADDR’ undeclared (first use in 
this function)
 ofi_info->domain_attr->mr_mode & FI_MR_VIRT_ADDR) {
  ^~~
In file included from ../../../../opal/threads/thread_usage.h:31:0,
 from ../../../../opal/class/opal_object.h:126,
 from ../../../../opal/util/output.h:70,
 from ../../../../opal/include/opal/types.h:43,
 from ../../../../opal/mca/btl/btl.h:119,
 from btl_ofi_component.c:27:
btl_ofi_component.c: In function ‘mca_btl_ofi_component_progress’:
btl_ofi_component.c:557:63: error: ‘FI_EINTR’ undeclared (first use in this 
function)
 } else if (OPAL_UNLIKELY(ret != -FI_EAGAIN && ret != -FI_EINTR)) {
   ^

What the heck version was this tested against???


> On Jun 3, 2018, at 7:32 AM, r...@open-mpi.org wrote:
> 
> On my system, which has libfabric installed (but maybe an older version than 
> expected?):
> 
> btl_ofi_component.c: In function ‘mca_btl_ofi_component_progress’:
> btl_ofi_component.c:557:63: error: ‘FI_EINTR’ undeclared (first use in this 
> function)
>  } else if (OPAL_UNLIKELY(ret != -FI_EAGAIN && ret != -FI_EINTR)) {
> 
> Can someone please fix this?
> Ralph
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Master broken

2018-06-03 Thread r...@open-mpi.org

On my system, which has libfabric installed (but maybe an older version than 
expected?):

btl_ofi_component.c: In function ‘mca_btl_ofi_component_progress’:
btl_ofi_component.c:557:63: error: ‘FI_EINTR’ undeclared (first use in this 
function)
 } else if (OPAL_UNLIKELY(ret != -FI_EAGAIN && ret != -FI_EINTR)) {

Can someone please fix this?
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Master warnings?

2018-06-02 Thread r...@open-mpi.org

No problem - I just commented because earlier in the week it had built clean, 
so I was surprised to get the flood.

This was with gcc 6.3.0, so not that old


> On Jun 2, 2018, at 7:19 AM, Nathan Hjelm  wrote:
> 
> Should have it fixed today or tomorrow. Guess I didn't have a sufficiently 
> old gcc to catch this during testing.
> 
> -Nathan
> 
> On Jun 2, 2018, at 1:09 AM, gil...@rist.or.jp  
> wrote:
> 
>> Hi Ralph,
>> 
>>  
>> see my last comment in https://github.com/open-mpi/ompi/pull/5210 
>> 
>>  
>> long story short, this is just a warning you can ignore.
>> 
>> If you are running on a CentOS 7 box
>> 
>> with the default GNU compiler, you can
>> 
>> opal_cv___attribute__error=0 configure ...
>> 
>> in order to get rid of these.
>> 
>>  
>> Cheers,
>> 
>>  
>> Gilles
>> 
>> - Original Message -
>> 
>> Geez guys - what happened?
>>  
>> In file included from monitoring_prof.c:47:0:
>> ../../../../ompi/include/mpi.h:423:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_Comm_errhandler_fn was removed in 
>> MPI-3.0; use MPI_Comm_errhandler_function instead");
>>   ^
>> ../../../../ompi/include/mpi.h:425:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_File_errhandler_fn was removed in 
>> MPI-3.0; use MPI_File_errhandler_function instead");
>>   ^
>> ../../../../ompi/include/mpi.h:427:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_Win_errhandler_fn was removed in 
>> MPI-3.0; use MPI_Win_errhandler_function instead");
>>   ^
>> ../../../../ompi/include/mpi.h:429:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_Handler_function was removed in 
>> MPI-3.0; use MPI_Win_errhandler_function instead");
>>   ^
>> ../../../../ompi/include/mpi.h:1042:29:warning: ‘__error__’ attribute 
>> ignored [-Wattributes]
>>  OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_lb 
>> __mpi_interface_removed__("MPI_LB was removed in MPI-3.0");
>> ^~
>> ../../../../ompi/include/mpi.h:1043:29:warning: ‘__error__’ attribute 
>> ignored [-Wattributes]
>>  OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_ub 
>> __mpi_interface_removed__("MPI_UB was removed in MPI-3.0");
>> ^~
>> In file included from monitoring_test.c:65:0:
>> ../../ompi/include/mpi.h:423:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_Comm_errhandler_fn was removed in 
>> MPI-3.0; use MPI_Comm_errhandler_function instead");
>>   ^
>> ../../ompi/include/mpi.h:425:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_File_errhandler_fn was removed in 
>> MPI-3.0; use MPI_File_errhandler_function instead");
>>   ^
>> ../../ompi/include/mpi.h:427:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_Win_errhandler_fn was removed in 
>> MPI-3.0; use MPI_Win_errhandler_function instead");
>>   ^
>> ../../ompi/include/mpi.h:429:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_Handler_function was removed in 
>> MPI-3.0; use MPI_Win_errhandler_function instead");
>>   ^
>> ../../ompi/include/mpi.h:1042:29:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>  OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_lb 
>> __mpi_interface_removed__("MPI_LB was removed in MPI-3.0");
>> ^~
>> ../../ompi/include/mpi.h:1043:29:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>  OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_ub 
>> __mpi_interface_removed__("MPI_UB was removed in MPI-3.0");
>> ^~
>> In file included from check_monitoring.c:21:0:
>> ../../ompi/include/mpi.h:423:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_Comm_errhandler_fn was removed in 
>> MPI-3.0; use MPI_Comm_errhandler_function instead");
>>   ^
>> ../../ompi/include/mpi.h:425:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_File_errhandler_fn was removed in 
>> MPI-3.0; use MPI_File_errhandler_function instead");
>>   ^
>> ../../ompi/include/mpi.h:427:9:warning: ‘__error__’ attribute ignored 
>> [-Wattributes]
>>   __mpi_interface_removed__("MPI_Win_errhandler_fn was removed in 
>> MPI-3.0; use MPI_Win_errhandler_function ins

[OMPI devel] Master warnings?

2018-06-01 Thread r...@open-mpi.org

Geez guys - what happened?

In file included from monitoring_prof.c:47:0:
../../../../ompi/include/mpi.h:423:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Comm_errhandler_fn was removed in 
MPI-3.0; use MPI_Comm_errhandler_function instead");
 ^
../../../../ompi/include/mpi.h:425:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_File_errhandler_fn was removed in 
MPI-3.0; use MPI_File_errhandler_function instead");
 ^
../../../../ompi/include/mpi.h:427:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Win_errhandler_fn was removed in 
MPI-3.0; use MPI_Win_errhandler_function instead");
 ^
../../../../ompi/include/mpi.h:429:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Handler_function was removed in 
MPI-3.0; use MPI_Win_errhandler_function instead");
 ^
../../../../ompi/include/mpi.h:1042:29: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_lb 
__mpi_interface_removed__("MPI_LB was removed in MPI-3.0");
 ^~
../../../../ompi/include/mpi.h:1043:29: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_ub 
__mpi_interface_removed__("MPI_UB was removed in MPI-3.0");
 ^~
In file included from monitoring_test.c:65:0:
../../ompi/include/mpi.h:423:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Comm_errhandler_fn was removed in 
MPI-3.0; use MPI_Comm_errhandler_function instead");
 ^
../../ompi/include/mpi.h:425:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_File_errhandler_fn was removed in 
MPI-3.0; use MPI_File_errhandler_function instead");
 ^
../../ompi/include/mpi.h:427:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Win_errhandler_fn was removed in 
MPI-3.0; use MPI_Win_errhandler_function instead");
 ^
../../ompi/include/mpi.h:429:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Handler_function was removed in 
MPI-3.0; use MPI_Win_errhandler_function instead");
 ^
../../ompi/include/mpi.h:1042:29: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_lb 
__mpi_interface_removed__("MPI_LB was removed in MPI-3.0");
 ^~
../../ompi/include/mpi.h:1043:29: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_ub 
__mpi_interface_removed__("MPI_UB was removed in MPI-3.0");
 ^~
In file included from check_monitoring.c:21:0:
../../ompi/include/mpi.h:423:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Comm_errhandler_fn was removed in 
MPI-3.0; use MPI_Comm_errhandler_function instead");
 ^
../../ompi/include/mpi.h:425:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_File_errhandler_fn was removed in 
MPI-3.0; use MPI_File_errhandler_function instead");
 ^
../../ompi/include/mpi.h:427:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Win_errhandler_fn was removed in 
MPI-3.0; use MPI_Win_errhandler_function instead");
 ^
../../ompi/include/mpi.h:429:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_removed__("MPI_Handler_function was removed in 
MPI-3.0; use MPI_Win_errhandler_function instead");
 ^
../../ompi/include/mpi.h:1042:29: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_lb 
__mpi_interface_removed__("MPI_LB was removed in MPI-3.0");
 ^~
../../ompi/include/mpi.h:1043:29: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 OMPI_DECLSPEC extern struct ompi_predefined_datatype_t ompi_mpi_ub 
__mpi_interface_removed__("MPI_UB was removed in MPI-3.0");
 ^~
In file included from example_reduce_count.c:12:0:
../../ompi/include/mpi.h:423:9: warning: ‘__error__’ attribute ignored 
[-Wattributes]
 __mpi_interface_remo

[OMPI devel] Some disturbing warnings on master today

2018-05-30 Thread r...@open-mpi.org

In file included from /usr/include/stdio.h:411:0,
 from ../../opal/util/malloc.h:24,
 from ../../opal/include/opal_config_bottom.h:331,
 from ../../opal/include/opal_config.h:2919,
 from ../../opal/util/argv.h:33,
 from info.c:41:
info.c: In function 'opal_info_dup_mode':
info.c:209:36: warning: '%s' directive writing up to 36 bytes into a region of 
size 31 [-Wformat-overflow=]
  sprintf(savedkey, "__IN_%s", iterator->ie_key);
^
info.c:209:18: note: '__builtin___sprintf_chk' output between 6 and 42 bytes 
into a destination of size 36
  sprintf(savedkey, "__IN_%s", iterator->ie_key);
  ^


fcoll_dynamic_gen2_file_write_all.c: In function 'shuffle_init':
fcoll_dynamic_gen2_file_write_all.c:1165:39: warning: initialization makes 
integer from pointer without a cast [-Wint-conversion]
 ptrdiff_t send_mem_address  = NULL;
   ^~~~

This was from building on a Mac with gcc 7.3.0

Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Running on Kubernetes

2018-05-28 Thread r...@open-mpi.org

One suggestion: this approach requires that the job be executed using “mpirun”. 
Another approach would be to integrate PMIx into Kubernetes, thus allowing any 
job to call MPI_Init regardless of how it was started. The advantage would be 
that it enables the use of MPI by workflow-based applications that really 
aren’t supported by mpirun and require their own application manager.

See https://pmix.org <https://pmix.org/> for more info

Ralph


> On May 24, 2018, at 9:02 PM, Rong Ou  wrote:
> 
> Hi guys,
> 
> Thanks for all the suggestions! It's been a while but we finally got it 
> approved for open sourcing. I've submitted a proposal to kubeflow: 
> https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md
>  
> <https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md>.
>  In this version we've managed to not use ssh, relying on `kubectl exec` 
> instead. It's still pretty "ghetto", but at least we've managed to train some 
> tensorflow models with it. :) Please take a look and let me know what you 
> think.
> 
> Thanks,
> 
> Rong
> 
> On Fri, Mar 16, 2018 at 11:38 AM r...@open-mpi.org <mailto:r...@open-mpi.org> 
> mailto:r...@open-mpi.org>> wrote:
> I haven’t really spent any time with Kubernetes, but it seems to me you could 
> just write a Kubernetes plm (and maybe an odls) component and bypass the ssh 
> stuff completely given that you say there is a launcher API.
> 
> > On Mar 16, 2018, at 11:02 AM, Jeff Squyres (jsquyres)  > <mailto:jsquy...@cisco.com>> wrote:
> > 
> > On Mar 16, 2018, at 10:01 AM, Gilles Gouaillardet 
> > mailto:gilles.gouaillar...@gmail.com>> 
> > wrote:
> >> 
> >> By default, Open MPI uses the rsh PLM in order to start a job.
> > 
> > To clarify one thing here: the name of our plugin is "rsh" for historical 
> > reasons, but it defaults to looking to looking for "ssh" first.  If it 
> > finds ssh, it uses it.  Otherwise, it tries to find rsh and use that.
> > 
> > -- 
> > Jeff Squyres
> > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
> > 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> > https://lists.open-mpi.org/mailman/listinfo/devel 
> > <https://lists.open-mpi.org/mailman/listinfo/devel>
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/devel 
> <https://lists.open-mpi.org/mailman/listinfo/devel>___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-22 Thread r...@open-mpi.org

FWIW: I just tested this on today’s OMPI master and it is working there. Could 
just be something that didn’t get moved to a release branch.


> On May 21, 2018, at 8:43 PM, Ben Menadue  <mailto:ben.mena...@nci.org.au>> wrote:
> 
> Hi Ralph,
> 
> Thanks for that. That would also explain why it works with OMPI 1.10.7. In 
> which case, I’ll just suggest they continue using 1.10.7 for now.
> 
> I just went back over the doMPI R code, and it looks like it’s using 
> MPI_Comm_spawn to create it’s “cluster” of MPI worker processes but then 
> using MPI_Comm_disconnect when closing the cluster. I think the idea is that 
> they can then create and destroy clusters several times within the same R 
> script. But of course, that won’t work here when you can’t disconnect 
> processes.
> 
> Cheers,
> Ben
> 
> 
> 
>> On 22 May 2018, at 1:09 pm, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> wrote:
>> 
>> Comm_connect and Comm_disconnect are both broken in OMPI v2.0 and above, 
>> including OMPI master - the precise reasons differ across the various 
>> releases. From what I can tell, the problem is in the OMPI side (as opposed 
>> to PMIx). I’ll try to file a few issues (since the problem is different in 
>> the various releases) in the next few days that points to the problems.
>> 
>> Comm_spawn is okay, FWIW
>> 
>> Ralph
>> 
>> 
>>> On May 21, 2018, at 8:00 PM, Ben Menadue >> <mailto:ben.mena...@nci.org.au>> wrote:
>>> 
>>> Hi,
>>> 
>>> Moving this over to the devel list... I’m not sure if it's is a problem 
>>> with PMIx or with OMPI’s integration with that. It looks like wait_cbfunc 
>>> callback enqueued as part of the PMIX_PTL_SEND_RECV at 
>>> pmix_client_connect.c:329 is never called, and so the main thread is never 
>>> woken from the PMIX_WAIT_THREAD at pmix_client_connect.c:232. (This is for 
>>> PMIx v2.1.1.) But I haven’t worked out why that callback is not being 
>>> called yet… looking at the output, I think that it’s expecting a message 
>>> back from the PMIx server that it’s never getting.
>>> 
>>> [raijin7:05505] pmix: disconnect called
>>> [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to 
>>> server
>>> [raijin7:05505] posting recv on tag 119
>>> [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645
>>> [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER 
>>> 1746468864:0 tag 119 with NON-NULL msg
>>> [raijin7:05505] ptl:base:send_handler SENDING MSG
>>> [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer 
>>> 1746468865:0
>>> [raijin7:05493] ptl:base:recv:handler allocate new recv msg
>>> [raijin7:05493] ptl:base:recv:handler read hdr on socket 27
>>> [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645
>>> [raijin7:05493] ptl:base:recv:handler allocate data region of size 645
>>> [raijin7:05505] ptl:base:send_handler MSG SENT
>>> [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 
>>> BYTES FOR TAG 119 ON PEER SOCKET 27
>>> [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post 
>>> msg
>>> [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on 
>>> socket 27
>>> [raijin7:05493] checking msg on tag 119 for tag 0
>>> [raijin7:05493] checking msg on tag 119 for tag 4294967295
>>> [raijin7:05505] pmix: disconnect completed
>>> [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119
>>> [raijin7:05493] SWITCHYARD for 1746468865:0:27
>>> [raijin7:05493] recvd pmix cmd 11 from 1746468865:0
>>> [raijin7:05493] recvd CONNECT from peer 1746468865:0
>>> [raijin7:05493] get_tracker called with 32 procs
>>> [raijin7:05493] 1746468864:0 CALLBACK COMPLETE
>>> 
>>> Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of 
>>> the MPI processes (i.e. the original one along with the dynamically 
>>> launched ones) look to be waiting on the same pthread_cond_wait in the 
>>> backtrace below, while the mpirun is just in the standard event loops 
>>> (event_base_loop, oob_tcp_listener, opal_progress_threads, 
>>> ptl_base_listener, and pmix_progress_threads).
>>> 
>>> That said, I’m not sure why get_tracker is reporting 32 procs — there’s 
>>> only 16 running here (i.e. 1 original + 15 spawned).
>>> 
>>> Or should I post this over in the PMIx list instead?
>>> 
>>> Cheers,
>>> Ben
>>> 
>>

Re: [OMPI devel] About supporting HWLOC 2.0.x

2018-05-22 Thread r...@open-mpi.org

Arg - just remembered. I should have noted in my comment that I started with 
that PR and did make a few further adjustments, though not much.

> On May 22, 2018, at 8:49 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Geoffroy -- check out https://github.com/open-mpi/ompi/pull/4677.
> 
> If all those issues are now moot, great.  I really haven't followed up much 
> since I made the initial PR; I'm happy to have someone else take it over...
> 
> 
>> On May 22, 2018, at 11:46 AM, Vallee, Geoffroy R.  wrote:
>> 
>> Hi,
>> 
>> HWLOC 2.0.x support was brought up during the call. FYI, I am currently 
>> using (and still testing) hwloc 2.0.1 as an external library with master and 
>> I did not face any major problem; I only had to fix minor things, mainly for 
>> putting the HWLOC topology in a shared memory segment. Let me know if you 
>> want me to help with the effort of supporting HWLOC 2.0.x.
>> 
>> Thanks,
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] About supporting HWLOC 2.0.x

2018-05-22 Thread r...@open-mpi.org

I’ve been running with hwloc 2.0.1 for quite some time now without problem, 
including use of the shared memory segment. It would be interesting to hear 
what changes you had to make.

However, that said, there is a significant issue in ORTE when trying to map-by 
NUMA as hwloc 2.0.1 no longer associates cpus with NUMA regions. So you’ll get 
an error when you try it. Unfortunately, that is the default mapping policy 
when #procs > 2.

> On May 22, 2018, at 8:46 AM, Vallee, Geoffroy R.  wrote:
> 
> Hi,
> 
> HWLOC 2.0.x support was brought up during the call. FYI, I am currently using 
> (and still testing) hwloc 2.0.1 as an external library with master and I did 
> not face any major problem; I only had to fix minor things, mainly for 
> putting the HWLOC topology in a shared memory segment. Let me know if you 
> want me to help with the effort of supporting HWLOC 2.0.x.
> 
> Thanks,
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-21 Thread r...@open-mpi.org

Comm_connect and Comm_disconnect are both broken in OMPI v2.0 and above, 
including OMPI master - the precise reasons differ across the various releases. 
From what I can tell, the problem is in the OMPI side (as opposed to PMIx). 
I’ll try to file a few issues (since the problem is different in the various 
releases) in the next few days that points to the problems.

Comm_spawn is okay, FWIW

Ralph


> On May 21, 2018, at 8:00 PM, Ben Menadue  wrote:
> 
> Hi,
> 
> Moving this over to the devel list... I’m not sure if it's is a problem with 
> PMIx or with OMPI’s integration with that. It looks like wait_cbfunc callback 
> enqueued as part of the PMIX_PTL_SEND_RECV at pmix_client_connect.c:329 is 
> never called, and so the main thread is never woken from the PMIX_WAIT_THREAD 
> at pmix_client_connect.c:232. (This is for PMIx v2.1.1.) But I haven’t worked 
> out why that callback is not being called yet… looking at the output, I think 
> that it’s expecting a message back from the PMIx server that it’s never 
> getting.
> 
> [raijin7:05505] pmix: disconnect called
> [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to 
> server
> [raijin7:05505] posting recv on tag 119
> [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645
> [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER 
> 1746468864:0 tag 119 with NON-NULL msg
> [raijin7:05505] ptl:base:send_handler SENDING MSG
> [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer 
> 1746468865:0
> [raijin7:05493] ptl:base:recv:handler allocate new recv msg
> [raijin7:05493] ptl:base:recv:handler read hdr on socket 27
> [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645
> [raijin7:05493] ptl:base:recv:handler allocate data region of size 645
> [raijin7:05505] ptl:base:send_handler MSG SENT
> [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 BYTES 
> FOR TAG 119 ON PEER SOCKET 27
> [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post 
> msg
> [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on socket 
> 27
> [raijin7:05493] checking msg on tag 119 for tag 0
> [raijin7:05493] checking msg on tag 119 for tag 4294967295
> [raijin7:05505] pmix: disconnect completed
> [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119
> [raijin7:05493] SWITCHYARD for 1746468865:0:27
> [raijin7:05493] recvd pmix cmd 11 from 1746468865:0
> [raijin7:05493] recvd CONNECT from peer 1746468865:0
> [raijin7:05493] get_tracker called with 32 procs
> [raijin7:05493] 1746468864:0 CALLBACK COMPLETE
> 
> Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of the 
> MPI processes (i.e. the original one along with the dynamically launched 
> ones) look to be waiting on the same pthread_cond_wait in the backtrace 
> below, while the mpirun is just in the standard event loops (event_base_loop, 
> oob_tcp_listener, opal_progress_threads, ptl_base_listener, and 
> pmix_progress_threads).
> 
> That said, I’m not sure why get_tracker is reporting 32 procs — there’s only 
> 16 running here (i.e. 1 original + 15 spawned).
> 
> Or should I post this over in the PMIx list instead?
> 
> Cheers,
> Ben
> 
> 
>> On 17 May 2018, at 9:59 am, Ben Menadue > > wrote:
>> 
>> Hi,
>> 
>> I’m trying to debug a user’s program that uses dynamic process management 
>> through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of 
>> the processes is in
>> 
>> #0  0x7ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from 
>> /lib64/libpthread.so.0
>> #1  0x7ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=> optimized out>, info=, ninfo=0) at 
>> ../../src/client/pmix_client_connect.c:232
>> #2  0x7ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at 
>> ext2x_client.c:1432
>> #3  0x7ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at 
>> ../../../../../ompi/dpm/dpm.c:596
>> #4  0x7ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at 
>> pcomm_disconnect.c:67
>> #5  0x7ff71a7466b9 in mpi_comm_disconnect () from 
>> /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so
>> 
>> This is using 3.1.0 against and external install of PMIx 2.1.1. But I see 
>> exactly the same issue with 3.0.1 using its internal PMIx. It looks similar 
>> to issue #4542, but the corresponding patch in PR#4549 doesn’t seem to help 
>> (it just hangs in PMIx_fence instead of PMIx_disconnect).
>> 
>> Attached is the offending R script, it hangs in the “closeCluster” call. Has 
>> anyone seen this issue? I’m not sure what approach to take to debug it, but 
>> I have builds of the MPI libraries with --enable-debug available if needed.
>> 
>> Cheers,
>> Ben
>> 
>> 
>> ___
>> users mailing list
>> us...@lists.open-mpi.org 
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> ___
> devel mailing list
> devel@

Re: [OMPI devel] Open MPI 3.1.0rc4 posted

2018-04-17 Thread r...@open-mpi.org

I’ll let you decide about 3.1.0. FWIW: I think Gilles fix should work for 
external PMIx v1.2.5 as well.


> On Apr 17, 2018, at 7:56 AM, Barrett, Brian via devel 
>  wrote:
> 
> Do we honestly care for 3.1.0?  I mean, we went 6 months without it working 
> and no one cared.  We can’t fix all bugs, and I’m a little concerned about 
> making changes right before release.
> 
> Brian
> 
>> On Apr 17, 2018, at 7:49 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>> Brian,
>> 
>> https://github.com/open-mpi/ompi/pull/5081 fixes support for external PMIx 
>> v2.0
>> 
>> Support for external PMIx v1 is broken (same in master) and extra dev would 
>> be required to fix it.
>> 
>> The easiest path, if acceptable, is to simply drop support for PMIx v1
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> 
>> "Barrett, Brian via devel"  wrote:
>>> In what we hope is the last RC for the 3.1.0 series, I’ve posted 3.1.0rc4 
>>> at:
>>> 
>>>  https://www.open-mpi.org/software/ompi/v3.1/
>>> 
>>> Please give it a try and provide feedback asap; goal is to release end of 
>>> the week if we don’t find any major issues.
>>> 
>>> Brian
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Running on Kubernetes

2018-03-16 Thread r...@open-mpi.org

I haven’t really spent any time with Kubernetes, but it seems to me you could 
just write a Kubernetes plm (and maybe an odls) component and bypass the ssh 
stuff completely given that you say there is a launcher API.

> On Mar 16, 2018, at 11:02 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Mar 16, 2018, at 10:01 AM, Gilles Gouaillardet 
>  wrote:
>> 
>> By default, Open MPI uses the rsh PLM in order to start a job.
> 
> To clarify one thing here: the name of our plugin is "rsh" for historical 
> reasons, but it defaults to looking to looking for "ssh" first.  If it finds 
> ssh, it uses it.  Otherwise, it tries to find rsh and use that.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Fabric manager interactions: request for comments

2018-02-05 Thread r...@open-mpi.org

Hello all

The PMIx community is starting work on the next phase of defining support for 
network interactions, looking specifically at things we might want to obtain 
and/or control via the fabric manager. A very preliminary draft is shown here:

https://pmix.org/home/pmix-standard/fabric-manager-roles-and-expectations/ 


We would welcome any comments/suggestions regarding information you might find 
useful to get regarding the network, or controls you would like to set.

Thanks in advance
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] hwloc issues in this week telcon?

2018-01-31 Thread r...@open-mpi.org

hwloc2 is for OMPI 4.0, not 3.1.


> On Jan 31, 2018, at 3:28 PM, Brice Goglin  wrote:
> 
> Hello
> 
> Two hwloc issues are listed in this week telcon:
> 
> "hwloc2 WIP, may need help with."
> https://github.com/open-mpi/ompi/pull/4677
> * Is this really a 3.0.1 thing? I thought hwloc2 was only for 3.1+
> * As I replied in this PR, I have some patches but I need help for
> testing them. Can you list some good test cases?
> 
> "Issue - hwloc can't handle cuda from a different location"
> I have no idea what this is about. Is there a github issue for this?
> 
> Thanks
> Brice
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] cannot push directly to master anymore

2018-01-31 Thread r...@open-mpi.org



> On Jan 31, 2018, at 8:41 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Jan 31, 2018, at 11:33 AM, r...@open-mpi.org wrote:
>> 
>> If CI takes 30 min, then not a problem - when CI takes 6 hours (as it 
>> sometimes does), then that’s a different story.
> 
> Fair point; that's why I experimented with (and accidentally left enabled) 
> only having the 2 pretty-much-immediate CI checks (email checker and 
> signed-off-by checker).
> 
> We have definitely seen unreliable CI hang for hours (or days... or even get 
> abandoned when a CI server is reset).  So it's understandable that sometimes 
> people merge before waiting for CI to complete.
> 
> But I think the central question here is: do we want to leave it set as it is 
> right now:
> 
> 1. you *must* make a PR
> 2. the email-checker and signed-off-by-checker CI *must* pass on that PR
> 
> This still allows you to merge early (i.e., before other CI completes).  
> That's a different issue, and is probably ok the way that it is currently 
> handled (i.e., individual developer's discretion -- usually let all the CI 
> finish, but merge early when the situation warrants it).

I personally have no objections

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] cannot push directly to master anymore

2018-01-31 Thread r...@open-mpi.org



> On Jan 31, 2018, at 7:36 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Jan 31, 2018, at 10:14 AM, Gilles Gouaillardet 
>  wrote:
>> 
>> I tried to push some trivial commits directly to the master branch and
>> was surprised that is no more allowed.
>> 
>> The error message is not crystal clear, but I guess the root cause is
>> the two newly required checks (Commit email checker and
>> Signed-off-by-checker) were not performed.
> 
> That is probably my fault; I was testing something and didn't mean to leave 
> that enabled.  Oops -- sorry.  :-(
> 
> That being said -- is it a terrible thing to require a PR to ensure that we 
> get a valid email address (e.g., not a "root@localhost") and that we have a 
> proper signed-off-by line?

> 
>> /* note if the commit is trivial, then it is possible to add the following 
>> line
>> [skip ci]
>> into the commit message, so Jenkins will not check the PR. */
> 
> We've had some discussions about this on the Tuesday calls -- the point was 
> made that if you allow skipping CI for "trivial" commits, it starts you down 
> the slippery slope of precisely defining what "trivial" means.  Indeed, I 
> know that I have been guilty of making a "trivial" change that ended up 
> breaking something.
> 
> FWIW, I have stopped using the "[skip ci]" stuff -- even if I made docs-only 
> changes.  I.e., just *always* go through CI.  That way there's never any 
> question, and never any possibility of a human mistake (e.g., accidentally 
> marking "[skip ci]" on a PR that really should have had CI).

If CI takes 30 min, then not a problem - when CI takes 6 hours (as it sometimes 
does), then that’s a different story.


> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] hwloc2 and cuda and non-default cudatoolkit install location

2017-12-20 Thread r...@open-mpi.org

FWIW: what we do in PMIx (where we also have some overlapping options) is to 
add in OMPI a new --enable-pmix-foo option and then have the configury in the 
corresponding OMPI component convert it to use inside of the embedded PMIx 
itself. It isn’t a big deal - just have to do a little code to save the OMPI 
settings where they overlap, reset those, and then check for the pmix-specific 
values to re-enable those that are specified.

Frankly, I prefer that to modifying the non-embedded options - after all, we 
hope to remove the embedded versions in the near future anyway.


> On Dec 20, 2017, at 1:45 PM, Brice Goglin  wrote:
> 
> Le 20/12/2017 à 22:01, Howard Pritchard a écrit :
>> I can think of several ways to fix it.  Easiest would be to modify the
>> opal/mca/hwloc/hwloc2a/configure.m4
>> to not set --enable-cuda if --with-cuda is evaluated to something other than 
>> yes.
>> 
>> Optionally, I could fix the hwloc configury to use a --with-cuda argument 
>> rather than an --enable-cuda configury argument.  Would 
>> such a configury argument change be traumatic for the hwloc community?
>> I think it would be weird to have both an --enable-cuda and a --with-cuda 
>> configury argument for hwloc.
>> 
> 
> Hello
> 
> hwloc currently only has --enable-foo configure options, but very few 
> --with-foo. We rely on pkg-config and variables for setting dependency paths.
> 
> OMPI seems to use --enable for enabling features, and --with for enabling 
> dependencies and setting dependency paths. If that's the official recommended 
> way to choose between --enable and --with, maybe hwloc should just replace 
> many --enable-foo with --with-foo ? But I tend to think we should support 
> both to ease the transition?
> 
> Brice
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] hwloc 2 thing

2017-12-13 Thread r...@open-mpi.org

I’m confused - what does this have to do with hwloc???


> On Dec 13, 2017, at 7:57 PM, saisilpa b  wrote:
> 
> Hi all, 
> 
> I need one help.. 
> 
> I am using openmpi library for my project,  which is very old version and 
> uses the commands like orterun and orted.. 
> 
> I written one script and passing the input in the text file, which has 
> 22lakhs lines..  The script has to read one by one and generate output and 
> write it into the file.. The process is taking quite a long time. 
> 
> If I tired to add multiple hosts for distribution to execute this program 
> then each input line read by all the hosts and generate the same output from 
> all the hosts..  I am getting duplicate output and it is expected to take 
> additional time..  I don't want like that...  Can you please let us know is 
> there anyway we can split the work between the hosts.. 
>  
> 
> Thanks for your help. 
> 
> Best regards,  
> Silpa
> 
> Sent from Yahoo Mail on Android 
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> On Sat, Jul 22, 2017 at 6:28 PM, r...@open-mpi.org
>  wrote:
> You’ll have to be a little clearer than that - what “issues” are you talking 
> about?
> 
>> On Jul 21, 2017, at 10:06 PM, saisilpa b via devel > <mailto:devel@lists.open-mpi.org>> wrote:
>> 
>> Hi ,
>>  
>> Can some one provide the configuration to build the openmpi libraries to 
>> avoid the issues on ld libraries while running.
>>  
>> thanks,
>> silpa
>>  
>>  
>> 
>> 
>> On Friday, 21 July 2017 8:52 AM, "r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>" mailto:r...@open-mpi.org>> 
>> wrote:
>> 
>> 
>> Yes - I have a PR about cleared that will remove the hwloc2 install. It 
>> needs to be redone
>> 
>>> On Jul 20, 2017, at 8:18 PM, Howard Pritchard >> <mailto:hpprit...@gmail.com>> wrote:
>>> 
>>> Hi Folks,
>>> 
>>> I'm noticing that if I pull a recent version of master with hwloc 2 support 
>>> into my local repo, that my autogen,pl  run fails unless I do the following:
>>> 
>>> mkdir $PWD/opal/mca/hwloc/hwloc2x/hwloc/include/private/autogen
>>> 
>>> where PWD is the top level of my work area.
>>> 
>>> I did a
>>> 
>>> git clean -df
>>> 
>>> but that did not help.
>>> 
>>> Is anyone else seeing this?
>>> 
>>> Just curious,
>>> 
>>> Howard
>>> 
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Enable issue tracker for ompi-www repo?

2017-11-04 Thread r...@open-mpi.org

Hi Chris

It was just an oversight - I have turned on the issue tracker, so feel free to 
post, or a PR is also welcome

Ralph


> On Nov 4, 2017, at 5:03 AM, Gilles Gouaillardet 
>  wrote:
> 
> Chris,
> 
> feel free to issue a PR, or fully describe the issue so a developer
> can update the FAQ accordingly.
> 
> Cheers,
> 
> Gilles
> 
> On Sat, Nov 4, 2017 at 4:44 PM, Chris Samuel  wrote:
>> Hi folks,
>> 
>> I was looking to file an issue against the website for the FAQ about XRC
>> support (given it was disabled in issue #4087) but it doesn't appear to be
>> enabled.   Is that just an oversight or is there a different way preferred?
>> 
>> All the best,
>> Chris
>> --
>> Christopher SamuelSenior Systems Administrator
>> Melbourne Bioinformatics - The University of Melbourne
>> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Cuda build break

2017-10-04 Thread r...@open-mpi.org

Fix is here: https://github.com/open-mpi/ompi/pull/4301 
<https://github.com/open-mpi/ompi/pull/4301>

> On Oct 4, 2017, at 11:19 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Thanks Ralph.
> 
>> On Oct 4, 2017, at 2:07 PM, r...@open-mpi.org wrote:
>> 
>> I’ll fix
>> 
>>> On Oct 4, 2017, at 10:57 AM, Sylvain Jeaugey  wrote:
>>> 
>>> See my last comment on #4257 :
>>> 
>>> https://github.com/open-mpi/ompi/pull/4257#issuecomment-332900393
>>> 
>>> We should completely disable CUDA in hwloc. It is breaking the build, but 
>>> more importantly, it creates an extra dependency on the CUDA runtime that 
>>> Open MPI doesn't have, even when compiled with --with-cuda (we load symbols 
>>> dynamically).
>>> 
>>> On 10/04/2017 10:42 AM, Barrett, Brian via devel wrote:
>>>> All -
>>>> 
>>>> It looks like nVidia’s MTT started failing on 9/26, due to not finding 
>>>> Cuda.  There’s a suspicious commit given the error message in the hwloc 
>>>> cuda changes.  Jeff and Brice, it’s your patch, can you dig into the build 
>>>> failures?
>>>> 
>>>> Brian
>>>> ___
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>> 
>>> ---
>>> This email message is for the sole use of the intended recipient(s) and may 
>>> contain
>>> confidential information.  Any unauthorized review, use, disclosure or 
>>> distribution
>>> is prohibited.  If you are not the intended recipient, please contact the 
>>> sender by
>>> reply email and destroy all copies of the original message.
>>> ---
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] HWLOC / rmaps ppr build failure

2017-10-04 Thread r...@open-mpi.org

Thanks! Fix is here: https://github.com/open-mpi/ompi/pull/4301 


> On Oct 4, 2017, at 11:10 AM, Brice Goglin  wrote:
> 
> Looks like you're using a hwloc < 1.11. If you want to support this old
> API while using the 1.11 names, you can add this to OMPI after #include
> 
> #if HWLOC_API_VERSION < 0x00010b00
> #define HWLOC_OBJ_NUMANODE HWLOC_OBJ_NODE
> #define HWLOC_OBJ_PACKAGE HWLOC_OBJ_SOCKET
> #endif
> 
> Brice
> 
> 
> 
> 
> Le 04/10/2017 19:54, Barrett, Brian via devel a écrit :
>> It looks like a change in either HWLOC or the rmaps ppr component is causing 
>> Cisco build failures on master for the last couple of days:
>> 
>>  https://mtt.open-mpi.org/index.php?do_redir=2486
>> 
>> rmaps_ppr.c:665:17: error: ‘HWLOC_OBJ_NUMANODE’ undeclared (first use in 
>> this function); did you mean ‘HWLOC_OBJ_NODE’?
>> level = HWLOC_OBJ_NUMANODE;
>> ^~
>> HWLOC_OBJ_NODE
>> rmaps_ppr.c:665:17: note: each undeclared identifier is reported only once 
>> for each function it
>> 
>> Can someone take a look?
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Cuda build break

2017-10-04 Thread r...@open-mpi.org

I’ll fix

> On Oct 4, 2017, at 10:57 AM, Sylvain Jeaugey  wrote:
> 
> See my last comment on #4257 :
> 
> https://github.com/open-mpi/ompi/pull/4257#issuecomment-332900393
> 
> We should completely disable CUDA in hwloc. It is breaking the build, but 
> more importantly, it creates an extra dependency on the CUDA runtime that 
> Open MPI doesn't have, even when compiled with --with-cuda (we load symbols 
> dynamically).
> 
> On 10/04/2017 10:42 AM, Barrett, Brian via devel wrote:
>> All -
>> 
>> It looks like nVidia’s MTT started failing on 9/26, due to not finding Cuda. 
>>  There’s a suspicious commit given the error message in the hwloc cuda 
>> changes.  Jeff and Brice, it’s your patch, can you dig into the build 
>> failures?
>> 
>> Brian
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ---
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> ---
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] HWLOC / rmaps ppr build failure

2017-10-04 Thread r...@open-mpi.org

Hmmm...I suspect this is a hwloc v2 vs v1 issue. I’ll fix it

> On Oct 4, 2017, at 10:54 AM, Barrett, Brian via devel 
>  wrote:
> 
> It looks like a change in either HWLOC or the rmaps ppr component is causing 
> Cisco build failures on master for the last couple of days:
> 
>  https://mtt.open-mpi.org/index.php?do_redir=2486
> 
> rmaps_ppr.c:665:17: error: ‘HWLOC_OBJ_NUMANODE’ undeclared (first use in this 
> function); did you mean ‘HWLOC_OBJ_NODE’?
> level = HWLOC_OBJ_NUMANODE;
> ^~
> HWLOC_OBJ_NODE
> rmaps_ppr.c:665:17: note: each undeclared identifier is reported only once 
> for each function it
> 
> Can someone take a look?
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Jenkins nowhere land again

2017-10-03 Thread r...@open-mpi.org

I’m not sure either - I have the patch to fix the loop_spawn test problem, but 
can’t get it into the repo.


> On Oct 3, 2017, at 1:22 PM, Barrett, Brian via devel 
>  wrote:
> 
> I’m not sure entirely what we want to do.  It looks like both Nathan and I’s 
> OS X servers died on the same day.  It looks like mine might be a larger 
> failure than just Jenkins, because I can’t log into the machine remotely.  
> It’s going to be a couple hours before I can get home.  Nathan, do you know 
> what happened to your machine?
> 
> The only options for the OMPI builder are to either wait until Nathan or I 
> get home and get our servers running again or to not test OS X (which has its 
> own problems).  I don’t have a strong preference here, but I also don’t want 
> to make the decision unilaterally.
> 
> Brian
> 
> 
>> On Oct 3, 2017, at 1:14 PM, r...@open-mpi.org wrote:
>> 
>> We are caught between two infrastructure failures:
>> 
>> Mellanox can’t pull down a complete PR
>> 
>> OMPI is hanging on the OS-X server
>> 
>> Can someone put us out of our misery?
>> Ralph
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Jenkins nowhere land again

2017-10-03 Thread r...@open-mpi.org

We are caught between two infrastructure failures:

Mellanox can’t pull down a complete PR

OMPI is hanging on the OS-X server

Can someone put us out of our misery?
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Map by socket broken in 3.0.0?

2017-10-03 Thread r...@open-mpi.org

Found the bug - see https://github.com/open-mpi/ompi/pull/4291 


Will PR for the next 3.0.x release

> On Oct 2, 2017, at 9:55 PM, Ben Menadue  wrote:
> 
> Hi,
> 
> I having trouble using map by socket on remote nodes.
> 
> Running on the same node as mpirun works fine (except for that spurious 
> debugging line):
> 
> $ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
> [raijin7:22248] SETTING BINDING TO CORE
>  Data for JOB [11140,1] offset 0 Total slots allocated 16
> 
>     JOB MAP   
> 
>  Data for node: raijin7   Num slots: 16   Max slots: 0Num procs: 4
>   Process OMPI jobid: [11140,1] App: 0 Process rank: 0 Bound: socket 
> 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 
> 0[core 3[hwt 0]]:[B/B/B/B/./././.][./././././././.]
>   Process OMPI jobid: [11140,1] App: 0 Process rank: 1 Bound: socket 
> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 
> 0[core 7[hwt 0]]:[././././B/B/B/B][./././././././.]
>   Process OMPI jobid: [11140,1] App: 0 Process rank: 2 Bound: socket 
> 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 
> 1[core 11[hwt 0]]:[./././././././.][B/B/B/B/./././.]
>   Process OMPI jobid: [11140,1] App: 0 Process rank: 3 Bound: socket 
> 1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], socket 
> 1[core 15[hwt 0]]:[./././././././.][././././B/B/B/B]
> 
>  =
> But the same on a remote node fails in a rather odd fashion:
> 
> $ mpirun -H r1:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
> [raijin7:22291] SETTING BINDING TO CORE
> [r1:10565] SETTING BINDING TO CORE
>  Data for JOB [10879,1] offset 0 Total slots allocated 32
> 
>     JOB MAP   
> 
>  Data for node: r1Num slots: 16   Max slots: 0Num procs: 4
>   Process OMPI jobid: [10879,1] App: 0 Process rank: 0 Bound: N/A
>   Process OMPI jobid: [10879,1] App: 0 Process rank: 1 Bound: N/A
>   Process OMPI jobid: [10879,1] App: 0 Process rank: 2 Bound: N/A
>   Process OMPI jobid: [10879,1] App: 0 Process rank: 3 Bound: N/A
> 
>  =
> --
> The request to bind processes could not be completed due to
> an internal error - the locale of the following process was
> not set by the mapper code:
> 
>   Process:  [[10879,1],2]
> 
> Please contact the OMPI developers for assistance. Meantime,
> you will still be able to run your application without binding
> by specifying "--bind-to none" on your command line.
> --
> --
> ORTE has lost communication with a remote daemon.
> 
>   HNP daemon   : [[10879,0],0] on node raijin7
>   Remote daemon: [[10879,0],1] on node r1
> 
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --
> 
> On the other hand, mapping by node works fine...
> 
> > mpirun -H r1:16 -map-by ppr:4:node:PE=4 -display-map /bin/true
> [raijin7:22668] SETTING BINDING TO CORE
> [r1:10777] SETTING BINDING TO CORE
>  Data for JOB [9696,1] offset 0 Total slots allocated 32
> 
>     JOB MAP   
> 
>  Data for node: r1Num slots: 16   Max slots: 0Num procs: 4
>   Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: N/A
>   Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: N/A
>   Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: N/A
>   Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: N/A
> 
>  =
>  Data for JOB [9696,1] offset 0 Total slots allocated 32
> 
>     JOB MAP   
> 
>  Data for node: r1Num slots: 16   Max slots: 0Num procs: 4
>   Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: socket 
> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], 
> socket 0[core 3[hwt 0-1]]:[BB/BB/BB/BB/../../../..][../../../../../../../..]
>   Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: socket 
> 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], 
> socket 0[core 7[hwt 0-1]]:[../../../../BB/BB/BB/BB][../../../../../../../..]
>   Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: socket 
> 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], 
> socket 1[core

[OMPI devel] ORTE DVM update

2017-09-18 Thread r...@open-mpi.org

Hi all

The DVM on master is working again. You will need to use the new “prun” tool 
instead of “orterun” to submit your jobs - note that “prun” automatically finds 
the DVM, and so there is no longer any need to have orte-dvm report its URI, 
nor does prun take the “-hnp” argument.

The “orte-ps” and “orte-top” tools will not work until they have been updated. 
I’ll get to them before we branch for v3.1.

Also, not all options are supported yet by prun - e.g., “map-by” and friends. 
Work in progress.
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Stale PRs

2017-09-06 Thread r...@open-mpi.org

Okay, the list at least is getting smaller!

Jeff: you now have the two oldest PRs sitting out there for more than a year 
now. One states it is being held because the Forum can’t decide something, so 
maybe it should be shelved?

The other is about clock_gettime, which I believe we resolved since then and 
probably is no longer even relevant (and has lots of conflicts as a result)

Ralph

> On Aug 31, 2017, at 11:15 AM, r...@open-mpi.org wrote:
> 
> Thanks George - wasn’t picking on you, just citing the oldest one on the 
> list. Once that goes in, I’ll be poking the next :-)
> 
>> On Aug 31, 2017, at 11:10 AM, George Bosilca > <mailto:bosi...@icl.utk.edu>> wrote:
>> 
>> Ralph,
>> 
>> I updated the TCP-related pending PR. It offers a better solution that what 
>> we have today, unfortunately not perfect as it would require additions to 
>> the configure. Waiting for reviews.
>> 
>>   George.
>> 
>> 
>> On Thu, Aug 31, 2017 at 10:12 AM, r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> mailto:r...@open-mpi.org>> 
>> wrote:
>> Thanks to those who made a first pass at these old PRs. The oldest one is 
>> now dated Dec 2015 - nearly a two-year old change for large messages over 
>> the TCP BTL, waiting for someone to commit.
>> 
>> 
>> > On Aug 30, 2017, at 7:34 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> > wrote:
>> >
>> > Hey folks
>> >
>> > This is getting ridiculous - we have PRs sitting on GitHub that are more 
>> > than a year old! If they haven’t been committed in all that time, they 
>> > can’t possibly be worth anything now.
>> >
>> > Would people _please_ start paying attention to their PRs? Either close 
>> > them, or update/commit them.
>> >
>> > Ralph
>> >
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/devel 
>> <https://lists.open-mpi.org/mailman/listinfo/devel>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.1 Feature List

2017-09-05 Thread r...@open-mpi.org

We currently have PMIx v2.1.0beta in OMPI master. This includes cross-version 
support - i.e., OMPI v3.1 would be able to run against an RM using any PMIx 
version. At the moment, the shared memory (or dstore) support isn’t working 
across versions, but I’d consider that a “bug” that will hopefully be addressed 
prior to release. If not, then we can release with dstore disabled by default 
along with a note that dstore should be disabled on the RM as well if it isn’t 
using 2.1.

Will keep you updated.
Ralph

> On Sep 5, 2017, at 9:19 AM, Barrett, Brian via devel 
>  wrote:
> 
> All -
> 
> With 3.0 (finally) starting to wrap up, we’re starting discussion of the 3.1 
> release.  As a reminder, we are targeting 2017 for the release, are going to 
> cut the release from master, and are not going to have a feature whitelist 
> for the release.  We are currently looking at a timeline for cutting the 3.1 
> branch from master.  It will be after 2.0.x is wrapped (possibly one more 
> bugfix release) and 3.0.0 has released.  That said, we are looking for 
> feedback on features that your organization plans to contribute for 3.1 that 
> are not already in master.  What are the features and what is the timeline 
> for submission to master?  If you have something not in master that needs to 
> be, please comment on timelines before next Tuesday’s con-call.
> 
> Thanks,
> 
> The 3.1 release managers
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] configure --with paths don't allow for absolute path specification

2017-09-02 Thread r...@open-mpi.org

Okay, so this has nothing to do with the internal pmix or the pmi-1/2 headers 
it provides, which is what confused me. You are building the SLURM pmi support 
for OMPI, which does indeed use the slurm-provided headers and pmi libraries.

Someone can take a look at that as it should check first in the given 
directory, and then in the include subdir. However, you might also then need to 
specify --with-pmi-libdir to find the libraries.


> On Sep 2, 2017, at 9:37 AM, Phil K  wrote:
> 
> The issue is getting through the OMPI configure without error which you   
> _cannot_ when using --with-pmi=/usr/include/slurm if pmi.h and pmi2.h are 
> installed *only* in /usr/include/slurm.
> 
> 
> On Saturday, September 2, 2017 9:55 AM, "r...@open-mpi.org" 
>  wrote:
> 
> 
> I’m honestly confused by this as I don’t understand what you are trying to 
> accomplish. Neither OMPI nor PMIx uses those headers. PMIx provides them just 
> as a convenience for anyone wanting to compile a PMI based code, and so that 
> we could internally write functions that translate from PMI to the equivalent 
> PMIx calls.
> 
> So you can compile your code with -any- PMI header you want - so long as you 
> then link your code to a PMIx library, it doesn’t matter if that header 
> differs somewhat from the one we use. All that matters is that any function 
> call you use matches the one we wrote against. We took ours directly from the 
> MPICH official ones.
> 
> Ralph
> 
>> On Sep 1, 2017, at 10:08 PM, Phil K via devel > <mailto:devel@lists.open-mpi.org>> wrote:
>> 
>> I just wanted to share a workaround I came up with for this openmpi 
>> configure issue.
>> 
>> When specifying header paths in configure, openmpi adds an /include subpath 
>> the --with-pmi specifier
>> (and others).  This is documented very clearly.  Recently, in switching over 
>> to internal pmix, I wanted to rip
>> out the pmix-provided pmi.h and pmi2.h development headers and use those 
>> supplied by slurm since openmpi
>> will link to the slurm-provided pmi libraries and I like to match headers 
>> and libraries properly.  (Yes the headers
>> are similar but they are not identical).
>> 
>> On my distro, the pmix pmi.h and pmi2.h headers were in /usr/include, which 
>> openmpi finds with:
>> 
>> --with-pmi=/usr
>> 
>> After removing the pmix development headers, I am left with the slurm 
>> headers are in /usr/include/slurm.
>> Unfortunately the configure item:
>> 
>> --with-pmi=/usr/include/slurm
>> 
>> fails to locate the pmi.h/pmi2.h slurm headers due to the addition of that 
>> /include subpath, i.e. they are not
>> in /usr/include/slurm/include.  There is no way to specify an absolute path 
>> to a header directory.  So here's
>> what I did:
>> 
>> (unpack tarball to /path/to/openmpi-2.1.1)
>> cd /path/to/openmpi-2.1.1
>> mkdir slurm
>> ln -s /usr/include/slurm /path/to/openmpi-2.1.1/slurm/include
>> 
>> then configure as follows:
>> 
>> ./configure --with-pmi=/path/to/openmpi-2.1.1/slurm
>> 
>> The configure adds the /include subpath and finds the slurm pmi/pmi2 headers 
>> through my symlink.
>> 
>> Cumbersome, but it works.
>> 
>> Phil
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> 

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] configure --with paths don't allow for absolute path specification

2017-09-02 Thread r...@open-mpi.org

I’m honestly confused by this as I don’t understand what you are trying to 
accomplish. Neither OMPI nor PMIx uses those headers. PMIx provides them just 
as a convenience for anyone wanting to compile a PMI based code, and so that we 
could internally write functions that translate from PMI to the equivalent PMIx 
calls.

So you can compile your code with -any- PMI header you want - so long as you 
then link your code to a PMIx library, it doesn’t matter if that header differs 
somewhat from the one we use. All that matters is that any function call you 
use matches the one we wrote against. We took ours directly from the MPICH 
official ones.

Ralph

> On Sep 1, 2017, at 10:08 PM, Phil K via devel  
> wrote:
> 
> I just wanted to share a workaround I came up with for this openmpi configure 
> issue.
> 
> When specifying header paths in configure, openmpi adds an /include subpath 
> the --with-pmi specifier
> (and others).  This is documented very clearly.  Recently, in switching over 
> to internal pmix, I wanted to rip
> out the pmix-provided pmi.h and pmi2.h development headers and use those 
> supplied by slurm since openmpi
> will link to the slurm-provided pmi libraries and I like to match headers and 
> libraries properly.  (Yes the headers
> are similar but they are not identical).
> 
> On my distro, the pmix pmi.h and pmi2.h headers were in /usr/include, which 
> openmpi finds with:
> 
> --with-pmi=/usr
> 
> After removing the pmix development headers, I am left with the slurm headers 
> are in /usr/include/slurm.
> Unfortunately the configure item:
> 
> --with-pmi=/usr/include/slurm
> 
> fails to locate the pmi.h/pmi2.h slurm headers due to the addition of that 
> /include subpath, i.e. they are not
> in /usr/include/slurm/include.  There is no way to specify an absolute path 
> to a header directory.  So here's
> what I did:
> 
> (unpack tarball to /path/to/openmpi-2.1.1)
> cd /path/to/openmpi-2.1.1
> mkdir slurm
> ln -s /usr/include/slurm /path/to/openmpi-2.1.1/slurm/include
> 
> then configure as follows:
> 
> ./configure --with-pmi=/path/to/openmpi-2.1.1/slurm
> 
> The configure adds the /include subpath and finds the slurm pmi/pmi2 headers 
> through my symlink.
> 
> Cumbersome, but it works.
> 
> Phil
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Stale PRs

2017-08-31 Thread r...@open-mpi.org

Thanks George - wasn’t picking on you, just citing the oldest one on the list. 
Once that goes in, I’ll be poking the next :-)

> On Aug 31, 2017, at 11:10 AM, George Bosilca  wrote:
> 
> Ralph,
> 
> I updated the TCP-related pending PR. It offers a better solution that what 
> we have today, unfortunately not perfect as it would require additions to the 
> configure. Waiting for reviews.
> 
>   George.
> 
> 
> On Thu, Aug 31, 2017 at 10:12 AM, r...@open-mpi.org 
> <mailto:r...@open-mpi.org> mailto:r...@open-mpi.org>> 
> wrote:
> Thanks to those who made a first pass at these old PRs. The oldest one is now 
> dated Dec 2015 - nearly a two-year old change for large messages over the TCP 
> BTL, waiting for someone to commit.
> 
> 
> > On Aug 30, 2017, at 7:34 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
> > wrote:
> >
> > Hey folks
> >
> > This is getting ridiculous - we have PRs sitting on GitHub that are more 
> > than a year old! If they haven’t been committed in all that time, they 
> > can’t possibly be worth anything now.
> >
> > Would people _please_ start paying attention to their PRs? Either close 
> > them, or update/commit them.
> >
> > Ralph
> >
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/devel 
> <https://lists.open-mpi.org/mailman/listinfo/devel>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Stale PRs

2017-08-31 Thread r...@open-mpi.org

Thanks to those who made a first pass at these old PRs. The oldest one is now 
dated Dec 2015 - nearly a two-year old change for large messages over the TCP 
BTL, waiting for someone to commit.


> On Aug 30, 2017, at 7:34 AM, r...@open-mpi.org wrote:
> 
> Hey folks
> 
> This is getting ridiculous - we have PRs sitting on GitHub that are more than 
> a year old! If they haven’t been committed in all that time, they can’t 
> possibly be worth anything now.
> 
> Would people _please_ start paying attention to their PRs? Either close them, 
> or update/commit them.
> 
> Ralph
> 

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] [2.1.2rc3] libevent SEGV on FreeBSD/amd64

2017-08-30 Thread r...@open-mpi.org

Yeah, that caught my eye too as that is impossibly large. We only have a 
handful of active queues - looks to me like there is some kind of alignment 
issue.

Paul - has this configuration worked with prior versions of OMPI? Or is this 
something new?

Ralph

> On Aug 30, 2017, at 4:17 PM, Larry Baker  wrote:
> 
> Paul,
> 
>> (gdb) print base->nactivequeues
> 
> 
> seems like an extraordinarily large number to me.  I don't know what the 
> implications are of the --enable-debug clang option is.  Any chance the 
> SEGFAULT is a debugging trap when an uninitialized value is encountered?
> 
> The other thought I had is an alignment trap if, for example, nactivequeues 
> is a 64-bit int but is not 64-bit aligned.  As far as I can tell, 
> nactivequeues is a plain int.  But, what that is on FreeBSD/amd64, I do not 
> know.
> 
> Should there be more information in dmesg or a system log file with the trap 
> code so you can identify whether it is an instruction fetch (VERY unlikely), 
> an operand fetch, or a store that caused the trap?
> 
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov 
> 
> 
> 
>> On 30 Aug 2017, at 3:17:05 PM, Paul Hargrove > > wrote:
>> 
>> I am testing the 2.1.2rc3 tarball on FreeBSD-11.1, configured with
>>--prefix=[...] --enable-debug CC=clang CXX=clang++ --disable-mpi-fortran 
>> --with-hwloc=/usr/local
>> 
>> The CC/CXX setting are to use the system default compilers (rather than 
>> gcc/g++ in /usr/local/bin).
>> The --with-hwloc is to avoid issue #3992 
>>  (though I have not determined 
>> if that impacts this RC).
>> 
>> When running ring_c I get a SEGV from orterun, for which a gdb backtrace is 
>> given below.
>> The one surprising thing (highlighted) in the backtrace is that both the RHS 
>> and LHS of the assignment appear to be valid memory locations.
>> So, if the backtrace is accurate then I am at a loss as to why a SEGV occurs.
>> 
>> -Paul
>> 
>> 
>> Program terminated with signal 11, Segmentation fault.
>> [...]
>> #0  opal_libevent2022_event_assign (ev=0x8065482c0, base=> out>, fd=,
>> events=2, callback=, arg=0x0)
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/opal/mca/event/libevent2022/libevent/event.c:1779
>> 1779ev->ev_pri = base->nactivequeues / 2;
>> (gdb) print base->nactivequeues
>> $3 = 106201992
>> (gdb) print ev->ev_pri
>> $4 = 0 '\0'
>> (gdb) where
>> #0  opal_libevent2022_event_assign (ev=0x8065482c0, base=> out>, fd=,
>> events=2, callback=, arg=0x0)
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/opal/mca/event/libevent2022/libevent/event.c:1779
>> #1  0x0008062e1fd2 in pmix_start_progress_thread ()
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix/src/util/progress_threads.c:83
>> #2  0x0008063047e4 in PMIx_server_init (module=0x806545be8, 
>> info=0x802e16a00, ninfo=2)
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c:310
>> #3  0x0008062c12f6 in pmix1_server_init (module=0x800b106a0, 
>> info=0x7fffe290)
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/opal/mca/pmix/pmix112/pmix1_server_south.c:140
>> #4  0x000800889f43 in pmix_server_init ()
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/orte/orted/pmix/pmix_server.c:261
>> #5  0x000803e22d87 in rte_init ()
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/orte/mca/ess/hnp/ess_hnp_module.c:666
>> #6  0x00080084a45e in orte_init (pargc=0x7fffe988, 
>> pargv=0x7fffe980, flags=4)
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/orte/runtime/orte_init.c:226
>> #7  0x004046a4 in orterun (argc=7, argv=0x7fffea18)
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/orte/tools/orterun/orterun.c:831
>> #8  0x00403bc2 in main (argc=7, argv=0x7fffea18)
>> at 
>> /home/phargrov/OMPI/openmpi-2.1.2rc3-freebsd11-amd64/openmpi-2.1.2rc3/orte/tools/orterun/main.c:13
>> 
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov 
>> 
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org 
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_

[OMPI devel] Stale PRs

2017-08-30 Thread r...@open-mpi.org

Hey folks

This is getting ridiculous - we have PRs sitting on GitHub that are more than a 
year old! If they haven’t been committed in all that time, they can’t possibly 
be worth anything now.

Would people _please_ start paying attention to their PRs? Either close them, 
or update/commit them.

Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Verbosity for "make check"

2017-08-08 Thread r...@open-mpi.org

Okay, I’ll update that PR accordingly

> On Aug 8, 2017, at 10:51 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Per our discussion on the webex today about getting verbosity out of running 
> "make check" (e.g., to see what the heck is going on in 
> https://github.com/open-mpi/ompi/pull/4028).
> 
> I checked the Automake docs: they (strongly) discourage the use of the serial 
> test suite.
> 
> Instead, they suggest setting VERBOSE=1 (not V=1).  This will cause the 
> stdout/stderr of any failed test to be output after all the tests in a single 
> dir are run.  E.g.,
> 
>make check VERBOSE=1
> 
> will do the trick.  Output from multiple failed tests is clearly delineated 
> from each other, so you can tell which output is which.
> 
> (this is actually better than the old serial tester, which will output all 
> output from all tests after each test -- even the successful ones)
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] PMIX visibility

2017-07-25 Thread r...@open-mpi.org

George - I believe this PR fixes the problems. At least, it now runs on OSX for 
me:

https://github.com/open-mpi/ompi/pull/3957 
<https://github.com/open-mpi/ompi/pull/3957>


> On Jul 25, 2017, at 5:27 AM, r...@open-mpi.org wrote:
> 
> Ouch - sorry about that. pmix_setenv is actually defined down in the code 
> base, so let me investigate why it got into pmix_common.
> 
>> On Jul 24, 2017, at 10:26 PM, George Bosilca  wrote:
>> 
>> The last PMIX import broke the master on all platforms that support 
>> visibility. I have pushed a patch that solves __most__ of the issues (that I 
>> could find). I say most because there is a big left that require a 
>> significant change in PMIX design.
>> 
>> This problem arise from the use of the pmix_setenv symbol in one of the MCA 
>> components (a totally legit operation). Except that in PMIX the pmix_setenv 
>> is defined in opal/mca/pmix/pmix2x/pmix/include/pmix_common.h, which is one 
>> of these headers that is self-contained and does not include the 
>> config_bottom.h, and thus has no access to the PMIX_EXPORT.
>> 
>> Here are 3 possible solutions:
>> 1. don't use pmix_setenv in any of the MCA components
>> 2. create a new header that provides support for all util functions (similar 
>> to OPAL) and that supports PMIX_EXPORT
>> 3. make pmix_common.h not self-contained in order to provide access to 
>> PMIX_EXPORT.
>> 
>> Any of this approach requires changes to PMIX (and a push upstream). 
>> Meanwhile the trunk seems to be broken on all platforms that support 
>> visibility.
>> 
>>  George.
>> 
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] PMIX visibility

2017-07-25 Thread r...@open-mpi.org

Ouch - sorry about that. pmix_setenv is actually defined down in the code base, 
so let me investigate why it got into pmix_common.

> On Jul 24, 2017, at 10:26 PM, George Bosilca  wrote:
> 
> The last PMIX import broke the master on all platforms that support 
> visibility. I have pushed a patch that solves __most__ of the issues (that I 
> could find). I say most because there is a big left that require a 
> significant change in PMIX design.
> 
> This problem arise from the use of the pmix_setenv symbol in one of the MCA 
> components (a totally legit operation). Except that in PMIX the pmix_setenv 
> is defined in opal/mca/pmix/pmix2x/pmix/include/pmix_common.h, which is one 
> of these headers that is self-contained and does not include the 
> config_bottom.h, and thus has no access to the PMIX_EXPORT.
> 
> Here are 3 possible solutions:
> 1. don't use pmix_setenv in any of the MCA components
> 2. create a new header that provides support for all util functions (similar 
> to OPAL) and that supports PMIX_EXPORT
> 3. make pmix_common.h not self-contained in order to provide access to 
> PMIX_EXPORT.
> 
> Any of this approach requires changes to PMIX (and a push upstream). 
> Meanwhile the trunk seems to be broken on all platforms that support 
> visibility.
> 
>   George.
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] hwloc 2 thing

2017-07-22 Thread r...@open-mpi.org

You’ll have to be a little clearer than that - what “issues” are you talking 
about?

> On Jul 21, 2017, at 10:06 PM, saisilpa b via devel  
> wrote:
> 
> Hi ,
>  
> Can some one provide the configuration to build the openmpi libraries to 
> avoid the issues on ld libraries while running.
>  
> thanks,
> silpa
>  
>  
> 
> 
> On Friday, 21 July 2017 8:52 AM, "r...@open-mpi.org"  
> wrote:
> 
> 
> Yes - I have a PR about cleared that will remove the hwloc2 install. It needs 
> to be redone
> 
>> On Jul 20, 2017, at 8:18 PM, Howard Pritchard > <mailto:hpprit...@gmail.com>> wrote:
>> 
>> Hi Folks,
>> 
>> I'm noticing that if I pull a recent version of master with hwloc 2 support 
>> into my local repo, that my autogen,pl  run fails unless I do the following:
>> 
>> mkdir $PWD/opal/mca/hwloc/hwloc2x/hwloc/include/private/autogen
>> 
>> where PWD is the top level of my work area.
>> 
>> I did a
>> 
>> git clean -df
>> 
>> but that did not help.
>> 
>> Is anyone else seeing this?
>> 
>> Just curious,
>> 
>> Howard
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] hwloc 2 thing

2017-07-20 Thread r...@open-mpi.org

Yes - I have a PR about cleared that will remove the hwloc2 install. It needs 
to be redone

> On Jul 20, 2017, at 8:18 PM, Howard Pritchard  wrote:
> 
> Hi Folks,
> 
> I'm noticing that if I pull a recent version of master with hwloc 2 support 
> into my local repo, that my autogen,pl  run fails unless I do the following:
> 
> mkdir $PWD/opal/mca/hwloc/hwloc2x/hwloc/include/private/autogen
> 
> where PWD is the top level of my work area.
> 
> I did a
> 
> git clean -df
> 
> but that did not help.
> 
> Is anyone else seeing this?
> 
> Just curious,
> 
> Howard
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] LD_LIBRARY_PATH and environment variables not getting set in remote hosts

2017-07-20 Thread r...@open-mpi.org

You must be kidding - 1.2.8??? We wouldn’t even know where to begin to advise 
you on something that old - I’m actually rather surprised it even compiled on a 
new Linux.


> On Jul 20, 2017, at 4:22 AM, saisilpa b via devel  
> wrote:
> 
> HI Gilles,
> 
> Thanks for your immediate response.
> 
> I am using OpenMPI 1.2.8.tar.bz2 for my project and use orterun and orted 
> binaries for distribution perspective.
> 
> I built openmpi binaries (orterun, orted) with this option 
> --enable-orterun-prefix-by-default , but getting the error on orted that not 
> able to find the file in the particular location while running.
> 
> we use the below command:
> orterun  --hostfile ~/hosts -np processes Binary  .config
> 
> if the host file configured to one host then it is working fine.
> 
> if we configure to multiple hosts then we are getting the below error from 
> remote nodes.
> 
> unable to load shared libraries: (this is specific to depend library for the 
> application)
> 
> Note that: The library paths and environment variables are configured in the 
> application config file and it is invoked by .bashrc.
> 
> I believe the patch updated for suse Linux is : 3.0.101-0.47.96
> 
> Please let me know if you require any specific details to help on this issue. 
> Thanks.
> 
> Silpa
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thursday, 20 July 2017 3:59 PM, Gilles Gouaillardet 
>  wrote:
> 
> 
> Hi,
> 
> you meant Open MPI 1.8.2, right ?
> 
> 
> as far as i am concerned, i always configure Open MPI with
> --enable-mpirun-prefix-by-default, so i do not need to set
> LD_LIBRARY_PATH in my .bashrc
> 
> if you want us to investigate this issue, please post the full error message
> - is the issue reported by mpirun ? orted ? or the MPI application ?
> - should the missing library provided by Open MPI ? the system ? or
> the application/a 3rd party library ?
> - what is the one patch that caused your app to stop working ?
> 
> Cheers,
> 
> Gilles
> 
> On Thu, Jul 20, 2017 at 6:49 PM, saisilpa b via devel
> mailto:devel@lists.open-mpi.org>> wrote:
> > Hi All,
> >
> > I am Silpakala and using OpenMPI 1.2.8 for my project. We are using orterun
> > and orted binaries to invoke the program exectable from multiple hosts and
> > was working successfully. There was one patch applied in Suse Linux, after
> > that the program is not working.
> >
> > We have multiple hosts are configured in NFS. The LD_LIBRARY_PATH and
> > environment variables are configured in application configuration file and
> > it is getting invoked from  .bashrc.
> >
> > After the patch installation in Suse Linux, we are getting the error as
> > "error while loading shared libraries: cannot open shared libraries: no such
> > file or directory"  when the executable try to invoke from remote hosts.
> >
> > Can you please let us know any solution for the same. Much appreciated for
> > your resonse.
> >
> > Thanks, Silpa
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org 
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> > 
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Issue/PR tagging

2017-07-19 Thread r...@open-mpi.org

Okay - thanks!

> On Jul 19, 2017, at 4:47 PM, Barrett, Brian via devel 
>  wrote:
> 
> I’ll update the wiki (and figure out where on our wiki to put more general 
> information), but the basics are:
> 
> If you find a bug, file an issue.  Add Target:v??? labels for any branch it 
> impacts.  If we decide later not to fix the issue on a branch, we’ll remove 
> the label
> Open/find an issue for any PR going to release branches.  That issue can 
> (possibly should, if the issue impacts multiple branches) have multiple 
> Target:v??? labels
> If a PR is for a release branch (ie, it’s immediate target to merge to is a 
> release branch), please add a Target:v??? label and reference the issue
> If a PR is for master, it can reference an issue (if there’s an issue 
> associated with it), but should not have a Target:v??? label
> If an issue is fixed in master, but not merged into branches, don’t close the 
> issue
> 
> I think that’s about it.  There’s some workflows we want to build to automate 
> enforcing many of these things, but for now, it’s just hints to help the RMs 
> not lose track of issues.
> 
> Brian
> 
>> On Jul 19, 2017, at 12:18 PM, r...@open-mpi.org wrote:
>> 
>> Hey folks
>> 
>> I know we made some decisions last week about how to tag issues and PRs to 
>> make things easier to track for release branches, but the wiki notes don’t 
>> cover what we actually decided to do. Can someone briefly summarize? I 
>> honestly have forgotten if we tag issues, or tag PRs
>> 
>> Ralph
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Issue/PR tagging

2017-07-19 Thread r...@open-mpi.org

Hey folks

I know we made some decisions last week about how to tag issues and PRs to make 
things easier to track for release branches, but the wiki notes don’t cover 
what we actually decided to do. Can someone briefly summarize? I honestly have 
forgotten if we tag issues, or tag PRs

Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.0.0 first release candidate posted

2017-06-29 Thread r...@open-mpi.org

I tracked down a possible source of the oob/tcp error - this should address it, 
I think: https://github.com/open-mpi/ompi/pull/3794 


> On Jun 29, 2017, at 3:14 PM, Howard Pritchard  wrote:
> 
> Hi Brian,
> 
> I tested this rc using both srun native launch and mpirun on the following 
> systems:
> - LANL CTS-1 systems (haswell + Intel OPA/PSM2)
> - LANL network testbed system (haswell  + connectX5/UCX and OB1)
> - LANL Cray XC
> 
> I am finding some problems with mpirun on the network testbed system.  
> 
> For example, for spawn_with_env_vars from IBM tests:
> 
> *** Error in `mpirun': corrupted double-linked list: 0x006e75b0 ***
> 
> === Backtrace: =
> 
> /usr/lib64/libc.so.6(+0x7bea2)[0x76597ea2]
> 
> /usr/lib64/libc.so.6(+0x7cec6)[0x76598ec6]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(opal_proc_table_remove_all+0x91)[0x77855851]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_ud.so(+0x5e09)[0x73cc0e09]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_ud.so(+0x5952)[0x73cc0952]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(+0x6b032)[0x77b94032]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(mca_base_framework_close+0x7d)[0x7788592d]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[0x75b04e4d]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(orte_finalize+0x79)[0x77b43bf9]
> 
> mpirun[0x4014f1]
> 
> mpirun[0x401018]
> 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7653db15]
> 
> mpirun[0x400f29]
> 
> 
> and another like
> 
> [hpp@hi-master dynamic (master *)]$mpirun -np 1 ./spawn_with_env_vars
> 
> Spawning...
> 
> Spawned
> 
> Child got foo and baz env variables -- yay!
> 
> *** Error in `mpirun': corrupted double-linked list: 0x006eb350 ***
> 
> === Backtrace: =
> 
> /usr/lib64/libc.so.6(+0x7b184)[0x76597184]
> 
> /usr/lib64/libc.so.6(+0x7d1ec)[0x765991ec]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x57a2)[0x732297a2]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x5a87)[0x73229a87]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(+0x6b032)[0x77b94032]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(mca_base_framework_close+0x7d)[0x7788592d]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[0x75b04e4d]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(orte_finalize+0x79)[0x77b43bf9]
> 
> mpirun[0x4014f1]
> 
> mpirun[0x401018]
> 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7653db15]
> 
> mpirun[0x400f29]
> 
> It doesn't happen on every run though.
> 
> I'll do some more investigating, but probably not till next week.
> 
> Howard
> 
> 
> 2017-06-28 11:50 GMT-06:00 Barrett, Brian via devel  >:
> The first release candidate of Open MPI 3.0.0 is now available 
> (https://www.open-mpi.org/software/ompi/v3.0/ 
> ).  We expect to have at least 
> one more release candidate, as there are still outstanding MPI-layer issues 
> to be resolved (particularly around one-sided).  We are posting 3.0.0rc1 to 
> get feedback on run-time stability, as one of the big features of Open MPI 
> 3.0 is the update to the PMIx 2 runtime environment.  We would appreciate any 
> and all testing you can do,  around run-time behaviors.
> 
> Thank you,
> 
> Brian & Howard
> ___
> devel mailing list
> devel@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] SLURM 17.02 support

2017-06-27 Thread r...@open-mpi.org

Okay, I added the warning here: https://github.com/open-mpi/ompi/pull/3778 
<https://github.com/open-mpi/ompi/pull/3778>

This is what it looks like for SLURM (slightly different error message for 
ALPS):

$ srun -n 1 ./mpi_spin
--
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[rhc001:189810] Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to guarantee that all 
other processes were killed!
srun: error: rhc001: task 0: Exited with exit code 1
$


> On Jun 19, 2017, at 9:35 PM, Barrett, Brian via devel 
>  wrote:
> 
> By the way, there was a change between 2.x and 3.0.x:
> 
> 2.x:
> 
> Hello, world, I am 0 of 1, (Open MPI v2.1.2a1, package: Open MPI 
> bbarrett@ip-172-31-64-10 Distribution, ident: 2.1.2a1, repo rev: 
> v2.1.1-59-gdc049e4, Unreleased developer copy, 148)
> Hello, world, I am 0 of 1, (Open MPI v2.1.2a1, package: Open MPI 
> bbarrett@ip-172-31-64-10 Distribution, ident: 2.1.2a1, repo rev: 
> v2.1.1-59-gdc049e4, Unreleased developer copy, 148)
> 
> 
> 3.0.x:
> 
> % srun  -n 2 ./hello_c
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [ip-172-31-64-100:72545] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [ip-172-31-64-100:72546] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> srun: error: ip-172-31-64-100: tasks 0-1: Exited with exit code 1
> 
> Don’t think it really matters, since v2.x probably wasn’t what the customer 
> wanted.
> 
> Brian
> 
>> On Jun 19, 2017, at 7:18 AM, Howard Pritchard > <mailto:hpprit...@gmail.com>> wrote:
>> 
>> Hi Ralph
>> 
>> I think the alternative you mention below should suffice.
>> 
>> Howard
>> 
>> r...@open-mpi.org <mailto:r...@open-mpi.org> > <mailto:r...@open-mpi.org>> schrieb am Mo. 19. Juni 2017 um 07:24:
>> So what you guys want is for me to detect that no opal/pmix framework 
>> components could run, detect that we are in a slurm job, and so print out an 
>> error message saying “hey dummy - you didn’t configure us with slurm pmi 
>> support”?
>> 
>> It means embedding slurm job detection code in the heart of ORTE (as opposed 
>> to in a component), which bothers me a bit.
>> 
>> As an alternative, what if I print out a generic “you didn’t configure us 
>> with pmi support for this environment” instead of the “pmix select failed” 
>> message? I can mention how to configure the support in a general way, but it 
>> avoids having to embed slurm detection into ORTE outside of a component.
>> 
>> > On Jun 16, 2017, at 8:39 AM, Jeff Squyres (jsquyres) > > <mailto:jsquy...@cisco.com>> wrote:
>> >
>> > +1 on the error message.
>> >
>> >
>> >
>> >> On Jun 16, 2017, at 10:06 AM, Howard Pritchard > >> <mailto:hpprit...@gmail.com>> wrote:
>> >>
>> >> Hi Ralph
>> >>
>> >> I think a helpful  error message would suffice.
>> >>
>> >> Howard
>> >>
>> >> r...@open-mpi.org <mailto:r...@open-mpi.org> > >> <mailto:r...@open-mpi.org>> schrieb am Di. 13. Juni 2017 um 11:15:
>> >> He

[OMPI devel] PMIx Working Groups: Call for participants

2017-06-26 Thread r...@open-mpi.org

Hello all

There are two new PMIx working groups starting up to work on new APIs and 
attributes to support application/tool interactions with the system management 
stack in the following areas:

1. tiered storage support - prepositioning of files/binaries/libraries, 
directed hot/warm/cold storage strategies, file system aware scheduling 
algorithms, etc. See 
https://github.com/pmix/publications/blob/master/PMIx-TieredStorage.pdf 
, 
slides 15-21 for some ideas.

2. network support - we already have defined network APIs for launch support 
(see https://github.com/pmix/RFCs/blob/master/RFC0012.md 
). This working group will 
investigate additional definitions to support requests for obtaining 
information on fabric topology and status, traffic reports, registering for 
network-related events, requesting changes in QoS, etc.

The goal of these efforts is to deliver an RFC to the PMIx community that will 
be incorporated into the PMIx v3.0 version of the standard, expected out in 
Q42017. Implementations will be included in the corresponding releases of the 
PMIx convenience library and the PMIx reference server.

All are welcome. If you would like to participate in either of these, please 
drop me a note, or join the PMIx developer’s mailing list 
(https://groups.google.com/forum/#!forum/pmix 
) and indicate your interest 
there.

Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] orterun busted

2017-06-23 Thread r...@open-mpi.org

Odd - I guess my machine is just consistently lucky, as was the CI’s when this 
went thru. The problem field is actually stale - we haven’t used it in years - 
so I simply removed it from orte_process_info.

https://github.com/open-mpi/ompi/pull/3741 


Should fix the problem.

> On Jun 23, 2017, at 3:38 AM, George Bosilca  wrote:
> 
> Ralph,
> 
> I got consistent segfaults during the infrastructure tearing down in the 
> orterun (I noticed them on a OSX). After digging a little bit it turns out 
> that the opal_buffet_t class has been cleaned-up in orte_finalize before 
> orte_proc_info_finalize is called, leading to calling the destructors into a 
> randomly initialized memory. If I change the order of the teardown to move 
> orte_proc_info_finalize before orte_finalize things work better, but I still 
> get a very annoying warning about a "Bad file descriptor in select".
> 
> Any better fix ?
> 
> George.
> 
> PS: Here is the patch I am currently using to get rid of the segfaults
> 
> diff --git a/orte/tools/orterun/orterun.c b/orte/tools/orterun/orterun.c
> index 85aba0a0f3..506b931d35 100644
> --- a/orte/tools/orterun/orterun.c
> +++ b/orte/tools/orterun/orterun.c
> @@ -222,10 +222,10 @@ int orterun(int argc, char *argv[])
>   DONE:
>  /* cleanup and leave */
>  orte_submit_finalize();
> -orte_finalize();
> -orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
>  /* cleanup the process info */
>  orte_proc_info_finalize();
> +orte_finalize();
> +orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
> 
>  if (orte_debug_flag) {
>  fprintf(stderr, "exiting with status %d\n", orte_exit_status);
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Abstraction violation!

2017-06-22 Thread r...@open-mpi.org

Here’s something even weirder. You cannot build that file unless mpi.h already 
exists, which it won’t until you build the MPI layer. So apparently what is 
happening is that we somehow pickup a pre-existing version of mpi.h and use 
that to build the file?

Checking around, I find that all my available machines have an mpi.h somewhere 
in the default path because we always install _something_. I wonder if our 
master would fail in a distro that didn’t have an MPI installed...

> On Jun 22, 2017, at 5:02 PM, r...@open-mpi.org wrote:
> 
> It apparently did come in that way. We just never test -no-ompi and so it 
> wasn’t discovered until a downstream project tried to update. Then...boom.
> 
> 
>> On Jun 22, 2017, at 4:07 PM, Barrett, Brian via devel 
>>  wrote:
>> 
>> I’m confused; looking at history, there’s never been a time when 
>> opal/util/info.c hasn’t included mpi.h.  That seems odd, but so does info 
>> being in opal.
>> 
>> Brian
>> 
>>> On Jun 22, 2017, at 3:46 PM, r...@open-mpi.org wrote:
>>> 
>>> I don’t understand what someone was thinking, but you CANNOT #include 
>>> “mpi.h” in opal/util/info.c. It has broken pretty much every downstream 
>>> project.
>>> 
>>> Please fix this!
>>> Ralph
>>> 
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Abstraction violation!

2017-06-22 Thread r...@open-mpi.org

It apparently did come in that way. We just never test -no-ompi and so it 
wasn’t discovered until a downstream project tried to update. Then...boom.


> On Jun 22, 2017, at 4:07 PM, Barrett, Brian via devel 
>  wrote:
> 
> I’m confused; looking at history, there’s never been a time when 
> opal/util/info.c hasn’t included mpi.h.  That seems odd, but so does info 
> being in opal.
> 
> Brian
> 
>> On Jun 22, 2017, at 3:46 PM, r...@open-mpi.org wrote:
>> 
>> I don’t understand what someone was thinking, but you CANNOT #include 
>> “mpi.h” in opal/util/info.c. It has broken pretty much every downstream 
>> project.
>> 
>> Please fix this!
>> Ralph
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Abstraction violation!

2017-06-22 Thread r...@open-mpi.org

I don’t understand what someone was thinking, but you CANNOT #include “mpi.h” 
in opal/util/info.c. It has broken pretty much every downstream project.

Please fix this!
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-20 Thread r...@open-mpi.org

I updated orte-clean in master, and for v3.0, so it cleans up all both current 
and legacy session directory files as well as any pmix artifacts. I don’t see 
any files named OMPI_*.sm, though that might be something from v2.x? I don’t 
recall us ever making files of that name before - anything we make should be 
under the session directory, not directly in /tmp.

> On May 9, 2017, at 2:10 AM, Christoph Niethammer  wrote:
> 
> Hi,
> 
> I am using Open MPI 2.1.0. 
> 
> Best
> Christoph
> 
> - Original Message -
> From: "Ralph Castain" 
> To: "Open MPI Developers" 
> Sent: Monday, May 8, 2017 6:28:42 PM
> Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O 
> files in /tmp
> 
> What version of OMPI are you using?
> 
>> On May 8, 2017, at 8:56 AM, Christoph Niethammer  wrote:
>> 
>> Hello
>> 
>> According to the manpage "...orte-clean attempts to clean up any processes 
>> and files left over from Open MPI jobs that were run in the past as well as 
>> any currently running jobs. This includes OMPI infrastructure and helper 
>> commands, any processes that were spawned as part of the job, and any 
>> temporary files...".
>> 
>> If I now have a program which calls MPI_File_open, MPI_File_write and 
>> MPI_Abort() in order, I get left over files /tmp/OMPI_*.sm.
>> Running orte-clean does not remove them.
>> 
>> Is this a bug or a feature?
>> 
>> Best
>> Christoph Niethammer
>> 
>> --
>> 
>> Christoph Niethammer
>> High Performance Computing Center Stuttgart (HLRS)
>> Nobelstrasse 19
>> 70569 Stuttgart
>> 
>> Tel: ++49(0)711-685-87203
>> email: nietham...@hlrs.de
>> http://www.hlrs.de/people/niethammer
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] SLURM 17.02 support

2017-06-19 Thread r...@open-mpi.org

So what you guys want is for me to detect that no opal/pmix framework 
components could run, detect that we are in a slurm job, and so print out an 
error message saying “hey dummy - you didn’t configure us with slurm pmi 
support”?

It means embedding slurm job detection code in the heart of ORTE (as opposed to 
in a component), which bothers me a bit.

As an alternative, what if I print out a generic “you didn’t configure us with 
pmi support for this environment” instead of the “pmix select failed” message? 
I can mention how to configure the support in a general way, but it avoids 
having to embed slurm detection into ORTE outside of a component.

> On Jun 16, 2017, at 8:39 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> +1 on the error message.
> 
> 
> 
>> On Jun 16, 2017, at 10:06 AM, Howard Pritchard  wrote:
>> 
>> Hi Ralph
>> 
>> I think a helpful  error message would suffice.
>> 
>> Howard
>> 
>> r...@open-mpi.org  schrieb am Di. 13. Juni 2017 um 11:15:
>> Hey folks
>> 
>> Brian brought this up today on the call, so I spent a little time 
>> investigating. After installing SLURM 17.02 (with just --prefix as config 
>> args), I configured OMPI with just --prefix config args. Getting an 
>> allocation and then executing “srun ./hello” failed, as expected.
>> 
>> However, configuring OMPI --with-pmi= resolved the problem. 
>> SLURM continues to default to PMI-1, and so we pick that option up and use 
>> it. Everything works fine.
>> 
>> FWIW: I also went back and checked using SLURM 15.08 and got the identical 
>> behavior.
>> 
>> So the issue is: we don’t pick up PMI support by default, and never have due 
>> to the SLURM license issue. Thus, we have always required that the user 
>> explicitly configure --with-pmi so they take responsibility for the license. 
>> This is an acknowledged way of avoiding having GPL pull OMPI under its 
>> umbrella as it is the user, and not the OMPI community, that is making the 
>> link.
>> 
>> I’m not sure there is anything we can or should do about this, other than 
>> perhaps providing a nicer error message. Thoughts?
>> Ralph
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Coverity strangeness

2017-06-16 Thread r...@open-mpi.org

Good suggestion - mail sent. Will report back here.

> On Jun 15, 2017, at 10:24 PM, Brice Goglin  wrote:
> 
> You can email scan-ad...@coverity.com <mailto:scan-ad...@coverity.com> to 
> report bugs and/or ask what's going on.
> Brice
> 
> 
> 
> 
> Le 16/06/2017 07:12, Gilles Gouaillardet a écrit :
>> Ralph, 
>> 
>> 
>> my 0.02 US$ 
>> 
>> 
>> i noted the error message mentions 'holding lock 
>> "pmix_mutex_t.m_lock_pthread"', but it does not explicitly mentions 
>> 
>> 'pmix_global_lock' (!) 
>> 
>> at line 446, PMIX_WAIT_THREAD() does release 'cb.lock', which has the same 
>> type than 'pmix_global_lock', but is not the very same lock. 
>> 
>> so maybe coverity is being mislead by PMIX_WAIT_THREAD(), and hence the 
>> false positive 
>> 
>> 
>> if you have contacts at coverity, it would be interesting to report this 
>> false positive 
>> 
>> 
>> 
>> Cheers, 
>> 
>> 
>> Gilles 
>> 
>> 
>> On 6/16/2017 12:02 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote: 
>>> I’m trying to understand some recent coverity warnings, and I confess I’m a 
>>> little stumped - so I figured I’d ask out there and see if anyone has a 
>>> suggestion. This is in the PMIx repo, but it is reported as well in OMPI 
>>> (down in opal/mca/pmix/pmix2x/pmix). The warnings all take the following 
>>> form: 
>>> 
>>> 
>>>  
>>> *** CID 145810:  Concurrent data access violations  (MISSING_LOCK) 
>>> /src/client/pmix_client.c: 451 in PMIx_Init() 
>>> 445 /* wait for the data to return */ 
>>> 446 PMIX_WAIT_THREAD(&cb.lock); 
>>> 447 rc = cb.status; 
>>> 448 PMIX_DESTRUCT(&cb); 
>>> 449 
>>> 450 if (PMIX_SUCCESS == rc) { 
>>>>>> CID 145810:  Concurrent data access violations  (MISSING_LOCK) 
>>>>>> Accessing "pmix_globals.init_cntr" without holding lock 
>>>>>> "pmix_mutex_t.m_lock_pthread". Elsewhere, "pmix_globals_t.init_cntr" is 
>>>>>> accessed with "pmix_mutex_t.m_lock_pthread" held 10 out of 11 times. 
>>> 451 pmix_globals.init_cntr++; 
>>> 452 } else { 
>>> 453 PMIX_RELEASE_THREAD(&pmix_global_lock); 
>>> 454 return rc; 
>>> 455 } 
>>> 456 PMIX_RELEASE_THREAD(&pmix_global_lock); 
>>> 
>>> Now the odd thing is that the lock is in fact being held - it gets released 
>>> 5 lines lower down. However, the lock was taken nearly 100 lines above this 
>>> point. 
>>> 
>>> I’m therefore inclined to think that the lock somehow “slid” outside of 
>>> Coverity’s analysis window and it therefore thought (erroneously) that the 
>>> lock isn’t being held. Has anyone else seen such behavior? 
>>> 
>>> Ralph 
>>> 
>>> ___ 
>>> devel mailing list 
>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> 
>> 
>> ___ 
>> devel mailing list 
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Coverity strangeness

2017-06-15 Thread r...@open-mpi.org

I’m trying to understand some recent coverity warnings, and I confess I’m a 
little stumped - so I figured I’d ask out there and see if anyone has a 
suggestion. This is in the PMIx repo, but it is reported as well in OMPI (down 
in opal/mca/pmix/pmix2x/pmix). The warnings all take the following form:


*** CID 145810:  Concurrent data access violations  (MISSING_LOCK)
/src/client/pmix_client.c: 451 in PMIx_Init()
445 /* wait for the data to return */
446 PMIX_WAIT_THREAD(&cb.lock);
447 rc = cb.status;
448 PMIX_DESTRUCT(&cb);
449 
450 if (PMIX_SUCCESS == rc) {
>>>CID 145810:  Concurrent data access violations  (MISSING_LOCK)
>>>Accessing "pmix_globals.init_cntr" without holding lock 
>>> "pmix_mutex_t.m_lock_pthread". Elsewhere, "pmix_globals_t.init_cntr" is 
>>> accessed with "pmix_mutex_t.m_lock_pthread" held 10 out of 11 times.
451 pmix_globals.init_cntr++;
452 } else {
453 PMIX_RELEASE_THREAD(&pmix_global_lock);
454 return rc;
455 }
456 PMIX_RELEASE_THREAD(&pmix_global_lock);

Now the odd thing is that the lock is in fact being held - it gets released 5 
lines lower down. However, the lock was taken nearly 100 lines above this point.

I’m therefore inclined to think that the lock somehow “slid” outside of 
Coverity’s analysis window and it therefore thought (erroneously) that the lock 
isn’t being held. Has anyone else seen such behavior?

Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] SLURM 17.02 support

2017-06-13 Thread r...@open-mpi.org

Hey folks

Brian brought this up today on the call, so I spent a little time 
investigating. After installing SLURM 17.02 (with just --prefix as config 
args), I configured OMPI with just --prefix config args. Getting an allocation 
and then executing “srun ./hello” failed, as expected.

However, configuring OMPI --with-pmi= resolved the problem. 
SLURM continues to default to PMI-1, and so we pick that option up and use it. 
Everything works fine.

FWIW: I also went back and checked using SLURM 15.08 and got the identical 
behavior.

So the issue is: we don’t pick up PMI support by default, and never have due to 
the SLURM license issue. Thus, we have always required that the user explicitly 
configure --with-pmi so they take responsibility for the license. This is an 
acknowledged way of avoiding having GPL pull OMPI under its umbrella as it is 
the user, and not the OMPI community, that is making the link.

I’m not sure there is anything we can or should do about this, other than 
perhaps providing a nicer error message. Thoughts?
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] ompi_info "developer warning"

2017-06-05 Thread r...@open-mpi.org

Fine with me - I don’t care so long as we get rid of the annoying “warning”

> On Jun 5, 2017, at 6:51 AM, George Bosilca  wrote:
> 
> I do care a little as the default size for most terminal is still 80 chars. I 
> would prefer your second choice where we replace "disabled" by "-" to  losing 
> information on the useful part of the message.
> 
> George.
>  
> 
> On Mon, Jun 5, 2017 at 9:45 AM,  <mailto:gil...@rist.or.jp>> wrote:
> George,
> 
>  
> it seems today the limit is more something like max 24 + max 56.
> 
> we can keep the 80 character limit (i have zero opinion on that) and move to
> 
> max 32 + max 48 for example.
> 
> an other option is to replace "(disabled) " with something more compact
> 
> "(-) " or even "- "
> 
>  
> Cheers,
> 
>  
> Gilles
> 
> - Original Message -----
> 
> So we are finally getting rid of the 80 chars per line limit?
>  
>   George.
>  
>  
> 
> On Sun, Jun 4, 2017 at 11:23 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
> mailto:r...@open-mpi.org>> wrote:
> Really? Sigh - frustrating. I’ll change itas it gets irritating to keep get 
> this warning.
> 
> Frankly, I find I’m constantly doing --all because otherwise I have no 
> earthly idea how to find what I’m looking for anymore...
> 
> 
> > On Jun 4, 2017, at 7:25 PM, Gilles Gouaillardet  > <mailto:gil...@rist.or.jp>> wrote:
> >
> > Ralph,
> >
> >
> > in your environment, pml/monitoring is disabled.
> >
> > so instead of displaying "MCA pml monitoring", ompi_info --all displays
> >
> > "MCA (disabled) pml monitoring" which is larger than 24 characters.
> >
> >
> > fwiw, you can observe the same behavior with
> >
> > OMPI_MCA_sharedfp=^lockedfile ompi_info --all
> >
> >
> > one option is to bump centerpoint (opal/runtime/opal_info_support.c) from 
> > 24 to something larger,
> > an other option is to mark disabled components with a shorter string, for 
> > example
> > "MCA (-) pml monitoring"
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On 6/3/2017 5:26 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote:
> >> I keep seeing this when I run ompi_info --all:
> >>
> >> **
> >> *** DEVELOPER WARNING: A field in ompi_info output is too long and
> >> *** will appear poorly in the prettyprint output.
> >> ***
> >> ***   Value:  "MCA (disabled) pml monitoring"
> >> ***   Max length: 24
> >> **
> >> **
> >> *** DEVELOPER WARNING: A field in ompi_info output is too long and
> >> *** will appear poorly in the prettyprint output.
> >> ***
> >> ***   Value:  "MCA (disabled) pml monitoring"
> >> ***   Max length: 24
> >> **
> >>
> >> Anyone know what this is about???
> >> Ralph
> >>
> >>
> >>
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel  
> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel  
> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] ompi_info "developer warning"

2017-06-05 Thread r...@open-mpi.org

I added the change to https://github.com/open-mpi/ompi/pull/3651 
<https://github.com/open-mpi/ompi/pull/3651>. We’ll just have to hope that 
people intuitively understand that “-“ means “disabled”.

> On Jun 5, 2017, at 7:01 AM, r...@open-mpi.org wrote:
> 
> Fine with me - I don’t care so long as we get rid of the annoying “warning”
> 
>> On Jun 5, 2017, at 6:51 AM, George Bosilca > <mailto:bosi...@icl.utk.edu>> wrote:
>> 
>> I do care a little as the default size for most terminal is still 80 chars. 
>> I would prefer your second choice where we replace "disabled" by "-" to  
>> losing information on the useful part of the message.
>> 
>> George.
>>  
>> 
>> On Mon, Jun 5, 2017 at 9:45 AM, > <mailto:gil...@rist.or.jp>> wrote:
>> George,
>> 
>>  
>> it seems today the limit is more something like max 24 + max 56.
>> 
>> we can keep the 80 character limit (i have zero opinion on that) and move to
>> 
>> max 32 + max 48 for example.
>> 
>> an other option is to replace "(disabled) " with something more compact
>> 
>> "(-) " or even "- "
>> 
>>  
>> Cheers,
>> 
>>  
>> Gilles
>> 
>> - Original Message -
>> 
>> So we are finally getting rid of the 80 chars per line limit?
>>  
>>   George.
>>  
>>  
>> 
>> On Sun, Jun 4, 2017 at 11:23 PM, r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> mailto:r...@open-mpi.org>> 
>> wrote:
>> Really? Sigh - frustrating. I’ll change itas it gets irritating to keep get 
>> this warning.
>> 
>> Frankly, I find I’m constantly doing --all because otherwise I have no 
>> earthly idea how to find what I’m looking for anymore...
>> 
>> 
>> > On Jun 4, 2017, at 7:25 PM, Gilles Gouaillardet > > <mailto:gil...@rist.or.jp>> wrote:
>> >
>> > Ralph,
>> >
>> >
>> > in your environment, pml/monitoring is disabled.
>> >
>> > so instead of displaying "MCA pml monitoring", ompi_info --all displays
>> >
>> > "MCA (disabled) pml monitoring" which is larger than 24 characters.
>> >
>> >
>> > fwiw, you can observe the same behavior with
>> >
>> > OMPI_MCA_sharedfp=^lockedfile ompi_info --all
>> >
>> >
>> > one option is to bump centerpoint (opal/runtime/opal_info_support.c) from 
>> > 24 to something larger,
>> > an other option is to mark disabled components with a shorter string, for 
>> > example
>> > "MCA (-) pml monitoring"
>> >
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > On 6/3/2017 5:26 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote:
>> >> I keep seeing this when I run ompi_info --all:
>> >>
>> >> **
>> >> *** DEVELOPER WARNING: A field in ompi_info output is too long and
>> >> *** will appear poorly in the prettyprint output.
>> >> ***
>> >> ***   Value:  "MCA (disabled) pml monitoring"
>> >> ***   Max length: 24
>> >> **
>> >> **
>> >> *** DEVELOPER WARNING: A field in ompi_info output is too long and
>> >> *** will appear poorly in the prettyprint output.
>> >> ***
>> >> ***   Value:  "MCA (disabled) pml monitoring"
>> >> ***   Max length: 24
>> >> **
>> >>
>> >> Anyone know what this is about???
>> >> Ralph
>> >>
>> >>
>> >>
>> >> ___
>> >> devel mailing list
>> >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel  
>> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> >
>> > ___
>> > devel mailing list
>> > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel  
>> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] ompi_info "developer warning"

2017-06-04 Thread r...@open-mpi.org

Really? Sigh - frustrating. I’ll change itas it gets irritating to keep get 
this warning.

Frankly, I find I’m constantly doing --all because otherwise I have no earthly 
idea how to find what I’m looking for anymore...


> On Jun 4, 2017, at 7:25 PM, Gilles Gouaillardet  wrote:
> 
> Ralph,
> 
> 
> in your environment, pml/monitoring is disabled.
> 
> so instead of displaying "MCA pml monitoring", ompi_info --all displays
> 
> "MCA (disabled) pml monitoring" which is larger than 24 characters.
> 
> 
> fwiw, you can observe the same behavior with
> 
> OMPI_MCA_sharedfp=^lockedfile ompi_info --all
> 
> 
> one option is to bump centerpoint (opal/runtime/opal_info_support.c) from 24 
> to something larger,
> an other option is to mark disabled components with a shorter string, for 
> example
> "MCA (-) pml monitoring"
> 
> 
> Cheers,
> 
> Gilles
> 
> On 6/3/2017 5:26 AM, r...@open-mpi.org wrote:
>> I keep seeing this when I run ompi_info --all:
>> 
>> **
>> *** DEVELOPER WARNING: A field in ompi_info output is too long and
>> *** will appear poorly in the prettyprint output.
>> ***
>> ***   Value:  "MCA (disabled) pml monitoring"
>> ***   Max length: 24
>> **
>> **
>> *** DEVELOPER WARNING: A field in ompi_info output is too long and
>> *** will appear poorly in the prettyprint output.
>> ***
>> ***   Value:  "MCA (disabled) pml monitoring"
>> ***   Max length: 24
>> **
>> 
>> Anyone know what this is about???
>> Ralph
>> 
>> 
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] ompi_info "developer warning"

2017-06-02 Thread r...@open-mpi.org

I keep seeing this when I run ompi_info --all:

**
*** DEVELOPER WARNING: A field in ompi_info output is too long and
*** will appear poorly in the prettyprint output.
***
***   Value:  "MCA (disabled) pml monitoring"
***   Max length: 24
**
**
*** DEVELOPER WARNING: A field in ompi_info output is too long and
*** will appear poorly in the prettyprint output.
***
***   Value:  "MCA (disabled) pml monitoring"
***   Max length: 24
**

Anyone know what this is about???
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Master MTT results

2017-06-01 Thread r...@open-mpi.org

Hey folks

I scanned the nightly MTT results from last night on master, and the RTE looks 
pretty solid. However, there are a LOT of onesided segfaults occurring, and I 
know that will eat up people’s disk space.

Just wanted to ensure folks were aware of the problem
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Time to remove Travis?

2017-06-01 Thread r...@open-mpi.org

I’d vote to remove it - it’s too unreliable anyway

> On Jun 1, 2017, at 6:30 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Is it time to remove Travis?
> 
> I believe that the Open MPI PRB now covers all the modern platforms that 
> Travis covers, and we have people actively maintaining all of the machines / 
> configurations being used for CI.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] mapper issue with heterogeneous topologies

2017-05-31 Thread r...@open-mpi.org

I don’t believe we check topologies prior to making that decision - this is why 
we provide map-by options. Seems to me that this oddball setup has a simple 
solution - all he has to do is set a mapping policy for that environment. Can 
even be done in the default mca param file.

I wouldn’t modify the code for these corner cases as it is just as likely to 
introduce errors

> On May 31, 2017, at 5:46 PM, Gilles Gouaillardet  wrote:
> 
> Hi Ralph,
> 
> 
> this is a follow-up on Siegmar's post that started at 
> https://www.mail-archive.com/users@lists.open-mpi.org/msg31177.html
> 
> 
>> mpiexec -np 3 --host loki:2,exin hello_1_mpi
>> --
>> There are not enough slots available in the system to satisfy the 3 slots
>> that were requested by the application:
>>   hello_1_mpi
>> 
>> Either request fewer slots for your application, or make more slots available
>> for use.
>> --
> 
> 
> loki is a physical machine with 2 NUMA, 2 sockets, ...
> 
> *but* exin is a virtual machine with *no* NUMA, 2 sockets, ...
> 
> 
> my guess is that mpirun is able to find some NUMA objects on 'loki', so it 
> uses the default mapping policy
> 
> (aka --map-by numa). unfortunatly exin has no NUMA objects, and mpirun fails 
> with an error message
> 
> that is hard to interpret.
> 
> 
> as a workaround, it is possible to
> 
> mpirun --map-by socket
> 
> 
> so if i understand and remember correctly, mpirun should make the decision to 
> map by numa *after* it receives the topology from exin and not before.
> 
> does that make sense ?
> 
> can you please take care of that ?
> 
> 
> fwiw, i ran
> 
> lstopo --of xml > /tmp/topo.xml
> 
> on two nodes, and manually remove the NUMANode and Bridge objects from the 
> topology of the second node, and then
> 
> mpirun --mca --mca hwloc_base_topo_file /tmp/topo.xml --host n0:2,n1 -np 3 
> hostname
> 
> in order to reproduce the issue.
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.x branch naming

2017-05-31 Thread r...@open-mpi.org


> On May 31, 2017, at 7:48 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On May 30, 2017, at 11:37 PM, Barrett, Brian via devel 
>  wrote:
>> 
>> We have now created a v3.0.x branch based on today’s v3.x branch.  I’ve 
>> reset all outstanding v3.x PRs to the v3.0.x branch.  No one has permissions 
>> to pull into the v3.x branch, although I’ve left it in place for a couple of 
>> weeks so that people can slowly update their local git repositories.  
> 
> A thought on this point...
> 
> I'm kinda in favor of ripping off the band aid and deleting the 
> old/stale/now-unwritable v3.x branch in order to force everyone to update to 
> the new branch name ASAP.
> 
> Thoughts?

FWIW: Brian very kindly already re-pointed all the existing PRs to the new 
branch.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] PMIX busted

2017-05-31 Thread r...@open-mpi.org

Sorry for the hassle...

> On May 31, 2017, at 7:31 AM, George Bosilca  wrote:
> 
> After removing all leftover files and redoing the autogen things went back to 
> normal. Sorry for the noise.
> 
>   George.
> 
> 
> 
> On Wed, May 31, 2017 at 10:06 AM, r...@open-mpi.org 
> <mailto:r...@open-mpi.org> mailto:r...@open-mpi.org>> 
> wrote:
> No - I just rebuilt it myself, and I don’t see any relevant MTT build 
> failures. Did you rerun autogen?
> 
> 
> > On May 31, 2017, at 7:02 AM, George Bosilca  > <mailto:bosi...@icl.utk.edu>> wrote:
> >
> > I have problems compiling the current master. Anyone else has similar 
> > issues ?
> >
> >   George.
> >
> >
> >   CC   base/ptl_base_frame.lo
> > In file included from 
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/thread_usage.h:31:0,
> >  from 
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/mutex.h:32,
> >  from 
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/threads.h:37,
> >  from 
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client_ops.h:18,
> >  from 
> > ../../../../../../../../../../opal/mca/pmix/pmix2x/pmix/src/mca/ptl/base/ptl_base_frame.c:45:
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:80:34:
> >  warning: "PMIX_C_GCC_INLINE_ASSEMBLY" is not defined [-Wundef]
> >  #define PMIX_GCC_INLINE_ASSEMBLY PMIX_C_GCC_INLINE_ASSEMBLY
> >   ^
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:115:6:
> >  note: in expansion of macro 'PMIX_GCC_INLINE_ASSEMBLY'
> >  #if !PMIX_GCC_INLINE_ASSEMBLY
> >   ^
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:153:7:
> >  warning: "PMIX_ASSEMBLY_BUILTIN" is not defined [-Wundef]
> >  #elif PMIX_ASSEMBLY_BUILTIN == PMIX_BUILTIN_SYNC
> >^
> > /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:155:7:
> >  warning: "PMIX_ASSEMBLY_BUILTIN" is not defined [-Wundef]
> >  #elif PMIX_ASSEMBLY_BUILTIN == PMIX_BUILTIN_GCC
> >^
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] PMIX busted

2017-05-31 Thread r...@open-mpi.org

No - I just rebuilt it myself, and I don’t see any relevant MTT build failures. 
Did you rerun autogen?


> On May 31, 2017, at 7:02 AM, George Bosilca  wrote:
> 
> I have problems compiling the current master. Anyone else has similar issues ?
> 
>   George.
> 
> 
>   CC   base/ptl_base_frame.lo
> In file included from 
> /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/thread_usage.h:31:0,
>  from 
> /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/mutex.h:32,
>  from 
> /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/threads/threads.h:37,
>  from 
> /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client_ops.h:18,
>  from 
> ../../../../../../../../../../opal/mca/pmix/pmix2x/pmix/src/mca/ptl/base/ptl_base_frame.c:45:
> /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:80:34:
>  warning: "PMIX_C_GCC_INLINE_ASSEMBLY" is not defined [-Wundef]
>  #define PMIX_GCC_INLINE_ASSEMBLY PMIX_C_GCC_INLINE_ASSEMBLY
>   ^
> /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:115:6:
>  note: in expansion of macro 'PMIX_GCC_INLINE_ASSEMBLY'
>  #if !PMIX_GCC_INLINE_ASSEMBLY
>   ^
> /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:153:7:
>  warning: "PMIX_ASSEMBLY_BUILTIN" is not defined [-Wundef]
>  #elif PMIX_ASSEMBLY_BUILTIN == PMIX_BUILTIN_SYNC
>^
> /Users/bosilca/unstable/ompi/trunk/ompi/opal/mca/pmix/pmix2x/pmix/src/atomics/sys/atomic.h:155:7:
>  warning: "PMIX_ASSEMBLY_BUILTIN" is not defined [-Wundef]
>  #elif PMIX_ASSEMBLY_BUILTIN == PMIX_BUILTIN_GCC
>^
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Please turn off MTT on v1.10

2017-05-30 Thread r...@open-mpi.org

The v1.10 series is closed and no new commits will be made to that branch. So 
please turn off any MTT runs you have scheduled for that branch - this will 
allow people to commit tests that will not run on the v1.10 series.

Thanks
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Stale PRs

2017-05-26 Thread r...@open-mpi.org

Hey folks

We’re seeing a number of stale PRs hanging around again - these are PRs that 
were submitted against master (in some cases, months ago) that cleared CI and 
were never committed. Could people please take a look at their PRs and either 
commit them or delete them?

We are trying to get 3.0 out the door in the next few weeks. If it’s a bug you 
fixed on master, then NOW is the time to complete your work!
Ralph
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Updating the v1.10.7 tag

2017-05-19 Thread r...@open-mpi.org

I also tested with my multiple clones before committing the new tag, following 
the Git documentation. In no case did I encounter a problem.

I agree that someone force pushing tags will cause a problem, but (as has been 
noted multiple times now) we don’t allow that in our repo as it would -always- 
cause a problem, regardless of this kerfluffle. So this is a non-issue

I also agree that following the instructions will resolve any future issues. 
There are other ways of also getting there, but those are simple enough, and 
(per your other note) an “rm -rf” is the ultimate solution, albeit 
unnecessarily dramatic.

And I also agree with Brian that the documentation may not mirror what people 
see in practice. The problem is that Git is so atomistic that you can create as 
big a mudpie as you want - sadly, people don’t bother to read nor follow the 
docs, and so you can get weird behavior and attribute it to Git.

Anyway, enough - far more electrons burned on this than it is worth.


> On May 19, 2017, at 10:29 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> I tested everything I said in my email with a GitHub repo+fork and multiple 
> clones this morning.  Please feel free to test and correct me!  There seem to 
> be two possible problems:
> 
> 1. Propagating the wrong tag value.  Fortunately, GitHub saves us from 
> several cases where this can happen, unless someone force pushes tags.  Which 
> nobody should do.  Ever.  :-)
> 
> 2. Keeping the wrong tag value and analyzing history.  It's hopefully 
> unlikely that we have a "GitHub fork created when the bad tag was present" 
> situation.  It's also probably unlikely that people will need to look closely 
> at the v1.10 branch in detail in the future, since that release series is now 
> effectively done.
> 
> More specifically: hopefully everyone does the "git tag -d ..." instructions 
> and this becomes a moot point.
> 
> 
> 
>> On May 19, 2017, at 11:25 AM, r...@open-mpi.org wrote:
>> 
>> I would only point out that the panic tone of these statements appears 
>> unwarranted based on all available documentation. I’m not convinced this 
>> analysis is correct as it seems to contradict the documentation.
>> 
>> Nevertheless, there is certainly no harm in executing the recommended steps, 
>> and it is a good idea to do it.
>> 
>> 
>>> On May 19, 2017, at 8:03 AM, Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
>>> On May 19, 2017, at 5:06 AM, r...@open-mpi.org wrote:
>>>> 
>>>> $ git tag -d v1.10.7
>>>> $ git pull   (or whatever your favorite update command is)
>>> 
>>> *
>>> *** Everybody needs to do this, regardless of whether you have checked out 
>>> the git tag or not ***
>>> *
>>> 
>>> SHORT VERSION
>>> =
>>> 
>>> - Ralph changed the v1.10.7 tag on the ompi GitHub repo to point to the 
>>> correct location.  It's done, don't bother saying, "you shouldn't have done 
>>> that!".  It's done.  Everyone **NEEDS** to update their local repos to get 
>>> the new/correct tag.
>>> 
>>> - Note that Git automatically fetches tags the *first* time they are seen; 
>>> it doesn't matter if you've checked out that tag or not.  So even if you 
>>> haven't checked out v1.10.7, you *NEEED* to do the above 
>>> procedure.
>>> 
>>> - Additionally, if you have propagated the "incorrect" tag elsewhere (e.g., 
>>> into other local repos, or your GitHub fork), you need to chase it down and 
>>> delete / re-fetch the tags there, too.  Do it now.
>>> 
>>> MORE DETAIL
>>> ===
>>> 
>>> By default, git will fetch any new tag that it sees upstream.  You don't 
>>> have to check out that tag -- just a "git fetch" will pull down any new 
>>> tags that it sees upstream.
>>> 
>>> If the tag changes upstream, but you already have that tag, git won't fetch 
>>> the new/changed upstream tag.  Hence, you can *think* you have the right 
>>> tag value, but you really don't.  It's kinda a form of silent data 
>>> corruption.  Hence, the "git tag -d ..." instructions above -- it deletes 
>>> your local tag and then you do another fetch, so it re-obtains the tag from 
>>> upstream.
>>> 
>>> The danger is if anyone pushes tags

Re: [OMPI devel] Updating the v1.10.7 tag

2017-05-19 Thread r...@open-mpi.org

I would only point out that the panic tone of these statements appears 
unwarranted based on all available documentation. I’m not convinced this 
analysis is correct as it seems to contradict the documentation.

Nevertheless, there is certainly no harm in executing the recommended steps, 
and it is a good idea to do it.


> On May 19, 2017, at 8:03 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On May 19, 2017, at 5:06 AM, r...@open-mpi.org wrote:
>> 
>> $ git tag -d v1.10.7
>> $ git pull   (or whatever your favorite update command is)
> 
> *
> *** Everybody needs to do this, regardless of whether you have checked out 
> the git tag or not ***
> *
> 
> SHORT VERSION
> =
> 
> - Ralph changed the v1.10.7 tag on the ompi GitHub repo to point to the 
> correct location.  It's done, don't bother saying, "you shouldn't have done 
> that!".  It's done.  Everyone **NEEDS** to update their local repos to get 
> the new/correct tag.
> 
> - Note that Git automatically fetches tags the *first* time they are seen; it 
> doesn't matter if you've checked out that tag or not.  So even if you haven't 
> checked out v1.10.7, you *NEEED* to do the above procedure.
> 
> - Additionally, if you have propagated the "incorrect" tag elsewhere (e.g., 
> into other local repos, or your GitHub fork), you need to chase it down and 
> delete / re-fetch the tags there, too.  Do it now.
> 
> MORE DETAIL
> ===
> 
> By default, git will fetch any new tag that it sees upstream.  You don't have 
> to check out that tag -- just a "git fetch" will pull down any new tags that 
> it sees upstream.
> 
> If the tag changes upstream, but you already have that tag, git won't fetch 
> the new/changed upstream tag.  Hence, you can *think* you have the right tag 
> value, but you really don't.  It's kinda a form of silent data corruption.  
> Hence, the "git tag -d ..." instructions above -- it deletes your local tag 
> and then you do another fetch, so it re-obtains the tag from upstream.
> 
> The danger is if anyone pushes tags to our repos.  If the pusher has the 
> *old* tag, they'll could/will re-push the old tag.  Fortunately, Github seems 
> to disallow overwriting tags by default -- if you have the tag FOO value X 
> and try to "git push --tags" when there is already a tag FOO with value Y 
> upstream, it'll abort.  But GitHub does allow "git push --tags --force", 
> which will overwrite the upstream FOO with Y.  This is a danger.
> 
> Note that this doesn't apply just to release managers with access to the 
> release branches -- since we allow direct pushing to master, any of us can 
> "git push --tags" (and/or --force).
> 
> Meaning: Git tags are just *another* reason not to --force push to the ompi 
> repo.  Don't ever, Ever, EVER --force push anything to the public main ompi 
> repo.  
> 
> A secondary, lesser danger is that most people don't update tags in their 
> forks.  If they get the old/wrong tag in their fork, it'll likely never be 
> updated.  The wrong tag existed for about a week or so, so hopefully no one 
> created a fork in that time (and therefore has the wrong tag).  But forks 
> with wrong tags aren't usually a *problem* (because who looks at tags in 
> forks?), but it is weird that a fork has one value of the tag and the main 
> repo has a different one.
> 
> I think the main fear from all of this is the silent, unintentional 
> propagation of the old / incorrect tag -- in 2 years, when Future Bob is 
> looking back at the git history to try to figure out some tangled issue, will 
> they have the right tag?  Will Future Bob have confidence in the git history 
> data?  ...etc.
> 
> Meaning: everyone go do the "git tag -d ..." procedure.  Stop reading; go do 
> it now.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Updating the v1.10.7 tag

2017-05-19 Thread r...@open-mpi.org

Hi folks

I apparently inadvertently tagged the wrong hash the other night when tagging 
v1.10.7. I have corrected it, but if you updated your clone _and_ checked out 
the v1.10.7 tag in the interim, you might need to manually delete the tag on 
your clone and re-pull.

It’s trivial to do:

$ git tag -d v1.10.7
$ git pull   (or whatever your favorite update command is)

Again, you may not need to do anything (I didn’t, but one person did - however, 
they had manually checked out the tag before it was fixed).

Sorry for the inconvenience.
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Combining Binaries for Launch

2017-05-15 Thread r...@open-mpi.org

So long as both binaries use the same OMPI version, I can’t see why there would 
be an issue. It sounds like you are thinking of running an MPI process on the 
GPU itself (instead of using an offload library)? People have done that before 
- IIRC, the only issue is trying to launch a process onto the GPU when the GPU 
doesn’t have a globally visible TCP address. I wrote a backend spawn capability 
to resolve that problem and it should still work, though I am not ware of it 
being exercised recently.

> On May 15, 2017, at 8:02 AM, Kumar, Amit  wrote:
> 
> Dear Open MPI,
>  
> I would like to gain a better understanding for running two different 
> binaries on two different types of nodes(GPU nodes and Non GPUnodes) as a 
> single job.
>  
> I have run two different binaries with mpirun command and that works fine for 
> us. But My question is: if I have a binary-1 that uses Intel MKL, and is 
> compiled with (OpenMPI-wrapped-around-gcc-compiler), and then another 
> binary-2 that uses Intel MKL and compiled with OpenMPI-warped-around-gcc, 
> should they have any MPI communication or launch issues? What ABI 
> compatibilities should I be aware of when launching task that need to 
> communicate over Open MPI? Or this question has no relevance?
>  
> Thank you,
> Amit
>  
>  
> ___
> devel mailing list
> devel@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> 
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Socket buffer sizes

2017-05-15 Thread r...@open-mpi.org

Thanks - already done, as you say

> On May 15, 2017, at 7:32 AM, Håkon Bugge  wrote:
> 
> Dear Open MPIers,
> 
> 
> Automatic tuning of socket buffers has been in the linux kernel since 
> 2.4.17/2.6.7. That is some time ago. I remember, at the time, that we removed 
> the default setsockopt() for SO_SNDBUF and SO_RCVBUF in Scali MPI.
> 
> Today, running Open MPI 1.10.2 using the TCP BTL, on a 10Gbit/ ethernet, I 
> get:
> 
> # OSU MPI Bi-Directional Bandwidth Test
> # Size Bi-Bandwidth (MB/s)
> 1 1.72
> 2 4.58
> 4 9.06
> 818.43
> 16   35.68
> 32   68.47
> 64  135.20
> 128 259.30
> 256 450.59
> 512 703.55
> 1024935.58
> 2048   1020.04
> 4096   1191.23
> 8192   1192.13
> 16384  1155.97
> 32768  1181.74
> 
> and by strace I see that Open MPI sets said buffer sizes:
> 
> setsockopt(12, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0
> setsockopt(12, SOL_SOCKET, SO_RCVBUF, [131072], 4) = 0
> 
> Now, by adding “—mca btl_tcp_rcvbuf 0 —mca btl_tcp_sndbuf 0” to the above 
> command line, I get:
> 
> # OSU MPI Bi-Directional Bandwidth Test
> # Size Bi-Bandwidth (MB/s)
> 1 1.60
> 2 4.56
> 4 9.03
> 811.66
> 16   35.54
> 32   68.36
> 64  133.70
> 128 247.69
> 256 466.75
> 512 885.40
> 1024   1557.51
> 2048   2115.40
> 4096   2226.65
> 8192   2288.82
> 16384  2318.11
> 32768  2334.19
> 
> (and strace shows no setsockopt for SO_{RCV,SND}BUF)
> 
> Roughly, the performance doubles.
> 
> Just a humble suggestion to remove the setting of the socket buffer sizes if 
> not already done in newer versions.
> 
> 
> Thxs, Håkon
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Quick help with OMPI_COMM_WORLD_LOCAL_RANK

2017-05-12 Thread r...@open-mpi.org

If you configure with --enable-debug, then you can set the following mca params 
on your cmd line:

--mca plm_base_verbose 5  will show you the details of the launch
--mca odls_base_verbose 5 will show you the details of the fork/exec


> On May 12, 2017, at 10:30 AM, Kumar, Amit  wrote:
> 
>  
> >>>That’s a pretty ancient release, but a quick glance at the source code 
> >>>indicates that you should always see it when launched via mpirun, and 
> >>>never when launched via srun
>  
> Thank you for your response “rhc”. I will look more into the launch scripts 
> and see if I messed up in spelling it.  I always wondered if there is 
> MPI_DEBUG flag that I can define to some value and get more insight during 
> the launch process?
>  
> Thank you,
> Amit
>  
>  
> ___
> devel mailing list
> devel@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> 
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Quick help with OMPI_COMM_WORLD_LOCAL_RANK

2017-05-12 Thread r...@open-mpi.org

That’s a pretty ancient release, but a quick glance at the source code 
indicates that you should always see it when launched via mpirun, and never 
when launched via srun


> On May 12, 2017, at 9:22 AM, Kumar, Amit  wrote:
> 
> Dear OpenMPI,
>  
> Under what circumstances I would find that OMPI_COMM_WORLD_LOCAL_RANK is not 
> set? For some reason our install of openmpi-1.6.5 with SLURM 16.05.08 with 
> PMI support is not setting OMPI_COMM_WORLD_LOCAL_RANK.
>  
> I need openmpi-1.6.5 because I have some NVIDIA binaries that are only 
> available at the version.
>  
> I have tried using: mpirun --mca btl self,openib -np 4 -hostfile ./host_cpu 
> binary : -np 2 -hostfile ./host_gpu binary
>  
> Also I have tried running srun -l -n 6 --multi-prog ./myrun.conf
>  
> In both cases I find that OMPI_COMM_WORLD_LOCAL_RANK is NOT SET.
>  
> Any help with this will be a great great help for us….
>  
> Thank you,
> Amit
> ___
> devel mailing list
> devel@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> 
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] OMPI v1.10.7rc1 ready for evaluation

2017-05-12 Thread r...@open-mpi.org

Hi folks

We want/need to release a final version of the 1.10 series that will contain 
all remaining cleanups. Please take a gander at it.

https://www.open-mpi.org/software/ompi/v1.10/ 


Changes:

1.10.7
--
- Fix bug in TCP BTL that impacted performance on 10GbE networks
  by not adjusting the TCP send/recv buffer sizes and using system
  default values
- Add missing MPI_AINT_ADD and MPI_AINT_DIFF functions
- Fix a bug in the OMPI internal timer code that affected MPI_Wtime
  and caused performance to be cpu freq dependent
- Improve performance of the MPI_Comm_create algorithm
- Fix platform detection on FreeBSD
- Fix a bug in the handling of MPI_TYPE_CREATE_DARRAY in MPI_(R)(GET_)ACCUMULATE
- Fix openib memory registration limit calculation
- Add missing MPI_T_PVAR_SESSION_NULL in mpi.h
- Fix "make distcheck" when using external hwloc and/or libevent packages
- Add latest ConnectX-5 vendor part id to OpenIB device params
- Fix race condition in the UCX PML
- Fix signal handling for rsh launcher
- Miscellaneous Fortran fixes
- Fix typo bugs in wrapper compiler script

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-05-08 Thread r...@open-mpi.org

What version of OMPI are you using?

> On May 8, 2017, at 8:56 AM, Christoph Niethammer  wrote:
> 
> Hello
> 
> According to the manpage "...orte-clean attempts to clean up any processes 
> and files left over from Open MPI jobs that were run in the past as well as 
> any currently running jobs. This includes OMPI infrastructure and helper 
> commands, any processes that were spawned as part of the job, and any 
> temporary files...".
> 
> If I now have a program which calls MPI_File_open, MPI_File_write and 
> MPI_Abort() in order, I get left over files /tmp/OMPI_*.sm.
> Running orte-clean does not remove them.
> 
> Is this a bug or a feature?
> 
> Best
> Christoph Niethammer
> 
> --
> 
> Christoph Niethammer
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstrasse 19
> 70569 Stuttgart
> 
> Tel: ++49(0)711-685-87203
> email: nietham...@hlrs.de
> http://www.hlrs.de/people/niethammer
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.x branch naming

2017-05-05 Thread r...@open-mpi.org

+1 Go for it :-)

> On May 5, 2017, at 2:34 PM, Barrett, Brian via devel 
>  wrote:
> 
> To be clear, we’d do the move all at once on Saturday morning.  Things that 
> would change:
> 
> 1) nightly tarballs would rename from openmpi-v3.x--.tar.gz 
> to openmpi-v3.0.x--.tar.gz
> 2) nightly tarballs would build from v3.0.x, not v3.x branch
> 3) PRs would need to be filed against v3.0.x
> 4) Both https://www.open-mpi.org/nightly/v3.x/ 
>  and 
> https://www.open-mpi.org/nightly/v3.0.x/ 
>  would work for searching for new 
> nightly tarballs
> 
> At some point in the future (say, two weeks), (4) would change, and only 
> https://www.open-mpi.org/nightly/v3.0.x/ 
>  would work.  Otherwise, we need to 
> have a coordinated name switch, which seems way harder than it needs to be.  
> MTT, for example, requires a configured directory for nightlies, but as long 
> as the latest_tarball.txt is formatted correctly, everything else works fine.
> 
> Brian
> 
>> On May 5, 2017, at 2:26 PM, Paul Hargrove > > wrote:
>> 
>> As a maintainer of non-MTT scripts that need to know the layout of the 
>> directories containing nighty and RC tarball, I also think that all the 
>> changes should be done soon (and all together, not spread over months).
>> 
>> -Paul
>> 
>> On Fri, May 5, 2017 at 2:16 PM, George Bosilca > > wrote:
>> If we rebranch from master for every "major" release it makes sense to 
>> rename the branch. In the long term renaming seems like the way to go, and 
>> thus the pain of altering everything that depends on the naming will exist 
>> at some point. I'am in favor of doing it asap (but I have no stakes in the 
>> game as UTK does not have an MTT).
>> 
>>   George.
>> 
>> 
>> 
>> On Fri, May 5, 2017 at 1:53 PM, Barrett, Brian via devel 
>> mailto:devel@lists.open-mpi.org>> wrote:
>> Hi everyone -
>> 
>> We’ve been having discussions among the release managers about the choice of 
>> naming the branch for Open MPI 3.0.0 as v3.x (as opposed to v3.0.x).  
>> Because the current plan is that each “major” release (in the sense of the 
>> three release points from master per year, not necessarily in increasing the 
>> major number of the release number) is to rebranch off of master, there’s a 
>> feeling that we should have named the branch v3.0.x, and then named the next 
>> one 3.1.x, and so on.  If that’s the case, we should consider renaming the 
>> branch and all the things that depend on the branch (web site, which Jeff 
>> has already half-done; MTT testing; etc.).  The disadvantage is that 
>> renaming will require everyone who’s configured MTT to update their test 
>> configs.
>> 
>> The first question is should we rename the branch?  While there would be 
>> some ugly, there’s nothing that really breaks long term if we don’t.  Jeff 
>> has stronger feelings than I have here.
>> 
>> If we are going to rename the branch from v3.x to v3.0.x, my proposal would 
>> be that we do it next Saturday evening (May 13th).  I’d create a new branch 
>> from the current state of v3.x and then delete the old branch.  We’d try to 
>> push all the PRs Friday so that there were no outstanding PRs that would 
>> have to be reopened.  We’d then bug everyone to update their nightly testing 
>> to pull from a different URL and update their MTT configs.  After a week or 
>> two, we’d stop having tarballs available at both v3.x and v3.0.x on the Open 
>> MPI web page.
>> 
>> Thoughts?
>> 
>> Brian
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> 
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> 
>> 
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov 
>> 
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] v3 branch - Problem with LSF

2017-05-05 Thread r...@open-mpi.org

I would suggest not bringing it over in isolation - we planned to do an update 
that contains a lot of related changes, including the PMIx update. Probably 
need to do that pretty soon given the June target.


> On May 5, 2017, at 3:04 PM, Vallee, Geoffroy R.  wrote:
> 
> Hi,
> 
> I am running some tests on a PPC platform that is using LSF and I see the 
> following problem every time I launch a job that runs on 2 nodes or more:
> 
> [crest1:49998] *** Process received signal ***
> [crest1:49998] Signal: Segmentation fault (11)
> [crest1:49998] Signal code: Address not mapped (1)
> [crest1:49998] Failing at address: 0x10061636d2d
> [crest1:49998] [ 0] [0x10050478]
> [crest1:49998] [ 1] 
> /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(+0x0)[0x109c]
> [crest1:49998] [ 2] 
> /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so(straddr_isIPv4+0x44)[0x10e31b64]
> [crest1:49998] [ 3] 
> /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_pjob_array2LIST+0x114)[0x10be79b4]
> [crest1:49998] [ 4] 
> /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_pjob_constructList+0xfc)[0x10becdbc]
> [crest1:49998] [ 5] 
> /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_launch+0x184)[0x10bed9c4]
> [crest1:49998] [ 6] 
> /ccs/home/gvh/install/crest/ompi3_llvm/lib/openmpi/mca_plm_lsf.so(+0x2660)[0x10992660]
> [crest1:49998] [ 7] 
> /ccs/home/gvh/install/crest/ompi3_llvm/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x940)[0x101f7730]
> [crest1:49998] [ 8] 
> /ccs/home/gvh/install/crest/ompi3_llvm/bin/mpiexec[0x100013e4]
> [crest1:49998] [ 9] 
> /ccs/home/gvh/install/crest/ompi3_llvm/bin/mpiexec[0x1f10]
> [crest1:49998] [10] /lib64/power8/libc.so.6(+0x24580)[0x104f4580]
> [crest1:49998] [11] 
> /lib64/power8/libc.so.6(__libc_start_main+0xc4)[0x104f4774]
> [crest1:49998] *** End of error message ***
> 
> I do not experience that problem with master and the only difference about 
> the LSF support between master and the v3 branch is:
> 
> https://github.com/open-mpi/ompi/commit/92c996487c589ef8558a087ce2a9923dacdf0b99
>  
> 
> 
> If I can confirm that this change fixes the problem with the v3 branch, would 
> you guys accept to bring it into the v3 branch?
> 
> Thanks,
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] remote spawn - have no children

2017-05-03 Thread r...@open-mpi.org

Everything operates via the state machine - events trigger moving the job from 
one state to the next, with each state being tied to a callback function that 
implements that state. If you set state_base_verbose=5, you’ll see when and 
where each state gets executed.

By default, the launch_app state goes to a function in the plm/base:

https://github.com/open-mpi/ompi/blob/master/orte/mca/plm/base/plm_base_launch_support.c#L477
 
<https://github.com/open-mpi/ompi/blob/master/orte/mca/plm/base/plm_base_launch_support.c#L477>

I suspect the problem is that your plm component isn’t activating the next step 
upon completion of launch_daemons.


> On May 3, 2017, at 8:15 AM, Justin Cinkelj  wrote:
> 
> So "remote spawn" and children refer to orted daemons only, and I was looking 
> into wrong modules.
> 
> Which module(s) are then responsible to send command to orted to start mpi 
> application?
> Which event names should I search for?
> 
> Thank you,
> Justin
> 
> - Original Message -
>> From: r...@open-mpi.org
>> To: "OpenMPI Devel" 
>> Sent: Wednesday, May 3, 2017 3:29:16 PM
>> Subject: Re: [OMPI devel] remote spawn - have no children
>> 
>> I should have looked more closely as you already have the routed verbose
>> output there. Everything in fact looks correct. The node with mpirun has 1
>> child, which is the daemon on the other node. The vpid=1 daemon on node 250
>> doesn’t have any children as there aren’t any more daemons in the system.
>> 
>> Note that the output has nothing to do with spawning your mpi_hello - it is
>> solely describing the startup of the daemons.
>> 
>> 
>>> On May 3, 2017, at 6:26 AM, r...@open-mpi.org wrote:
>>> 
>>> The orte routed framework does that for you - there is an API for that
>>> purpose.
>>> 
>>> 
>>>> On May 3, 2017, at 12:17 AM, Justin Cinkelj 
>>>> wrote:
>>>> 
>>>> Important detail first: I get this message from significantly modified
>>>> Open MPI code, so problem exists solely due to my mistake.
>>>> 
>>>> Orterun on 192.168.122.90 starts orted on remote node 192.168.122.91, than
>>>> orted figures out it has nothing to do.
>>>> If I request to start workers on the same 192.168.122.90 IP, the mpi_hello
>>>> is started.
>>>> 
>>>> Partial log:
>>>> /usr/bin/mpirun -np 1 ... mpi_hello
>>>> #
>>>> [osv:00252] [[50738,0],0] plm:base:setup_job
>>>> [osv:00252] [[50738,0],0] plm:base:setup_vm
>>>> [osv:00252] [[50738,0],0] plm:base:setup_vm creating map
>>>> [osv:00252] [[50738,0],0] setup:vm: working unmanaged allocation
>>>> [osv:00252] [[50738,0],0] using dash_host
>>>> [osv:00252] [[50738,0],0] checking node 192.168.122.91
>>>> [osv:00252] [[50738,0],0] plm:base:setup_vm add new daemon [[50738,0],1]
>>>> [osv:00252] [[50738,0],0] plm:base:setup_vm assigning new daemon
>>>> [[50738,0],1] to node 192.168.122.91
>>>> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 2
>>>> [osv:00252] [[50738,0],0] routed:binomial 0 found child 1
>>>> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 2
>>>> [osv:00252] [[50738,0],0] routed:binomial find children of rank 0
>>>> [osv:00252] [[50738,0],0] routed:binomial find children checking peer 1
>>>> [osv:00252] [[50738,0],0] routed:binomial find children computing tree
>>>> [osv:00252] [[50738,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 2
>>>> [osv:00252] [[50738,0],0] routed:binomial find children returning found
>>>> value 0
>>>> [osv:00252] [[50738,0],0]: parent 0 num_children 1
>>>> [osv:00252] [[50738,0],0]:  child 1
>>>> [osv:00252] [[50738,0],0] plm:osvrest: launching vm
>>>> #
>>>> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn called
>>>> [osv:00250] [[50738,0],1] routed:binomial rank 0 parent 0 me 1 num_procs 2
>>>> [osv:00250] [[50738,0],1] routed:binomial find children of rank 0
>>>> [osv:00250] [[50738,0],1] routed:binomial find children checking peer 1
>>>> [osv:00250] [[50738,0],1] routed:binomial find children computing tree
>>>> [osv:00250] [[50738,0],1] routed:binomial rank 1 parent 0 me 1 num_procs 2
>>>> [osv:00250] [[50738,0],1] routed:binomial find children returning found
>>>> value 0
>>>> [osv:00250] [[50738,0],1]: parent 0 num_children 0
>>>> [osv:00250] [[50738,0],1] plm:osvrest: remo

Re: [OMPI devel] remote spawn - have no children

2017-05-03 Thread r...@open-mpi.org

I should have looked more closely as you already have the routed verbose output 
there. Everything in fact looks correct. The node with mpirun has 1 child, 
which is the daemon on the other node. The vpid=1 daemon on node 250 doesn’t 
have any children as there aren’t any more daemons in the system.

Note that the output has nothing to do with spawning your mpi_hello - it is 
solely describing the startup of the daemons.


> On May 3, 2017, at 6:26 AM, r...@open-mpi.org wrote:
> 
> The orte routed framework does that for you - there is an API for that 
> purpose.
> 
> 
>> On May 3, 2017, at 12:17 AM, Justin Cinkelj  wrote:
>> 
>> Important detail first: I get this message from significantly modified Open 
>> MPI code, so problem exists solely due to my mistake.
>> 
>> Orterun on 192.168.122.90 starts orted on remote node 192.168.122.91, than 
>> orted figures out it has nothing to do.
>> If I request to start workers on the same 192.168.122.90 IP, the mpi_hello 
>> is started.
>> 
>> Partial log:
>> /usr/bin/mpirun -np 1 ... mpi_hello
>> #
>> [osv:00252] [[50738,0],0] plm:base:setup_job
>> [osv:00252] [[50738,0],0] plm:base:setup_vm
>> [osv:00252] [[50738,0],0] plm:base:setup_vm creating map
>> [osv:00252] [[50738,0],0] setup:vm: working unmanaged allocation
>> [osv:00252] [[50738,0],0] using dash_host
>> [osv:00252] [[50738,0],0] checking node 192.168.122.91
>> [osv:00252] [[50738,0],0] plm:base:setup_vm add new daemon [[50738,0],1]
>> [osv:00252] [[50738,0],0] plm:base:setup_vm assigning new daemon 
>> [[50738,0],1] to node 192.168.122.91
>> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 2
>> [osv:00252] [[50738,0],0] routed:binomial 0 found child 1
>> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 2
>> [osv:00252] [[50738,0],0] routed:binomial find children of rank 0
>> [osv:00252] [[50738,0],0] routed:binomial find children checking peer 1
>> [osv:00252] [[50738,0],0] routed:binomial find children computing tree
>> [osv:00252] [[50738,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 2
>> [osv:00252] [[50738,0],0] routed:binomial find children returning found 
>> value 0
>> [osv:00252] [[50738,0],0]: parent 0 num_children 1
>> [osv:00252] [[50738,0],0]:  child 1
>> [osv:00252] [[50738,0],0] plm:osvrest: launching vm
>> #
>> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn called
>> [osv:00250] [[50738,0],1] routed:binomial rank 0 parent 0 me 1 num_procs 2
>> [osv:00250] [[50738,0],1] routed:binomial find children of rank 0
>> [osv:00250] [[50738,0],1] routed:binomial find children checking peer 1
>> [osv:00250] [[50738,0],1] routed:binomial find children computing tree
>> [osv:00250] [[50738,0],1] routed:binomial rank 1 parent 0 me 1 num_procs 2
>> [osv:00250] [[50738,0],1] routed:binomial find children returning found 
>> value 0
>> [osv:00250] [[50738,0],1]: parent 0 num_children 0
>> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn - have no children!
>> 
>> In the plm mca module remote_spawn() function (my plm is based on 
>> orte/mca/plm/rsh/), the &coll.targets list has zero length. My question is, 
>> which module(s) are responsible for filling in the coll.targets? Then I will 
>> turn on the correct mca xzy_base_verbose level, and hopefully narrow down my 
>> problem. I have quite a problem guessing/finding out what various xyz 
>> strings mean :)
>> 
>> Thank you, Justin
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] remote spawn - have no children

2017-05-03 Thread r...@open-mpi.org

The orte routed framework does that for you - there is an API for that purpose.


> On May 3, 2017, at 12:17 AM, Justin Cinkelj  wrote:
> 
> Important detail first: I get this message from significantly modified Open 
> MPI code, so problem exists solely due to my mistake.
> 
> Orterun on 192.168.122.90 starts orted on remote node 192.168.122.91, than 
> orted figures out it has nothing to do.
> If I request to start workers on the same 192.168.122.90 IP, the mpi_hello is 
> started.
> 
> Partial log:
> /usr/bin/mpirun -np 1 ... mpi_hello
> #
> [osv:00252] [[50738,0],0] plm:base:setup_job
> [osv:00252] [[50738,0],0] plm:base:setup_vm
> [osv:00252] [[50738,0],0] plm:base:setup_vm creating map
> [osv:00252] [[50738,0],0] setup:vm: working unmanaged allocation
> [osv:00252] [[50738,0],0] using dash_host
> [osv:00252] [[50738,0],0] checking node 192.168.122.91
> [osv:00252] [[50738,0],0] plm:base:setup_vm add new daemon [[50738,0],1]
> [osv:00252] [[50738,0],0] plm:base:setup_vm assigning new daemon 
> [[50738,0],1] to node 192.168.122.91
> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 2
> [osv:00252] [[50738,0],0] routed:binomial 0 found child 1
> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 2
> [osv:00252] [[50738,0],0] routed:binomial find children of rank 0
> [osv:00252] [[50738,0],0] routed:binomial find children checking peer 1
> [osv:00252] [[50738,0],0] routed:binomial find children computing tree
> [osv:00252] [[50738,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 2
> [osv:00252] [[50738,0],0] routed:binomial find children returning found value > 0
> [osv:00252] [[50738,0],0]: parent 0 num_children 1
> [osv:00252] [[50738,0],0]:  child 1
> [osv:00252] [[50738,0],0] plm:osvrest: launching vm
> #
> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn called
> [osv:00250] [[50738,0],1] routed:binomial rank 0 parent 0 me 1 num_procs 2
> [osv:00250] [[50738,0],1] routed:binomial find children of rank 0
> [osv:00250] [[50738,0],1] routed:binomial find children checking peer 1
> [osv:00250] [[50738,0],1] routed:binomial find children computing tree
> [osv:00250] [[50738,0],1] routed:binomial rank 1 parent 0 me 1 num_procs 2
> [osv:00250] [[50738,0],1] routed:binomial find children returning found value > 0
> [osv:00250] [[50738,0],1]: parent 0 num_children 0
> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn - have no children!
> 
> In the plm mca module remote_spawn() function (my plm is based on 
> orte/mca/plm/rsh/), the &coll.targets list has zero length. My question is, 
> which module(s) are responsible for filling in the coll.targets? Then I will 
> turn on the correct mca xzy_base_verbose level, and hopefully narrow down my 
> problem. I have quite a problem guessing/finding out what various xyz strings 
> mean :)
> 
> Thank you, Justin
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] openib oob module

2017-04-21 Thread r...@open-mpi.org

I’m not familiar with the openib code, but this looks to me like it may be 
caused by a change in the openib code itself. Have you looked to see what the 
diff might be between the two versions?

> On Apr 21, 2017, at 6:45 AM, Shiqing Fan  wrote:
> 
> I've tried this out, and got the same problem as I sent before. 
> 
> With the same configuration and command line, 1.6.5 works for me, 1.10 series 
> seem not.
> 
> Could it also be IB configuration issue? (ib_write/read_bw/lat work fine 
> across the two nodes)
> 
> Error output below:
> 
> [[39776,1],0][btl_openib_component.c:3502:handle_wc] from vrdma-host1 to: 
> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status 
> number 12 for wr_id 2318d80 opcode 32767  vendor error 129 qp_idx 0
> 
> --
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
> 
>The total number of times that the sender wishes the receiver to
>retry timeout, packet sequence, etc. errors before posting a
>completion error.
> 
> This error typically means that there is something awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
> 
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will
>  attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>  to 20).  The actual timeout value used is calculated as:
> 
> 4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> 
> Below is some information about the host that raised the error and the
> peer to which it was connected:
> 
>  Local host:   host1
>  Local device: mlx4_0
>  Peer host:192.168.2.22
> 
> You may need to consult with your system administrator to get this
> problem fixed.
> --
> 
> -Original Message-
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles 
> Gouaillardet
> Sent: Friday, April 21, 2017 9:41 AM
> To: devel@lists.open-mpi.org
> Subject: Re: [OMPI devel] openib oob module
> 
> Folks,
> 
> 
> fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for me 
> on a mlx4 cluster (Mellanox QDR)
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 4/21/2017 1:31 AM, r...@open-mpi.org wrote:
>> I’m not seeing any problem inside the OOB - the problem appears to be 
>> in the info being given to it:
>> 
>> [host1:16244] 1 more process has sent help message 
>> help-mpi-btl-openib.txt / default subnet prefix
>> [host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>> all help / error messages
>> [[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: 
>> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
>> status number 12 for wr_id 112db80 opcode 32767  vendor error 129 qp_idx 0
>> 
>> I’ve been searching, and I don’t see that help message anywhere in 
>> your output - not sure what happened to it. I do see this in your 
>> output - don’t know what it means:
>> 
>> [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] 
>> !
>> 
>> 
>>> On Apr 20, 2017, at 8:36 AM, Shiqing Fan >> <mailto:shiqing@huawei.com>> wrote:
>>> 
>>> Forgot to enable oob verbose in my last test. Here is the updated 
>>> output file.
>>> Thanks,
>>> Shiqing
>>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf 
>>> Of*r...@open-mpi.org <mailto:r...@open-mpi.org>
>>> *Sent:*Thursday, April 20, 2017 4:29 PM
>>> *To:*OpenMPI Devel
>>> *Subject:*Re: [OMPI devel] openib oob module
>>> Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. 
>>> Should be able to restore it. I honestly don’t recall the bug, though :-(
>>> If you want to try reviving it, you can add some debug in there (plus 
>>> turn on the OOB verbosity) and I’m happy to help you figure it out.
>>> Ralph
>>> 
>>>On Apr 20, 2017, at 7:13 AM, Shiqing Fan >><mailto:shiqing@huawei.com>> wrote:
>>&

Re: [OMPI devel] openib oob module

2017-04-20 Thread r...@open-mpi.org

I’m not seeing any problem inside the OOB - the problem appears to be in the 
info being given to it:

[host1:16244] 1 more process has sent help message help-mpi-btl-openib.txt / 
default subnet prefix
[host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 
/ error messages
[[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: 
192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status number 
12 for wr_id 112db80 opcode 32767  vendor error 129 qp_idx 0

I’ve been searching, and I don’t see that help message anywhere in your output 
- not sure what happened to it. I do see this in your output - don’t know what 
it means:

[host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] 
!


> On Apr 20, 2017, at 8:36 AM, Shiqing Fan  wrote:
> 
> Forgot to enable oob verbose in my last test. Here is the updated output file.
>  
> Thanks,
> Shiqing
>  
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
> r...@open-mpi.org
> Sent: Thursday, April 20, 2017 4:29 PM
> To: OpenMPI Devel
> Subject: Re: [OMPI devel] openib oob module
>  
> Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. Should be 
> able to restore it. I honestly don’t recall the bug, though :-(
>  
> If you want to try reviving it, you can add some debug in there (plus turn on 
> the OOB verbosity) and I’m happy to help you figure it out.
> Ralph
>  
> On Apr 20, 2017, at 7:13 AM, Shiqing Fan  <mailto:shiqing@huawei.com>> wrote:
>  
> Hi Ralph,
>  
> Yes, it’s been a long time. Hope you all are doing well (I believe so J ).
>  
> I’m working on a virtualization project, and need to run Open MPI on an 
> unikernel OS (most of OFED is missing/unsupported).
>  
> Actually I’m only focusing on 1.10.2, which still has oob in ompi. Probably 
> it might be possible to make oob work there? Or even for 1.10 branch (as 
> Gilles metioned)?
> Do you have any clue about the bug in oob back then?
>  
> Regards,
> Shiqing
>  
>  
> From: devel [mailto:devel-boun...@lists.open-mpi.org 
> <mailto:devel-boun...@lists.open-mpi.org>] On Behalf Of r...@open-mpi.org 
> <mailto:r...@open-mpi.org>
> Sent: Thursday, April 20, 2017 3:49 PM
> To: OpenMPI Devel
> Subject: Re: [OMPI devel] openib oob module
>  
> Hi Shiqing!
>  
> Been a long time - hope you are doing well.
>  
> I see no way to bring the oob module back now that the BTLs are in the OPAL 
> layer - this is why it was removed as the oob is in ORTE, and thus not 
> accessible from OPAL.
> Ralph
>  
> On Apr 20, 2017, at 6:02 AM, Shiqing Fan  <mailto:shiqing@huawei.com>> wrote:
>  
> Dear all,
>  
> I noticed that openib oob module has been removed since a long time ago, 
> because it wasn’t working anymore and nobody seemed need it.
> But for some special operating system, where the rdmacm, udcm or ibcm kernel 
> support is missing, oob may still be necessary.
>  
> I’m curious if it’s possible to bring this module back? How difficult would 
> it be to fix the bug in order to make it work again in 1.10 branch or later? 
> Thanks a lot.
>  
> Best Regards,
> Shiqing
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>  
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>  
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] openib oob module

2017-04-20 Thread r...@open-mpi.org

Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. Should be able 
to restore it. I honestly don’t recall the bug, though :-(

If you want to try reviving it, you can add some debug in there (plus turn on 
the OOB verbosity) and I’m happy to help you figure it out.
Ralph

> On Apr 20, 2017, at 7:13 AM, Shiqing Fan  wrote:
> 
> Hi Ralph,
>  
> Yes, it’s been a long time. Hope you all are doing well (I believe so J ).
>  
> I’m working on a virtualization project, and need to run Open MPI on an 
> unikernel OS (most of OFED is missing/unsupported).
>  
> Actually I’m only focusing on 1.10.2, which still has oob in ompi. Probably 
> it might be possible to make oob work there? Or even for 1.10 branch (as 
> Gilles metioned)?
> Do you have any clue about the bug in oob back then?
>  
> Regards,
> Shiqing
>  
>  
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
> r...@open-mpi.org
> Sent: Thursday, April 20, 2017 3:49 PM
> To: OpenMPI Devel
> Subject: Re: [OMPI devel] openib oob module
>  
> Hi Shiqing!
>  
> Been a long time - hope you are doing well.
>  
> I see no way to bring the oob module back now that the BTLs are in the OPAL 
> layer - this is why it was removed as the oob is in ORTE, and thus not 
> accessible from OPAL.
> Ralph
>  
> On Apr 20, 2017, at 6:02 AM, Shiqing Fan  <mailto:shiqing@huawei.com>> wrote:
>  
> Dear all,
>  
> I noticed that openib oob module has been removed since a long time ago, 
> because it wasn’t working anymore and nobody seemed need it.
> But for some special operating system, where the rdmacm, udcm or ibcm kernel 
> support is missing, oob may still be necessary.
>  
> I’m curious if it’s possible to bring this module back? How difficult would 
> it be to fix the bug in order to make it work again in 1.10 branch or later? 
> Thanks a lot.
>  
> Best Regards,
> Shiqing
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>  
> ___
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] openib oob module

2017-04-20 Thread r...@open-mpi.org

Hi Shiqing!

Been a long time - hope you are doing well.

I see no way to bring the oob module back now that the BTLs are in the OPAL 
layer - this is why it was removed as the oob is in ORTE, and thus not 
accessible from OPAL.
Ralph

> On Apr 20, 2017, at 6:02 AM, Shiqing Fan  wrote:
> 
> Dear all,
>  
> I noticed that openib oob module has been removed since a long time ago, 
> because it wasn’t working anymore and nobody seemed need it.
> But for some special operating system, where the rdmacm, udcm or ibcm kernel 
> support is missing, oob may still be necessary.
>  
> I’m curious if it’s possible to bring this module back? How difficult would 
> it be to fix the bug in order to make it work again in 1.10 branch or later? 
> Thanks a lot.
>  
> Best Regards,
> Shiqing
> ___
> devel mailing list
> devel@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> 
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Program which runs wih 1.8.3, fails with 2.0.2

2017-04-19 Thread r...@open-mpi.org

Fully expected - if ORTE can’t start one or more daemons, then the MPI job 
itself will never be executed.

There was an SGE integration issue in the 2.0 series - I fixed it, but IIRC it 
didn’t quite make the 2.0.2 release. In fact, I just checked and it did indeed 
miss that release.

You have three choices:

1. you could apply the patch to the 2.0.2 source code yourself - it is at 
https://github.com/open-mpi/ompi/pull/3162 

2. download a copy of the latest nightly 2.0.3 tarball - hasn’t been officially 
released yet, but includes the patch

3. upgrade to the nightly 2.1.1 tarball - expected to be officially released 
soon and also includes the patch

Hopefully, one of those options will fix the problem
Ralph

> On Apr 19, 2017, at 4:57 PM, Kevin Buckley 
>  wrote:
> 
> On 19 April 2017 at 18:35, Kevin Buckley
>  wrote:
> 
>> If I compile against 2.0.2 the same command works at the command line
>> but not in the "SGE" job submission, where I see a complaint about
>> 
>> =
>> Host key verification failed.
>> --
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>>  blah, blah, blah ...
>> =
> 
> Just to add that if I add in some basic debugging
> 
> --mca btl_base_verbose 30
> 
> then when running at the command line, I get a swathe of info
> from the MCA, however within the SGE environment, I still only
> get the "ORTE was unable .." message ?
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

1 2 3 >

1 - 100 of 228 matches

Mail list logo