[OMPI devel] OpenMPI 1.10.0: Arch Linux PkgSrc build suggests configure script patch

2015-09-21 Thread Kevin Buckley
Watcha,

we recently updated the OpenMPI installation on our School's ArchLinux
machines, where OpenMPI is built as a PkgSrc package, to 1.10.0

In running through the build, we were told that PkgSrc wasn't too keen on
the use of the == with a single "if test" construct and so I needed to apply
the following patch

--- configure.orig  2015-08-24 23:33:14.0 +
+++ configure
@@ -60570,8 +60570,8 @@ _ACEOF
 $as_echo "$MPI_OFFSET_DATATYPE" >&6; }


-if test "$ompi_fortran_happy" == "1" && \
-   test "$OMPI_WANT_FORTRAN_BINDINGS" == "1"; then
+if test "$ompi_fortran_happy" = "1" && \
+   test "$OMPI_WANT_FORTRAN_BINDINGS" = "1"; then

 # Get the kind value for Fortran MPI_INTEGER_KIND (corresponding
 # to whatever is the same size as a F77 INTEGER -- for the


Seem to recall that this is "good practice" and indeed, can see that
other "if test" stanzas in the configure script have been fixed to match,
so perhaps this one has just slipped through the net and/or not been
reported by anyone else as yet.

--
Kevin M. Buckley

eScience Consultant
School of Engineering and Computer Science
Victoria University of Wellington
New Zealand


Re: [OMPI devel] OpenMPI 1.10.0: Arch Linux PkgSrc build suggests configure script patch

2015-09-21 Thread Ralph Castain
Yeah, we tried to catch all those, but obviously must have missed this one. 
I’ll add it to the mix

Thanks!
Ralph

> On Sep 20, 2015, at 9:01 PM, Kevin Buckley 
>  wrote:
> 
> Watcha,
> 
> we recently updated the OpenMPI installation on our School's ArchLinux
> machines, where OpenMPI is built as a PkgSrc package, to 1.10.0
> 
> In running through the build, we were told that PkgSrc wasn't too keen on
> the use of the == with a single "if test" construct and so I needed to apply
> the following patch
> 
> --- configure.orig  2015-08-24 23:33:14.0 +
> +++ configure
> @@ -60570,8 +60570,8 @@ _ACEOF
> $as_echo "$MPI_OFFSET_DATATYPE" >&6; }
> 
> 
> -if test "$ompi_fortran_happy" == "1" && \
> -   test "$OMPI_WANT_FORTRAN_BINDINGS" == "1"; then
> +if test "$ompi_fortran_happy" = "1" && \
> +   test "$OMPI_WANT_FORTRAN_BINDINGS" = "1"; then
> 
> # Get the kind value for Fortran MPI_INTEGER_KIND (corresponding
> # to whatever is the same size as a F77 INTEGER -- for the
> 
> 
> Seem to recall that this is "good practice" and indeed, can see that
> other "if test" stanzas in the configure script have been fixed to match,
> so perhaps this one has just slipped through the net and/or not been
> reported by anyone else as yet.
> 
> --
> Kevin M. Buckley
> 
> eScience Consultant
> School of Engineering and Computer Science
> Victoria University of Wellington
> New Zealand
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/18090.php



Re: [OMPI devel] PMIX vs Solaris

2015-09-21 Thread Ralph Castain
?? Just so this old fossilized brain gets this right: you are saying there was 
a 64s pause before the hwloc error appeared, and then another 64s pause after 
the second server_setup_fork message appeared?

If that’s true, then I’m chasing the wrong problem - it sounds like something 
is messed up in the mpirun startup. Did you have more than one node in the 
allocation by chance? I’m wondering if we are getting held up by something in 
the daemon launch/callback area.



> On Sep 20, 2015, at 4:08 PM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> Still failing with that patch, but with the addition of a fairly long pause 
> (64s) before the first error message appears, and again after the second 
> "server setup_fork" (64s again)
> 
> New output is attached.
> 
> -Paul
> 
> On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain  > wrote:
> Argh - found a typo in the output line. Could you please try the attached 
> patch and do it again? This might fix it, but if not it will provide me with 
> some idea of the returned error.
> 
> Thanks
> Ralph
> 
> 
>> On Sep 20, 2015, at 12:40 PM, Paul Hargrove > > wrote:
>> 
>> Yes, it is definitely at 10.
>> Another attempt is attached.
>> -Paul
>> 
>> On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain > > wrote:
>> Paul - can you please confirm that you gave mpirun a level of 10 for the 
>> pmix_base_verbose param? This output isn’t what I would have expected from 
>> that level - it looks more like the verbosity was set to 5, and so the error 
>> number isn’t printed.
>> 
>> Thanks
>> Ralph
>> 
>> 
>>> On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet 
>>> mailto:gilles.gouaillar...@gmail.com>> 
>>> wrote:
>>> 
>>> Paul,
>>> 
>>> I do not remember it like that ...
>>> 
>>> at that time, the issue in ompi was that the global errno was uses instead 
>>> of the per thread errno.
>>> though the man pages tells -mt should be used fir multithreaded apps, you 
>>> tried -D_REENTRANT on all your platforms, and it was enough to get the 
>>> expected result.
>>> 
>>> I just wanted to check pmix1xx (sub)configure did correctly pass the 
>>> -D_REENTRANT flag, and it does. so this is very likely a new and unrelated 
>>> error
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On Sunday, September 20, 2015, Paul Hargrove >> > wrote:
>>> Gilles,
>>> 
>>> Yes every $CC invocation in opal/mca/pmix/pmix1xx includes "-D_REENTRANT".
>>> However, they don't include "-mt".
>>> I believe we concluded (when we had problems previously) that "-mt" was the 
>>> proper flag (at compile and link) for multi-threaded with the Studio 
>>> compilers.
>>> 
>>> -Paul
>>> 
>>> On Sat, Sep 19, 2015 at 11:29 PM, Gilles Gouaillardet 
>>> > wrote:
>>> Paul,
>>> 
>>> Can you please double check pmix1xx is compiled with -D_REENTRANT ?
>>> We ran into similar issues in the past, and they only occurred with Solaris 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> On Sunday, September 20, 2015, Paul Hargrove > wrote:
>>> Ralph,
>>> The output from the requested run is attached.
>>> -Paul
>>> 
>>> On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain > wrote:
>>> Ah, okay - that makes more sense. I’ll have to let Brice see if he can 
>>> figure out how to silence the hwloc error message as I can’t find where it 
>>> came from. The other errors are real and are the reason why the job was 
>>> terminated.
>>> 
>>> The problem is that we are trying to establish a communication between the 
>>> app and the daemon via unix domain socket, and we failed to do so. The 
>>> error tells me that we were able to create and connect to the socket, but 
>>> failed when the daemon tried to do a blocking send to the app.
>>> 
>>> Can you rerun it with -mca pmix_base_verbose 10? It will tell us the value 
>>> of the error number that was returned
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> 
 On Sep 19, 2015, at 9:37 PM, Paul Hargrove > wrote:
 
 Ralph,
 
 No it did not run.
 The complete output (which I really should have included in the first 
 place) is below.
 
 -Paul
 
 $ mpirun -mca btl sm,self -np 2 examples/ring_c'
 Error opening /devices/pci@0,0:reg: Permission denied
 [pcp-d-3:26054] PMIX ERROR: ERROR in file 
 /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
  at line 181
 [pcp-d-3:26053] PMIX ERROR: UNREACHABLE in file 
 /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
  at line 463
 --
 It looks like MPI_INIT failed for some reason; your parallel process is
 likely to abort.  There are many reasons that a parallel process can
 fail during MPI_INIT; some of which are due to configuration or environment
 pro

Re: [OMPI devel] PMIX vs Solaris

2015-09-21 Thread Paul Hargrove
Ralph,

Just as you say:
The first 64s pause was before the hwloc error message appeared.
The second was after the second server_setup_fork appears, and before
whatever line came after that.

I don't know if stdio buffering my be "distorting" the placement of the
pause relative to the lines of output.
However, prior to your patch the entire failed mpirun was around 1s.

No allocation.
No resource manager.
Just a single workstation.

-Paul

On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain  wrote:

> ?? Just so this old fossilized brain gets this right: you are saying there
> was a 64s pause before the hwloc error appeared, and then another 64s pause
> after the second server_setup_fork message appeared?
>
> If that’s true, then I’m chasing the wrong problem - it sounds like
> something is messed up in the mpirun startup. Did you have more than one
> node in the allocation by chance? I’m wondering if we are getting held up
> by something in the daemon launch/callback area.
>
>
>
> On Sep 20, 2015, at 4:08 PM, Paul Hargrove  wrote:
>
> Ralph,
>
> Still failing with that patch, but with the addition of a fairly long
> pause (64s) before the first error message appears, and again after the
> second "server setup_fork" (64s again)
>
> New output is attached.
>
> -Paul
>
> On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain  wrote:
>
>> Argh - found a typo in the output line. Could you please try the attached
>> patch and do it again? This might fix it, but if not it will provide me
>> with some idea of the returned error.
>>
>> Thanks
>> Ralph
>>
>>
>> On Sep 20, 2015, at 12:40 PM, Paul Hargrove  wrote:
>>
>> Yes, it is definitely at 10.
>> Another attempt is attached.
>> -Paul
>>
>> On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain  wrote:
>>
>>> Paul - can you please confirm that you gave mpirun a level of 10 for the
>>> pmix_base_verbose param? This output isn’t what I would have expected from
>>> that level - it looks more like the verbosity was set to 5, and so the
>>> error number isn’t printed.
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>> On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>> Paul,
>>>
>>> I do not remember it like that ...
>>>
>>> at that time, the issue in ompi was that the global errno was uses
>>> instead of the per thread errno.
>>> though the man pages tells -mt should be used fir multithreaded apps,
>>> you tried -D_REENTRANT on all your platforms, and it was enough to get the
>>> expected result.
>>>
>>> I just wanted to check pmix1xx (sub)configure did correctly pass the
>>> -D_REENTRANT flag, and it does. so this is very likely a new and unrelated
>>> error
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Sunday, September 20, 2015, Paul Hargrove  wrote:
>>>
 Gilles,

 Yes every $CC invocation in opal/mca/pmix/pmix1xx includes
 "-D_REENTRANT".
 However, they don't include "-mt".
 I believe we concluded (when we had problems previously) that "-mt" was
 the proper flag (at compile and link) for multi-threaded with the Studio
 compilers.

 -Paul

 On Sat, Sep 19, 2015 at 11:29 PM, Gilles Gouaillardet <
 gilles.gouaillar...@gmail.com> wrote:

> Paul,
>
> Can you please double check pmix1xx is compiled with -D_REENTRANT ?
> We ran into similar issues in the past, and they only occurred with
> Solaris
>
> Cheers,
>
> Gilles
>
>
> On Sunday, September 20, 2015, Paul Hargrove 
> wrote:
>
>> Ralph,
>> The output from the requested run is attached.
>> -Paul
>>
>> On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain 
>> wrote:
>>
>>> Ah, okay - that makes more sense. I’ll have to let Brice see if he
>>> can figure out how to silence the hwloc error message as I can’t find 
>>> where
>>> it came from. The other errors are real and are the reason why the job 
>>> was
>>> terminated.
>>>
>>> The problem is that we are trying to establish a communication
>>> between the app and the daemon via unix domain socket, and we failed to 
>>> do
>>> so. The error tells me that we were able to create and connect to the
>>> socket, but failed when the daemon tried to do a blocking send to the 
>>> app.
>>>
>>> Can you rerun it with -mca pmix_base_verbose 10? It will tell us the
>>> value of the error number that was returned
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>> On Sep 19, 2015, at 9:37 PM, Paul Hargrove 
>>> wrote:
>>>
>>> Ralph,
>>>
>>> No it did not run.
>>> The complete output (which I really should have included in the
>>> first place) is below.
>>>
>>> -Paul
>>>
>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>>> Error opening /devices/pci@0,0:reg: Permission denied
>>> [pcp-d-3:26054] PMIX ERROR: ERROR in file
>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e