FYI:

Things look fine today with last night's master tarball.

I hope Brice has a way to eliminate the hwloc warning, since I am sure I am
not the only one with scripts that will notice "Error" in the output.

-Paul

On Wed, Sep 23, 2015 at 6:08 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Aha! Thanks - just what the doctor ordered!
>
>
> On Sep 23, 2015, at 5:45 PM, Gilles Gouaillardet <gil...@rist.or.jp>
> wrote:
>
> Ralph,
>
> the root cause is
> getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...)
> fails with errno ENOPROTOOPT on solaris 11.2
>
> the attached patch is a proof of concept and works for me :
> /* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */
>
> Cheers,
>
> Gilles
>
> On 9/21/2015 2:16 PM, Paul Hargrove wrote:
>
> Ralph,
>
> Just as you say:
> The first 64s pause was before the hwloc error message appeared.
> The second was after the second server_setup_fork appears, and before
> whatever line came after that.
>
> I don't know if stdio buffering my be "distorting" the placement of the
> pause relative to the lines of output.
> However, prior to your patch the entire failed mpirun was around 1s.
>
> No allocation.
> No resource manager.
> Just a single workstation.
>
> -Paul
>
> On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> ?? Just so this old fossilized brain gets this right: you are saying
>> there was a 64s pause before the hwloc error appeared, and then another 64s
>> pause after the second server_setup_fork message appeared?
>>
>> If that’s true, then I’m chasing the wrong problem - it sounds like
>> something is messed up in the mpirun startup. Did you have more than one
>> node in the allocation by chance? I’m wondering if we are getting held up
>> by something in the daemon launch/callback area.
>>
>>
>>
>> On Sep 20, 2015, at 4:08 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>
>> Ralph,
>>
>> Still failing with that patch, but with the addition of a fairly long
>> pause (64s) before the first error message appears, and again after the
>> second "server setup_fork" (64s again)
>>
>> New output is attached.
>>
>> -Paul
>>
>> On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Argh - found a typo in the output line. Could you please try the
>>> attached patch and do it again? This might fix it, but if not it will
>>> provide me with some idea of the returned error.
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>> On Sep 20, 2015, at 12:40 PM, Paul Hargrove < <phhargr...@lbl.gov>
>>> phhargr...@lbl.gov> wrote:
>>>
>>> Yes, it is definitely at 10.
>>> Another attempt is attached.
>>> -Paul
>>>
>>> On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain < <r...@open-mpi.org>
>>> r...@open-mpi.org> wrote:
>>>
>>>> Paul - can you please confirm that you gave mpirun a level of 10 for
>>>> the pmix_base_verbose param? This output isn’t what I would have expected
>>>> from that level - it looks more like the verbosity was set to 5, and so the
>>>> error number isn’t printed.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>>
>>>> On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet <
>>>> <gilles.gouaillar...@gmail.com>gilles.gouaillar...@gmail.com> wrote:
>>>>
>>>> Paul,
>>>>
>>>> I do not remember it like that ...
>>>>
>>>> at that time, the issue in ompi was that the global errno was uses
>>>> instead of the per thread errno.
>>>> though the man pages tells -mt should be used fir multithreaded apps,
>>>> you tried -D_REENTRANT on all your platforms, and it was enough to get the
>>>> expected result.
>>>>
>>>> I just wanted to check pmix1xx (sub)configure did correctly pass the
>>>> -D_REENTRANT flag, and it does. so this is very likely a new and unrelated
>>>> error
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Sunday, September 20, 2015, Paul Hargrove < <phhargr...@lbl.gov>
>>>> phhargr...@lbl.gov> wrote:
>>>>
>>>>> Gilles,
>>>>>
>>>>> Yes every $CC invocation in opal/mca/pmix/pmix1xx includes
>>>>> "-D_REENTRANT".
>>>>> However, they don't include "-mt".
>>>>> I believe we concluded (when we had problems previously) that "-mt"
>>>>> was the proper flag (at compile and link) for multi-threaded with the
>>>>> Studio compilers.
>>>>>
>>>>> -Paul
>>>>>
>>>>> On Sat, Sep 19, 2015 at 11:29 PM, Gilles Gouaillardet <
>>>>> gilles.gouaillar...@gmail.com> wrote:
>>>>>
>>>>>> Paul,
>>>>>>
>>>>>> Can you please double check pmix1xx is compiled with -D_REENTRANT ?
>>>>>> We ran into similar issues in the past, and they only occurred with
>>>>>> Solaris
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>> On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov>
>>>>>> wrote:
>>>>>>
>>>>>>> Ralph,
>>>>>>> The output from the requested run is attached.
>>>>>>> -Paul
>>>>>>>
>>>>>>> On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain <r...@open-mpi.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ah, okay - that makes more sense. I’ll have to let Brice see if he
>>>>>>>> can figure out how to silence the hwloc error message as I can’t find 
>>>>>>>> where
>>>>>>>> it came from. The other errors are real and are the reason why the job 
>>>>>>>> was
>>>>>>>> terminated.
>>>>>>>>
>>>>>>>> The problem is that we are trying to establish a communication
>>>>>>>> between the app and the daemon via unix domain socket, and we failed 
>>>>>>>> to do
>>>>>>>> so. The error tells me that we were able to create and connect to the
>>>>>>>> socket, but failed when the daemon tried to do a blocking send to the 
>>>>>>>> app.
>>>>>>>>
>>>>>>>> Can you rerun it with -mca pmix_base_verbose 10? It will tell us
>>>>>>>> the value of the error number that was returned
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 19, 2015, at 9:37 PM, Paul Hargrove <phhargr...@lbl.gov>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Ralph,
>>>>>>>>
>>>>>>>> No it did not run.
>>>>>>>> The complete output (which I really should have included in the
>>>>>>>> first place) is below.
>>>>>>>>
>>>>>>>> -Paul
>>>>>>>>
>>>>>>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>>>>>>>> Error opening /devices/pci@0,0:reg: Permission denied
>>>>>>>> [pcp-d-3:26054] PMIX ERROR: ERROR in file
>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
>>>>>>>> at line 181
>>>>>>>> [pcp-d-3:26053] PMIX ERROR: UNREACHABLE in file
>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
>>>>>>>> at line 463
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> It looks like MPI_INIT failed for some reason; your parallel
>>>>>>>> process is
>>>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>>>>> environment
>>>>>>>> problems.  This failure appears to be an internal failure; here's
>>>>>>>> some
>>>>>>>> additional information (which may only be relevant to an Open MPI
>>>>>>>> developer):
>>>>>>>>
>>>>>>>>   ompi_mpi_init: ompi_rte_init failed
>>>>>>>>   --> Returned "(null)" (-43) instead of "Success" (0)
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> *** An error occurred in MPI_Init
>>>>>>>> *** on a NULL communicator
>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
>>>>>>>> abort,
>>>>>>>> ***    and potentially your MPI job)
>>>>>>>> [pcp-d-3:26054] Local abort before MPI_INIT completed completed
>>>>>>>> successfully, but am not able to aggregate error messages, and not 
>>>>>>>> able to
>>>>>>>> guarantee that all other processes were killed!
>>>>>>>> -------------------------------------------------------
>>>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>>>> -------------------------------------------------------
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun detected that one or more processes exited with non-zero
>>>>>>>> status, thus causing
>>>>>>>> the job to be terminated. The first process to do so was:
>>>>>>>>
>>>>>>>>   Process name: [[11371,1],0]
>>>>>>>>   Exit code:    1
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> On Sat, Sep 19, 2015 at 8:50 PM, Ralph Castain <r...@open-mpi.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Paul, can you clarify something for me? The error in this case
>>>>>>>>> indicates that the client wasn’t able to reach the daemon - this 
>>>>>>>>> should
>>>>>>>>> have resulted in termination of the job. Did the job actually run?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sep 18, 2015, at 2:50 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I'm on travel right now, but it should be an easy fix when I
>>>>>>>>> return. Sorry for the annoyance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 17, 2015 at 11:13 PM, Paul Hargrove <
>>>>>>>>> phhargr...@lbl.gov> wrote:
>>>>>>>>>
>>>>>>>>>> Any suggestion how I (as a non-root user) can avoid seeing this
>>>>>>>>>> hwloc error message on every run?
>>>>>>>>>>
>>>>>>>>>> -Paul
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 17, 2015 at 11:00 PM, Gilles Gouaillardet <
>>>>>>>>>> gil...@rist.or.jp> wrote:
>>>>>>>>>>
>>>>>>>>>>> Paul,
>>>>>>>>>>>
>>>>>>>>>>> IIRC, the "Permission denied" is coming from hwloc that cannot
>>>>>>>>>>> collect all the info it would like.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> Gilles
>>>>>>>>>>>
>>>>>>>>>>> On 9/18/2015 2:34 PM, Paul Hargrove wrote:
>>>>>>>>>>>
>>>>>>>>>>> Tried tonight's master tarball on Solaris 11.2 on x86-64 with
>>>>>>>>>>> the Studio Compilers  (default ILP32 output) and saw the following 
>>>>>>>>>>> result
>>>>>>>>>>>
>>>>>>>>>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>>>>>>>>>>> Error opening /devices/pci@0,0:reg: Permission denied
>>>>>>>>>>> [pcp-d-4:00492] PMIX ERROR: ERROR in file
>>>>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
>>>>>>>>>>> at line 181
>>>>>>>>>>> [pcp-d-4:00491] PMIX ERROR: UNREACHABLE in file
>>>>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
>>>>>>>>>>> at line 463
>>>>>>>>>>>
>>>>>>>>>>> I don't know if the Permission denied error is related to the
>>>>>>>>>>> subsequent PMIX errors, but any message that says "UNREACHABLE" is 
>>>>>>>>>>> clearly
>>>>>>>>>>> worth reporting.
>>>>>>>>>>>
>>>>>>>>>>> -Paul
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing listde...@open-mpi.org
>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> Link to this post: 
>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> Subscription:
>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> Link to this post:
>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18075.php>
>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18075.php
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> Subscription:
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> Link to this post:
>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18076.php>
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18076.php
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> Subscription:
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post:
>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18078.php>
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18078.php
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription:  <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post:
>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18080.php>
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18080.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription:  <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post:
>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18081.php>
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18081.php
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription:  <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18083.php>
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18083.php
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> <de...@open-mpi.org>de...@open-mpi.org
>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18085.php>
>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18085.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> <de...@open-mpi.org>de...@open-mpi.org
>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18086.php>
>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18086.php
>>>>
>>>
>>>
>>>
>>> --
>>> Paul H. Hargrove                           <phhargr...@lbl.gov>
>>> phhargr...@lbl.gov
>>> Computer Languages & Systems Software (CLaSS) Group
>>> Computer Science Department               Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>> <typescript>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18087.php>
>>> http://www.open-mpi.org/community/lists/devel/2015/09/18087.php
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18088.php>
>>> http://www.open-mpi.org/community/lists/devel/2015/09/18088.php
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove                           <phhargr...@lbl.gov>
>> phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>> <typescript>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/09/18089.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/09/18092.php
>>
>
>
>
> --
> Paul H. Hargrove                           <phhargr...@lbl.gov>
> phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/18093.php
>
>
> <pmix_client.diff>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18101.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18102.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to