FYI: Things look fine today with last night's master tarball.
I hope Brice has a way to eliminate the hwloc warning, since I am sure I am not the only one with scripts that will notice "Error" in the output. -Paul On Wed, Sep 23, 2015 at 6:08 PM, Ralph Castain <r...@open-mpi.org> wrote: > Aha! Thanks - just what the doctor ordered! > > > On Sep 23, 2015, at 5:45 PM, Gilles Gouaillardet <gil...@rist.or.jp> > wrote: > > Ralph, > > the root cause is > getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...) > fails with errno ENOPROTOOPT on solaris 11.2 > > the attached patch is a proof of concept and works for me : > /* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */ > > Cheers, > > Gilles > > On 9/21/2015 2:16 PM, Paul Hargrove wrote: > > Ralph, > > Just as you say: > The first 64s pause was before the hwloc error message appeared. > The second was after the second server_setup_fork appears, and before > whatever line came after that. > > I don't know if stdio buffering my be "distorting" the placement of the > pause relative to the lines of output. > However, prior to your patch the entire failed mpirun was around 1s. > > No allocation. > No resource manager. > Just a single workstation. > > -Paul > > On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> ?? Just so this old fossilized brain gets this right: you are saying >> there was a 64s pause before the hwloc error appeared, and then another 64s >> pause after the second server_setup_fork message appeared? >> >> If that’s true, then I’m chasing the wrong problem - it sounds like >> something is messed up in the mpirun startup. Did you have more than one >> node in the allocation by chance? I’m wondering if we are getting held up >> by something in the daemon launch/callback area. >> >> >> >> On Sep 20, 2015, at 4:08 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> >> Ralph, >> >> Still failing with that patch, but with the addition of a fairly long >> pause (64s) before the first error message appears, and again after the >> second "server setup_fork" (64s again) >> >> New output is attached. >> >> -Paul >> >> On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Argh - found a typo in the output line. Could you please try the >>> attached patch and do it again? This might fix it, but if not it will >>> provide me with some idea of the returned error. >>> >>> Thanks >>> Ralph >>> >>> >>> On Sep 20, 2015, at 12:40 PM, Paul Hargrove < <phhargr...@lbl.gov> >>> phhargr...@lbl.gov> wrote: >>> >>> Yes, it is definitely at 10. >>> Another attempt is attached. >>> -Paul >>> >>> On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain < <r...@open-mpi.org> >>> r...@open-mpi.org> wrote: >>> >>>> Paul - can you please confirm that you gave mpirun a level of 10 for >>>> the pmix_base_verbose param? This output isn’t what I would have expected >>>> from that level - it looks more like the verbosity was set to 5, and so the >>>> error number isn’t printed. >>>> >>>> Thanks >>>> Ralph >>>> >>>> >>>> On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet < >>>> <gilles.gouaillar...@gmail.com>gilles.gouaillar...@gmail.com> wrote: >>>> >>>> Paul, >>>> >>>> I do not remember it like that ... >>>> >>>> at that time, the issue in ompi was that the global errno was uses >>>> instead of the per thread errno. >>>> though the man pages tells -mt should be used fir multithreaded apps, >>>> you tried -D_REENTRANT on all your platforms, and it was enough to get the >>>> expected result. >>>> >>>> I just wanted to check pmix1xx (sub)configure did correctly pass the >>>> -D_REENTRANT flag, and it does. so this is very likely a new and unrelated >>>> error >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On Sunday, September 20, 2015, Paul Hargrove < <phhargr...@lbl.gov> >>>> phhargr...@lbl.gov> wrote: >>>> >>>>> Gilles, >>>>> >>>>> Yes every $CC invocation in opal/mca/pmix/pmix1xx includes >>>>> "-D_REENTRANT". >>>>> However, they don't include "-mt". >>>>> I believe we concluded (when we had problems previously) that "-mt" >>>>> was the proper flag (at compile and link) for multi-threaded with the >>>>> Studio compilers. >>>>> >>>>> -Paul >>>>> >>>>> On Sat, Sep 19, 2015 at 11:29 PM, Gilles Gouaillardet < >>>>> gilles.gouaillar...@gmail.com> wrote: >>>>> >>>>>> Paul, >>>>>> >>>>>> Can you please double check pmix1xx is compiled with -D_REENTRANT ? >>>>>> We ran into similar issues in the past, and they only occurred with >>>>>> Solaris >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> >>>>>> On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov> >>>>>> wrote: >>>>>> >>>>>>> Ralph, >>>>>>> The output from the requested run is attached. >>>>>>> -Paul >>>>>>> >>>>>>> On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Ah, okay - that makes more sense. I’ll have to let Brice see if he >>>>>>>> can figure out how to silence the hwloc error message as I can’t find >>>>>>>> where >>>>>>>> it came from. The other errors are real and are the reason why the job >>>>>>>> was >>>>>>>> terminated. >>>>>>>> >>>>>>>> The problem is that we are trying to establish a communication >>>>>>>> between the app and the daemon via unix domain socket, and we failed >>>>>>>> to do >>>>>>>> so. The error tells me that we were able to create and connect to the >>>>>>>> socket, but failed when the daemon tried to do a blocking send to the >>>>>>>> app. >>>>>>>> >>>>>>>> Can you rerun it with -mca pmix_base_verbose 10? It will tell us >>>>>>>> the value of the error number that was returned >>>>>>>> >>>>>>>> Thanks >>>>>>>> Ralph >>>>>>>> >>>>>>>> >>>>>>>> On Sep 19, 2015, at 9:37 PM, Paul Hargrove <phhargr...@lbl.gov> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Ralph, >>>>>>>> >>>>>>>> No it did not run. >>>>>>>> The complete output (which I really should have included in the >>>>>>>> first place) is below. >>>>>>>> >>>>>>>> -Paul >>>>>>>> >>>>>>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c' >>>>>>>> Error opening /devices/pci@0,0:reg: Permission denied >>>>>>>> [pcp-d-3:26054] PMIX ERROR: ERROR in file >>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c >>>>>>>> at line 181 >>>>>>>> [pcp-d-3:26053] PMIX ERROR: UNREACHABLE in file >>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>>>>>> at line 463 >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> It looks like MPI_INIT failed for some reason; your parallel >>>>>>>> process is >>>>>>>> likely to abort. There are many reasons that a parallel process can >>>>>>>> fail during MPI_INIT; some of which are due to configuration or >>>>>>>> environment >>>>>>>> problems. This failure appears to be an internal failure; here's >>>>>>>> some >>>>>>>> additional information (which may only be relevant to an Open MPI >>>>>>>> developer): >>>>>>>> >>>>>>>> ompi_mpi_init: ompi_rte_init failed >>>>>>>> --> Returned "(null)" (-43) instead of "Success" (0) >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> *** An error occurred in MPI_Init >>>>>>>> *** on a NULL communicator >>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>>>> abort, >>>>>>>> *** and potentially your MPI job) >>>>>>>> [pcp-d-3:26054] Local abort before MPI_INIT completed completed >>>>>>>> successfully, but am not able to aggregate error messages, and not >>>>>>>> able to >>>>>>>> guarantee that all other processes were killed! >>>>>>>> ------------------------------------------------------- >>>>>>>> Primary job terminated normally, but 1 process returned >>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>>>> ------------------------------------------------------- >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> mpirun detected that one or more processes exited with non-zero >>>>>>>> status, thus causing >>>>>>>> the job to be terminated. The first process to do so was: >>>>>>>> >>>>>>>> Process name: [[11371,1],0] >>>>>>>> Exit code: 1 >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> On Sat, Sep 19, 2015 at 8:50 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Paul, can you clarify something for me? The error in this case >>>>>>>>> indicates that the client wasn’t able to reach the daemon - this >>>>>>>>> should >>>>>>>>> have resulted in termination of the job. Did the job actually run? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sep 18, 2015, at 2:50 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> I'm on travel right now, but it should be an easy fix when I >>>>>>>>> return. Sorry for the annoyance >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Sep 17, 2015 at 11:13 PM, Paul Hargrove < >>>>>>>>> phhargr...@lbl.gov> wrote: >>>>>>>>> >>>>>>>>>> Any suggestion how I (as a non-root user) can avoid seeing this >>>>>>>>>> hwloc error message on every run? >>>>>>>>>> >>>>>>>>>> -Paul >>>>>>>>>> >>>>>>>>>> On Thu, Sep 17, 2015 at 11:00 PM, Gilles Gouaillardet < >>>>>>>>>> gil...@rist.or.jp> wrote: >>>>>>>>>> >>>>>>>>>>> Paul, >>>>>>>>>>> >>>>>>>>>>> IIRC, the "Permission denied" is coming from hwloc that cannot >>>>>>>>>>> collect all the info it would like. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> Gilles >>>>>>>>>>> >>>>>>>>>>> On 9/18/2015 2:34 PM, Paul Hargrove wrote: >>>>>>>>>>> >>>>>>>>>>> Tried tonight's master tarball on Solaris 11.2 on x86-64 with >>>>>>>>>>> the Studio Compilers (default ILP32 output) and saw the following >>>>>>>>>>> result >>>>>>>>>>> >>>>>>>>>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c' >>>>>>>>>>> Error opening /devices/pci@0,0:reg: Permission denied >>>>>>>>>>> [pcp-d-4:00492] PMIX ERROR: ERROR in file >>>>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c >>>>>>>>>>> at line 181 >>>>>>>>>>> [pcp-d-4:00491] PMIX ERROR: UNREACHABLE in file >>>>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>>>>>>>>> at line 463 >>>>>>>>>>> >>>>>>>>>>> I don't know if the Permission denied error is related to the >>>>>>>>>>> subsequent PMIX errors, but any message that says "UNREACHABLE" is >>>>>>>>>>> clearly >>>>>>>>>>> worth reporting. >>>>>>>>>>> >>>>>>>>>>> -Paul >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>>>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>>>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>>>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing listde...@open-mpi.org >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> Subscription: >>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> Link to this post: >>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18075.php> >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18075.php >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> Subscription: >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> Link to this post: >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18076.php> >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18076.php >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18078.php> >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18078.php >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18080.php> >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18080.php >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18081.php> >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18081.php >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18083.php> >>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18083.php >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>> Computer Languages & Systems Software (CLaSS) Group >>>>> Computer Science Department Tel: +1-510-495-2352 >>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>> >>>> _______________________________________________ >>>> devel mailing list >>>> <de...@open-mpi.org>de...@open-mpi.org >>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18085.php> >>>> http://www.open-mpi.org/community/lists/devel/2015/09/18085.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> <de...@open-mpi.org>de...@open-mpi.org >>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18086.php> >>>> http://www.open-mpi.org/community/lists/devel/2015/09/18086.php >>>> >>> >>> >>> >>> -- >>> Paul H. Hargrove <phhargr...@lbl.gov> >>> phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> <typescript>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> <http://www.open-mpi.org/community/lists/devel/2015/09/18087.php> >>> http://www.open-mpi.org/community/lists/devel/2015/09/18087.php >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> <http://www.open-mpi.org/community/lists/devel/2015/09/18088.php> >>> http://www.open-mpi.org/community/lists/devel/2015/09/18088.php >>> >> >> >> >> -- >> Paul H. Hargrove <phhargr...@lbl.gov> >> phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> <typescript>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/18089.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/18092.php >> > > > > -- > Paul H. Hargrove <phhargr...@lbl.gov> > phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > _______________________________________________ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18093.php > > > <pmix_client.diff>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18101.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18102.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900