Ralph,

the root cause is
getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...)
fails with errno ENOPROTOOPT on solaris 11.2

the attached patch is a proof of concept and works for me :
/* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */

Cheers,

Gilles

On 9/21/2015 2:16 PM, Paul Hargrove wrote:
Ralph,

Just as you say:
The first 64s pause was before the hwloc error message appeared.
The second was after the second server_setup_fork appears, and before whatever line came after that.

I don't know if stdio buffering my be "distorting" the placement of the pause relative to the lines of output.
However, prior to your patch the entire failed mpirun was around 1s.

No allocation.
No resource manager.
Just a single workstation.

-Paul

On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    ?? Just so this old fossilized brain gets this right: you are
    saying there was a 64s pause before the hwloc error appeared, and
    then another 64s pause after the second server_setup_fork message
    appeared?

    If that’s true, then I’m chasing the wrong problem - it sounds
    like something is messed up in the mpirun startup. Did you have
    more than one node in the allocation by chance? I’m wondering if
    we are getting held up by something in the daemon launch/callback
    area.



    On Sep 20, 2015, at 4:08 PM, Paul Hargrove <phhargr...@lbl.gov
    <mailto:phhargr...@lbl.gov>> wrote:

    Ralph,

    Still failing with that patch, but with the addition of a fairly
    long pause (64s) before the first error message appears, and
    again after the second "server setup_fork" (64s again)

    New output is attached.

    -Paul

    On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain <r...@open-mpi.org
    <mailto:r...@open-mpi.org>> wrote:

        Argh - found a typo in the output line. Could you please try
        the attached patch and do it again? This might fix it, but if
        not it will provide me with some idea of the returned error.

        Thanks
        Ralph


        On Sep 20, 2015, at 12:40 PM, Paul Hargrove
        <phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:

        Yes, it is definitely at 10.
        Another attempt is attached.
        -Paul

        On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain
        <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

            Paul - can you please confirm that you gave mpirun a
            level of 10 for the pmix_base_verbose param? This output
            isn’t what I would have expected from that level - it
            looks more like the verbosity was set to 5, and so the
            error number isn’t printed.

            Thanks
            Ralph


            On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet
            <gilles.gouaillar...@gmail.com
            <mailto:gilles.gouaillar...@gmail.com>> wrote:

            Paul,

            I do not remember it like that ...

            at that time, the issue in ompi was that the global
            errno was uses instead of the per thread errno.
            though the man pages tells -mt should be used fir
            multithreaded apps, you tried -D_REENTRANT on all your
            platforms, and it was enough to get the expected result.

            I just wanted to check pmix1xx (sub)configure did
            correctly pass the -D_REENTRANT flag, and it does. so
            this is very likely a new and unrelated error

            Cheers,

            Gilles

            On Sunday, September 20, 2015, Paul Hargrove
            <phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:

                Gilles,

                Yes every $CC invocation in opal/mca/pmix/pmix1xx
                includes "-D_REENTRANT".
                However, they don't include "-mt".
                I believe we concluded (when we had problems
                previously) that "-mt" was the proper flag (at
                compile and link) for multi-threaded with the
                Studio compilers.

                -Paul

                On Sat, Sep 19, 2015 at 11:29 PM, Gilles
                Gouaillardet<gilles.gouaillar...@gmail.com>wrote:

                    Paul,

                    Can you please double check pmix1xx is compiled
                    with -D_REENTRANT ?
                    We ran into similar issues in the past, and
                    they only occurred with Solaris

                    Cheers,

                    Gilles


                    On Sunday, September 20, 2015, Paul Hargrove
                    <phhargr...@lbl.gov> wrote:

                        Ralph,
                        The output from the requested run is attached.
                        -Paul

                        On Sat, Sep 19, 2015 at 9:46 PM, Ralph
                        Castain<r...@open-mpi.org>wrote:

                            Ah, okay - that makes more sense. I’ll
                            have to let Brice see if he can figure
                            out how to silence the hwloc error
                            message as I can’t find where it came
                            from. The other errors are real and are
                            the reason why the job was terminated.

                            The problem is that we are trying to
                            establish a communication between the
                            app and the daemon via unix domain
                            socket, and we failed to do so. The
                            error tells me that we were able to
                            create and connect to the socket, but
                            failed when the daemon tried to do a
                            blocking send to the app.

                            Can you rerun it with -mca
                            pmix_base_verbose 10? It will tell us
                            the value of the error number that was
                            returned

                            Thanks
                            Ralph


                            On Sep 19, 2015, at 9:37 PM, Paul
                            Hargrove <phhargr...@lbl.gov> wrote:

                            Ralph,

                            No it did not run.
                            The complete output (which I really
                            should have included in the first
                            place) is below.

                            -Paul

                            $ mpirun -mca btl sm,self -np 2
                            examples/ring_c'
                            Error opening /devices/pci@0,0:reg:
                            Permission denied
                            [pcp-d-3:26054] PMIX ERROR: ERROR in
                            file
                            
/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
                            at line 181
                            [pcp-d-3:26053] PMIX ERROR:
                            UNREACHABLE in file
                            
/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
                            at line 463
                            
--------------------------------------------------------------------------
                            It looks like MPI_INIT failed for some
                            reason; your parallel process is
                            likely to abort. There are many
                            reasons that a parallel process can
                            fail during MPI_INIT; some of which
                            are due to configuration or environment
                            problems. This failure appears to be
                            an internal failure; here's some
                            additional information (which may only
                            be relevant to an Open MPI
                            developer):

                            ompi_mpi_init: ompi_rte_init failed
                            --> Returned "(null)" (-43) instead of
                            "Success" (0)
                            
--------------------------------------------------------------------------
                            *** An error occurred in MPI_Init
                            *** on a NULL communicator
                            *** MPI_ERRORS_ARE_FATAL (processes in
                            this communicator will now abort,
                            ***    and potentially your MPI job)
                            [pcp-d-3:26054] Local abort before
                            MPI_INIT completed completed
                            successfully, but am not able to
                            aggregate error messages, and not able
                            to guarantee that all other processes
                            were killed!
                            
-------------------------------------------------------
                            Primary job  terminated normally, but
                            1 process returned
                            a non-zero exit code.. Per
                            user-direction, the job has been aborted.
                            
-------------------------------------------------------
                            
--------------------------------------------------------------------------
                            mpirun detected that one or more
                            processes exited with non-zero status,
                            thus causing
                            the job to be terminated. The first
                            process to do so was:

                            Process name: [[11371,1],0]
                            Exit code:    1
                            
--------------------------------------------------------------------------

                            On Sat, Sep 19, 2015 at 8:50 PM, Ralph
                            Castain<r...@open-mpi.org>wrote:

                                Paul, can you clarify something
                                for me? The error in this case
                                indicates that the client wasn’t
                                able to reach the daemon - this
                                should have resulted in
                                termination of the job. Did the
                                job actually run?


                                On Sep 18, 2015, at 2:50 AM,
                                Ralph Castain <r...@open-mpi.org>
                                wrote:

                                I'm on travel right now, but it
                                should be an easy fix when I
                                return. Sorry for the annoyance


                                On Thu, Sep 17, 2015 at 11:13 PM,
                                Paul
                                Hargrove<phhargr...@lbl.gov>wrote:

                                    Any suggestion how I (as a
                                    non-root user) can avoid
                                    seeing this hwloc error
                                    message on every run?

                                    -Paul

                                    On Thu, Sep 17, 2015 at 11:00
                                    PM, Gilles
                                    Gouaillardet<gil...@rist.or.jp>wrote:

                                        Paul,

                                        IIRC, the "Permission
                                        denied" is coming from
                                        hwloc that cannot collect
                                        all the info it would like.

                                        Cheers,

                                        Gilles

                                        On 9/18/2015 2:34 PM,
                                        Paul Hargrove wrote:
                                        Tried tonight's master
                                        tarball on Solaris 11.2
                                        on x86-64 with the
                                        Studio Compilers
                                         (default ILP32 output)
                                        and saw the following
                                        result

                                        $ mpirun -mca btl
                                        sm,self -np 2
                                        examples/ring_c'
                                        Error opening
                                        /devices/pci@0,0:reg:
                                        Permission denied
                                        [pcp-d-4:00492] PMIX
                                        ERROR: ERROR in file
                                        
/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
                                        at line 181
                                        [pcp-d-4:00491] PMIX
                                        ERROR: UNREACHABLE in
                                        file
                                        
/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
                                        at line 463

                                        I don't know if the
                                        Permission denied error
                                        is related to the
                                        subsequent PMIX errors,
                                        but any message that
                                        says "UNREACHABLE" is
                                        clearly worth reporting.

                                        -Paul

                                        --
                                        Paul H. Hargrove
                                        phhargr...@lbl.gov
                                        Computer Languages &
                                        Systems Software (CLaSS)
                                        Group
                                        Computer Science
Department Tel:+1-510-495-2352
                                        <tel:%2B1-510-495-2352>
                                        Lawrence Berkeley
                                        National Laboratory
                                        Fax:+1-510-486-6900
                                        <tel:%2B1-510-486-6900>


                                        
_______________________________________________
                                        devel mailing list
                                        de...@open-mpi.org
                                        
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                        Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/09/18074.php


                                        
_______________________________________________
                                        devel mailing list
                                        de...@open-mpi.org
                                        
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                        Link to this
                                        
post:http://www.open-mpi.org/community/lists/devel/2015/09/18075.php




                                    --
                                    Paul H. Hargrove
                                    phhargr...@lbl.gov
                                    Computer Languages & Systems
                                    Software (CLaSS) Group
Computer Science Department Tel:+1-510-495-2352
                                    <tel:%2B1-510-495-2352>
                                    Lawrence Berkeley National
                                    Laboratory
                                    Fax:+1-510-486-6900
                                    <tel:%2B1-510-486-6900>

                                    
_______________________________________________
                                    devel mailing list
                                    de...@open-mpi.org
                                    
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                    Link to this
                                    
post:http://www.open-mpi.org/community/lists/devel/2015/09/18076.php




                                _______________________________________________
                                devel mailing list
                                de...@open-mpi.org
                                
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                Link to this
                                
post:http://www.open-mpi.org/community/lists/devel/2015/09/18078.php




                            --
                            Paul H. Hargrove phhargr...@lbl.gov
                            Computer Languages & Systems Software
                            (CLaSS) Group
Computer Science Department Tel:+1-510-495-2352
                            <tel:%2B1-510-495-2352>
                            Lawrence Berkeley National Laboratory
                            Fax:+1-510-486-6900
                            <tel:%2B1-510-486-6900>
                            _______________________________________________
                            devel mailing list
                            de...@open-mpi.org
                            
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                            Link to this
                            
post:http://www.open-mpi.org/community/lists/devel/2015/09/18080.php


                            _______________________________________________
                            devel mailing list
                            de...@open-mpi.org
                            
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                            Link to this
                            
post:http://www.open-mpi.org/community/lists/devel/2015/09/18081.php




                        --
                        Paul H. Hargrove phhargr...@lbl.gov
                        Computer Languages & Systems Software
                        (CLaSS) Group
Computer Science Department Tel:+1-510-495-2352 <tel:%2B1-510-495-2352>
                        Lawrence Berkeley National Laboratory
                        Fax:+1-510-486-6900 <tel:%2B1-510-486-6900>


                    _______________________________________________
                    devel mailing list
                    de...@open-mpi.org
                    
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                    Link to this
                    
post:http://www.open-mpi.org/community/lists/devel/2015/09/18083.php




                --
                Paul H. Hargrove phhargr...@lbl.gov
                Computer Languages & Systems Software (CLaSS) Group
                Computer Science Department           Tel:
                +1-510-495-2352 <tel:%2B1-510-495-2352>
                Lawrence Berkeley National Laboratory Fax:
                +1-510-486-6900 <tel:%2B1-510-486-6900>

            _______________________________________________
            devel mailing list
            de...@open-mpi.org <mailto:de...@open-mpi.org>
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/devel
            Link to this
            post:http://www.open-mpi.org/community/lists/devel/2015/09/18085.php


            _______________________________________________
            devel mailing list
            de...@open-mpi.org <mailto:de...@open-mpi.org>
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/devel
            Link to this post:
            http://www.open-mpi.org/community/lists/devel/2015/09/18086.php




-- Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
        Computer Languages & Systems Software (CLaSS) Group
        Computer Science Department               Tel:
        +1-510-495-2352 <tel:%2B1-510-495-2352>
        Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
        <tel:%2B1-510-486-6900>
        <typescript>_______________________________________________
        devel mailing list
        de...@open-mpi.org <mailto:de...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2015/09/18087.php


        _______________________________________________
        devel mailing list
        de...@open-mpi.org <mailto:de...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2015/09/18088.php




-- Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
    Computer Languages & Systems Software (CLaSS) Group
    Computer Science Department           Tel: +1-510-495-2352
    <tel:%2B1-510-495-2352>
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    <tel:%2B1-510-486-6900>
    <typescript>_______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/09/18089.php


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/09/18092.php




--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/09/18093.php

diff --git a/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c 
b/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
index 61f617a..fcd08de 100644
--- a/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
+++ b/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
@@ -809,6 +809,7 @@ static pmix_status_t recv_connect_ack(int sd)
     pmix_status_t rc;
     struct timeval tv, save;
     pmix_socklen_t sz;
+    bool sockopt = true;

     pmix_output_verbose(2, pmix_globals.debug_output,
                         "pmix: RECV CONNECT ACK FROM SERVER");
@@ -816,14 +817,20 @@ static pmix_status_t recv_connect_ack(int sd)
     /* get the current timeout value so we can reset to it */
     sz = sizeof(save);
     if (0 != getsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, (void*)&save, &sz)) {
-        return PMIX_ERR_UNREACH;
-    }
-
-    /* set a timeout on the blocking recv so we don't hang */
-    tv.tv_sec  = 2;
-    tv.tv_usec = 0;
-    if (0 != setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv))) {
-        return PMIX_ERR_UNREACH;
+        if (ENOPROTOOPT == errno) {
+            sockopt = false;
+        } else {
+            return PMIX_ERR_UNREACH;
+        }
+    } else {
+        /* set a timeout on the blocking recv so we don't hang */
+        tv.tv_sec  = 2;
+        tv.tv_usec = 0;
+        if (0 != setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv))) {
+            pmix_output_verbose(2, pmix_globals.debug_output,
+                                "pmix: recv_connect_ack could not setsockopt 
SO_RCVTIMEO");
+            return PMIX_ERR_UNREACH;
+        }
     }

     /* receive the status reply */
@@ -855,9 +862,11 @@ static pmix_status_t recv_connect_ack(int sd)
         return rc;
     }

-    /* return the socket to normal */
-    if (0 != setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, &save, sz)) {
-        return PMIX_ERR_UNREACH;
+    if (sockopt) {
+        /* return the socket to normal */
+        if (0 != setsockopt(sd, SOL_SOCKET, SO_RCVTIMEO, &save, sz)) {
+            return PMIX_ERR_UNREACH;
+        }
     }

     return PMIX_SUCCESS;

Reply via email to