Paul and Brice,

the error message is displayed by libpciaccess when hwloc invokes pci_system_init

on Solaris :
crw-------   1 root     sys      182, 253 Sep 28 10:55 /devices/pci@0,0:reg

from libpciaccess

   snprintf(nexus_path, sizeof(nexus_path), "/devices%s", nexus_name);
    if ((fd = open(nexus_path, O_RDWR | O_CLOEXEC)) >= 0) {
[...]
    } else {
        (void) fprintf(stderr, "Error opening %s: %s\n",
                       nexus_path, strerror(errno));
[...]
    }

i noted some TODO comments in the code to handle this.
since this piece of code is deep inside libpciaccess, i guess a fix is not trivial. unless libpciaccess is modified (for example, do not fprintf if a given environment variable is set), hwloc should "emulate" pieces of libpciaccess to get the devices path, check the permissions and
invoke pci_system_init only if everything is ok.


an other simpler (but arguable ...) option, is not to probe the PCI bus on Solaris unless root
i made PR #136 https://github.com/open-mpi/hwloc/pull/136 to implement this

Cheers,

Gilles

On 9/26/2015 9:24 AM, Paul Hargrove wrote:
FYI:

Things look fine today with last night's master tarball.

I hope Brice has a way to eliminate the hwloc warning, since I am sure I am not the only one with scripts that will notice "Error" in the output.

-Paul

On Wed, Sep 23, 2015 at 6:08 PM, Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    Aha! Thanks - just what the doctor ordered!


    On Sep 23, 2015, at 5:45 PM, Gilles Gouaillardet
    <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Ralph,

    the root cause is
    getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...)
    fails with errno ENOPROTOOPT on solaris 11.2

    the attached patch is a proof of concept and works for me :
    /* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */

    Cheers,

    Gilles

    On 9/21/2015 2:16 PM, Paul Hargrove wrote:
    Ralph,

    Just as you say:
    The first 64s pause was before the hwloc error message appeared.
    The second was after the second server_setup_fork appears, and
    before whatever line came after that.

    I don't know if stdio buffering my be "distorting" the placement
    of the pause relative to the lines of output.
    However, prior to your patch the entire failed mpirun was around 1s.

    No allocation.
    No resource manager.
    Just a single workstation.

    -Paul

    On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain <r...@open-mpi.org
    <mailto:r...@open-mpi.org>> wrote:

        ?? Just so this old fossilized brain gets this right: you
        are saying there was a 64s pause before the hwloc error
        appeared, and then another 64s pause after the second
        server_setup_fork message appeared?

        If that’s true, then I’m chasing the wrong problem - it
        sounds like something is messed up in the mpirun startup.
        Did you have more than one node in the allocation by chance?
        I’m wondering if we are getting held up by something in the
        daemon launch/callback area.



        On Sep 20, 2015, at 4:08 PM, Paul Hargrove
        <phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:

        Ralph,

        Still failing with that patch, but with the addition of a
        fairly long pause (64s) before the first error message
        appears, and again after the second "server setup_fork"
        (64s again)

        New output is attached.

        -Paul

        On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain
        <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

            Argh - found a typo in the output line. Could you
            please try the attached patch and do it again? This
            might fix it, but if not it will provide me with some
            idea of the returned error.

            Thanks
            Ralph


            On Sep 20, 2015, at 12:40 PM, Paul Hargrove
            <phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:

            Yes, it is definitely at 10.
            Another attempt is attached.
            -Paul

            On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain
            <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

                Paul - can you please confirm that you gave mpirun
                a level of 10 for the pmix_base_verbose param?
                This output isn’t what I would have expected from
                that level - it looks more like the verbosity was
                set to 5, and so the error number isn’t printed.

                Thanks
                Ralph


                On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet
                <gilles.gouaillar...@gmail.com
                <mailto:gilles.gouaillar...@gmail.com>> wrote:

                Paul,

                I do not remember it like that ...

                at that time, the issue in ompi was that the
                global errno was uses instead of the per thread
                errno.
                though the man pages tells -mt should be used fir
                multithreaded apps, you tried -D_REENTRANT on all
                your platforms, and it was enough to get the
                expected result.

                I just wanted to check pmix1xx (sub)configure did
                correctly pass the -D_REENTRANT flag, and it
                does. so this is very likely a new and unrelated
                error

                Cheers,

                Gilles

                On Sunday, September 20, 2015, Paul Hargrove
                <phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>>
                wrote:

                    Gilles,

                    Yes every $CC invocation
                    in opal/mca/pmix/pmix1xx includes "-D_REENTRANT".
                    However, they don't include "-mt".
                    I believe we concluded (when we had problems
                    previously) that "-mt" was the proper flag
                    (at compile and link) for multi-threaded with
                    the Studio compilers.

                    -Paul

                    On Sat, Sep 19, 2015 at 11:29 PM, Gilles
                    Gouaillardet<gilles.gouaillar...@gmail.com
                    <mailto:gilles.gouaillar...@gmail.com>>wrote:

                        Paul,

                        Can you please double check pmix1xx is
                        compiled with -D_REENTRANT ?
                        We ran into similar issues in the past,
                        and they only occurred with Solaris

                        Cheers,

                        Gilles


                        On Sunday, September 20, 2015, Paul
                        Hargrove <phhargr...@lbl.gov
                        <mailto:phhargr...@lbl.gov>> wrote:

                            Ralph,
                            The output from the requested run is
                            attached.
                            -Paul

                            On Sat, Sep 19, 2015 at 9:46 PM,
                            Ralph Castain<r...@open-mpi.org
                            <mailto:r...@open-mpi.org>>wrote:

                                Ah, okay - that makes more sense.
                                I’ll have to let Brice see if he
                                can figure out how to silence the
                                hwloc error message as I can’t
                                find where it came from. The
                                other errors are real and are the
                                reason why the job was terminated.

                                The problem is that we are trying
                                to establish a communication
                                between the app and the daemon
                                via unix domain socket, and we
                                failed to do so. The error tells
                                me that we were able to create
                                and connect to the socket, but
                                failed when the daemon tried to
                                do a blocking send to the app.

                                Can you rerun it with -mca
                                pmix_base_verbose 10? It will
                                tell us the value of the error
                                number that was returned

                                Thanks
                                Ralph


                                On Sep 19, 2015, at 9:37 PM,
                                Paul Hargrove
                                <phhargr...@lbl.gov
                                <mailto:phhargr...@lbl.gov>> wrote:

                                Ralph,

                                No it did not run.
                                The complete output (which I
                                really should have included in
                                the first place) is below.

                                -Paul

                                $ mpirun -mca btl sm,self -np 2
                                examples/ring_c'
                                Error opening
                                /devices/pci@0,0:reg: Permission
                                denied
                                [pcp-d-3:26054] PMIX ERROR:
                                ERROR in file
                                
/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
                                at line 181
                                [pcp-d-3:26053] PMIX ERROR:
                                UNREACHABLE in file
                                
/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
                                at line 463
                                
--------------------------------------------------------------------------
                                It looks like MPI_INIT failed
                                for some reason; your parallel
                                process is
                                likely to abort. There are many
                                reasons that a parallel process can
                                fail during MPI_INIT; some of
                                which are due to configuration
                                or environment
                                problems. This failure appears
                                to be an internal failure;
                                here's some
                                additional information (which
                                may only be relevant to an Open MPI
                                developer):

                                ompi_mpi_init: ompi_rte_init failed
                                --> Returned "(null)" (-43)
                                instead of "Success" (0)
                                
--------------------------------------------------------------------------
                                *** An error occurred in MPI_Init
                                *** on a NULL communicator
                                *** MPI_ERRORS_ARE_FATAL
                                (processes in this communicator
                                will now abort,
                                ***    and potentially your MPI job)
                                [pcp-d-3:26054] Local abort
                                before MPI_INIT completed
                                completed successfully, but am
                                not able to aggregate error
                                messages, and not able to
                                guarantee that all other
                                processes were killed!
                                
-------------------------------------------------------
                                Primary job  terminated
                                normally, but 1 process returned
                                a non-zero exit code.. Per
                                user-direction, the job has been
                                aborted.
                                
-------------------------------------------------------
                                
--------------------------------------------------------------------------
                                mpirun detected that one or more
                                processes exited with non-zero
                                status, thus causing
                                the job to be terminated. The
                                first process to do so was:

                                Process name: [[11371,1],0]
                                Exit code:    1
                                
--------------------------------------------------------------------------

                                On Sat, Sep 19, 2015 at 8:50 PM,
                                Ralph Castain<r...@open-mpi.org
                                <mailto:r...@open-mpi.org>>wrote:

                                    Paul, can you clarify
                                    something for me? The error
                                    in this case indicates that
                                    the client wasn’t able to
                                    reach the daemon - this
                                    should have resulted in
                                    termination of the job. Did
                                    the job actually run?


                                    On Sep 18, 2015, at 2:50
                                    AM, Ralph Castain
                                    <r...@open-mpi.org
                                    <mailto:r...@open-mpi.org>>
                                    wrote:

                                    I'm on travel right now,
                                    but it should be an easy
                                    fix when I return. Sorry
                                    for the annoyance


                                    On Thu, Sep 17, 2015 at
                                    11:13 PM, Paul
                                    Hargrove<phhargr...@lbl.gov
                                    <mailto:phhargr...@lbl.gov>>wrote:

                                        Any suggestion how I
                                        (as a non-root user)
                                        can avoid seeing this
                                        hwloc error message on
                                        every run?

                                        -Paul

                                        On Thu, Sep 17, 2015 at
                                        11:00 PM, Gilles
                                        Gouaillardet<gil...@rist.or.jp
                                        <mailto:gil...@rist.or.jp>>wrote:

                                            Paul,

                                            IIRC, the
                                            "Permission denied"
                                            is coming from
                                            hwloc that cannot
                                            collect all the
                                            info it would like.

                                            Cheers,

                                            Gilles

                                            On 9/18/2015 2:34
                                            PM, Paul Hargrove
                                            wrote:
                                            Tried tonight's
                                            master tarball on
                                            Solaris 11.2 on
                                            x86-64 with the
                                            Studio Compilers
                                             (default ILP32
                                            output) and saw
                                            the following result

                                            $ mpirun -mca btl
                                            sm,self -np 2
                                            examples/ring_c'
                                            Error opening
                                            /devices/pci@0,0:reg:
                                            Permission denied
                                            [pcp-d-4:00492]
                                            PMIX ERROR: ERROR
                                            in file
                                            
/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
                                            at line 181
                                            [pcp-d-4:00491]
                                            PMIX ERROR:
                                            UNREACHABLE in
                                            file
                                            
/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
                                            at line 463

                                            I don't know if
                                            the Permission
                                            denied error is
                                            related to the
                                            subsequent PMIX
                                            errors, but any
                                            message that says
                                            "UNREACHABLE" is
                                            clearly worth
                                            reporting.

                                            -Paul

                                            --
                                            Paul H. Hargrove
                                            phhargr...@lbl.gov
                                            <mailto:phhargr...@lbl.gov>
                                            Computer Languages
                                            & Systems Software
                                            (CLaSS) Group
                                            Computer Science
Department Tel:+1-510-495-2352 <tel:%2B1-510-495-2352>
                                            Lawrence Berkeley
                                            National
                                            Laboratory
                                            Fax:+1-510-486-6900 
<tel:%2B1-510-486-6900>


                                            
_______________________________________________
                                            devel mailing list
                                            de...@open-mpi.org
                                            
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                            Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/09/18074.php


                                            
_______________________________________________
                                            devel mailing list
                                            de...@open-mpi.org
                                            <mailto:de...@open-mpi.org>
                                            
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                            Link to this
                                            
post:http://www.open-mpi.org/community/lists/devel/2015/09/18075.php




                                        --
                                        Paul H. Hargrove
                                        phhargr...@lbl.gov
                                        <mailto:phhargr...@lbl.gov>
                                        Computer Languages &
                                        Systems Software
                                        (CLaSS) Group
                                        Computer Science
Department Tel:+1-510-495-2352
                                        <tel:%2B1-510-495-2352>
                                        Lawrence Berkeley
                                        National Laboratory
                                        Fax:+1-510-486-6900
                                        <tel:%2B1-510-486-6900>

                                        
_______________________________________________
                                        devel mailing list
                                        de...@open-mpi.org
                                        <mailto:de...@open-mpi.org>
                                        
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                        Link to this
                                        
post:http://www.open-mpi.org/community/lists/devel/2015/09/18076.php




                                    
_______________________________________________
                                    devel mailing list
                                    de...@open-mpi.org
                                    <mailto:de...@open-mpi.org>
                                    
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                    Link to this
                                    
post:http://www.open-mpi.org/community/lists/devel/2015/09/18078.php




                                --
                                Paul H. Hargrove
                                phhargr...@lbl.gov
                                <mailto:phhargr...@lbl.gov>
                                Computer Languages & Systems
                                Software (CLaSS) Group
Computer Science Department Tel:+1-510-495-2352
                                <tel:%2B1-510-495-2352>
                                Lawrence Berkeley National
                                Laboratory Fax:+1-510-486-6900
                                <tel:%2B1-510-486-6900>
                                _______________________________________________
                                devel mailing list
                                de...@open-mpi.org
                                <mailto:de...@open-mpi.org>
                                
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                Link to this
                                
post:http://www.open-mpi.org/community/lists/devel/2015/09/18080.php


                                _______________________________________________
                                devel mailing list
                                de...@open-mpi.org
                                <mailto:de...@open-mpi.org>
                                
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                Link to this
                                
post:http://www.open-mpi.org/community/lists/devel/2015/09/18081.php




                            --
                            Paul H. Hargrove phhargr...@lbl.gov
                            <mailto:phhargr...@lbl.gov>
                            Computer Languages & Systems Software
                            (CLaSS) Group
Computer Science Department Tel:+1-510-495-2352
                            <tel:%2B1-510-495-2352>
                            Lawrence Berkeley National Laboratory
                            Fax:+1-510-486-6900
                            <tel:%2B1-510-486-6900>


                        _______________________________________________
                        devel mailing list
                        de...@open-mpi.org
                        <mailto:de...@open-mpi.org>
                        
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                        Link to this
                        
post:http://www.open-mpi.org/community/lists/devel/2015/09/18083.php




                    --
                    Paul H. Hargrove phhargr...@lbl.gov
                    <mailto:phhargr...@lbl.gov>
                    Computer Languages & Systems Software (CLaSS)
                    Group
                    Computer Science Department           Tel:
                    +1-510-495-2352 <tel:%2B1-510-495-2352>
                    Lawrence Berkeley National Laboratory Fax:
                    +1-510-486-6900 <tel:%2B1-510-486-6900>

                _______________________________________________
                devel mailing list
                de...@open-mpi.org <mailto:de...@open-mpi.org>
                Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/devel
                Link to this
                
post:http://www.open-mpi.org/community/lists/devel/2015/09/18085.php


                _______________________________________________
                devel mailing list
                de...@open-mpi.org <mailto:de...@open-mpi.org>
                Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/devel
                Link to this post:
                http://www.open-mpi.org/community/lists/devel/2015/09/18086.php




-- Paul H. Hargrove phhargr...@lbl.gov
            <mailto:phhargr...@lbl.gov>
            Computer Languages & Systems Software (CLaSS) Group
            Computer Science Department           Tel:
            +1-510-495-2352 <tel:%2B1-510-495-2352>
            Lawrence Berkeley National Laboratory Fax:
            +1-510-486-6900 <tel:%2B1-510-486-6900>
            <typescript>_______________________________________________
            devel mailing list
            de...@open-mpi.org <mailto:de...@open-mpi.org>
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/devel
            Link to this post:
            http://www.open-mpi.org/community/lists/devel/2015/09/18087.php


            _______________________________________________
            devel mailing list
            de...@open-mpi.org <mailto:de...@open-mpi.org>
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/devel
            Link to this post:
            http://www.open-mpi.org/community/lists/devel/2015/09/18088.php




-- Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
        Computer Languages & Systems Software (CLaSS) Group
        Computer Science Department           Tel: +1-510-495-2352
        <tel:%2B1-510-495-2352>
        Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
        <tel:%2B1-510-486-6900>
        <typescript>_______________________________________________
        devel mailing list
        de...@open-mpi.org <mailto:de...@open-mpi.org>
        Subscription:
        http://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2015/09/18089.php


        _______________________________________________
        devel mailing list
        de...@open-mpi.org <mailto:de...@open-mpi.org>
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2015/09/18092.php




-- Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
    Computer Languages & Systems Software (CLaSS) Group
    Computer Science Department           Tel: +1-510-495-2352
    <tel:%2B1-510-495-2352>
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    <tel:%2B1-510-486-6900>


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/09/18093.php

    <pmix_client.diff>_______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/09/18101.php


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/09/18102.php




--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/09/18109.php

Reply via email to