Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Gilles Gouaillardet Thu, 06 Oct 2016 21:57:30 -0700

Eric,


2.0.2 is scheduled to happen.

2.1.0 will bring some new features whereas v2.0.2 is a bug fix release.

my guess is v2.0.2 will come first, but this is just a guess

(even if v2.1.0 comes first, v2.0.2 will be released anyway)


Cheers,


Gilles


On 10/7/2016 2:34 AM, Eric Chamberland wrote:

Hi Gilles,

just to mention that since the PR 2091 as been merged into 2.0.x, Ihaven't got any failure!

Since 2.0.0 and 2.0.1 aren't usable for us, the next version should bea good one... So will there be a 2.0.2 release or will it go to 2.1.0directly?


Thanks,

Eric

On 16/09/16 10:01 AM, Gilles Gouaillardet wrote:

Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would

not worry too much of that crash (to me, it is an undefined behavioranyway)


Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland
<eric.chamberl...@giref.ulaval.ca
<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

    Hi,

    I know the pull request has not (yet) been merged, but here is a
    somewhat "different" output from a single sequential test
    (automatically) laucnhed without mpirun last night:

    [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
    rsh path NULL
    [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename
    hash 1366255883
    [lorien:172229] plm:base:set_hnp_name: final jobfam 39075

[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rshpath NULL

    [lorien:172229] [[39075,0],0] plm:base:receive start comm
    [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
    [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
    dynamic spawn
    [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack:

received unexpected process identifier [[41545,0],0] from[[39075,0],0]

    [lorien:172218] *** Process received signal ***
    [lorien:172218] Signal: Segmentation fault (11)
    [lorien:172218] Signal code: Invalid permissions (2)
    [lorien:172218] Failing at address: 0x2d07e00
    [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive
    stop comm


    unfortunately, I didn't got any coredump (???)  The line:

    [lorien:172218] Signal code: Invalid permissions (2)

    is curious or not?

    as usual, here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log>

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt>

    Does the PR #1376 will prevent or fix this too?

    Thanks again!

    Eric



    On 15/09/16 09:32 AM, Eric Chamberland wrote:

        Hi Gilles,

        On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

            Eric,


            a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
<https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch>



            the bug is specific to singleton mode (e.g. ./a.out vs
            mpirun -np 1
            ./a.out), so if applying a patch does not fit your test
            workflow,

            it might be easier for you to update it and mpirun -np 1
            ./a.out instead
            of ./a.out


            basically, increasing verbosity runs some extra code, which
            include
            sprintf.
            so yes, it is possible to crash an app by increasing
            verbosity by
            running into a bug that is hidden under normal operation.
            my intuition suggests this is quite unlikely ... if you can
            get a core
            file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when theautomatic

        process started again... (which erase all directories before
        starting) :/

I would like to put core files in a user specific directory,but it

        seems it has to be a system-wide configuration... :/  I will
        trick this
        by changing the "pwd" to a path outside the erased directory...

        So as of tonight I should be able to retrieve core files even
        after I
        relaunched the process..

        Thanks for all the support!

        Eric


            Cheers,

            Gilles



            On 9/15/2016 2:58 AM, Eric Chamberland wrote:

                Ok,

                one test segfaulted *but* I can't tell if it is the
                *same* bug because
                there has been a segfault:

                stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt>



                [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on
                agent ssh : rsh
                path NULL
                [lorien:190552] plm:base:set_hnp_name: initial bias
                190552 nodename
                hash 1366255883

[lorien:190552] plm:base:set_hnp_name: final jobfam53310

                [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh
                : rsh path NULL

[lorien:190552] [[53310,0],0] plm:base:receive startcomm

                *** Error in `orted': realloc(): invalid next size:
                0x0000000001e58770
                ***
                ...
                ...
                [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG:
                Unable to start a
                daemon on the local node in file ess_singleton_module.c
                at line 573
                [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG:
                Unable to start a
                daemon on the local node in file ess_singleton_module.c
                at line 163
                *** An error occurred in MPI_Init_thread
                *** on a NULL communicator
                *** MPI_ERRORS_ARE_FATAL (processes in this communicator
                will now abort,
                ***    and potentially your MPI job)
                [lorien:190306] Local abort before MPI_INIT completed
                completed
                successfully, but am not able to aggregate error
                messages, and not
                able to guarantee that all other processes were killed!

                stdout:

--------------------------------------------------------------------------


                It looks like orte_init failed for some reason; your
                parallel process is
                likely to abort.  There are many reasons that a parallel
                process can
                fail during orte_init; some of which are due to
                configuration or
                environment problems.  This failure appears to be an
                internal failure;
                here's some additional information (which may only be
                relevant to an
                Open MPI developer):

                  orte_ess_init failed
                  --> Returned value Unable to start a daemon on the
                local node (-127)
                instead of ORTE_SUCCESS
--------------------------------------------------------------------------


--------------------------------------------------------------------------


                It looks like MPI_INIT failed for some reason; your
                parallel process is
                likely to abort.  There are many reasons that a parallel
                process can
                fail during MPI_INIT; some of which are due to
                configuration or
                environment
                problems.  This failure appears to be an internal
                failure; here's some
                additional information (which may only be relevant to an
                Open MPI
                developer):

                  ompi_mpi_init: ompi_rte_init failed
                  --> Returned "Unable to start a daemon on the local
                node" (-127)
                instead of "Success" (0)
--------------------------------------------------------------------------



                openmpi content of $TMP:

                /tmp/tmp.GoQXICeyJl> ls -la
                total 1500
                drwx------    3 cmpbib bib     250 Sep 14 13:34 .
                drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
                ...
                drwx------ 1848 cmpbib bib   45056 Sep 14 13:34
                openmpi-sessions-40031@lorien_0

srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24pmix-190552


cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0>
                find . -type f
                ./53310/contact.txt

                cat 53310/contact.txt
                3493724160.0;usock;tcp://132.203.7.36:54605
                <http://132.203.7.36:54605>
                190552

                egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep
                53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
                plm:base:set_hnp_name: final jobfam 53310

                (this is the faulty test)
                full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt>



                config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log>



                ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt>



                Maybe it aborted (instead of giving the other message)
                while doing the
                error because of export OMPI_MCA_plm_base_verbose=5 ?

                Thanks,

                Eric


                On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:

                    Eric,

                    do you mean you have a unique $TMP per a.out ?
                    or a unique $TMP per "batch" of run ?

                    in the first case, my understanding is that
                    conflicts cannot happen ...

                    once you hit the bug, can you please please post the
                    output of the
                    failed a.out,
                    and run
                    egrep 'jobfam|stop'
                    on all your logs, so we might spot a conflict

                    Cheers,

                    Gilles

                    On Wednesday, September 14, 2016, Eric Chamberland
                    <eric.chamberl...@giref.ulaval.ca
<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

                        Lucky!

                        Since each runs have a specific TMP, I still
                    have it on disc.

                        for the faulty run, the TMP variable was:

                        TMP=/tmp/tmp.wOv5dkNaSI

                        and into $TMP I have:

                        openmpi-sessions-40031@lorien_0

                        and into this subdirectory I have a bunch of
                    empty dirs:

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
                        ls -la |wc -l
                        1841

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
                        ls -la |more
                        total 68
                        drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
                        drwx------    3 cmpbib bib   231 Sep 13 03:50 ..

drwx------ 2 cmpbib bib 6 Sep 13 02:1010015drwx------ 2 cmpbib bib 6 Sep 13 03:0510049drwx------ 2 cmpbib bib 6 Sep 13 03:1510052drwx------ 2 cmpbib bib 6 Sep 13 02:2210059drwx------ 2 cmpbib bib 6 Sep 13 02:2210110drwx------ 2 cmpbib bib 6 Sep 13 02:4110114

                        ...

                        If I do:

                        lsof |grep "openmpi-sessions-40031"
                        lsof: WARNING: can't stat() fuse.gvfsd-fuse file
                    system
                        /run/user/1000/gvfs
                              Output information may be incomplete.
                        lsof: WARNING: can't stat() tracefs file system
                        /sys/kernel/debug/tracing
                              Output information may be incomplete.

                        nothing...

                        What else may I check?

                        Eric


                        On 14/09/16 08:47 AM, Joshua Ladd wrote:

                            Hi, Eric

                            I **think** this might be related to the
                    following:

https://github.com/pmix/master/pull/145
<https://github.com/pmix/master/pull/145>
<https://github.com/pmix/master/pull/145
<https://github.com/pmix/master/pull/145>>

                            I'm wondering if you can look into the /tmp
                    directory and see
                    if you
                            have a bunch of stale usock files.

                            Best,

                            Josh


                            On Wed, Sep 14, 2016 at 1:36 AM, Gilles
                    Gouaillardet
                            <gil...@rist.or.jp
                            <mailto:gil...@rist.or.jp>> wrote:

                                Eric,


                                can you please provide more information
                    on how your tests
                            are launched ?

                                do you

                                mpirun -np 1 ./a.out

                                or do you simply

                                ./a.out


                                do you use a batch manager ? if yes,
                    which one ?

                                do you run one test per job ? or
                    multiple tests per job ?

                                how are these tests launched ?


                                do the test that crashes use
                    MPI_Comm_spawn ?

                                i am surprised by the process name
                    [[9325,5754],0], which
                            suggests
                                there MPI_Comm_spawn was called 5753
                    times (!)


                                can you also run

                                hostname

                                on the 'lorien' host ?

                                if you configure'd Open MPI with
                    --enable-debug, can you

                                export OMPI_MCA_plm_base_verbose 5

                                then run one test and post the logs ?


                                from orte_plm_base_set_hnp_name(),
                    "lorien" and pid 142766
                            should

produce job family 5576 (but you get9325)


                                the discrepancy could be explained by
                    the use of a batch
                    manager
                                and/or a full hostname i am unaware of.


                                orte_plm_base_set_hnp_name() generate a
                    16 bits job family
                            from the
                                (32 bits hash of the) hostname and the
                    mpirun (32 bits ?)
                    pid.

                                so strictly speaking, it is possible two
                    jobs launched on
                            the same
                                node are assigned the same 16 bits job
                    family.

the easiest way to detect this couldbe to

- editorte/mca/plm/base/plm_base_jobid.c


                                and replace

                                    OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,

                     "plm:base:set_hnp_name: final
                            jobfam %lu",
(unsigned
                    long)jobfam));

                                with

                                    OPAL_OUTPUT_VERBOSE((4,
orte_plm_base_framework.framework_output,

                     "plm:base:set_hnp_name: final
                            jobfam %lu",
(unsigned
                    long)jobfam));

                                configure Open MPI with --enable-debug
                    and rebuild

                                and then

                                export OMPI_MCA_plm_base_verbose=4

                                and run your tests.


                                when the problem occurs, you will be
                    able to check which
                    pids
                                produced the faulty jobfam, and that
                    could hint to a
                    conflict.


                                Cheers,


                                Gilles



                                On 9/14/2016 12:35 AM, Eric Chamberland
                    wrote:

                                    Hi,

                                    It is the third time this happened
                    into the last 10
                    days.

                                    While running nighlty tests (~2200),
                    we have one or two
                            tests
                                    that fails at the very beginning
                    with this strange
                    error:

                                    [lorien:142766] [[9325,5754],0]
                    usock_peer_recv_connect_ack:
                                    received unexpected process
                    identifier [[9325,0],0]
                    from
                                    [[5590,0],0]

                                    But I can't reproduce the problem
                    right now... ie: If I
                            launch
                                    this test alone "by hand", it is
                    successful... the same
                            test was
                                    successful yesterday...

                                    Is there some kind of "race
                    condition" that can happen
                            on the
                                    creation of "tmp" files if many
                    tests runs together on
                            the same
                                    node? (we are oversubcribing even
                    sequential runs...)

                                    Here are the build logs:


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>


<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>



<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>


<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>>




http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>


<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>



<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>


<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
<http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>>




                                    Thanks,

                                    Eric

_______________________________________________
                                    devel mailing list
                                    devel@lists.open-mpi.org
                    <mailto:devel@lists.open-mpi.org>

https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>

<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>>



_______________________________________________
                                devel mailing list
                                devel@lists.open-mpi.org
                    <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>>


_______________________________________________
                        devel mailing list
                        devel@lists.open-mpi.org

https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>

                _______________________________________________
                devel mailing list
                devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>


            _______________________________________________
            devel mailing list
            devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

        _______________________________________________
        devel mailing list
        devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to