Re: [OMPI devel] OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Gilles Gouaillardet Thu, 15 Sep 2016 04:50:29 -0700

I just realized i screwed up my test, and i was missing some relevant info...
So on one hand, i fixed a bug in singleton,
But on the other hand, i cannot tell whether a collision was involved in this 
issue


Cheers,

Gilles

Joshua Ladd <jladd.m...@gmail.com> wrote:
>Great catch, Gilles! Not much of a surprise though. 
>
>
>Indeed, this issue has EVERYTHING to do with how PMIx is calculating the 
>jobid, which, in this case, results in hash collisions. ;-P
>
>
>Josh
>
>
>On Thursday, September 15, 2016, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>
>Eric,
>
>
>a bug has been identified, and a patch is available at 
>https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
>
>the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), 
>so if applying a patch does not fit your test workflow,
>
>it might be easier for you to update it and mpirun -np 1 ./a.out instead of 
>./a.out
>
>
>basically, increasing verbosity runs some extra code, which include sprintf.
>so yes, it is possible to crash an app by increasing verbosity by running into 
>a bug that is hidden under normal operation.
>my intuition suggests this is quite unlikely ... if you can get a core file 
>and a backtrace, we will soon find out
>
>
>Cheers,
>
>Gilles
>
>
>
>On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>
>Ok,
>
>one test segfaulted *but* I can't tell if it is the *same* bug because there 
>has been a segfault:
>
>stderr:
>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
> 
>
>[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
>[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
>1366255883
>[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>[lorien:190552] [[53310,0],0] plm:base:receive start comm
>*** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 ***
>...
>...
>[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>on the local node in file ess_singleton_module.c at line 573
>[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>on the local node in file ess_singleton_module.c at line 163
>*** An error occurred in MPI_Init_thread
>*** on a NULL communicator
>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>***    and potentially your MPI job)
>[lorien:190306] Local abort before MPI_INIT completed completed successfully, 
>but am not able to aggregate error messages, and not able to guarantee that 
>all other processes were killed!
>
>stdout:
>
>-------------------------------------------------------------------------- 
>It looks like orte_init failed for some reason; your parallel process is
>likely to abort.  There are many reasons that a parallel process can
>fail during orte_init; some of which are due to configuration or
>environment problems.  This failure appears to be an internal failure;
>here's some additional information (which may only be relevant to an
>Open MPI developer):
>
>  orte_ess_init failed
>  --> Returned value Unable to start a daemon on the local node (-127) instead 
>of ORTE_SUCCESS
>-------------------------------------------------------------------------- 
>-------------------------------------------------------------------------- 
>It looks like MPI_INIT failed for some reason; your parallel process is
>likely to abort.  There are many reasons that a parallel process can
>fail during MPI_INIT; some of which are due to configuration or environment
>problems.  This failure appears to be an internal failure; here's some
>additional information (which may only be relevant to an Open MPI
>developer):
>
>  ompi_mpi_init: ompi_rte_init failed
>  --> Returned "Unable to start a daemon on the local node" (-127) instead of 
>"Success" (0)
>-------------------------------------------------------------------------- 
>
>openmpi content of $TMP:
>
>/tmp/tmp.GoQXICeyJl> ls -la
>total 1500
>drwx------    3 cmpbib bib     250 Sep 14 13:34 .
>drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>...
>drwx------ 1848 cmpbib bib   45056 Sep 14 13:34 openmpi-sessions-40031@lorien_0
>srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552
>
>cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . 
>-type f
>./53310/contact.txt
>
>cat 53310/contact.txt
>3493724160.0;usock;tcp://132.203.7.36:54605
>190552
>
>egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
>dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
>plm:base:set_hnp_name: final jobfam 53310
>
>(this is the faulty test)
>full egrep:
>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt
> 
>
>config.log:
>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log
> 
>
>ompi_info:
>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt
> 
>
>Maybe it aborted (instead of giving the other message) while doing the error 
>because of export OMPI_MCA_plm_base_verbose=5 ?
>
>Thanks,
>
>Eric
>
>
>On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
>
>Eric,
>
>do you mean you have a unique $TMP per a.out ?
>or a unique $TMP per "batch" of run ?
>
>in the first case, my understanding is that conflicts cannot happen ...
>
>once you hit the bug, can you please please post the output of the
>failed a.out,
>and run
>egrep 'jobfam|stop'
>on all your logs, so we might spot a conflict
>
>Cheers,
>
>Gilles
>
>On Wednesday, September 14, 2016, Eric Chamberland
><eric.chamberl...@giref.ulaval.ca
><mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
>
>    Lucky!
>
>    Since each runs have a specific TMP, I still have it on disc.
>
>    for the faulty run, the TMP variable was:
>
>    TMP=/tmp/tmp.wOv5dkNaSI
>
>    and into $TMP I have:
>
>    openmpi-sessions-40031@lorien_0
>
>    and into this subdirectory I have a bunch of empty dirs:
>
>cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>    ls -la |wc -l
>    1841
>
>cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>    ls -la |more
>    total 68
>    drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>    drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>    drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>    drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>    drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>    drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>    drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>    drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>    ...
>
>    If I do:
>
>    lsof |grep "openmpi-sessions-40031"
>    lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>    /run/user/1000/gvfs
>          Output information may be incomplete.
>    lsof: WARNING: can't stat() tracefs file system
>    /sys/kernel/debug/tracing
>          Output information may be incomplete.
>
>    nothing...
>
>    What else may I check?
>
>    Eric
>
>
>    On 14/09/16 08:47 AM, Joshua Ladd wrote:
>
>        Hi, Eric
>
>        I **think** this might be related to the following:
>
>        https://github.com/pmix/master/pull/145
>        <https://github.com/pmix/master/pull/145>
>
>        I'm wondering if you can look into the /tmp directory and see if you
>        have a bunch of stale usock files.
>
>        Best,
>
>        Josh
>
>
>        On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>        <gil...@rist.or.jp
>        <mailto:gil...@rist.or.jp>> wrote:
>
>            Eric,
>
>
>            can you please provide more information on how your tests
>        are launched ?
>
>            do you
>
>            mpirun -np 1 ./a.out
>
>            or do you simply
>
>            ./a.out
>
>
>            do you use a batch manager ? if yes, which one ?
>
>            do you run one test per job ? or multiple tests per job ?
>
>            how are these tests launched ?
>
>
>            do the test that crashes use MPI_Comm_spawn ?
>
>            i am surprised by the process name [[9325,5754],0], which
>        suggests
>            there MPI_Comm_spawn was called 5753 times (!)
>
>
>            can you also run
>
>            hostname
>
>            on the 'lorien' host ?
>
>            if you configure'd Open MPI with --enable-debug, can you
>
>            export OMPI_MCA_plm_base_verbose 5
>
>            then run one test and post the logs ?
>
>
>            from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
>        should
>            produce job family 5576 (but you get 9325)
>
>            the discrepancy could be explained by the use of a batch manager
>            and/or a full hostname i am unaware of.
>
>
>            orte_plm_base_set_hnp_name() generate a 16 bits job family
>        from the
>            (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>
>            so strictly speaking, it is possible two jobs launched on
>        the same
>            node are assigned the same 16 bits job family.
>
>
>            the easiest way to detect this could be to
>
>            - edit orte/mca/plm/base/plm_base_jobid.c
>
>            and replace
>
>                OPAL_OUTPUT_VERBOSE((5,
>        orte_plm_base_framework.framework_output,
>                                     "plm:base:set_hnp_name: final
>        jobfam %lu",
>                                     (unsigned long)jobfam));
>
>            with
>
>                OPAL_OUTPUT_VERBOSE((4,
>        orte_plm_base_framework.framework_output,
>                                     "plm:base:set_hnp_name: final
>        jobfam %lu",
>                                     (unsigned long)jobfam));
>
>            configure Open MPI with --enable-debug and rebuild
>
>            and then
>
>            export OMPI_MCA_plm_base_verbose=4
>
>            and run your tests.
>
>
>            when the problem occurs, you will be able to check which pids
>            produced the faulty jobfam, and that could hint to a conflict.
>
>
>            Cheers,
>
>
>            Gilles
>
>
>
>            On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>
>                Hi,
>
>                It is the third time this happened into the last 10 days.
>
>                While running nighlty tests (~2200), we have one or two
>        tests
>                that fails at the very beginning with this strange error:
>
>                [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>                received unexpected process identifier [[9325,0],0] from
>                [[5590,0],0]
>
>                But I can't reproduce the problem right now... ie: If I
>        launch
>                this test alone "by hand", it is successful... the same
>        test was
>                successful yesterday...
>
>                Is there some kind of "race condition" that can happen
>        on the
>                creation of "tmp" files if many tests runs together on
>        the same
>                node? (we are oversubcribing even sequential runs...)
>
>                Here are the build logs:
>
>
>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
>
><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
>
>
>http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
>
><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
>
>
>                Thanks,
>
>                Eric
>                _______________________________________________
>                devel mailing list
>                devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>
>https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>
><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>
>
>            _______________________________________________
>            devel mailing list
>            devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>
>
>    _______________________________________________
>    devel mailing list
>    devel@lists.open-mpi.org
>    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>
>_______________________________________________
>devel mailing list
>devel@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>_______________________________________________
>devel mailing list
>devel@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to