Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Gilles Gouaillardet Fri, 16 Sep 2016 07:17:29 -0700

Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would not
worry too much of that crash (to me, it is an undefined behavior anyway)


Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

> Hi,
>
> I know the pull request has not (yet) been merged, but here is a somewhat
> "different" output from a single sequential test (automatically) laucnhed
> without mpirun last night:
>
> [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path
> NULL
> [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash
> 1366255883
> [lorien:172229] plm:base:set_hnp_name: final jobfam 39075
> [lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [lorien:172229] [[39075,0],0] plm:base:receive start comm
> [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
> [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
> dynamic spawn
> [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received
> unexpected process identifier [[41545,0],0] from [[39075,0],0]
> [lorien:172218] *** Process received signal ***
> [lorien:172218] Signal: Segmentation fault (11)
> [lorien:172218] Signal code: Invalid permissions (2)
> [lorien:172218] Failing at address: 0x2d07e00
> [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop
> comm
>
>
> unfortunately, I didn't got any coredump (???)  The line:
>
> [lorien:172218] Signal code: Invalid permissions (2)
>
> is curious or not?
>
> as usual, here are the build logs:
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16
> .01h16m01s_config.log
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16
> .01h16m01s_ompi_info_all.txt
>
> Does the PR #1376 will prevent or fix this too?
>
> Thanks again!
>
> Eric
>
>
>
> On 15/09/16 09:32 AM, Eric Chamberland wrote:
>
>> Hi Gilles,
>>
>> On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:
>>
>>> Eric,
>>>
>>>
>>> a bug has been identified, and a patch is available at
>>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-
>>> release/pull/1376.patch
>>>
>>>
>>>
>>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
>>> ./a.out), so if applying a patch does not fit your test workflow,
>>>
>>> it might be easier for you to update it and mpirun -np 1 ./a.out instead
>>> of ./a.out
>>>
>>>
>>> basically, increasing verbosity runs some extra code, which include
>>> sprintf.
>>> so yes, it is possible to crash an app by increasing verbosity by
>>> running into a bug that is hidden under normal operation.
>>> my intuition suggests this is quite unlikely ... if you can get a core
>>> file and a backtrace, we will soon find out
>>>
>>> Damn! I did got one but it got erased last night when the automatic
>> process started again... (which erase all directories before starting) :/
>>
>> I would like to put core files in a user specific directory, but it
>> seems it has to be a system-wide configuration... :/  I will trick this
>> by changing the "pwd" to a path outside the erased directory...
>>
>> So as of tonight I should be able to retrieve core files even after I
>> relaunched the process..
>>
>> Thanks for all the support!
>>
>> Eric
>>
>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>>
>>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>>>
>>>> Ok,
>>>>
>>>> one test segfaulted *but* I can't tell if it is the *same* bug because
>>>> there has been a segfault:
>>>>
>>>> stderr:
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>>> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>>>
>>>>
>>>>
>>>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>>>> path NULL
>>>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
>>>> hash 1366255883
>>>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>>>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>>>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>>>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770
>>>> ***
>>>> ...
>>>> ...
>>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>>>> daemon on the local node in file ess_singleton_module.c at line 573
>>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>>>> daemon on the local node in file ess_singleton_module.c at line 163
>>>> *** An error occurred in MPI_Init_thread
>>>> *** on a NULL communicator
>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>> ***    and potentially your MPI job)
>>>> [lorien:190306] Local abort before MPI_INIT completed completed
>>>> successfully, but am not able to aggregate error messages, and not
>>>> able to guarantee that all other processes were killed!
>>>>
>>>> stdout:
>>>>
>>>> ------------------------------------------------------------
>>>> --------------
>>>>
>>>>
>>>> It looks like orte_init failed for some reason; your parallel process is
>>>> likely to abort.  There are many reasons that a parallel process can
>>>> fail during orte_init; some of which are due to configuration or
>>>> environment problems.  This failure appears to be an internal failure;
>>>> here's some additional information (which may only be relevant to an
>>>> Open MPI developer):
>>>>
>>>>   orte_ess_init failed
>>>>   --> Returned value Unable to start a daemon on the local node (-127)
>>>> instead of ORTE_SUCCESS
>>>> ------------------------------------------------------------
>>>> --------------
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> --------------
>>>>
>>>>
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort.  There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or
>>>> environment
>>>> problems.  This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>>   ompi_mpi_init: ompi_rte_init failed
>>>>   --> Returned "Unable to start a daemon on the local node" (-127)
>>>> instead of "Success" (0)
>>>> ------------------------------------------------------------
>>>> --------------
>>>>
>>>>
>>>>
>>>> openmpi content of $TMP:
>>>>
>>>> /tmp/tmp.GoQXICeyJl> ls -la
>>>> total 1500
>>>> drwx------    3 cmpbib bib     250 Sep 14 13:34 .
>>>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>>>> ...
>>>> drwx------ 1848 cmpbib bib   45056 Sep 14 13:34
>>>> openmpi-sessions-40031@lorien_0
>>>> srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552
>>>>
>>>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0>
>>>> find . -type f
>>>> ./53310/contact.txt
>>>>
>>>> cat 53310/contact.txt
>>>> 3493724160.0;usock;tcp://132.203.7.36:54605
>>>> 190552
>>>>
>>>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
>>>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
>>>> plm:base:set_hnp_name: final jobfam 53310
>>>>
>>>> (this is the faulty test)
>>>> full egrep:
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>>> .10h38m52s.egrep.txt
>>>>
>>>>
>>>>
>>>> config.log:
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>>> .10h38m52s_config.log
>>>>
>>>>
>>>>
>>>> ompi_info:
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>>> .10h38m52s_ompi_info_all.txt
>>>>
>>>>
>>>>
>>>> Maybe it aborted (instead of giving the other message) while doing the
>>>> error because of export OMPI_MCA_plm_base_verbose=5 ?
>>>>
>>>> Thanks,
>>>>
>>>> Eric
>>>>
>>>>
>>>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
>>>>
>>>>> Eric,
>>>>>
>>>>> do you mean you have a unique $TMP per a.out ?
>>>>> or a unique $TMP per "batch" of run ?
>>>>>
>>>>> in the first case, my understanding is that conflicts cannot happen ...
>>>>>
>>>>> once you hit the bug, can you please please post the output of the
>>>>> failed a.out,
>>>>> and run
>>>>> egrep 'jobfam|stop'
>>>>> on all your logs, so we might spot a conflict
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On Wednesday, September 14, 2016, Eric Chamberland
>>>>> <eric.chamberl...@giref.ulaval.ca
>>>>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
>>>>>
>>>>>     Lucky!
>>>>>
>>>>>     Since each runs have a specific TMP, I still have it on disc.
>>>>>
>>>>>     for the faulty run, the TMP variable was:
>>>>>
>>>>>     TMP=/tmp/tmp.wOv5dkNaSI
>>>>>
>>>>>     and into $TMP I have:
>>>>>
>>>>>     openmpi-sessions-40031@lorien_0
>>>>>
>>>>>     and into this subdirectory I have a bunch of empty dirs:
>>>>>
>>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>>>     ls -la |wc -l
>>>>>     1841
>>>>>
>>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>>>     ls -la |more
>>>>>     total 68
>>>>>     drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>>>     drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>>>>     drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>>>>     drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>>>>     drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>>>>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>>>>     drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>>>>     drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>>>>     ...
>>>>>
>>>>>     If I do:
>>>>>
>>>>>     lsof |grep "openmpi-sessions-40031"
>>>>>     lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>>>>     /run/user/1000/gvfs
>>>>>           Output information may be incomplete.
>>>>>     lsof: WARNING: can't stat() tracefs file system
>>>>>     /sys/kernel/debug/tracing
>>>>>           Output information may be incomplete.
>>>>>
>>>>>     nothing...
>>>>>
>>>>>     What else may I check?
>>>>>
>>>>>     Eric
>>>>>
>>>>>
>>>>>     On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>>>
>>>>>         Hi, Eric
>>>>>
>>>>>         I **think** this might be related to the following:
>>>>>
>>>>>         https://github.com/pmix/master/pull/145
>>>>>         <https://github.com/pmix/master/pull/145>
>>>>>
>>>>>         I'm wondering if you can look into the /tmp directory and see
>>>>> if you
>>>>>         have a bunch of stale usock files.
>>>>>
>>>>>         Best,
>>>>>
>>>>>         Josh
>>>>>
>>>>>
>>>>>         On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>>>>         <gil...@rist.or.jp
>>>>>         <mailto:gil...@rist.or.jp>> wrote:
>>>>>
>>>>>             Eric,
>>>>>
>>>>>
>>>>>             can you please provide more information on how your tests
>>>>>         are launched ?
>>>>>
>>>>>             do you
>>>>>
>>>>>             mpirun -np 1 ./a.out
>>>>>
>>>>>             or do you simply
>>>>>
>>>>>             ./a.out
>>>>>
>>>>>
>>>>>             do you use a batch manager ? if yes, which one ?
>>>>>
>>>>>             do you run one test per job ? or multiple tests per job ?
>>>>>
>>>>>             how are these tests launched ?
>>>>>
>>>>>
>>>>>             do the test that crashes use MPI_Comm_spawn ?
>>>>>
>>>>>             i am surprised by the process name [[9325,5754],0], which
>>>>>         suggests
>>>>>             there MPI_Comm_spawn was called 5753 times (!)
>>>>>
>>>>>
>>>>>             can you also run
>>>>>
>>>>>             hostname
>>>>>
>>>>>             on the 'lorien' host ?
>>>>>
>>>>>             if you configure'd Open MPI with --enable-debug, can you
>>>>>
>>>>>             export OMPI_MCA_plm_base_verbose 5
>>>>>
>>>>>             then run one test and post the logs ?
>>>>>
>>>>>
>>>>>             from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
>>>>>         should
>>>>>             produce job family 5576 (but you get 9325)
>>>>>
>>>>>             the discrepancy could be explained by the use of a batch
>>>>> manager
>>>>>             and/or a full hostname i am unaware of.
>>>>>
>>>>>
>>>>>             orte_plm_base_set_hnp_name() generate a 16 bits job family
>>>>>         from the
>>>>>             (32 bits hash of the) hostname and the mpirun (32 bits ?)
>>>>> pid.
>>>>>
>>>>>             so strictly speaking, it is possible two jobs launched on
>>>>>         the same
>>>>>             node are assigned the same 16 bits job family.
>>>>>
>>>>>
>>>>>             the easiest way to detect this could be to
>>>>>
>>>>>             - edit orte/mca/plm/base/plm_base_jobid.c
>>>>>
>>>>>             and replace
>>>>>
>>>>>                 OPAL_OUTPUT_VERBOSE((5,
>>>>>         orte_plm_base_framework.framework_output,
>>>>>                                      "plm:base:set_hnp_name: final
>>>>>         jobfam %lu",
>>>>>                                      (unsigned long)jobfam));
>>>>>
>>>>>             with
>>>>>
>>>>>                 OPAL_OUTPUT_VERBOSE((4,
>>>>>         orte_plm_base_framework.framework_output,
>>>>>                                      "plm:base:set_hnp_name: final
>>>>>         jobfam %lu",
>>>>>                                      (unsigned long)jobfam));
>>>>>
>>>>>             configure Open MPI with --enable-debug and rebuild
>>>>>
>>>>>             and then
>>>>>
>>>>>             export OMPI_MCA_plm_base_verbose=4
>>>>>
>>>>>             and run your tests.
>>>>>
>>>>>
>>>>>             when the problem occurs, you will be able to check which
>>>>> pids
>>>>>             produced the faulty jobfam, and that could hint to a
>>>>> conflict.
>>>>>
>>>>>
>>>>>             Cheers,
>>>>>
>>>>>
>>>>>             Gilles
>>>>>
>>>>>
>>>>>
>>>>>             On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>>>
>>>>>                 Hi,
>>>>>
>>>>>                 It is the third time this happened into the last 10
>>>>> days.
>>>>>
>>>>>                 While running nighlty tests (~2200), we have one or two
>>>>>         tests
>>>>>                 that fails at the very beginning with this strange
>>>>> error:
>>>>>
>>>>>                 [lorien:142766] [[9325,5754],0]
>>>>> usock_peer_recv_connect_ack:
>>>>>                 received unexpected process identifier [[9325,0],0]
>>>>> from
>>>>>                 [[5590,0],0]
>>>>>
>>>>>                 But I can't reproduce the problem right now... ie: If I
>>>>>         launch
>>>>>                 this test alone "by hand", it is successful... the same
>>>>>         test was
>>>>>                 successful yesterday...
>>>>>
>>>>>                 Is there some kind of "race condition" that can happen
>>>>>         on the
>>>>>                 creation of "tmp" files if many tests runs together on
>>>>>         the same
>>>>>                 node? (we are oversubcribing even sequential runs...)
>>>>>
>>>>>                 Here are the build logs:
>>>>>
>>>>>
>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>>>> .01h16m01s_config.log
>>>>>
>>>>>
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>>> 3.01h16m01s_config.log>
>>>>>
>>>>>
>>>>>
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>>> 3.01h16m01s_config.log
>>>>>
>>>>>
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>>> 3.01h16m01s_config.log>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>>>>> .01h16m01s_ompi_info_all.txt
>>>>>
>>>>>
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>>> 3.01h16m01s_ompi_info_all.txt>
>>>>>
>>>>>
>>>>>
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>>> 3.01h16m01s_ompi_info_all.txt
>>>>>
>>>>>
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1
>>>>> 3.01h16m01s_ompi_info_all.txt>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                 Thanks,
>>>>>
>>>>>                 Eric
>>>>>                 _______________________________________________
>>>>>                 devel mailing list
>>>>>                 devel@lists.open-mpi.org
>>>>> <mailto:devel@lists.open-mpi.org>
>>>>>
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>>
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>>>
>>>>>
>>>>>             _______________________________________________
>>>>>             devel mailing list
>>>>>             devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>>>
>>>>>
>>>>>     _______________________________________________
>>>>>     devel mailing list
>>>>>     devel@lists.open-mpi.org
>>>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>>
>>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to