Eric, I expect the PR will fix this bug. The crash occur after the unexpected process identifier error, and this error should not happen in the first place. So at this stage, I would not worry too much of that crash (to me, it is an undefined behavior anyway)
Cheers, Gilles On Friday, September 16, 2016, Eric Chamberland < eric.chamberl...@giref.ulaval.ca> wrote: > Hi, > > I know the pull request has not (yet) been merged, but here is a somewhat > "different" output from a single sequential test (automatically) laucnhed > without mpirun last night: > > [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path > NULL > [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash > 1366255883 > [lorien:172229] plm:base:set_hnp_name: final jobfam 39075 > [lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL > [lorien:172229] [[39075,0],0] plm:base:receive start comm > [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered > [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a > dynamic spawn > [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received > unexpected process identifier [[41545,0],0] from [[39075,0],0] > [lorien:172218] *** Process received signal *** > [lorien:172218] Signal: Segmentation fault (11) > [lorien:172218] Signal code: Invalid permissions (2) > [lorien:172218] Failing at address: 0x2d07e00 > [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop > comm > > > unfortunately, I didn't got any coredump (???) The line: > > [lorien:172218] Signal code: Invalid permissions (2) > > is curious or not? > > as usual, here are the build logs: > > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16 > .01h16m01s_config.log > > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16 > .01h16m01s_ompi_info_all.txt > > Does the PR #1376 will prevent or fix this too? > > Thanks again! > > Eric > > > > On 15/09/16 09:32 AM, Eric Chamberland wrote: > >> Hi Gilles, >> >> On 15/09/16 03:38 AM, Gilles Gouaillardet wrote: >> >>> Eric, >>> >>> >>> a bug has been identified, and a patch is available at >>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi- >>> release/pull/1376.patch >>> >>> >>> >>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 >>> ./a.out), so if applying a patch does not fit your test workflow, >>> >>> it might be easier for you to update it and mpirun -np 1 ./a.out instead >>> of ./a.out >>> >>> >>> basically, increasing verbosity runs some extra code, which include >>> sprintf. >>> so yes, it is possible to crash an app by increasing verbosity by >>> running into a bug that is hidden under normal operation. >>> my intuition suggests this is quite unlikely ... if you can get a core >>> file and a backtrace, we will soon find out >>> >>> Damn! I did got one but it got erased last night when the automatic >> process started again... (which erase all directories before starting) :/ >> >> I would like to put core files in a user specific directory, but it >> seems it has to be a system-wide configuration... :/ I will trick this >> by changing the "pwd" to a path outside the erased directory... >> >> So as of tonight I should be able to retrieve core files even after I >> relaunched the process.. >> >> Thanks for all the support! >> >> Eric >> >> >>> Cheers, >>> >>> Gilles >>> >>> >>> >>> On 9/15/2016 2:58 AM, Eric Chamberland wrote: >>> >>>> Ok, >>>> >>>> one test segfaulted *but* I can't tell if it is the *same* bug because >>>> there has been a segfault: >>>> >>>> stderr: >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>>> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt >>>> >>>> >>>> >>>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh >>>> path NULL >>>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename >>>> hash 1366255883 >>>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 >>>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL >>>> [lorien:190552] [[53310,0],0] plm:base:receive start comm >>>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 >>>> *** >>>> ... >>>> ... >>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >>>> daemon on the local node in file ess_singleton_module.c at line 573 >>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >>>> daemon on the local node in file ess_singleton_module.c at line 163 >>>> *** An error occurred in MPI_Init_thread >>>> *** on a NULL communicator >>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> *** and potentially your MPI job) >>>> [lorien:190306] Local abort before MPI_INIT completed completed >>>> successfully, but am not able to aggregate error messages, and not >>>> able to guarantee that all other processes were killed! >>>> >>>> stdout: >>>> >>>> ------------------------------------------------------------ >>>> -------------- >>>> >>>> >>>> It looks like orte_init failed for some reason; your parallel process is >>>> likely to abort. There are many reasons that a parallel process can >>>> fail during orte_init; some of which are due to configuration or >>>> environment problems. This failure appears to be an internal failure; >>>> here's some additional information (which may only be relevant to an >>>> Open MPI developer): >>>> >>>> orte_ess_init failed >>>> --> Returned value Unable to start a daemon on the local node (-127) >>>> instead of ORTE_SUCCESS >>>> ------------------------------------------------------------ >>>> -------------- >>>> >>>> >>>> ------------------------------------------------------------ >>>> -------------- >>>> >>>> >>>> It looks like MPI_INIT failed for some reason; your parallel process is >>>> likely to abort. There are many reasons that a parallel process can >>>> fail during MPI_INIT; some of which are due to configuration or >>>> environment >>>> problems. This failure appears to be an internal failure; here's some >>>> additional information (which may only be relevant to an Open MPI >>>> developer): >>>> >>>> ompi_mpi_init: ompi_rte_init failed >>>> --> Returned "Unable to start a daemon on the local node" (-127) >>>> instead of "Success" (0) >>>> ------------------------------------------------------------ >>>> -------------- >>>> >>>> >>>> >>>> openmpi content of $TMP: >>>> >>>> /tmp/tmp.GoQXICeyJl> ls -la >>>> total 1500 >>>> drwx------ 3 cmpbib bib 250 Sep 14 13:34 . >>>> drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. >>>> ... >>>> drwx------ 1848 cmpbib bib 45056 Sep 14 13:34 >>>> openmpi-sessions-40031@lorien_0 >>>> srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24 pmix-190552 >>>> >>>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> >>>> find . -type f >>>> ./53310/contact.txt >>>> >>>> cat 53310/contact.txt >>>> 3493724160.0;usock;tcp://132.203.7.36:54605 >>>> 190552 >>>> >>>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 >>>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] >>>> plm:base:set_hnp_name: final jobfam 53310 >>>> >>>> (this is the faulty test) >>>> full egrep: >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>>> .10h38m52s.egrep.txt >>>> >>>> >>>> >>>> config.log: >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>>> .10h38m52s_config.log >>>> >>>> >>>> >>>> ompi_info: >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 >>>> .10h38m52s_ompi_info_all.txt >>>> >>>> >>>> >>>> Maybe it aborted (instead of giving the other message) while doing the >>>> error because of export OMPI_MCA_plm_base_verbose=5 ? >>>> >>>> Thanks, >>>> >>>> Eric >>>> >>>> >>>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: >>>> >>>>> Eric, >>>>> >>>>> do you mean you have a unique $TMP per a.out ? >>>>> or a unique $TMP per "batch" of run ? >>>>> >>>>> in the first case, my understanding is that conflicts cannot happen ... >>>>> >>>>> once you hit the bug, can you please please post the output of the >>>>> failed a.out, >>>>> and run >>>>> egrep 'jobfam|stop' >>>>> on all your logs, so we might spot a conflict >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On Wednesday, September 14, 2016, Eric Chamberland >>>>> <eric.chamberl...@giref.ulaval.ca >>>>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote: >>>>> >>>>> Lucky! >>>>> >>>>> Since each runs have a specific TMP, I still have it on disc. >>>>> >>>>> for the faulty run, the TMP variable was: >>>>> >>>>> TMP=/tmp/tmp.wOv5dkNaSI >>>>> >>>>> and into $TMP I have: >>>>> >>>>> openmpi-sessions-40031@lorien_0 >>>>> >>>>> and into this subdirectory I have a bunch of empty dirs: >>>>> >>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>>>> ls -la |wc -l >>>>> 1841 >>>>> >>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>>>> ls -la |more >>>>> total 68 >>>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >>>>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >>>>> ... >>>>> >>>>> If I do: >>>>> >>>>> lsof |grep "openmpi-sessions-40031" >>>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>>>> /run/user/1000/gvfs >>>>> Output information may be incomplete. >>>>> lsof: WARNING: can't stat() tracefs file system >>>>> /sys/kernel/debug/tracing >>>>> Output information may be incomplete. >>>>> >>>>> nothing... >>>>> >>>>> What else may I check? >>>>> >>>>> Eric >>>>> >>>>> >>>>> On 14/09/16 08:47 AM, Joshua Ladd wrote: >>>>> >>>>> Hi, Eric >>>>> >>>>> I **think** this might be related to the following: >>>>> >>>>> https://github.com/pmix/master/pull/145 >>>>> <https://github.com/pmix/master/pull/145> >>>>> >>>>> I'm wondering if you can look into the /tmp directory and see >>>>> if you >>>>> have a bunch of stale usock files. >>>>> >>>>> Best, >>>>> >>>>> Josh >>>>> >>>>> >>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet >>>>> <gil...@rist.or.jp >>>>> <mailto:gil...@rist.or.jp>> wrote: >>>>> >>>>> Eric, >>>>> >>>>> >>>>> can you please provide more information on how your tests >>>>> are launched ? >>>>> >>>>> do you >>>>> >>>>> mpirun -np 1 ./a.out >>>>> >>>>> or do you simply >>>>> >>>>> ./a.out >>>>> >>>>> >>>>> do you use a batch manager ? if yes, which one ? >>>>> >>>>> do you run one test per job ? or multiple tests per job ? >>>>> >>>>> how are these tests launched ? >>>>> >>>>> >>>>> do the test that crashes use MPI_Comm_spawn ? >>>>> >>>>> i am surprised by the process name [[9325,5754],0], which >>>>> suggests >>>>> there MPI_Comm_spawn was called 5753 times (!) >>>>> >>>>> >>>>> can you also run >>>>> >>>>> hostname >>>>> >>>>> on the 'lorien' host ? >>>>> >>>>> if you configure'd Open MPI with --enable-debug, can you >>>>> >>>>> export OMPI_MCA_plm_base_verbose 5 >>>>> >>>>> then run one test and post the logs ? >>>>> >>>>> >>>>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 >>>>> should >>>>> produce job family 5576 (but you get 9325) >>>>> >>>>> the discrepancy could be explained by the use of a batch >>>>> manager >>>>> and/or a full hostname i am unaware of. >>>>> >>>>> >>>>> orte_plm_base_set_hnp_name() generate a 16 bits job family >>>>> from the >>>>> (32 bits hash of the) hostname and the mpirun (32 bits ?) >>>>> pid. >>>>> >>>>> so strictly speaking, it is possible two jobs launched on >>>>> the same >>>>> node are assigned the same 16 bits job family. >>>>> >>>>> >>>>> the easiest way to detect this could be to >>>>> >>>>> - edit orte/mca/plm/base/plm_base_jobid.c >>>>> >>>>> and replace >>>>> >>>>> OPAL_OUTPUT_VERBOSE((5, >>>>> orte_plm_base_framework.framework_output, >>>>> "plm:base:set_hnp_name: final >>>>> jobfam %lu", >>>>> (unsigned long)jobfam)); >>>>> >>>>> with >>>>> >>>>> OPAL_OUTPUT_VERBOSE((4, >>>>> orte_plm_base_framework.framework_output, >>>>> "plm:base:set_hnp_name: final >>>>> jobfam %lu", >>>>> (unsigned long)jobfam)); >>>>> >>>>> configure Open MPI with --enable-debug and rebuild >>>>> >>>>> and then >>>>> >>>>> export OMPI_MCA_plm_base_verbose=4 >>>>> >>>>> and run your tests. >>>>> >>>>> >>>>> when the problem occurs, you will be able to check which >>>>> pids >>>>> produced the faulty jobfam, and that could hint to a >>>>> conflict. >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> Gilles >>>>> >>>>> >>>>> >>>>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>>>> >>>>> Hi, >>>>> >>>>> It is the third time this happened into the last 10 >>>>> days. >>>>> >>>>> While running nighlty tests (~2200), we have one or two >>>>> tests >>>>> that fails at the very beginning with this strange >>>>> error: >>>>> >>>>> [lorien:142766] [[9325,5754],0] >>>>> usock_peer_recv_connect_ack: >>>>> received unexpected process identifier [[9325,0],0] >>>>> from >>>>> [[5590,0],0] >>>>> >>>>> But I can't reproduce the problem right now... ie: If I >>>>> launch >>>>> this test alone "by hand", it is successful... the same >>>>> test was >>>>> successful yesterday... >>>>> >>>>> Is there some kind of "race condition" that can happen >>>>> on the >>>>> creation of "tmp" files if many tests runs together on >>>>> the same >>>>> node? (we are oversubcribing even sequential runs...) >>>>> >>>>> Here are the build logs: >>>>> >>>>> >>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>>>> .01h16m01s_config.log >>>>> >>>>> >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>>> 3.01h16m01s_config.log> >>>>> >>>>> >>>>> >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>>> 3.01h16m01s_config.log >>>>> >>>>> >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>>> 3.01h16m01s_config.log>> >>>>> >>>>> >>>>> >>>>> >>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13 >>>>> .01h16m01s_ompi_info_all.txt >>>>> >>>>> >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>>> 3.01h16m01s_ompi_info_all.txt> >>>>> >>>>> >>>>> >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>>> 3.01h16m01s_ompi_info_all.txt >>>>> >>>>> >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.1 >>>>> 3.01h16m01s_ompi_info_all.txt>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org >>>>> <mailto:devel@lists.open-mpi.org> >>>>> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>> >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>> >>>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel