I just realized i screwed up my test, and i was missing some relevant info... So on one hand, i fixed a bug in singleton, But on the other hand, i cannot tell whether a collision was involved in this issue
Cheers, Gilles Joshua Ladd <jladd.m...@gmail.com> wrote: >Great catch, Gilles! Not much of a surprise though. > > >Indeed, this issue has EVERYTHING to do with how PMIx is calculating the >jobid, which, in this case, results in hash collisions. ;-P > > >Josh > > >On Thursday, September 15, 2016, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > >Eric, > > >a bug has been identified, and a patch is available at >https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch > >the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), >so if applying a patch does not fit your test workflow, > >it might be easier for you to update it and mpirun -np 1 ./a.out instead of >./a.out > > >basically, increasing verbosity runs some extra code, which include sprintf. >so yes, it is possible to crash an app by increasing verbosity by running into >a bug that is hidden under normal operation. >my intuition suggests this is quite unlikely ... if you can get a core file >and a backtrace, we will soon find out > > >Cheers, > >Gilles > > > >On 9/15/2016 2:58 AM, Eric Chamberland wrote: > >Ok, > >one test segfaulted *but* I can't tell if it is the *same* bug because there >has been a segfault: > >stderr: >http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt > > >[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL >[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash >1366255883 >[lorien:190552] plm:base:set_hnp_name: final jobfam 53310 >[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL >[lorien:190552] [[53310,0],0] plm:base:receive start comm >*** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 *** >... >... >[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon >on the local node in file ess_singleton_module.c at line 573 >[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon >on the local node in file ess_singleton_module.c at line 163 >*** An error occurred in MPI_Init_thread >*** on a NULL communicator >*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >*** and potentially your MPI job) >[lorien:190306] Local abort before MPI_INIT completed completed successfully, >but am not able to aggregate error messages, and not able to guarantee that >all other processes were killed! > >stdout: > >-------------------------------------------------------------------------- >It looks like orte_init failed for some reason; your parallel process is >likely to abort. There are many reasons that a parallel process can >fail during orte_init; some of which are due to configuration or >environment problems. This failure appears to be an internal failure; >here's some additional information (which may only be relevant to an >Open MPI developer): > > orte_ess_init failed > --> Returned value Unable to start a daemon on the local node (-127) instead >of ORTE_SUCCESS >-------------------------------------------------------------------------- >-------------------------------------------------------------------------- >It looks like MPI_INIT failed for some reason; your parallel process is >likely to abort. There are many reasons that a parallel process can >fail during MPI_INIT; some of which are due to configuration or environment >problems. This failure appears to be an internal failure; here's some >additional information (which may only be relevant to an Open MPI >developer): > > ompi_mpi_init: ompi_rte_init failed > --> Returned "Unable to start a daemon on the local node" (-127) instead of >"Success" (0) >-------------------------------------------------------------------------- > >openmpi content of $TMP: > >/tmp/tmp.GoQXICeyJl> ls -la >total 1500 >drwx------ 3 cmpbib bib 250 Sep 14 13:34 . >drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. >... >drwx------ 1848 cmpbib bib 45056 Sep 14 13:34 openmpi-sessions-40031@lorien_0 >srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24 pmix-190552 > >cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . >-type f >./53310/contact.txt > >cat 53310/contact.txt >3493724160.0;usock;tcp://132.203.7.36:54605 >190552 > >egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 >dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] >plm:base:set_hnp_name: final jobfam 53310 > >(this is the faulty test) >full egrep: >http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt > > >config.log: >http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log > > >ompi_info: >http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt > > >Maybe it aborted (instead of giving the other message) while doing the error >because of export OMPI_MCA_plm_base_verbose=5 ? > >Thanks, > >Eric > > >On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: > >Eric, > >do you mean you have a unique $TMP per a.out ? >or a unique $TMP per "batch" of run ? > >in the first case, my understanding is that conflicts cannot happen ... > >once you hit the bug, can you please please post the output of the >failed a.out, >and run >egrep 'jobfam|stop' >on all your logs, so we might spot a conflict > >Cheers, > >Gilles > >On Wednesday, September 14, 2016, Eric Chamberland ><eric.chamberl...@giref.ulaval.ca ><mailto:eric.chamberl...@giref.ulaval.ca>> wrote: > > Lucky! > > Since each runs have a specific TMP, I still have it on disc. > > for the faulty run, the TMP variable was: > > TMP=/tmp/tmp.wOv5dkNaSI > > and into $TMP I have: > > openmpi-sessions-40031@lorien_0 > > and into this subdirectory I have a bunch of empty dirs: > >cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> > ls -la |wc -l > 1841 > >cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> > ls -la |more > total 68 > drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . > drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. > drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 > drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 > drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 > drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 > drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 > drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 > ... > > If I do: > > lsof |grep "openmpi-sessions-40031" > lsof: WARNING: can't stat() fuse.gvfsd-fuse file system > /run/user/1000/gvfs > Output information may be incomplete. > lsof: WARNING: can't stat() tracefs file system > /sys/kernel/debug/tracing > Output information may be incomplete. > > nothing... > > What else may I check? > > Eric > > > On 14/09/16 08:47 AM, Joshua Ladd wrote: > > Hi, Eric > > I **think** this might be related to the following: > > https://github.com/pmix/master/pull/145 > <https://github.com/pmix/master/pull/145> > > I'm wondering if you can look into the /tmp directory and see if you > have a bunch of stale usock files. > > Best, > > Josh > > > On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet > <gil...@rist.or.jp > <mailto:gil...@rist.or.jp>> wrote: > > Eric, > > > can you please provide more information on how your tests > are launched ? > > do you > > mpirun -np 1 ./a.out > > or do you simply > > ./a.out > > > do you use a batch manager ? if yes, which one ? > > do you run one test per job ? or multiple tests per job ? > > how are these tests launched ? > > > do the test that crashes use MPI_Comm_spawn ? > > i am surprised by the process name [[9325,5754],0], which > suggests > there MPI_Comm_spawn was called 5753 times (!) > > > can you also run > > hostname > > on the 'lorien' host ? > > if you configure'd Open MPI with --enable-debug, can you > > export OMPI_MCA_plm_base_verbose 5 > > then run one test and post the logs ? > > > from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 > should > produce job family 5576 (but you get 9325) > > the discrepancy could be explained by the use of a batch manager > and/or a full hostname i am unaware of. > > > orte_plm_base_set_hnp_name() generate a 16 bits job family > from the > (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. > > so strictly speaking, it is possible two jobs launched on > the same > node are assigned the same 16 bits job family. > > > the easiest way to detect this could be to > > - edit orte/mca/plm/base/plm_base_jobid.c > > and replace > > OPAL_OUTPUT_VERBOSE((5, > orte_plm_base_framework.framework_output, > "plm:base:set_hnp_name: final > jobfam %lu", > (unsigned long)jobfam)); > > with > > OPAL_OUTPUT_VERBOSE((4, > orte_plm_base_framework.framework_output, > "plm:base:set_hnp_name: final > jobfam %lu", > (unsigned long)jobfam)); > > configure Open MPI with --enable-debug and rebuild > > and then > > export OMPI_MCA_plm_base_verbose=4 > > and run your tests. > > > when the problem occurs, you will be able to check which pids > produced the faulty jobfam, and that could hint to a conflict. > > > Cheers, > > > Gilles > > > > On 9/14/2016 12:35 AM, Eric Chamberland wrote: > > Hi, > > It is the third time this happened into the last 10 days. > > While running nighlty tests (~2200), we have one or two > tests > that fails at the very beginning with this strange error: > > [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: > received unexpected process identifier [[9325,0],0] from > [[5590,0],0] > > But I can't reproduce the problem right now... ie: If I > launch > this test alone "by hand", it is successful... the same > test was > successful yesterday... > > Is there some kind of "race condition" that can happen > on the > creation of "tmp" files if many tests runs together on > the same > node? (we are oversubcribing even sequential runs...) > > Here are the build logs: > > >http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log ><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> > ><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log ><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>> > > >http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt ><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> > ><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt ><http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>> > > > Thanks, > > Eric > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > >https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > ><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> ><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ><https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > >_______________________________________________ >devel mailing list >devel@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > >_______________________________________________ >devel mailing list >devel@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel