Just in the FWIW category: the HNP used to send the singleton’s name down the pipe at startup, which eliminated the code line you identified. Now, we are pushing the name into the environment as a PMIx envar, and having the PMIx component pick it up. Roundabout way of getting it, and that’s what is causing the trouble.
> On Sep 14, 2016, at 8:26 PM, r...@open-mpi.org wrote: > > Ah...I take that back. We changed this and now we _do_ indeed go down that > code path. Not good. > > So yes, we need that putenv so it gets the jobid from the HNP that was > launched, like it used to do. You want to throw that in? > > Thanks > Ralph > >> On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote: >> >> Nah, something isn’t right here. The singleton doesn’t go thru that code >> line, or it isn’t supposed to do so. I think the problem lies in the way the >> singleton in 2.x is starting up. Let me take a look at how singletons are >> working over there. >> >>> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: >>> >>> Ralph, >>> >>> >>> i think i just found the root cause :-) >>> >>> >>> from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c >>> >>> /* store our jobid and rank */ >>> if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) { >>> /* if we were launched by the OMPI RTE, then >>> * the jobid is in a special format - so get it */ >>> mca_pmix_pmix112_component.native_launch = true; >>> opal_convert_string_to_jobid(&pname.jobid, my_proc.nspace); >>> } else { >>> /* we were launched by someone else, so make the >>> * jobid just be the hash of the nspace */ >>> OPAL_HASH_STR(my_proc.nspace, pname.jobid); >>> /* keep it from being negative */ >>> pname.jobid &= ~(0x8000); >>> } >>> >>> >>> in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment, >>> so we end up using a jobid that is a hash of the namespace >>> >>> the initial error message was >>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received >>> unexpected process identifier [[9325,0],0] from [[5590,0],0] >>> >>> and with little surprise >>> ((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated >>> by OPAL_HASH_STR >>> >>> >>> i am now thinking the right fix is to simply >>> putenv(OPAL_MCA_PREFIX"orte_launch=1"); >>> in ess/singleton >>> >>> makes sense ? >>> >>> >>> Cheers, >>> >>> Gilles >>> >>> On 9/15/2016 2:58 AM, Eric Chamberland wrote: >>>> Ok, >>>> >>>> one test segfaulted *but* I can't tell if it is the *same* bug because >>>> there has been a segfault: >>>> >>>> stderr: >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt >>>> >>>> >>>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path >>>> NULL >>>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash >>>> 1366255883 >>>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 >>>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL >>>> [lorien:190552] [[53310,0],0] plm:base:receive start comm >>>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 *** >>>> ... >>>> ... >>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >>>> daemon on the local node in file ess_singleton_module.c at line 573 >>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a >>>> daemon on the local node in file ess_singleton_module.c at line 163 >>>> *** An error occurred in MPI_Init_thread >>>> *** on a NULL communicator >>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> *** and potentially your MPI job) >>>> [lorien:190306] Local abort before MPI_INIT completed completed >>>> successfully, but am not able to aggregate error messages, and not able to >>>> guarantee that all other processes were killed! >>>> >>>> stdout: >>>> >>>> -------------------------------------------------------------------------- >>>> It looks like orte_init failed for some reason; your parallel process is >>>> likely to abort. There are many reasons that a parallel process can >>>> fail during orte_init; some of which are due to configuration or >>>> environment problems. This failure appears to be an internal failure; >>>> here's some additional information (which may only be relevant to an >>>> Open MPI developer): >>>> >>>> orte_ess_init failed >>>> --> Returned value Unable to start a daemon on the local node (-127) >>>> instead of ORTE_SUCCESS >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> It looks like MPI_INIT failed for some reason; your parallel process is >>>> likely to abort. There are many reasons that a parallel process can >>>> fail during MPI_INIT; some of which are due to configuration or environment >>>> problems. This failure appears to be an internal failure; here's some >>>> additional information (which may only be relevant to an Open MPI >>>> developer): >>>> >>>> ompi_mpi_init: ompi_rte_init failed >>>> --> Returned "Unable to start a daemon on the local node" (-127) instead >>>> of "Success" (0) >>>> -------------------------------------------------------------------------- >>>> >>>> openmpi content of $TMP: >>>> >>>> /tmp/tmp.GoQXICeyJl> ls -la >>>> total 1500 >>>> drwx------ 3 cmpbib bib 250 Sep 14 13:34 . >>>> drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. >>>> ... >>>> drwx------ 1848 cmpbib bib 45056 Sep 14 13:34 >>>> openmpi-sessions-40031@lorien_0 >>>> srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24 pmix-190552 >>>> >>>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . >>>> -type f >>>> ./53310/contact.txt >>>> >>>> cat 53310/contact.txt >>>> 3493724160.0;usock;tcp://132.203.7.36:54605 >>>> 190552 >>>> >>>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 >>>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] >>>> plm:base:set_hnp_name: final jobfam 53310 >>>> >>>> (this is the faulty test) >>>> full egrep: >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt >>>> >>>> >>>> config.log: >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log >>>> >>>> >>>> ompi_info: >>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt >>>> >>>> >>>> Maybe it aborted (instead of giving the other message) while doing the >>>> error because of export OMPI_MCA_plm_base_verbose=5 ? >>>> >>>> Thanks, >>>> >>>> Eric >>>> >>>> >>>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: >>>>> Eric, >>>>> >>>>> do you mean you have a unique $TMP per a.out ? >>>>> or a unique $TMP per "batch" of run ? >>>>> >>>>> in the first case, my understanding is that conflicts cannot happen ... >>>>> >>>>> once you hit the bug, can you please please post the output of the >>>>> failed a.out, >>>>> and run >>>>> egrep 'jobfam|stop' >>>>> on all your logs, so we might spot a conflict >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On Wednesday, September 14, 2016, Eric Chamberland >>>>> <eric.chamberl...@giref.ulaval.ca >>>>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote: >>>>> >>>>> Lucky! >>>>> >>>>> Since each runs have a specific TMP, I still have it on disc. >>>>> >>>>> for the faulty run, the TMP variable was: >>>>> >>>>> TMP=/tmp/tmp.wOv5dkNaSI >>>>> >>>>> and into $TMP I have: >>>>> >>>>> openmpi-sessions-40031@lorien_0 >>>>> >>>>> and into this subdirectory I have a bunch of empty dirs: >>>>> >>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>>>> ls -la |wc -l >>>>> 1841 >>>>> >>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> >>>>> ls -la |more >>>>> total 68 >>>>> drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . >>>>> drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 >>>>> drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 >>>>> ... >>>>> >>>>> If I do: >>>>> >>>>> lsof |grep "openmpi-sessions-40031" >>>>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>>>> /run/user/1000/gvfs >>>>> Output information may be incomplete. >>>>> lsof: WARNING: can't stat() tracefs file system >>>>> /sys/kernel/debug/tracing >>>>> Output information may be incomplete. >>>>> >>>>> nothing... >>>>> >>>>> What else may I check? >>>>> >>>>> Eric >>>>> >>>>> >>>>> On 14/09/16 08:47 AM, Joshua Ladd wrote: >>>>> >>>>> Hi, Eric >>>>> >>>>> I **think** this might be related to the following: >>>>> >>>>> https://github.com/pmix/master/pull/145 >>>>> <https://github.com/pmix/master/pull/145> >>>>> >>>>> I'm wondering if you can look into the /tmp directory and see if you >>>>> have a bunch of stale usock files. >>>>> >>>>> Best, >>>>> >>>>> Josh >>>>> >>>>> >>>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet >>>>> <gil...@rist.or.jp >>>>> <mailto:gil...@rist.or.jp>> wrote: >>>>> >>>>> Eric, >>>>> >>>>> >>>>> can you please provide more information on how your tests >>>>> are launched ? >>>>> >>>>> do you >>>>> >>>>> mpirun -np 1 ./a.out >>>>> >>>>> or do you simply >>>>> >>>>> ./a.out >>>>> >>>>> >>>>> do you use a batch manager ? if yes, which one ? >>>>> >>>>> do you run one test per job ? or multiple tests per job ? >>>>> >>>>> how are these tests launched ? >>>>> >>>>> >>>>> do the test that crashes use MPI_Comm_spawn ? >>>>> >>>>> i am surprised by the process name [[9325,5754],0], which >>>>> suggests >>>>> there MPI_Comm_spawn was called 5753 times (!) >>>>> >>>>> >>>>> can you also run >>>>> >>>>> hostname >>>>> >>>>> on the 'lorien' host ? >>>>> >>>>> if you configure'd Open MPI with --enable-debug, can you >>>>> >>>>> export OMPI_MCA_plm_base_verbose 5 >>>>> >>>>> then run one test and post the logs ? >>>>> >>>>> >>>>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 >>>>> should >>>>> produce job family 5576 (but you get 9325) >>>>> >>>>> the discrepancy could be explained by the use of a batch manager >>>>> and/or a full hostname i am unaware of. >>>>> >>>>> >>>>> orte_plm_base_set_hnp_name() generate a 16 bits job family >>>>> from the >>>>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. >>>>> >>>>> so strictly speaking, it is possible two jobs launched on >>>>> the same >>>>> node are assigned the same 16 bits job family. >>>>> >>>>> >>>>> the easiest way to detect this could be to >>>>> >>>>> - edit orte/mca/plm/base/plm_base_jobid.c >>>>> >>>>> and replace >>>>> >>>>> OPAL_OUTPUT_VERBOSE((5, >>>>> orte_plm_base_framework.framework_output, >>>>> "plm:base:set_hnp_name: final >>>>> jobfam %lu", >>>>> (unsigned long)jobfam)); >>>>> >>>>> with >>>>> >>>>> OPAL_OUTPUT_VERBOSE((4, >>>>> orte_plm_base_framework.framework_output, >>>>> "plm:base:set_hnp_name: final >>>>> jobfam %lu", >>>>> (unsigned long)jobfam)); >>>>> >>>>> configure Open MPI with --enable-debug and rebuild >>>>> >>>>> and then >>>>> >>>>> export OMPI_MCA_plm_base_verbose=4 >>>>> >>>>> and run your tests. >>>>> >>>>> >>>>> when the problem occurs, you will be able to check which pids >>>>> produced the faulty jobfam, and that could hint to a conflict. >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> Gilles >>>>> >>>>> >>>>> >>>>> On 9/14/2016 12:35 AM, Eric Chamberland wrote: >>>>> >>>>> Hi, >>>>> >>>>> It is the third time this happened into the last 10 days. >>>>> >>>>> While running nighlty tests (~2200), we have one or two >>>>> tests >>>>> that fails at the very beginning with this strange error: >>>>> >>>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: >>>>> received unexpected process identifier [[9325,0],0] from >>>>> [[5590,0],0] >>>>> >>>>> But I can't reproduce the problem right now... ie: If I >>>>> launch >>>>> this test alone "by hand", it is successful... the same >>>>> test was >>>>> successful yesterday... >>>>> >>>>> Is there some kind of "race condition" that can happen >>>>> on the >>>>> creation of "tmp" files if many tests runs together on >>>>> the same >>>>> node? (we are oversubcribing even sequential runs...) >>>>> >>>>> Here are the build logs: >>>>> >>>>> >>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> >>>>> >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>> >>>>> >>>>> >>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> >>>>> >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt >>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>>> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>> >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >>>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel