Ralph,
i fixed master at https://github.com/open-mpi/ompi/commit/11ebf3ab23bdaeb0ec96818c119364c6d837cd3b
and PR for v2.x at https://github.com/open-mpi/ompi-release/pull/1376 Cheers, Gilles On 9/15/2016 12:26 PM, r...@open-mpi.org wrote:
Ah...I take that back. We changed this and now we _do_ indeed go down that code path. Not good. So yes, we need that putenv so it gets the jobid from the HNP that was launched, like it used to do. You want to throw that in? Thanks RalphOn Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote: Nah, something isn’t right here. The singleton doesn’t go thru that code line, or it isn’t supposed to do so. I think the problem lies in the way the singleton in 2.x is starting up. Let me take a look at how singletons are working over there.On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: Ralph, i think i just found the root cause :-) from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c /* store our jobid and rank */ if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) { /* if we were launched by the OMPI RTE, then * the jobid is in a special format - so get it */ mca_pmix_pmix112_component.native_launch = true; opal_convert_string_to_jobid(&pname.jobid, my_proc.nspace); } else { /* we were launched by someone else, so make the * jobid just be the hash of the nspace */ OPAL_HASH_STR(my_proc.nspace, pname.jobid); /* keep it from being negative */ pname.jobid &= ~(0x8000); } in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment, so we end up using a jobid that is a hash of the namespace the initial error message was [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received unexpected process identifier [[9325,0],0] from [[5590,0],0] and with little surprise ((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated by OPAL_HASH_STR i am now thinking the right fix is to simply putenv(OPAL_MCA_PREFIX"orte_launch=1"); in ess/singleton makes sense ? Cheers, Gilles On 9/15/2016 2:58 AM, Eric Chamberland wrote:Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 1366255883 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:190552] [[53310,0],0] plm:base:receive start comm *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 *** ... ... [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 573 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 163 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [lorien:190306] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! stdout: -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0) -------------------------------------------------------------------------- openmpi content of $TMP: /tmp/tmp.GoQXICeyJl> ls -la total 1500 drwx------ 3 cmpbib bib 250 Sep 14 13:34 . drwxrwxrwt 356 root root 61440 Sep 14 13:45 .. ... drwx------ 1848 cmpbib bib 45056 Sep 14 13:34 openmpi-sessions-40031@lorien_0 srw-rw-r-- 1 cmpbib bib 0 Sep 14 12:24 pmix-190552 cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . -type f ./53310/contact.txt cat 53310/contact.txt 3493724160.0;usock;tcp://132.203.7.36:54605 190552 egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310 dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] plm:base:set_hnp_name: final jobfam 53310 (this is the faulty test) full egrep: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt config.log: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log ompi_info: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt Maybe it aborted (instead of giving the other message) while doing the error because of export OMPI_MCA_plm_base_verbose=5 ? Thanks, Eric On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:Eric, do you mean you have a unique $TMP per a.out ? or a unique $TMP per "batch" of run ? in the first case, my understanding is that conflicts cannot happen ... once you hit the bug, can you please please post the output of the failed a.out, and run egrep 'jobfam|stop' on all your logs, so we might spot a conflict Cheers, Gilles On Wednesday, September 14, 2016, Eric Chamberland <eric.chamberl...@giref.ulaval.ca <mailto:eric.chamberl...@giref.ulaval.ca>> wrote: Lucky! Since each runs have a specific TMP, I still have it on disc. for the faulty run, the TMP variable was: TMP=/tmp/tmp.wOv5dkNaSI and into $TMP I have: openmpi-sessions-40031@lorien_0 and into this subdirectory I have a bunch of empty dirs: cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |wc -l 1841 cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |more total 68 drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 . drwx------ 3 cmpbib bib 231 Sep 13 03:50 .. drwx------ 2 cmpbib bib 6 Sep 13 02:10 10015 drwx------ 2 cmpbib bib 6 Sep 13 03:05 10049 drwx------ 2 cmpbib bib 6 Sep 13 03:15 10052 drwx------ 2 cmpbib bib 6 Sep 13 02:22 10059 drwx------ 2 cmpbib bib 6 Sep 13 02:22 10110 drwx------ 2 cmpbib bib 6 Sep 13 02:41 10114 ... If I do: lsof |grep "openmpi-sessions-40031" lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing Output information may be incomplete. nothing... What else may I check? Eric On 14/09/16 08:47 AM, Joshua Ladd wrote: Hi, Eric I **think** this might be related to the following: https://github.com/pmix/master/pull/145 <https://github.com/pmix/master/pull/145> I'm wondering if you can look into the /tmp directory and see if you have a bunch of stale usock files. Best, Josh On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote: Eric, can you please provide more information on how your tests are launched ? do you mpirun -np 1 ./a.out or do you simply ./a.out do you use a batch manager ? if yes, which one ? do you run one test per job ? or multiple tests per job ? how are these tests launched ? do the test that crashes use MPI_Comm_spawn ? i am surprised by the process name [[9325,5754],0], which suggests there MPI_Comm_spawn was called 5753 times (!) can you also run hostname on the 'lorien' host ? if you configure'd Open MPI with --enable-debug, can you export OMPI_MCA_plm_base_verbose 5 then run one test and post the logs ? from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should produce job family 5576 (but you get 9325) the discrepancy could be explained by the use of a batch manager and/or a full hostname i am unaware of. orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32 bits hash of the) hostname and the mpirun (32 bits ?) pid. so strictly speaking, it is possible two jobs launched on the same node are assigned the same 16 bits job family. the easiest way to detect this could be to - edit orte/mca/plm/base/plm_base_jobid.c and replace OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); with OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output, "plm:base:set_hnp_name: final jobfam %lu", (unsigned long)jobfam)); configure Open MPI with --enable-debug and rebuild and then export OMPI_MCA_plm_base_verbose=4 and run your tests. when the problem occurs, you will be able to check which pids produced the faulty jobfam, and that could hint to a conflict. Cheers, Gilles On 9/14/2016 12:35 AM, Eric Chamberland wrote: Hi, It is the third time this happened into the last 10 days. While running nighlty tests (~2200), we have one or two tests that fails at the very beginning with this strange error: [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received unexpected process identifier [[9325,0],0] from [[5590,0],0] But I can't reproduce the problem right now... ie: If I launch this test alone "by hand", it is successful... the same test was successful yesterday... Is there some kind of "race condition" that can happen on the creation of "tmp" files if many tests runs together on the same node? (we are oversubcribing even sequential runs...) Here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>> Thanks, Eric _______________________________________________ devel mailing list devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> _______________________________________________ devel mailing list devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>> _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel