Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

r...@open-mpi.org Wed, 14 Sep 2016 20:30:04 -0700

Just in the FWIW category: the HNP used to send the singleton’s name down the 
pipe at startup, which eliminated the code line you identified. Now, we are 
pushing the name into the environment as a PMIx envar, and having the PMIx 
component pick it up. Roundabout way of getting it, and that’s what is causing 
the trouble.


> On Sep 14, 2016, at 8:26 PM, r...@open-mpi.org wrote:
> 
> Ah...I take that back. We changed this and now we _do_ indeed go down that 
> code path. Not good.
> 
> So yes, we need that putenv so it gets the jobid from the HNP that was 
> launched, like it used to do. You want to throw that in?
> 
> Thanks
> Ralph
> 
>> On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote:
>> 
>> Nah, something isn’t right here. The singleton doesn’t go thru that code 
>> line, or it isn’t supposed to do so. I think the problem lies in the way the 
>> singleton in 2.x is starting up. Let me take a look at how singletons are 
>> working over there.
>> 
>>> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>>> 
>>> Ralph,
>>> 
>>> 
>>> i think i just found the root cause :-)
>>> 
>>> 
>>> from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c
>>> 
>>> /* store our jobid and rank */
>>> if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
>>>      /* if we were launched by the OMPI RTE, then
>>>       * the jobid is in a special format - so get it */
>>>      mca_pmix_pmix112_component.native_launch = true;
>>>      opal_convert_string_to_jobid(&pname.jobid, my_proc.nspace);
>>>  } else {
>>>      /* we were launched by someone else, so make the
>>>       * jobid just be the hash of the nspace */
>>>      OPAL_HASH_STR(my_proc.nspace, pname.jobid);
>>>      /* keep it from being negative */
>>>      pname.jobid &= ~(0x8000);
>>>  }
>>> 
>>> 
>>> in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment,
>>> so we end up using a jobid that is a hash of the namespace
>>> 
>>> the initial error message was
>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
>>> unexpected process identifier [[9325,0],0] from [[5590,0],0]
>>> 
>>> and with little surprise
>>> ((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated 
>>> by OPAL_HASH_STR
>>> 
>>> 
>>> i am now thinking the right fix is to simply
>>> putenv(OPAL_MCA_PREFIX"orte_launch=1");
>>> in ess/singleton
>>> 
>>> makes sense ?
>>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>>>> Ok,
>>>> 
>>>> one test segfaulted *but* I can't tell if it is the *same* bug because 
>>>> there has been a segfault:
>>>> 
>>>> stderr:
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>>>  
>>>> 
>>>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
>>>> NULL
>>>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
>>>> 1366255883
>>>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>>>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>>>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>>>> *** Error in `orted': realloc(): invalid next size: 0x0000000001e58770 ***
>>>> ...
>>>> ...
>>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
>>>> daemon on the local node in file ess_singleton_module.c at line 573
>>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
>>>> daemon on the local node in file ess_singleton_module.c at line 163
>>>> *** An error occurred in MPI_Init_thread
>>>> *** on a NULL communicator
>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>> ***    and potentially your MPI job)
>>>> [lorien:190306] Local abort before MPI_INIT completed completed 
>>>> successfully, but am not able to aggregate error messages, and not able to 
>>>> guarantee that all other processes were killed!
>>>> 
>>>> stdout:
>>>> 
>>>> -------------------------------------------------------------------------- 
>>>> It looks like orte_init failed for some reason; your parallel process is
>>>> likely to abort.  There are many reasons that a parallel process can
>>>> fail during orte_init; some of which are due to configuration or
>>>> environment problems.  This failure appears to be an internal failure;
>>>> here's some additional information (which may only be relevant to an
>>>> Open MPI developer):
>>>> 
>>>> orte_ess_init failed
>>>> --> Returned value Unable to start a daemon on the local node (-127) 
>>>> instead of ORTE_SUCCESS
>>>> -------------------------------------------------------------------------- 
>>>> -------------------------------------------------------------------------- 
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort.  There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or environment
>>>> problems.  This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>> 
>>>> ompi_mpi_init: ompi_rte_init failed
>>>> --> Returned "Unable to start a daemon on the local node" (-127) instead 
>>>> of "Success" (0)
>>>> -------------------------------------------------------------------------- 
>>>> 
>>>> openmpi content of $TMP:
>>>> 
>>>> /tmp/tmp.GoQXICeyJl> ls -la
>>>> total 1500
>>>> drwx------    3 cmpbib bib     250 Sep 14 13:34 .
>>>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>>>> ...
>>>> drwx------ 1848 cmpbib bib   45056 Sep 14 13:34 
>>>> openmpi-sessions-40031@lorien_0
>>>> srw-rw-r--    1 cmpbib bib       0 Sep 14 12:24 pmix-190552
>>>> 
>>>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . 
>>>> -type f
>>>> ./53310/contact.txt
>>>> 
>>>> cat 53310/contact.txt
>>>> 3493724160.0;usock;tcp://132.203.7.36:54605
>>>> 190552
>>>> 
>>>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
>>>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
>>>> plm:base:set_hnp_name: final jobfam 53310
>>>> 
>>>> (this is the faulty test)
>>>> full egrep:
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt
>>>>  
>>>> 
>>>> config.log:
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log
>>>>  
>>>> 
>>>> ompi_info:
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt
>>>>  
>>>> 
>>>> Maybe it aborted (instead of giving the other message) while doing the 
>>>> error because of export OMPI_MCA_plm_base_verbose=5 ?
>>>> 
>>>> Thanks,
>>>> 
>>>> Eric
>>>> 
>>>> 
>>>> On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
>>>>> Eric,
>>>>> 
>>>>> do you mean you have a unique $TMP per a.out ?
>>>>> or a unique $TMP per "batch" of run ?
>>>>> 
>>>>> in the first case, my understanding is that conflicts cannot happen ...
>>>>> 
>>>>> once you hit the bug, can you please please post the output of the
>>>>> failed a.out,
>>>>> and run
>>>>> egrep 'jobfam|stop'
>>>>> on all your logs, so we might spot a conflict
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> On Wednesday, September 14, 2016, Eric Chamberland
>>>>> <eric.chamberl...@giref.ulaval.ca
>>>>> <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
>>>>> 
>>>>>  Lucky!
>>>>> 
>>>>>  Since each runs have a specific TMP, I still have it on disc.
>>>>> 
>>>>>  for the faulty run, the TMP variable was:
>>>>> 
>>>>>  TMP=/tmp/tmp.wOv5dkNaSI
>>>>> 
>>>>>  and into $TMP I have:
>>>>> 
>>>>>  openmpi-sessions-40031@lorien_0
>>>>> 
>>>>>  and into this subdirectory I have a bunch of empty dirs:
>>>>> 
>>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>>>  ls -la |wc -l
>>>>>  1841
>>>>> 
>>>>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
>>>>>  ls -la |more
>>>>>  total 68
>>>>>  drwx------ 1840 cmpbib bib 45056 Sep 13 03:49 .
>>>>>  drwx------    3 cmpbib bib   231 Sep 13 03:50 ..
>>>>>  drwx------    2 cmpbib bib     6 Sep 13 02:10 10015
>>>>>  drwx------    2 cmpbib bib     6 Sep 13 03:05 10049
>>>>>  drwx------    2 cmpbib bib     6 Sep 13 03:15 10052
>>>>>  drwx------    2 cmpbib bib     6 Sep 13 02:22 10059
>>>>>  drwx------    2 cmpbib bib     6 Sep 13 02:22 10110
>>>>>  drwx------    2 cmpbib bib     6 Sep 13 02:41 10114
>>>>>  ...
>>>>> 
>>>>>  If I do:
>>>>> 
>>>>>  lsof |grep "openmpi-sessions-40031"
>>>>>  lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>>>>>  /run/user/1000/gvfs
>>>>>        Output information may be incomplete.
>>>>>  lsof: WARNING: can't stat() tracefs file system
>>>>>  /sys/kernel/debug/tracing
>>>>>        Output information may be incomplete.
>>>>> 
>>>>>  nothing...
>>>>> 
>>>>>  What else may I check?
>>>>> 
>>>>>  Eric
>>>>> 
>>>>> 
>>>>>  On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>>>> 
>>>>>      Hi, Eric
>>>>> 
>>>>>      I **think** this might be related to the following:
>>>>> 
>>>>>      https://github.com/pmix/master/pull/145
>>>>>      <https://github.com/pmix/master/pull/145>
>>>>> 
>>>>>      I'm wondering if you can look into the /tmp directory and see if you
>>>>>      have a bunch of stale usock files.
>>>>> 
>>>>>      Best,
>>>>> 
>>>>>      Josh
>>>>> 
>>>>> 
>>>>>      On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
>>>>>      <gil...@rist.or.jp
>>>>>      <mailto:gil...@rist.or.jp>> wrote:
>>>>> 
>>>>>          Eric,
>>>>> 
>>>>> 
>>>>>          can you please provide more information on how your tests
>>>>>      are launched ?
>>>>> 
>>>>>          do you
>>>>> 
>>>>>          mpirun -np 1 ./a.out
>>>>> 
>>>>>          or do you simply
>>>>> 
>>>>>          ./a.out
>>>>> 
>>>>> 
>>>>>          do you use a batch manager ? if yes, which one ?
>>>>> 
>>>>>          do you run one test per job ? or multiple tests per job ?
>>>>> 
>>>>>          how are these tests launched ?
>>>>> 
>>>>> 
>>>>>          do the test that crashes use MPI_Comm_spawn ?
>>>>> 
>>>>>          i am surprised by the process name [[9325,5754],0], which
>>>>>      suggests
>>>>>          there MPI_Comm_spawn was called 5753 times (!)
>>>>> 
>>>>> 
>>>>>          can you also run
>>>>> 
>>>>>          hostname
>>>>> 
>>>>>          on the 'lorien' host ?
>>>>> 
>>>>>          if you configure'd Open MPI with --enable-debug, can you
>>>>> 
>>>>>          export OMPI_MCA_plm_base_verbose 5
>>>>> 
>>>>>          then run one test and post the logs ?
>>>>> 
>>>>> 
>>>>>          from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
>>>>>      should
>>>>>          produce job family 5576 (but you get 9325)
>>>>> 
>>>>>          the discrepancy could be explained by the use of a batch manager
>>>>>          and/or a full hostname i am unaware of.
>>>>> 
>>>>> 
>>>>>          orte_plm_base_set_hnp_name() generate a 16 bits job family
>>>>>      from the
>>>>>          (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>>>> 
>>>>>          so strictly speaking, it is possible two jobs launched on
>>>>>      the same
>>>>>          node are assigned the same 16 bits job family.
>>>>> 
>>>>> 
>>>>>          the easiest way to detect this could be to
>>>>> 
>>>>>          - edit orte/mca/plm/base/plm_base_jobid.c
>>>>> 
>>>>>          and replace
>>>>> 
>>>>>              OPAL_OUTPUT_VERBOSE((5,
>>>>>      orte_plm_base_framework.framework_output,
>>>>>                                   "plm:base:set_hnp_name: final
>>>>>      jobfam %lu",
>>>>>                                   (unsigned long)jobfam));
>>>>> 
>>>>>          with
>>>>> 
>>>>>              OPAL_OUTPUT_VERBOSE((4,
>>>>>      orte_plm_base_framework.framework_output,
>>>>>                                   "plm:base:set_hnp_name: final
>>>>>      jobfam %lu",
>>>>>                                   (unsigned long)jobfam));
>>>>> 
>>>>>          configure Open MPI with --enable-debug and rebuild
>>>>> 
>>>>>          and then
>>>>> 
>>>>>          export OMPI_MCA_plm_base_verbose=4
>>>>> 
>>>>>          and run your tests.
>>>>> 
>>>>> 
>>>>>          when the problem occurs, you will be able to check which pids
>>>>>          produced the faulty jobfam, and that could hint to a conflict.
>>>>> 
>>>>> 
>>>>>          Cheers,
>>>>> 
>>>>> 
>>>>>          Gilles
>>>>> 
>>>>> 
>>>>> 
>>>>>          On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>>>> 
>>>>>              Hi,
>>>>> 
>>>>>              It is the third time this happened into the last 10 days.
>>>>> 
>>>>>              While running nighlty tests (~2200), we have one or two
>>>>>      tests
>>>>>              that fails at the very beginning with this strange error:
>>>>> 
>>>>>              [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>>>>>              received unexpected process identifier [[9325,0],0] from
>>>>>              [[5590,0],0]
>>>>> 
>>>>>              But I can't reproduce the problem right now... ie: If I
>>>>>      launch
>>>>>              this test alone "by hand", it is successful... the same
>>>>>      test was
>>>>>              successful yesterday...
>>>>> 
>>>>>              Is there some kind of "race condition" that can happen
>>>>>      on the
>>>>>              creation of "tmp" files if many tests runs together on
>>>>>      the same
>>>>>              node? (we are oversubcribing even sequential runs...)
>>>>> 
>>>>>              Here are the build logs:
>>>>> 
>>>>> 
>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>
>>>>> 
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log>>
>>>>> 
>>>>> 
>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>
>>>>> 
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt
>>>>> <http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt>>
>>>>> 
>>>>> 
>>>>>              Thanks,
>>>>> 
>>>>>              Eric
>>>>>              _______________________________________________
>>>>>              devel mailing list
>>>>>              devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>>> 
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>> 
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>>> 
>>>>> 
>>>>>          _______________________________________________
>>>>>          devel mailing list
>>>>>          devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>>
>>>>> 
>>>>> 
>>>>>  _______________________________________________
>>>>>  devel mailing list
>>>>>  devel@lists.open-mpi.org
>>>>>  https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

Reply via email to