I think you are hitting this issue here in 4.0.1:

https://github.com/open-mpi/ompi/issues/6613

MPIR was broken in 4.0.1 due to a race condition in PMIx. It was patched,
it looks to me, for 4.0.2. Here is the openpmix issue:
https://github.com/openpmix/openpmix/issues/1189

I think this lines up - 4.0.2 should be good with a fix.



From:   John DelSignore <jdelsign...@perforce.com>
To:     Open MPI Developers <devel@lists.open-mpi.org>
Cc:     Austen W Lauria <awlau...@us.ibm.com>, devel
            <devel-boun...@lists.open-mpi.org>
Date:   11/12/2019 02:25 PM
Subject:        [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
            inside MPI_Init() when debugged with TotalView



Hi Austen,


Thanks for the reply. What I am seeing is consistent with your thought, in
that when I see the hang, one or more processes did not have a flag
updated. I don't understand how the Open MPI code works well enough to say
if it is a memory barrier problem or not. It almost looks like a event
delivery or dropped event problem to me.


The place in the MPI_init() code where the MPI processes hang and the
number of "hung" processes seems to vary from run to run. In some cases the
processes are waiting for an event or waiting for a fence (whatever that
is).


I did the following run today, which shows that it can hang waiting for an
event that apparently was not generated or was dropped:
   1.   Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
   2.   Ran the mpirun process until it hit the MPIR_Breakpoint() event.
   3.   TV attached to all four of the MPI processes and left all five
      processes stopped.
   4.   Continued all of the processes/threads and let them run freely for
      about 60 seconds. They should have run to completion in that amount
      of time.
   5.   Halted all of the processes. I included an aggregated backtrace of
      all of the processes below.


In this particular run, all four MPI processes were waiting in
ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196, which is:


        /* let the MPI progress engine run while we wait for debugger
release */
        OMPI_WAIT_FOR_COMPLETION(debugger_event_active);


I don't know how that is supposed to work, but I can clearly see that
debugger_event_active was true in all of the processes, even though TV set
MPIR_debug_gate to 1:


      d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
      Thread 2.1:
       debugger_event_active = true (1)
      Thread 3.1:
       debugger_event_active = true (1)
      Thread 4.1:
       debugger_event_active = true (1)
      Thread 5.1:
       debugger_event_active = true (1)
      d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
      Thread 2.1:
       MPIR_debug_gate = 0x00000001 (1)
      Thread 3.1:
       MPIR_debug_gate = 0x00000001 (1)
      Thread 4.1:
       MPIR_debug_gate = 0x00000001 (1)
      Thread 5.1:
       MPIR_debug_gate = 0x00000001 (1)
      d1.<>


I think the _release_fn() function in rte_orte_module.c is supposed to set
debugger_event_active to false, but that apparently did not happen in this
case. So, AFAICT, the reason debugger_event_active would not be set to
false is that the event was never delivered, so the _release_fn() function
was never called. If that's the case, then the lack of a memory barrier is
probably a moot point, and the problem is likely related to event
generation or dropped events.


Cheers, John D.


FWIW: Here's the aggregated backtrace after the whole job was allowed to
run freely for about 60 seconds, and then stopped:


      d1.<> f g w -g f+l


      +/
       +__clone : 5:12[0-3.2-3, p1.2-5]
       |+start_thread
       | +listen_thread : 1:2[p1.3, p1.5]
       | |+__select_nocancel
       | +progress_engine@opal_progress_threads.c#105 : 4:4[0-3.2]
       | |+opal_libevent2022_event_base_loop@event.c#1630
       | | +poll_dispatch@poll.c#165
       | |  +__poll_nocancel
       | +progress_engine@pmix_progress_threads.c#109 : 4:4[0-3.3]
       | |+opal_libevent2022_event_base_loop@event.c#1630
       | | +epoll_dispatch@epoll.c#407
       | |  +__epoll_wait_nocancel
       | +progress_engine : 1:2[p1.2, p1.4]
       |  +opal_libevent2022_event_base_loop@event.c#1630
       |   +epoll_dispatch@epoll.c#407 : 1:1[p1.2]
       |   |+__epoll_wait_nocancel
       |   +poll_dispatch@poll.c#165 : 1:1[p1.4]
       |    +__poll_nocancel
       +_start : 5:5[0-3.1, p1.1]
        +__libc_start_main
         +main@cpi.c#27 : 4:4[0-3.1]
         |+PMPI_Init@pinit.c#67
         | +ompi_mpi_init@ompi_mpi_init.c#890
         |  +ompi_rte_wait_for_debugger@rte_orte_module.c#196
         |   +opal_progress@opal_progress.c#245 : 1:1[0.1]
         |   |+opal_progress_events@opal_progress.c#191
         |   | +opal_libevent2022_event_base_loop@event.c#1630
         |   |  +poll_dispatch@poll.c#165
         |   |   +__poll_nocancel
         |   +opal_progress@opal_progress.c#247 : 3:3[1-3.1]
         |    +opal_progress_events@opal_progress.c#191
         |     +opal_libevent2022_event_base_loop@event.c#1630
         |      +poll_dispatch@poll.c#165
         |       +__poll_nocancel
         +orterun : 1:1[p1.1]
          +opal_libevent2022_event_base_loop@event.c#1630
           +poll_dispatch@poll.c#165
            +__poll_nocancel

      d1.<>





On 11/12/19 9:47 AM, Austen W Lauria via devel wrote:


      Could it be that some processes are not seeing the flag get updated?
      I don't think just using a simple while loop with a volatile variable
      is sufficient in all cases in a multi-threaded environment. It's my
      understanding that the volatile keyword just tells the compiler to
      not optimize or do anything funky with it - because it can change at
      any time. However, this doesn't provide any memory barrier - so it's
      possible that the thread polling on this variable is never seeing the
      update.

      Looking at the code - I see:

      #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
      do { \
      opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
      "%s lazy waiting on RTE event at %s:%d", \
      OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
      __FILE__, __LINE__); \
      while ((flg)) { \
      opal_progress(); \
      usleep(100); \
      } \
      }while(0);

      I think replacing that with:

      #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
      do { \
      opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
      "%s lazy waiting on RTE event at %s:%d", \
      OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
      __FILE__, __LINE__); \

      pthread_mutex_lock(&lock); \
      while(flag) { \
      pthread_cond_wait(&cond, &lock); \ //Releases the lock while waiting
      for a signal from another thread to wake up
      } \
      pthread_mutex_unlock(&lock); \

      }while(0);

      Is much more standard when dealing with threads updating a shared
      variable - and might lead to a more expected result in this case.

      On the other end, this would require the thread updating this
      variable to:

      pthread_mutex_lock(&lock);
      flg = new_val;
      pthread_cond_signal(&cond);
      pthread_mutex_unlock(&lock);

      This provides the memory barrier for the thread polling on the flag
      to see the update - something the volatile keyword doesn't do on its
      own. I think it's also much cleaner as it eliminates an arbitrary
      sleep from the code - which I see as a good thing as well.


      Inactive hide details for "Ralph Castain via
          devel"
      ---11/12/2019 09:24:23 AM---> On Nov 11, 2019,
          at 4:53 PM,
      Gilles Gouaillar"Ralph Castain via devel" ---11/12/2019 09:24:23
      AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel
      <devel@lists.open-mpi.org> wrote: >

      From: "Ralph Castain via devel" <devel@lists.open-mpi.org>
      To: "OpenMPI Devel" <devel@lists.open-mpi.org>
      Cc: "Ralph Castain" <r...@open-mpi.org>
      Date: 11/12/2019 09:24 AM
      Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is
      hanging inside MPI_Init() when debugged with TotalView
      Sent by: "devel" <devel-boun...@lists.open-mpi.org>







      > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel
      <devel@lists.open-mpi.org> wrote:
      >
      > John,
      >
      > OMPI_LAZY_WAIT_FOR_COMPLETION(active)
      >
      >
      > is a simple loop that periodically checks the (volatile) "active"
      condition, that is expected to be updated by an other thread.
      > So if you set your breakpoint too early, and **all** threads are
      stopped when this breakpoint is hit, you might experience
      > what looks like a race condition.
      > I guess a similar scenario can occur if the breakpoint is set in
      mpirun/orted too early, and prevents the pmix (or oob/tcp) thread
      > from sending the message to all MPI tasks)
      >
      >
      >
      > Ralph,
      >
      > does the v4.0.x branch still need the oob/tcp progress thread
      running inside the MPI app?
      > or are we missing some commits (since all interactions with
      mpirun/orted are handled by PMIx, at least in the master branch) ?

      IIRC, that progress thread only runs if explicitly asked to do so by
      MCA param. We don't need that code any more as PMIx takes care of it.

      >
      > Cheers,
      >
      > Gilles
      >
      > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
      >> Hi John
      >>
      >> Sorry to say, but there is no way to really answer your question
      as the OMPI community doesn't actively test MPIR support. I haven't
      seen any reports of hangs during MPI_Init from any release series,
      including 4.x. My guess is that it may have something to do with the
      debugger interactions as opposed to being a true race condition.
      >>
      >> Ralph
      >>
      >>
      >>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel <
      devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>> wrote:
      >>>
      >>> Hi,
      >>>
      >>> An LLNL TotalView user on a Mac reported that their MPI job was
      hanging inside MPI_Init() when started under the control of
      TotalView. They were using Open MPI 4.0.1, and TotalView was using
      the MPIR Interface (sorry, we don't support the PMIx debugging hooks
      yet).
      >>>
      >>> I was able to reproduce the hang on my own Linux system with my
      own build of Open MPI 4.0.1, which I built with debug symbols. As far
      as I can tell, there is some sort of race inside of Open MPI 4.0.1,
      because if I placed breakpoints at certain points in the Open MPI
      code, and thus change the timing slightly, that was enough to avoid
      the hang.
      >>>
      >>> When the code hangs, it appeared as if one or more MPI processes
      are waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a
      fence to be released. In one of the runs, rank 0 was the only one the
      was hanging there (though I have seen runs where two ranks were hung
      there).
      >>>
      >>> Here's a backtrace of the first thread in the rank 0 process in
      the case where one rank was hung:
      >>>
      >>> d1.<> f 10.1 w
      >>> >  0 __nanosleep_nocancel PC=0x7ffff74e2efd, FP=0x7fffffffd1e0
      [/lib64/libc.so.6]
      >>>    1 usleep PC=0x7ffff7513b2f, FP=0x7fffffffd200
      [/lib64/libc.so.6]
      >>>    2 ompi_mpi_init PC=0x7ffff7a64009, FP=0x7fffffffd350
      
[/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]

      >>>    3 PMPI_Init PC=0x7ffff7ab0be4, FP=0x7fffffffd390
      
[/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]

      >>>    4 main             PC=0x00400c5e, FP=0x7fffffffd550
      [/home/jdelsign/cpi.c#27]
      >>>    5 __libc_start_main PC=0x7ffff7446b13, FP=0x7fffffffd610
      [/lib64/libc.so.6]
      >>>    6 _start           PC=0x00400b04, FP=0x7fffffffd618
      [/amd/home/jdelsign/cpi]
      >>>
      >>> Here's the block of code where the thread is hung:
      >>>
      >>>     /* if we executed the above fence in the background, then
      >>>      * we have to wait here for it to complete. However, there
      >>>      * is no reason to do two barriers! */
      >>>     if (background_fence) {
      >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active);
      >>>     } else if (!ompi_async_mpi_init) {
      >>>         /* wait for everyone to reach this point - this is a hard
      >>>          * barrier requirement at this time, though we hope to
      relax
      >>>          * it at a later point */
      >>>         if (NULL != opal_pmix.fence_nb) {
      >>>             active = true;
      >>> OPAL_POST_OBJECT(&active);
      >>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL,
      false,
      >>> fence_release, (void*)&active))) {
      >>>                 error = "opal_pmix.fence_nb() failed";
      >>>                 goto error;
      >>>             }
      >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *<<<<----- STUCK HERE
      WAITING FOR THE FENCE TO BE RELEASED*
      >>>         } else {
      >>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL,
      false))) {
      >>>                 error = "opal_pmix.fence() failed";
      >>>                 goto error;
      >>>             }
      >>>         }
      >>>     }
      >>>
      >>> And here is an aggregated backtrace of all of the processes and
      threads in the job:
      >>>
      >>> d1.<> f g w -g f+l
      >>> +/
      >>>  +__clone : 5:12[0-3.2-3, p1.2-5]
      >>>  |+start_thread
      >>>  | +listen_thread@oob_tcp_listener.c <
      mailto:listen_thread@oob_tcp_listener.c>#705 : 1:1[p1.5]
      >>>  | |+__select_nocancel
      >>>  | +listen_thread@ptl_base_listener.c <
      mailto:listen_thread@ptl_base_listener.c>#214 : 1:1[p1.3]
      >>>  | |+__select_nocancel
      >>>  | +progress_engine@opal_progress_threads.c <
      mailto:progress_engine@opal_progress_threads.c>#105 : 5:5[0-3.2,
      p1.4]
      >>>  | |+opal_libevent2022_event_base_loop@event.c <
      mailto:opal_libevent2022_event_base_loop@event.c>#1632
      >>>  | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c>#167
      >>>  | |  +__poll_nocancel
      >>>  | +progress_engine@pmix_progress_threads.c <
      mailto:progress_engine@pmix_progress_threads.c>#108 : 5:5[0-3.3,
      p1.2]
      >>>  |  +opal_libevent2022_event_base_loop@event.c <
      mailto:opal_libevent2022_event_base_loop@event.c>#1632
      >>>  |   +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c>#409
      >>>  |    +__epoll_wait_nocancel
      >>>  +_start : 5:5[0-3.1, p1.1]
      >>>   +__libc_start_main
      >>>    +main@cpi.c <mailto:main@cpi.c>#27 : 4:4[0-3.1]
      >>>    |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c>#67
      >>>    | +*ompi_mpi_init@ompi_mpi_init.c#890 : 3:3[1-3.1]**<<<<----
      THE 3 OTHER MPI PROCS MADE IT PAST FENCE*
      >>>    | |+ompi_rte_wait_for_debugger@rte_orte_module.c <
      mailto:ompi_rte_wait_for_debugger@rte_orte_module.c>#196
      >>>    | | +opal_progress@opal_progress.c <
      mailto:opal_progress@opal_progress.c>#251
      >>>    | |  +opal_progress_events@opal_progress.c <
      mailto:opal_progress_events@opal_progress.c>#191
      >>>    | |   +opal_libevent2022_event_base_loop@event.c <
      mailto:opal_libevent2022_event_base_loop@event.c>#1632
      >>>    | |    +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c>#167
      >>>    | |     +__poll_nocancel
      >>>    | +*ompi_mpi_init@ompi_mpi_init.c#904 : 1:1
      [0.1]**<<<<----**THE THREAD THAT IS STUCK*
      >>>    |  +usleep
      >>>    |   +__nanosleep_nocancel
      >>>    +main@main.c <mailto:main@main.c>#14 : 1:1[p1.1]
      >>>     +orterun@orterun.c <mailto:orterun@orterun.c>#200
      >>>      +opal_libevent2022_event_base_loop@event.c <
      mailto:opal_libevent2022_event_base_loop@event.c>#1632
      >>>       +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c>#167
      >>>        +__poll_nocancel
      >>>
      >>> d1.<>
      >>>
      >>> I have tested Open MPI 4.0.2 dozens of times, and the hang does
      not seem to happen. My concern is that if the problem is indeed a
      race, then it's /possible/ (but perhaps not likely) that the same
      race exists in Open MPI 4.0.2, but the timing could be slightly
      different such that it doesn't hang using my simple test setup. In
      other words, maybe I've just been "lucky" with my testing of Open MPI
      4.0.2 and have failed to provoke the hang yet.
      >>>
      >>> My question is: Was this a known problem in Open MPI 4.0.1 that
      was fixed in Open MPI 4.0.2?
      >>>
      >>> Thanks, John D.
      >>>
      >>>
      >>







|------------------------------------------------------------------|
|CAUTION: This email originated from outside of the organization.  |
|Do not click on links or open attachments unless you recognize the|
|sender and know the content is safe.                              |
|------------------------------------------------------------------|




Reply via email to