Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

George Bosilca via devel Tue, 12 Nov 2019 08:54:10 -0800

I don't think there is a need any protection around that variable. It will
change value only once (in a callback triggered from opal_progress), and
the volatile guarantees that loads will be issued for every access, so the
waiting thread will eventually notice the change.


 George.


On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel <
devel@lists.open-mpi.org> wrote:

> Could it be that some processes are not seeing the flag get updated? I
> don't think just using a simple while loop with a volatile variable is
> sufficient in all cases in a multi-threaded environment. It's my
> understanding that the volatile keyword just tells the compiler to not
> optimize or do anything funky with it - because it can change at any time.
> However, this doesn't provide any memory barrier - so it's possible that
> the thread polling on this variable is never seeing the update.
>
> Looking at the code - I see:
>
> #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
> do { \
> opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
> "%s lazy waiting on RTE event at %s:%d", \
> OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
> __FILE__, __LINE__); \
> while ((flg)) { \
> opal_progress(); \
> usleep(100); \
> } \
> }while(0);
>
> I think replacing that with:
>
> #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
> do { \
> opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
> "%s lazy waiting on RTE event at %s:%d", \
> OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
> __FILE__, __LINE__); \
>
> pthread_mutex_lock(&lock); \
> while(flag) { \
> pthread_cond_wait(&cond, &lock); \ //Releases the lock while waiting for a
> signal from another thread to wake up
> } \
> pthread_mutex_unlock(&lock); \
>
> }while(0);
>
> Is much more standard when dealing with threads updating a shared variable
> - and might lead to a more expected result in this case.
>
> On the other end, this would require the thread updating this variable to:
>
> pthread_mutex_lock(&lock);
> flg = new_val;
> pthread_cond_signal(&cond);
> pthread_mutex_unlock(&lock);
>
> This provides the memory barrier for the thread polling on the flag to see
> the update - something the volatile keyword doesn't do on its own. I think
> it's also much cleaner as it eliminates an arbitrary sleep from the code -
> which I see as a good thing as well.
>
>
> [image: Inactive hide details for "Ralph Castain via devel" ---11/12/2019
> 09:24:23 AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillar]"Ralph
> Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 2019, at 4:53
> PM, Gilles Gouaillardet via devel <devel@lists.open-mpi.org> wrote: >
>
> From: "Ralph Castain via devel" <devel@lists.open-mpi.org>
> To: "OpenMPI Devel" <devel@lists.open-mpi.org>
> Cc: "Ralph Castain" <r...@open-mpi.org>
> Date: 11/12/2019 09:24 AM
> Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
> inside MPI_Init() when debugged with TotalView
> Sent by: "devel" <devel-boun...@lists.open-mpi.org>
> ------------------------------
>
>
>
>
>
> > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel <
> devel@lists.open-mpi.org> wrote:
> >
> > John,
> >
> > OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> >
> >
> > is a simple loop that periodically checks the (volatile) "active"
> condition, that is expected to be updated by an other thread.
> > So if you set your breakpoint too early, and **all** threads are stopped
> when this breakpoint is hit, you might experience
> > what looks like a race condition.
> > I guess a similar scenario can occur if the breakpoint is set in
> mpirun/orted too early, and prevents the pmix (or oob/tcp) thread
> > from sending the message to all MPI tasks)
> >
> >
> >
> > Ralph,
> >
> > does the v4.0.x branch still need the oob/tcp progress thread running
> inside the MPI app?
> > or are we missing some commits (since all interactions with mpirun/orted
> are handled by PMIx, at least in the master branch) ?
>
> IIRC, that progress thread only runs if explicitly asked to do so by MCA
> param. We don't need that code any more as PMIx takes care of it.
>
> >
> > Cheers,
> >
> > Gilles
> >
> > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
> >> Hi John
> >>
> >> Sorry to say, but there is no way to really answer your question as the
> OMPI community doesn't actively test MPIR support. I haven't seen any
> reports of hangs during MPI_Init from any release series, including 4.x. My
> guess is that it may have something to do with the debugger interactions as
> opposed to being a true race condition.
> >>
> >> Ralph
> >>
> >>
> >>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel <
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org
> <devel@lists.open-mpi.org>>> wrote:
> >>>
> >>> Hi,
> >>>
> >>> An LLNL TotalView user on a Mac reported that their MPI job was
> hanging inside MPI_Init() when started under the control of TotalView. They
> were using Open MPI 4.0.1, and TotalView was using the MPIR Interface
> (sorry, we don't support the PMIx debugging hooks yet).
> >>>
> >>> I was able to reproduce the hang on my own Linux system with my own
> build of Open MPI 4.0.1, which I built with debug symbols. As far as I can
> tell, there is some sort of race inside of Open MPI 4.0.1, because if I
> placed breakpoints at certain points in the Open MPI code, and thus change
> the timing slightly, that was enough to avoid the hang.
> >>>
> >>> When the code hangs, it appeared as if one or more MPI processes are
> waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to
> be released. In one of the runs, rank 0 was the only one the was hanging
> there (though I have seen runs where two ranks were hung there).
> >>>
> >>> Here's a backtrace of the first thread in the rank 0 process in the
> case where one rank was hung:
> >>>
> >>> d1.<> f 10.1 w
> >>> >  0 __nanosleep_nocancel PC=0x7ffff74e2efd, FP=0x7fffffffd1e0
> [/lib64/libc.so.6]
> >>>    1 usleep PC=0x7ffff7513b2f, FP=0x7fffffffd200 [/lib64/libc.so.6]
> >>>    2 ompi_mpi_init PC=0x7ffff7a64009, FP=0x7fffffffd350
> [/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
> >>>    3 PMPI_Init PC=0x7ffff7ab0be4, FP=0x7fffffffd390
> [/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
> >>>    4 main             PC=0x00400c5e, FP=0x7fffffffd550
> [/home/jdelsign/cpi.c#27]
> >>>    5 __libc_start_main PC=0x7ffff7446b13, FP=0x7fffffffd610
> [/lib64/libc.so.6]
> >>>    6 _start           PC=0x00400b04, FP=0x7fffffffd618
> [/amd/home/jdelsign/cpi]
> >>>
> >>> Here's the block of code where the thread is hung:
> >>>
> >>>     /* if we executed the above fence in the background, then
> >>>      * we have to wait here for it to complete. However, there
> >>>      * is no reason to do two barriers! */
> >>>     if (background_fence) {
> >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active);
> >>>     } else if (!ompi_async_mpi_init) {
> >>>         /* wait for everyone to reach this point - this is a hard
> >>>          * barrier requirement at this time, though we hope to relax
> >>>          * it at a later point */
> >>>         if (NULL != opal_pmix.fence_nb) {
> >>>             active = true;
> >>> OPAL_POST_OBJECT(&active);
> >>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
> >>> fence_release, (void*)&active))) {
> >>>                 error = "opal_pmix.fence_nb() failed";
> >>>                 goto error;
> >>>             }
> >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *<<<<----- STUCK HERE WAITING
> FOR THE FENCE TO BE RELEASED*
> >>>         } else {
> >>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
> >>>                 error = "opal_pmix.fence() failed";
> >>>                 goto error;
> >>>             }
> >>>         }
> >>>     }
> >>>
> >>> And here is an aggregated backtrace of all of the processes and
> threads in the job:
> >>>
> >>> d1.<> f g w -g f+l
> >>> +/
> >>>  +__clone : 5:12[0-3.2-3, p1.2-5]
> >>>  |+start_thread
> >>>  | +listen_thread@oob_tcp_listener.c <
> mailto:listen_thread@oob_tcp_listener.c 
> <listen_thread@oob_tcp_listener.c>>#705
> : 1:1[p1.5]
> >>>  | |+__select_nocancel
> >>>  | +listen_thread@ptl_base_listener.c <
> mailto:listen_thread@ptl_base_listener.c
> <listen_thread@ptl_base_listener.c>>#214 : 1:1[p1.3]
> >>>  | |+__select_nocancel
> >>>  | +progress_engine@opal_progress_threads.c <
> mailto:progress_engine@opal_progress_threads.c
> <progress_engine@opal_progress_threads.c>>#105 : 5:5[0-3.2, p1.4]
> >>>  | |+opal_libevent2022_event_base_loop@event.c <
> mailto:opal_libevent2022_event_base_loop@event.c
> <opal_libevent2022_event_base_loop@event.c>>#1632
> >>>  | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c
> <poll_dispatch@poll.c>>#167
> >>>  | |  +__poll_nocancel
> >>>  | +progress_engine@pmix_progress_threads.c <
> mailto:progress_engine@pmix_progress_threads.c
> <progress_engine@pmix_progress_threads.c>>#108 : 5:5[0-3.3, p1.2]
> >>>  |  +opal_libevent2022_event_base_loop@event.c <
> mailto:opal_libevent2022_event_base_loop@event.c
> <opal_libevent2022_event_base_loop@event.c>>#1632
> >>>  |   +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c
> <epoll_dispatch@epoll.c>>#409
> >>>  |    +__epoll_wait_nocancel
> >>>  +_start : 5:5[0-3.1, p1.1]
> >>>   +__libc_start_main
> >>>    +main@cpi.c <mailto:main@cpi.c <main@cpi.c>>#27 : 4:4[0-3.1]
> >>>    |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c <PMPI_Init@pinit.c>
> >#67
> >>>    | +*ompi_mpi_init@ompi_mpi_init.c#890 : 3:3[1-3.1]**<<<<---- THE 3
> OTHER MPI PROCS MADE IT PAST FENCE*
> >>>    | |+ompi_rte_wait_for_debugger@rte_orte_module.c <
> mailto:ompi_rte_wait_for_debugger@rte_orte_module.c
> <ompi_rte_wait_for_debugger@rte_orte_module.c>>#196
> >>>    | | +opal_progress@opal_progress.c <
> mailto:opal_progress@opal_progress.c <opal_progress@opal_progress.c>>#251
> >>>    | |  +opal_progress_events@opal_progress.c <
> mailto:opal_progress_events@opal_progress.c
> <opal_progress_events@opal_progress.c>>#191
> >>>    | |   +opal_libevent2022_event_base_loop@event.c <
> mailto:opal_libevent2022_event_base_loop@event.c
> <opal_libevent2022_event_base_loop@event.c>>#1632
> >>>    | |    +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c
> <poll_dispatch@poll.c>>#167
> >>>    | |     +__poll_nocancel
> >>>    | +*ompi_mpi_init@ompi_mpi_init.c#904 : 1:1[0.1]**<<<<----**THE
> THREAD THAT IS STUCK*
> >>>    |  +usleep
> >>>    |   +__nanosleep_nocancel
> >>>    +main@main.c <mailto:main@main.c <main@main.c>>#14 : 1:1[p1.1]
> >>>     +orterun@orterun.c <mailto:orterun@orterun.c <orterun@orterun.c>
> >#200
> >>>      +opal_libevent2022_event_base_loop@event.c <
> mailto:opal_libevent2022_event_base_loop@event.c
> <opal_libevent2022_event_base_loop@event.c>>#1632
> >>>       +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c
> <poll_dispatch@poll.c>>#167
> >>>        +__poll_nocancel
> >>>
> >>> d1.<>
> >>>
> >>> I have tested Open MPI 4.0.2 dozens of times, and the hang does not
> seem to happen. My concern is that if the problem is indeed a race, then
> it's /possible/ (but perhaps not likely) that the same race exists in Open
> MPI 4.0.2, but the timing could be slightly different such that it doesn't
> hang using my simple test setup. In other words, maybe I've just been
> "lucky" with my testing of Open MPI 4.0.2 and have failed to provoke the
> hang yet.
> >>>
> >>> My question is: Was this a known problem in Open MPI 4.0.1 that was
> fixed in Open MPI 4.0.2?
> >>>
> >>> Thanks, John D.
> >>>
> >>>
> >>
>
>
>
>
>
>
>

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

Reply via email to