I don't think there is a need any protection around that variable. It will change value only once (in a callback triggered from opal_progress), and the volatile guarantees that loads will be issued for every access, so the waiting thread will eventually notice the change.
George. On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel < devel@lists.open-mpi.org> wrote: > Could it be that some processes are not seeing the flag get updated? I > don't think just using a simple while loop with a volatile variable is > sufficient in all cases in a multi-threaded environment. It's my > understanding that the volatile keyword just tells the compiler to not > optimize or do anything funky with it - because it can change at any time. > However, this doesn't provide any memory barrier - so it's possible that > the thread polling on this variable is never seeing the update. > > Looking at the code - I see: > > #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \ > do { \ > opal_output_verbose(1, ompi_rte_base_framework.framework_output, \ > "%s lazy waiting on RTE event at %s:%d", \ > OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \ > __FILE__, __LINE__); \ > while ((flg)) { \ > opal_progress(); \ > usleep(100); \ > } \ > }while(0); > > I think replacing that with: > > #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \ > do { \ > opal_output_verbose(1, ompi_rte_base_framework.framework_output, \ > "%s lazy waiting on RTE event at %s:%d", \ > OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \ > __FILE__, __LINE__); \ > > pthread_mutex_lock(&lock); \ > while(flag) { \ > pthread_cond_wait(&cond, &lock); \ //Releases the lock while waiting for a > signal from another thread to wake up > } \ > pthread_mutex_unlock(&lock); \ > > }while(0); > > Is much more standard when dealing with threads updating a shared variable > - and might lead to a more expected result in this case. > > On the other end, this would require the thread updating this variable to: > > pthread_mutex_lock(&lock); > flg = new_val; > pthread_cond_signal(&cond); > pthread_mutex_unlock(&lock); > > This provides the memory barrier for the thread polling on the flag to see > the update - something the volatile keyword doesn't do on its own. I think > it's also much cleaner as it eliminates an arbitrary sleep from the code - > which I see as a good thing as well. > > > [image: Inactive hide details for "Ralph Castain via devel" ---11/12/2019 > 09:24:23 AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillar]"Ralph > Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 2019, at 4:53 > PM, Gilles Gouaillardet via devel <devel@lists.open-mpi.org> wrote: > > > From: "Ralph Castain via devel" <devel@lists.open-mpi.org> > To: "OpenMPI Devel" <devel@lists.open-mpi.org> > Cc: "Ralph Castain" <r...@open-mpi.org> > Date: 11/12/2019 09:24 AM > Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging > inside MPI_Init() when debugged with TotalView > Sent by: "devel" <devel-boun...@lists.open-mpi.org> > ------------------------------ > > > > > > > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel < > devel@lists.open-mpi.org> wrote: > > > > John, > > > > OMPI_LAZY_WAIT_FOR_COMPLETION(active) > > > > > > is a simple loop that periodically checks the (volatile) "active" > condition, that is expected to be updated by an other thread. > > So if you set your breakpoint too early, and **all** threads are stopped > when this breakpoint is hit, you might experience > > what looks like a race condition. > > I guess a similar scenario can occur if the breakpoint is set in > mpirun/orted too early, and prevents the pmix (or oob/tcp) thread > > from sending the message to all MPI tasks) > > > > > > > > Ralph, > > > > does the v4.0.x branch still need the oob/tcp progress thread running > inside the MPI app? > > or are we missing some commits (since all interactions with mpirun/orted > are handled by PMIx, at least in the master branch) ? > > IIRC, that progress thread only runs if explicitly asked to do so by MCA > param. We don't need that code any more as PMIx takes care of it. > > > > > Cheers, > > > > Gilles > > > > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote: > >> Hi John > >> > >> Sorry to say, but there is no way to really answer your question as the > OMPI community doesn't actively test MPIR support. I haven't seen any > reports of hangs during MPI_Init from any release series, including 4.x. My > guess is that it may have something to do with the debugger interactions as > opposed to being a true race condition. > >> > >> Ralph > >> > >> > >>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel < > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org > <devel@lists.open-mpi.org>>> wrote: > >>> > >>> Hi, > >>> > >>> An LLNL TotalView user on a Mac reported that their MPI job was > hanging inside MPI_Init() when started under the control of TotalView. They > were using Open MPI 4.0.1, and TotalView was using the MPIR Interface > (sorry, we don't support the PMIx debugging hooks yet). > >>> > >>> I was able to reproduce the hang on my own Linux system with my own > build of Open MPI 4.0.1, which I built with debug symbols. As far as I can > tell, there is some sort of race inside of Open MPI 4.0.1, because if I > placed breakpoints at certain points in the Open MPI code, and thus change > the timing slightly, that was enough to avoid the hang. > >>> > >>> When the code hangs, it appeared as if one or more MPI processes are > waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to > be released. In one of the runs, rank 0 was the only one the was hanging > there (though I have seen runs where two ranks were hung there). > >>> > >>> Here's a backtrace of the first thread in the rank 0 process in the > case where one rank was hung: > >>> > >>> d1.<> f 10.1 w > >>> > 0 __nanosleep_nocancel PC=0x7ffff74e2efd, FP=0x7fffffffd1e0 > [/lib64/libc.so.6] > >>> 1 usleep PC=0x7ffff7513b2f, FP=0x7fffffffd200 [/lib64/libc.so.6] > >>> 2 ompi_mpi_init PC=0x7ffff7a64009, FP=0x7fffffffd350 > [/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904] > >>> 3 PMPI_Init PC=0x7ffff7ab0be4, FP=0x7fffffffd390 > [/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67] > >>> 4 main PC=0x00400c5e, FP=0x7fffffffd550 > [/home/jdelsign/cpi.c#27] > >>> 5 __libc_start_main PC=0x7ffff7446b13, FP=0x7fffffffd610 > [/lib64/libc.so.6] > >>> 6 _start PC=0x00400b04, FP=0x7fffffffd618 > [/amd/home/jdelsign/cpi] > >>> > >>> Here's the block of code where the thread is hung: > >>> > >>> /* if we executed the above fence in the background, then > >>> * we have to wait here for it to complete. However, there > >>> * is no reason to do two barriers! */ > >>> if (background_fence) { > >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); > >>> } else if (!ompi_async_mpi_init) { > >>> /* wait for everyone to reach this point - this is a hard > >>> * barrier requirement at this time, though we hope to relax > >>> * it at a later point */ > >>> if (NULL != opal_pmix.fence_nb) { > >>> active = true; > >>> OPAL_POST_OBJECT(&active); > >>> if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false, > >>> fence_release, (void*)&active))) { > >>> error = "opal_pmix.fence_nb() failed"; > >>> goto error; > >>> } > >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *<<<<----- STUCK HERE WAITING > FOR THE FENCE TO BE RELEASED* > >>> } else { > >>> if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) { > >>> error = "opal_pmix.fence() failed"; > >>> goto error; > >>> } > >>> } > >>> } > >>> > >>> And here is an aggregated backtrace of all of the processes and > threads in the job: > >>> > >>> d1.<> f g w -g f+l > >>> +/ > >>> +__clone : 5:12[0-3.2-3, p1.2-5] > >>> |+start_thread > >>> | +listen_thread@oob_tcp_listener.c < > mailto:listen_thread@oob_tcp_listener.c > <listen_thread@oob_tcp_listener.c>>#705 > : 1:1[p1.5] > >>> | |+__select_nocancel > >>> | +listen_thread@ptl_base_listener.c < > mailto:listen_thread@ptl_base_listener.c > <listen_thread@ptl_base_listener.c>>#214 : 1:1[p1.3] > >>> | |+__select_nocancel > >>> | +progress_engine@opal_progress_threads.c < > mailto:progress_engine@opal_progress_threads.c > <progress_engine@opal_progress_threads.c>>#105 : 5:5[0-3.2, p1.4] > >>> | |+opal_libevent2022_event_base_loop@event.c < > mailto:opal_libevent2022_event_base_loop@event.c > <opal_libevent2022_event_base_loop@event.c>>#1632 > >>> | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c > <poll_dispatch@poll.c>>#167 > >>> | | +__poll_nocancel > >>> | +progress_engine@pmix_progress_threads.c < > mailto:progress_engine@pmix_progress_threads.c > <progress_engine@pmix_progress_threads.c>>#108 : 5:5[0-3.3, p1.2] > >>> | +opal_libevent2022_event_base_loop@event.c < > mailto:opal_libevent2022_event_base_loop@event.c > <opal_libevent2022_event_base_loop@event.c>>#1632 > >>> | +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c > <epoll_dispatch@epoll.c>>#409 > >>> | +__epoll_wait_nocancel > >>> +_start : 5:5[0-3.1, p1.1] > >>> +__libc_start_main > >>> +main@cpi.c <mailto:main@cpi.c <main@cpi.c>>#27 : 4:4[0-3.1] > >>> |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c <PMPI_Init@pinit.c> > >#67 > >>> | +*ompi_mpi_init@ompi_mpi_init.c#890 : 3:3[1-3.1]**<<<<---- THE 3 > OTHER MPI PROCS MADE IT PAST FENCE* > >>> | |+ompi_rte_wait_for_debugger@rte_orte_module.c < > mailto:ompi_rte_wait_for_debugger@rte_orte_module.c > <ompi_rte_wait_for_debugger@rte_orte_module.c>>#196 > >>> | | +opal_progress@opal_progress.c < > mailto:opal_progress@opal_progress.c <opal_progress@opal_progress.c>>#251 > >>> | | +opal_progress_events@opal_progress.c < > mailto:opal_progress_events@opal_progress.c > <opal_progress_events@opal_progress.c>>#191 > >>> | | +opal_libevent2022_event_base_loop@event.c < > mailto:opal_libevent2022_event_base_loop@event.c > <opal_libevent2022_event_base_loop@event.c>>#1632 > >>> | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c > <poll_dispatch@poll.c>>#167 > >>> | | +__poll_nocancel > >>> | +*ompi_mpi_init@ompi_mpi_init.c#904 : 1:1[0.1]**<<<<----**THE > THREAD THAT IS STUCK* > >>> | +usleep > >>> | +__nanosleep_nocancel > >>> +main@main.c <mailto:main@main.c <main@main.c>>#14 : 1:1[p1.1] > >>> +orterun@orterun.c <mailto:orterun@orterun.c <orterun@orterun.c> > >#200 > >>> +opal_libevent2022_event_base_loop@event.c < > mailto:opal_libevent2022_event_base_loop@event.c > <opal_libevent2022_event_base_loop@event.c>>#1632 > >>> +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c > <poll_dispatch@poll.c>>#167 > >>> +__poll_nocancel > >>> > >>> d1.<> > >>> > >>> I have tested Open MPI 4.0.2 dozens of times, and the hang does not > seem to happen. My concern is that if the problem is indeed a race, then > it's /possible/ (but perhaps not likely) that the same race exists in Open > MPI 4.0.2, but the timing could be slightly different such that it doesn't > hang using my simple test setup. In other words, maybe I've just been > "lucky" with my testing of Open MPI 4.0.2 and have failed to provoke the > hang yet. > >>> > >>> My question is: Was this a known problem in Open MPI 4.0.1 that was > fixed in Open MPI 4.0.2? > >>> > >>> Thanks, John D. > >>> > >>> > >> > > > > > > >