Re: [OMPI devel] Removing ORTE code
Based on silence plus today’s telecon, the stale code has been removed: https://github.com/open-mpi/ompi/pull/5827 > On Sep 26, 2018, at 7:00 AM, Ralph H Castain wrote: > > We are considering a “purge” of stale ORTE code and want to know if anyone is > using it before proceeding. With the advent of PMIx, several ORTE features > are no longer required by OMPI itself. However, we acknowledge that it is > possible that someone out there (e.g., a researcher) is using them. The > specific features include: > > * OOB use from within an application process. We need to retain the OOB > itself for daemon-to-daemon communication. However, the application processes > no longer open a connection to their ORTE daemon, instead relying on the PMIx > connection to communicate their needs. > > * the DFS framework - allows an application process to access a remote file > via ORTE. It provided essentially a function-shipping service that was used > by map-reduce applications we no longer support > > * the notifier framework - supported output of messages to syslog and email. > PMIx now provides such services if someone wants to use them > > * iof/tool component - we are moving to PMIx for tool support, so there are > no ORTE tools using this any more > > We may discover additional candidates for removal as we go forward - we’ll > update the list as we do. First, however, we’d really like to hear back from > anyone who might have a need for any of the above. > > Please respond by Oct 5th > Ralph > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] btl/vader: race condition in finalize on OS X
We already have the register_cleanup option in master - are you using an older version of PMIx that doesn’t support it? > On Oct 2, 2018, at 4:05 AM, Jeff Squyres (jsquyres) via devel > wrote: > > FYI: https://github.com/open-mpi/ompi/issues/5798 brought up what may be the > same issue. > > >> On Oct 2, 2018, at 3:16 AM, Gilles Gouaillardet wrote: >> >> Folks, >> >> >> When running a simple helloworld program on OS X, we can end up with the >> following error message >> >> >> A system call failed during shared memory initialization that should >> not have. It is likely that your MPI job will now either abort or >> experience performance degradation. >> >> Local host: c7.kmc.kobe.rist.or.jp >> System call: unlink(2) >> /tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54 >> Error: No such file or directory (errno 2) >> >> >> the error does not occur on linux by default since the vader segment is in >> /dev/shm by default. >> >> the patch below can be used to evidence the issue on linux >> >> >> diff --git a/opal/mca/btl/vader/btl_vader_component.c >> b/opal/mca/btl/vader/btl_vader_component.c >> index 115bceb..80fec05 100644 >> --- a/opal/mca/btl/vader/btl_vader_component.c >> +++ b/opal/mca/btl/vader/btl_vader_component.c >> @@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void) >>OPAL_INFO_LVL_3, >> MCA_BASE_VAR_SCOPE_GROUP, &mca_btl_vader_component.single_copy_mechanism); >> OBJ_RELEASE(new_enum); >> >> -if (0 == access ("/dev/shm", W_OK)) { >> +if (0 && 0 == access ("/dev/shm", W_OK)) { >> mca_btl_vader_component.backing_directory = "/dev/shm"; >> } else { >> mca_btl_vader_component.backing_directory = >> opal_process_info.job_session_dir; >> >> >> From my analysis, here is what happens : >> >> - each rank is supposed to have its own vader_segment unlinked by btl/vader >> in vader_finalize(). >> >> - but this file might have already been destroyed by an other task in >> orte_ess_base_app_finalize() >> >> if (NULL == opal_pmix.register_cleanup) { >>orte_session_dir_finalize(ORTE_PROC_MY_NAME); >>} >> >> *all* the tasks end up removing >> opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1") >> >> >> I am not really sure about the best way to fix this. >> >> - one option is to perform an intra node barrier in vader_finalize() >> >> - an other option would be to implement an opal_pmix.register_cleanup >> >> >> Any thoughts ? >> >> >> Cheers, >> >> >> Gilles >> >> ___ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] btl/vader: race condition in finalize on OS X
FYI: https://github.com/open-mpi/ompi/issues/5798 brought up what may be the same issue. > On Oct 2, 2018, at 3:16 AM, Gilles Gouaillardet wrote: > > Folks, > > > When running a simple helloworld program on OS X, we can end up with the > following error message > > > A system call failed during shared memory initialization that should > not have. It is likely that your MPI job will now either abort or > experience performance degradation. > > Local host: c7.kmc.kobe.rist.or.jp > System call: unlink(2) > /tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54 > Error: No such file or directory (errno 2) > > > the error does not occur on linux by default since the vader segment is in > /dev/shm by default. > > the patch below can be used to evidence the issue on linux > > > diff --git a/opal/mca/btl/vader/btl_vader_component.c > b/opal/mca/btl/vader/btl_vader_component.c > index 115bceb..80fec05 100644 > --- a/opal/mca/btl/vader/btl_vader_component.c > +++ b/opal/mca/btl/vader/btl_vader_component.c > @@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void) > OPAL_INFO_LVL_3, > MCA_BASE_VAR_SCOPE_GROUP, &mca_btl_vader_component.single_copy_mechanism); > OBJ_RELEASE(new_enum); > > -if (0 == access ("/dev/shm", W_OK)) { > +if (0 && 0 == access ("/dev/shm", W_OK)) { > mca_btl_vader_component.backing_directory = "/dev/shm"; > } else { > mca_btl_vader_component.backing_directory = > opal_process_info.job_session_dir; > > > From my analysis, here is what happens : > > - each rank is supposed to have its own vader_segment unlinked by btl/vader > in vader_finalize(). > > - but this file might have already been destroyed by an other task in > orte_ess_base_app_finalize() > > if (NULL == opal_pmix.register_cleanup) { > orte_session_dir_finalize(ORTE_PROC_MY_NAME); > } > > *all* the tasks end up removing > opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1") > > > I am not really sure about the best way to fix this. > > - one option is to perform an intra node barrier in vader_finalize() > > - an other option would be to implement an opal_pmix.register_cleanup > > > Any thoughts ? > > > Cheers, > > > Gilles > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
[OMPI devel] btl/vader: race condition in finalize on OS X
Folks, When running a simple helloworld program on OS X, we can end up with the following error message A system call failed during shared memory initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation. Local host: c7.kmc.kobe.rist.or.jp System call: unlink(2) /tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54 Error: No such file or directory (errno 2) the error does not occur on linux by default since the vader segment is in /dev/shm by default. the patch below can be used to evidence the issue on linux diff --git a/opal/mca/btl/vader/btl_vader_component.c b/opal/mca/btl/vader/btl_vader_component.c index 115bceb..80fec05 100644 --- a/opal/mca/btl/vader/btl_vader_component.c +++ b/opal/mca/btl/vader/btl_vader_component.c @@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void) OPAL_INFO_LVL_3, MCA_BASE_VAR_SCOPE_GROUP, &mca_btl_vader_component.single_copy_mechanism); OBJ_RELEASE(new_enum); - if (0 == access ("/dev/shm", W_OK)) { + if (0 && 0 == access ("/dev/shm", W_OK)) { mca_btl_vader_component.backing_directory = "/dev/shm"; } else { mca_btl_vader_component.backing_directory = opal_process_info.job_session_dir; From my analysis, here is what happens : - each rank is supposed to have its own vader_segment unlinked by btl/vader in vader_finalize(). - but this file might have already been destroyed by an other task in orte_ess_base_app_finalize() if (NULL == opal_pmix.register_cleanup) { orte_session_dir_finalize(ORTE_PROC_MY_NAME); } *all* the tasks end up removing opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1") I am not really sure about the best way to fix this. - one option is to perform an intra node barrier in vader_finalize() - an other option would be to implement an opal_pmix.register_cleanup Any thoughts ? Cheers, Gilles ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel