[OMPI devel] btl/vader: race condition in finalize on OS X

Gilles Gouaillardet Tue, 02 Oct 2018 00:18:37 -0700

Folks,

When running a simple helloworld program on OS X, we can end up with thefollowing error message



A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  c7.kmc.kobe.rist.or.jp

System call: unlink(2)/tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54

  Error:       No such file or directory (errno 2)

the error does not occur on linux by default since the vader segment isin /dev/shm by default.


the patch below can be used to evidence the issue on linux

diff --git a/opal/mca/btl/vader/btl_vader_component.cb/opal/mca/btl/vader/btl_vader_component.c

index 115bceb..80fec05 100644
--- a/opal/mca/btl/vader/btl_vader_component.c
+++ b/opal/mca/btl/vader/btl_vader_component.c
@@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void)

OPAL_INFO_LVL_3,MCA_BASE_VAR_SCOPE_GROUP, &mca_btl_vader_component.single_copy_mechanism);

     OBJ_RELEASE(new_enum);

-    if (0 == access ("/dev/shm", W_OK)) {
+    if (0 && 0 == access ("/dev/shm", W_OK)) {
         mca_btl_vader_component.backing_directory = "/dev/shm";
     } else {

mca_btl_vader_component.backing_directory =opal_process_info.job_session_dir;



From my analysis, here is what happens :

- each rank is supposed to have its own vader_segment unlinked bybtl/vader in vader_finalize().

- but this file might have already been destroyed by an other task inorte_ess_base_app_finalize()


      if (NULL == opal_pmix.register_cleanup) {
        orte_session_dir_finalize(ORTE_PROC_MY_NAME);
    }

*all* the tasks end up removingopal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1")



I am not really sure about the best way to fix this.

 - one option is to perform an intra node barrier in vader_finalize()

 - an other option would be to implement an opal_pmix.register_cleanup


Any thoughts ?


Cheers,


Gilles

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] btl/vader: race condition in finalize on OS X

Reply via email to