Re: [OMPI devel] oshmem Fortran

2013-10-18 Thread Igor Ivanov

Hi Jeff,

Is there naming convention related configure options in OMPI?
Do you see any objections about --enable-oshmem-xxx or they must be 
replaced with --enable-shmem-xxx?


Regards,
Igor

On 17.10.2013 14:20, Jeff Squyres (jsquyres) wrote:

Mellanox --

In r29448, you deleted the comment without doing what it explicitly said to do. 
 For example, you can --disable-mpi-fortran --enable-oshmem-fortran and get a 
broken build.

Additionally, the shmem example in examples/ has several problems:

1. Why all the #defines?  This is supposed to be a trivial "hello world in shmem" 
program, not a test case.  Please make it the equivalent of "hello world".

2. The hello world shmem program does not follow the same naming conventions as 
the rest of the code in the examples/ directory.

3. There's no Fortran shmem example.

Adding C/Fortran shmem test cases to a test suite for MTT runs would be a very 
good thing.  Can they be added so that others can run shmem tests in an 
automated fashion?  Indeed, we have no proof of any shmem correctness right 
now; that makes me quite nervous...


On Oct 17, 2013, at 1:42 AM,  wrote:


Author: miked (Mike Dubman)
Date: 2013-10-17 01:42:43 EDT (Thu, 17 Oct 2013)
New Revision: 29448
URL: https://svn.open-mpi.org/trac/ompi/changeset/29448

Log:
add --enable-oshmem-fortran opt to configure

Text files modified:
   trunk/config/ompi_configure_options.m4   | 1 -
   trunk/config/oshmem_configure_options.m4 |29 
-
   2 files changed, 12 insertions(+), 18 deletions(-)

Modified: trunk/config/ompi_configure_options.m4
==
--- trunk/config/ompi_configure_options.m4  Thu Oct 17 01:39:20 2013
(r29447)
+++ trunk/config/ompi_configure_options.m4  2013-10-17 01:42:43 EDT (Thu, 
17 Oct 2013)  (r29448)
@@ -114,7 +114,6 @@
   [OMPI_WANT_FORTRAN_BINDINGS=1],
   [OMPI_WANT_FORTRAN_BINDINGS=0])

-
#
# MPI profiling
#

Modified: trunk/config/oshmem_configure_options.m4
==
--- trunk/config/oshmem_configure_options.m4Thu Oct 17 01:39:20 2013
(r29447)
+++ trunk/config/oshmem_configure_options.m42013-10-17 01:42:43 EDT (Thu, 
17 Oct 2013)  (r29448)
@@ -79,26 +79,21 @@
fi
AM_CONDITIONAL(OSHMEM_PROFILING, test "$oshmem_profiling_support" = 1)

-# Whether to build the OpenShmem fortran support or not For the
-# moment, use the same value as was derived from --enable-mpi-fortra.
-# *This seems wrong*; someone should somehow unify these two
-# options... but the implications are complicated.
#
-# Option 1: make --enable-fortran that governs both MPI and shmem.
-# This has 2 implications:
-# - --enable-mpi-fortran needs to be maintained for at least the
-#   1.7/1.8 series
-# - what to do with --enable-mpi-cxx?  It should be made consistent --
-#   so make it --enable-cxx?
-#
-# Option 2: make separate --enable-oshmem-fortran.  This seems sucky,
-# though, because oshmem Fortran depends on a lot of MPI Fortran
-# infrastructure.  If it isin't there, then oshmem Fortran can't
-# built.
+# Fortran bindings
#
-# Option 3: ...? (something better than option 1/2?)
+AC_ARG_ENABLE(oshmem-fortran,
+AC_HELP_STRING([--enable-oshmem-fortran],
+   [enable OSHMEM Fortran bindings (default: enabled if Fortran 
compiler found)]))
+if test "$enable_oshmem_fortran" != "no"; then
+AC_MSG_RESULT([yes])
+OSHMEM_WANT_FORTRAN_BINDINGS=1
+else
+AC_MSG_RESULT([no])
+OSHMEM_WANT_FORTRAN_BINDINGS=0
+fi
+
AC_MSG_CHECKING([if want to build SHMEM fortran bindings])
-OSHMEM_WANT_FORTRAN_BINDINGS=$OMPI_WANT_FORTRAN_BINDINGS
AM_CONDITIONAL(OSHMEM_WANT_FORTRAN_BINDINGS,
 [test $OSHMEM_WANT_FORTRAN_BINDINGS -eq 1])
AS_IF([test $OSHMEM_WANT_FORTRAN_BINDINGS -eq 1],
___
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full






Re: [OMPI devel] oshmem Fortran

2013-10-18 Thread Igor Ivanov
As I know OpenSHMEM is an effort to create a standardized SHMEM library 
for C and Fortran. SGI’s SHMEM API is the baseline for OpenSHMEM 
Specification 1.0.
Most of OpenSHMEM API functions can be found in different SHMEM API 
implementations but there are functions that are specific for SGI`s, 
Cray`s etc products.


SIte www.shmem.org says that "The SHMEM™ Application Programming 
Interface (API) definition and the SHMEM trademark are the property of 
SGI. "


Igor

On 18.10.2013 16:33, Jeff Squyres (jsquyres) wrote:

On Oct 18, 2013, at 1:18 AM, Igor Ivanov  wrote:


Is there naming convention related configure options in OMPI?
Do you see any objections about --enable-oshmem-xxx or they must be replaced 
with --enable-shmem-xxx?

Hmm.  Good question.

I don't know enough about shmem vs. open shmem to say.  Is the API the same?  
If the API is the same, then you make a good point -- perhaps we should replace 
all --with-oshmem-xxx and --enable-oshmem-xxx with --with-shmem-xxx and 
--enable-shmem-xxx, respectively.

Supporting this view, we have several --with-mpi-xxx switches (i.e., we don't 
have --with-ompi-xxx switches).  But this works because Open MPI is just an 
implementation of MPI.  Forgive my shmem ignorance: is the oshmem/ layer an 
implementation of Shmem?  Or an implementation of Open Shmem?  (is there a 
difference between the two?)





Re: [OMPI devel] shmem vs. oshmem

2013-10-25 Thread Igor Ivanov

Hi Jeff,

I would like to add few notes inline

Igor

On 25.10.2013 20:33, Jeff Squyres (jsquyres) wrote:

We had a few emails a little while ago, and decided that the branding should be 
"oshmem" because Open SHMEM is different than (original) SHMEM.

I notice that there's still:

- shmemcc / shmemfort / shmem_info / shmemrun
   --> should these all be "oshmem*" ?

- the examples are hello_shmem* and ring_shmem*
   --> should these all be "*_oshmem*" ?

These examples are not OpenSHMEM specific.


- there are header files named shmem*
   --> I'm guessing the names "shmem.h" and "shmem.fh" are mandated

OpenSHMEM specification says
>>>
10.1 Incorporating OpenSHMEM into Programs
C and C++ programs that use the OpenSHMEM library must
#include 
All Fortran OpenSHMEM programs should
include ’shmem.fh’
and Fortran OpenSHMEM programs that use constants defined by OpenSHMEM must
include ’shmem.fh’
10.1.1 Compatibility Note
Implementations must also provide these header files so that they can be 
referenced in C

and C++ as
#include 
and in Fortran as
include ’mpp/shmem.fh’
for backward compatbility with OpensHMEM 1.0 and SGI SHMEM.
<<<


   --> shmem_portable_platform.h.in should probably be 
oshmem_portable_platform.h.in, right?
   --> same for the internal headers shmem_api_logger.h and shmem_lock.h





Re: [OMPI devel] bug in mca framework?

2013-12-04 Thread Igor Ivanov
It is the first mca variable with type as string from btl/openib as 
'device_param_files'. Actually you can disable it and get failure on the 
second.


Description of case we see:
1. openib mca variables are registered during startup as stage at select 
component phase;
2. but a winner is cm component and openib mca variables are 
deregistered as part of mca group;
3. mca variables are not removed from global mca array but they marked 
as invalid and memory for string is freed;

4. shmem needs openib for yoda and does bml initialization;
5. openib mca variables are registered againusing light mode as 
searching itself in global array and refreshing their fields again;
6. for unknown reason bml finalization does not clean these vars as it 
is done in step 2;

7. mca_btl_openib.so is unloaded;
8. opal_finalize() destroys mca variables form global array, observes 
openib`s variable, try destroy using non accessed address;


So a code that is under discussion fixes step 6.

Igor

On 03.12.2013 23:01, Jeff Squyres (jsquyres) wrote:

I don't think there is one -- you'll need to print it from the debugger.


On Dec 3, 2013, at 1:38 PM, Mike Dubman  wrote:


thanks
what magic "-mca base_verbose" param should print it?


On Tue, Dec 3, 2013 at 6:59 PM, Nathan Hjelm  wrote:
This usually happens when a string that belongs to the MCA system is freed
elsewhere. Can you find out the name of the variable that is being destructed
in frame 2.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Dec 03, 2013 at 02:53:29PM +0200, Mike Dubman wrote:

Hi,
We observe crash during shmem_finalize()  (in trunk) with new MCA
framework.
After investigation, found that  MCA tears-down process can access
previously released memory. (reproduced with oshmem_hello_c.c test)
0 0x7fffed3d51d0 in ?? ()
#1 
#2 0x7710e21e in var_destructor (var=0x6fa7e0) at
mca_base_var.c:1605
#3 0x7710ae99 in opal_obj_run_destructors (object=0x6fa7e0) at
../../../opal/class/opal_object.h:448
#4 0x7710ca18 in mca_base_var_finalize () at mca_base_var.c:954
#5 0x7710a7e2 in mca_base_param_finalize () at
mca_base_param.c:643
#6 0x770e08e2 in opal_finalize_util () at
runtime/opal_finalize.c:77
#7 0x77aa5319 in ompi_mpi_finalize () at
runtime/ompi_mpi_finalize.c:407
#8 0x77d900cc in oshmem_shmem_finalize () at
runtime/oshmem_shmem_finalize.c:75
#9 0x77d91119 in shmem_finalize () at shmem_finalize.c:24
#10 0x77d89b8f in __do_global_dtors_aux () from
/install/lib/libshmem.so.0
#11 0x in ?? ()
The crash can be resolved by following patch:
diff --git a/opal/mca/base/mca_base_var.c b/opal/mca/base/mca_base_var.c
index 9966627..48028d8 100644
--- a/opal/mca/base/mca_base_var.c
+++ b/opal/mca/base/mca_base_var.c
@@ -773,7 +773,7 @@ static int var_find_by_name (const char *full_name,
int *index, bool invalidok)

 (void) var_get ((int)(uintptr_t) tmp, &var, false);

-if (invalidok || VAR_IS_VALID(var[0])) {
+if (VAR_IS_VALID(var[0])) {
 *index = (int)(uintptr_t) tmp;
 return OPAL_SUCCESS;
 }
I`m not sure we understand yet why it fixes the problem and what is a
race.
Could some` with knowledge of MCA flows look at it and comment?
The "invalidok" was introduced by Jeff`s commit.
Thanks
M
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] bug in mca framework?

2013-12-04 Thread Igor Ivanov

On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:

On Dec 4, 2013, at 2:52 AM, Igor Ivanov  wrote:


It is the first mca variable with type as string from btl/openib as 
'device_param_files'. Actually you can disable it and get failure on the second.

Description of case we see:
1. openib mca variables are registered during startup as stage at select 
component phase;
2. but a winner is cm component and openib mca variables are deregistered as 
part of mca group;
3. mca variables are not removed from global mca array but they marked as 
invalid and memory for string is freed;
4. shmem needs openib for yoda and does bml initialization;
5. openib mca variables are registered againusing light mode as searching 
itself in global array and refreshing their fields again;

Can you explain what you mean by step 5?  I.e., what does "using light mode" 
mean?  Is the openib component register function invoked again?
It is correct, it is called twice. "light mode" means that 
mca_base_var_register() does not allocate mca variable object again, it 
seeks this variable in global array and finding it updates fields in 
mca_base_var_t structure (at least mbv_storage).



6. for unknown reason bml finalization does not clean these vars as it is done 
in step 2;
7. mca_btl_openib.so is unloaded;
8. opal_finalize() destroys mca variables form global array, observes openib`s 
variable, try destroy using non accessed address;

So a code that is under discussion fixes step 6.

Nathan: it sounds like an MCA var (and entire group) is registered, 
unregistered, and then registered again. Does the MCA var system get confused 
here when it tries to unregister the group a 2nd time?
Probably issue relates incorrect recognition if variable valid/invalid 
during second call of mca_base_var_deregister().




Re: [OMPI devel] [EXTERNAL] Re: bug in mca framework?

2013-12-23 Thread Igor Ivanov

Brian,

Could you look at patch based on your suggestion. It resolves the issue 
with mca variable.


Igor

On 18.12.2013 01:48, Barrett, Brian W wrote:

The proposed solution at the bottom is wrong.  There aren't two different
BMLs, there's one, and it lives in OMPI.

The solution is to open the bml and btls in ompi_mpi_init and not in the
pmls.  I checked, and the bml will deal with add_procs being called
multiple times on the same proc, so just moving the framework open / init
is sufficient.  This will also solve the MTL problem.

Brian

On 12/17/13 8:33 AM, "Joshua Ladd"  wrote:


I believe Devendar Bureddy nailed the root cause. I am providing his
excellent analysis below:


>From Devendar:

with curiosity i looked at this issue. here's my 2 cents
I think issue is because of BTL components is opened&closed
twice(ompi_init, yoda) which leading to incorrect usage of var groups.
The following sequence of events creating invalid memory

1) all openib component parameters registered in ompi_mpi_init
main > start_pes> shmem_init -> oshmem_shmem_init -> ompi_mpi_init ->
mca_base_framework_open -> mca_pml_base_open . mca_bml_base_open...
-> btl_openib_component_register()

*   for all string variables it allocated a memory block (var->mbv_storage
= PTR)

At this time a new var group id:114 (of parent group id: 112) is created
for all openib component variables.

2) This var group is de-registered in ompi_mpi_init. It marks all
variables as invalid. but, the group&vars is still exist
main > start_pes> shmem_init -> oshmem_shmem_init -> mca_pml_base_select
-> mca_base_components_close -> ... -> mca_bml_base_close ->
mca_base_framework_close -> mca_base_var_group_deregister(groupid: 114) *
all string variables memory is deallocated ( set var->mbv_storage = NULL;)

3) because of step 2). btl_openib.so shared lib dlclosed

4) Now we are reopening openib in yoda and registering the openib
variables again.
main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init ->
mca_base_framework_open -> mca_spml_base_open>
mca_spml_yoda_component_open-> . mca_bml_base_open... ->
btl_openib_component_register -> register_variables()

*   In register_variables(), var_find() finds this variable( from the same
old group: 114) and reset the variables.
*   For string variables, it allocated the buffers again (
(var->mbv_storage = PTR)
*   note that group:114 is not belongs to yoda component.

5) In yoda component close, it never finds above group(114) because this
is not belongs to this component. So, do not call
mca_base_var_group_deregister() again on the var group. string var memory
is not deallocated.
main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init ->
mca_spml_base_select ->..> mca_spml_yoda_component_close ->
mca_bml_base_close -> mca_base_var_group_find().

6) because of step 5), the btl_openib.so is dlclosed(). This step
invalidates, all openib string vars memory ( var->mbv_storage = PTR)
allocated in step 4)

7) in ompi_mpi_finalize(), it will loop through all vars and finalizes
and deallocate the string var memory (var->mbv_storage = PTR)
ompi_mpi_finalize >...> mca_base_var_finalize * var->mbv_storage = PTR is
invalid at this stage and causing the SEGFAULT.


This also explains why Dinar's patch, kostul_fix.patch
(http://bgate.mellanox.com/redmine/attachments/1643/kostul_fix.patch),
resolves the issue. His patch prevents you from finding the invalid
already opened params.
So, I see in a lot of these registration functions the signature has an
entry for the project name, but now, NULL, is always passed. I see a note
by Nathan in

../opal/mca/base/mca_base_var.c +1311
{
/* XXX -- component_update -- We will stash the project name in the
component */
return mca_base_var_register (NULL, component->mca_type_name,


Seems knowing the project name, oshmem, would allow us to distinguish
between the different BMLs.

Nathan, please advise.

Josh


-Original Message-
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan Hjelm
Sent: Monday, December 16, 2013 12:44 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] bug in mca framework?

On Mon, Dec 16, 2013 at 05:21:05PM +, Joshua Ladd wrote:

After speaking with Igor Ivanov about this this morning, he summarized
his findings as follows:

1. Valgrind comes up clean.

Thats good to hear but unfortunate since this seems really like a
stomping-on-memory problem.


2. The issue is not reproduced with a static build.

This is a red-herring. The variable itself contains garbage. The
mbv_storage pointer looked like it was on the stack, the name was not
valid, etc. Not sure how we got an mca_base_var_t into that state since
the only time we touch anything in them is in mca_base_var_finalize. That
functions cleans up all of the state to two c

Re: [OMPI devel] [EXTERNAL] Re: bug in mca framework?

2014-01-15 Thread Igor Ivanov

Brian,

Sorry for slow reaction.
I am not sure I understand your concern. Could you please make it 
clearer and review modified patch (I have figured out issue in my 
previous patch as absence of complete btl initialization in case PML 
components different from bfo and ob1 needed for OSHMEM.)


Igor

On 03.01.2014 00:04, Barrett, Brian W wrote:

Igor -

Sorry for the slow reply; I was on vacation for the last week and a half.

The patch doesn't look quite right to me.  If the cm PML is used, the spml
(or something else in the OSHMEM layer) is going to have to call add_procs
on the BML to initialize the procs arrays for the BTLs.

Brian

On 12/23/13 3:49 AM, "Igor Ivanov"  wrote:


Brian,

Could you look at patch based on your suggestion. It resolves the issue
with mca variable.

Igor

On 18.12.2013 01:48, Barrett, Brian W wrote:

The proposed solution at the bottom is wrong.  There aren't two
different
BMLs, there's one, and it lives in OMPI.

The solution is to open the bml and btls in ompi_mpi_init and not in the
pmls.  I checked, and the bml will deal with add_procs being called
multiple times on the same proc, so just moving the framework open /
init
is sufficient.  This will also solve the MTL problem.

Brian

On 12/17/13 8:33 AM, "Joshua Ladd"  wrote:


I believe Devendar Bureddy nailed the root cause. I am providing his
excellent analysis below:


>From Devendar:

with curiosity i looked at this issue. here's my 2 cents
I think issue is because of BTL components is opened&closed
twice(ompi_init, yoda) which leading to incorrect usage of var groups.
The following sequence of events creating invalid memory

1) all openib component parameters registered in ompi_mpi_init
main > start_pes> shmem_init -> oshmem_shmem_init -> ompi_mpi_init ->
mca_base_framework_open -> mca_pml_base_open . mca_bml_base_open...
-> btl_openib_component_register()

*   for all string variables it allocated a memory block
(var->mbv_storage
= PTR)

At this time a new var group id:114 (of parent group id: 112) is
created
for all openib component variables.

2) This var group is de-registered in ompi_mpi_init. It marks all
variables as invalid. but, the group&vars is still exist
main > start_pes> shmem_init -> oshmem_shmem_init ->
mca_pml_base_select
-> mca_base_components_close -> ... -> mca_bml_base_close ->
mca_base_framework_close -> mca_base_var_group_deregister(groupid:
114) *
all string variables memory is deallocated ( set var->mbv_storage =
NULL;)

3) because of step 2). btl_openib.so shared lib dlclosed

4) Now we are reopening openib in yoda and registering the openib
variables again.
main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init ->
mca_base_framework_open -> mca_spml_base_open>
mca_spml_yoda_component_open-> . mca_bml_base_open... ->
btl_openib_component_register -> register_variables()

*   In register_variables(), var_find() finds this variable( from the
same
old group: 114) and reset the variables.
*   For string variables, it allocated the buffers again (
(var->mbv_storage = PTR)
*   note that group:114 is not belongs to yoda component.

5) In yoda component close, it never finds above group(114) because
this
is not belongs to this component. So, do not call
mca_base_var_group_deregister() again on the var group. string var
memory
is not deallocated.
main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init ->
mca_spml_base_select ->..> mca_spml_yoda_component_close ->
mca_bml_base_close -> mca_base_var_group_find().

6) because of step 5), the btl_openib.so is dlclosed(). This step
invalidates, all openib string vars memory ( var->mbv_storage = PTR)
allocated in step 4)

7) in ompi_mpi_finalize(), it will loop through all vars and finalizes
and deallocate the string var memory (var->mbv_storage = PTR)
ompi_mpi_finalize >...> mca_base_var_finalize * var->mbv_storage = PTR
is
invalid at this stage and causing the SEGFAULT.


This also explains why Dinar's patch, kostul_fix.patch
(http://bgate.mellanox.com/redmine/attachments/1643/kostul_fix.patch),
resolves the issue. His patch prevents you from finding the invalid
already opened params.
So, I see in a lot of these registration functions the signature has an
entry for the project name, but now, NULL, is always passed. I see a
note
by Nathan in

../opal/mca/base/mca_base_var.c +1311
{
/* XXX -- component_update -- We will stash the project name in the
component */
return mca_base_var_register (NULL, component->mca_type_name,


Seems knowing the project name, oshmem, would allow us to distinguish
between the different BMLs.

Nathan, please advise.

Josh


-Original Message-
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan
Hjelm
Sent: Monday, December 16, 2013 12:44 PM
To: Open MPI Developers
Subject: 

Re: [OMPI devel] [EXTERNAL] Re: bug in mca framework?

2014-01-17 Thread Igor Ivanov
I have supposed that BML add_procs() is called by PML and I see such 
call in ompi_mpi_init() as ".. ret = MCA_PML_CALL(add_procs(procs, 
nprocs));...". Moreover BML add_procs() is called by SPML (OSHMEM`s PML) 
in oshmem_shmem_init().

So it looks that all should be correct. Or am I still missing something?

Igor

On 16.01.2014 22:21, Barrett, Brian W wrote:

If a process is using the Portals 4 MTL and calls shmem_init, the BTLS
will be initialized properly, but as of right now, no one will call
add_procs() on the BML (which calls add_procs() on the BTLs).  So the
first shmem communication will fail, because the proc lookup will fail
inside the BTL.  If the MPI layer doesn't call add_procs(), someone else
has to.  In this case, that someone else is the OpenSHMEM layer.

Brian

On 1/15/14 7:45 AM, "Igor Ivanov"  wrote:


Brian,

Sorry for slow reaction.
I am not sure I understand your concern. Could you please make it
clearer and review modified patch (I have figured out issue in my
previous patch as absence of complete btl initialization in case PML
components different from bfo and ob1 needed for OSHMEM.)

Igor

On 03.01.2014 00:04, Barrett, Brian W wrote:

Igor -

Sorry for the slow reply; I was on vacation for the last week and a
half.

The patch doesn't look quite right to me.  If the cm PML is used, the
spml
(or something else in the OSHMEM layer) is going to have to call
add_procs
on the BML to initialize the procs arrays for the BTLs.

Brian

On 12/23/13 3:49 AM, "Igor Ivanov"  wrote:


Brian,

Could you look at patch based on your suggestion. It resolves the issue
with mca variable.

Igor

On 18.12.2013 01:48, Barrett, Brian W wrote:

The proposed solution at the bottom is wrong.  There aren't two
different
BMLs, there's one, and it lives in OMPI.

The solution is to open the bml and btls in ompi_mpi_init and not in
the
pmls.  I checked, and the bml will deal with add_procs being called
multiple times on the same proc, so just moving the framework open /
init
is sufficient.  This will also solve the MTL problem.

Brian

On 12/17/13 8:33 AM, "Joshua Ladd"  wrote:


I believe Devendar Bureddy nailed the root cause. I am providing his
excellent analysis below:


>From Devendar:

with curiosity i looked at this issue. here's my 2 cents
I think issue is because of BTL components is opened&closed
twice(ompi_init, yoda) which leading to incorrect usage of var
groups.
The following sequence of events creating invalid memory

1) all openib component parameters registered in ompi_mpi_init
main > start_pes> shmem_init -> oshmem_shmem_init -> ompi_mpi_init ->
mca_base_framework_open -> mca_pml_base_open .
mca_bml_base_open...
-> btl_openib_component_register()

*   for all string variables it allocated a memory block
(var->mbv_storage
= PTR)

At this time a new var group id:114 (of parent group id: 112) is
created
for all openib component variables.

2) This var group is de-registered in ompi_mpi_init. It marks all
variables as invalid. but, the group&vars is still exist
main > start_pes> shmem_init -> oshmem_shmem_init ->
mca_pml_base_select
-> mca_base_components_close -> ... -> mca_bml_base_close ->
mca_base_framework_close -> mca_base_var_group_deregister(groupid:
114) *
all string variables memory is deallocated ( set var->mbv_storage =
NULL;)

3) because of step 2). btl_openib.so shared lib dlclosed

4) Now we are reopening openib in yoda and registering the openib
variables again.
main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init ->
mca_base_framework_open -> mca_spml_base_open>
mca_spml_yoda_component_open-> . mca_bml_base_open... ->
btl_openib_component_register -> register_variables()

*   In register_variables(), var_find() finds this variable( from the
same
old group: 114) and reset the variables.
*   For string variables, it allocated the buffers again (
(var->mbv_storage = PTR)
*   note that group:114 is not belongs to yoda component.

5) In yoda component close, it never finds above group(114) because
this
is not belongs to this component. So, do not call
mca_base_var_group_deregister() again on the var group. string var
memory
is not deallocated.
main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init ->
mca_spml_base_select ->..> mca_spml_yoda_component_close ->
mca_bml_base_close -> mca_base_var_group_find().

6) because of step 5), the btl_openib.so is dlclosed(). This step
invalidates, all openib string vars memory ( var->mbv_storage = PTR)
allocated in step 4)

7) in ompi_mpi_finalize(), it will loop through all vars and
finalizes
and deallocate the string var memory (var->mbv_storage = PTR)
ompi_mpi_finalize >...> mca_base_var_finalize * var->mbv_storage =
PTR
is
invalid at this stage and causing the SEGFAULT.


This also explains why Dina