Folks,

this is a follow up on a question from the users ML


is there any reason why plugins do not depend on the main openmpi libs (libopen-pal.so and libopen-rte.so, libompi.so and liboshmem.so if needed) ?

i guess that would solve the issue here without having to use RTLD_GLOBAL.


Cheers,


Gilles



-------- Forwarded Message --------
Subject:        Re: [OMPI users] Problem with double shared library
Date:   Tue, 18 Oct 2016 10:45:42 +0900
From:   Gilles Gouaillardet <gil...@rist.or.jp>
To:     Open MPI Users <us...@lists.open-mpi.org>



Sean,


if i understand correctly, your built a libtransport_mpi.so library that depends on Open MPI, and your main program dlopen libtransport_mpi.so.

in this case, and at least for the time being, you need to use RTLD_GLOBAL in your dlopen flags.


Cheers,


Gilles


On 10/18/2016 4:53 AM, Sean Ahern wrote:
Folks,

For our code, we have a communication layer that abstracts the code that does the actual transfer of data. We call these "transports", and we link them as shared libraries. We have created an MPI transport that compiles/links against OpenMPI 2.0.1 using the compiler wrappers. When I compile OpenMPI with the--disable-dlopenoption (thus cramming all of OpenMPI's plugins into the MPI library directly), things work great with our transport shared library. But when I have a "normal" OpenMPI (without --disable-dlopen) and create the same transport shared library, things fail. Upon launch, it appears that OpenMPI is unable to find the appropriate plugins:

    [hyperion.ceintl.com:25595 <http://hyperion.ceintl.com:25595>]
    mca_base_component_repository_open: unable to open
    mca_patcher_overwrite:
    
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_patcher_overwrite.so:
    undefined symbol: *mca_patcher_base_patch_t_class* (ignored)
    [hyperion.ceintl.com:25595 <http://hyperion.ceintl.com:25595>]
    mca_base_component_repository_open: unable to open mca_shmem_mmap:
    
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_mmap.so:
    undefined symbol: *opal_show_help* (ignored)
    [hyperion.ceintl.com:25595 <http://hyperion.ceintl.com:25595>]
    mca_base_component_repository_open: unable to open
    mca_shmem_posix:
    
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_posix.so:
    undefined symbol: *opal_show_help* (ignored)
    [hyperion.ceintl.com:25595 <http://hyperion.ceintl.com:25595>]
    mca_base_component_repository_open: unable to open mca_shmem_sysv:
    
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_sysv.so:
    undefined symbol: *opal_show_help* (ignored)
    --------------------------------------------------------------------------
    It looks like opal_init failed for some reason; your parallel
    process is
    likely to abort.  There are many reasons that a parallel process can
    fail during opal_init; some of which are due to configuration or
    environment problems.  This failure appears to be an internal failure;
    here's some additional information (which may only be relevant to an
    Open MPI developer):

      opal_shmem_base_select failed
      --> Returned value -1 instead of OPAL_SUCCESS
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    It looks like orte_init failed for some reason; your parallel
    process is
    likely to abort.  There are many reasons that a parallel process can
    fail during orte_init; some of which are due to configuration or
    environment problems.  This failure appears to be an internal failure;
    here's some additional information (which may only be relevant to an
    Open MPI developer):

      opal_init failed
      --> Returned value Error (-1) instead of ORTE_SUCCESS
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    It looks like MPI_INIT failed for some reason; your parallel
    process is
    likely to abort.  There are many reasons that a parallel process can
    fail during MPI_INIT; some of which are due to configuration or
    environment
    problems.  This failure appears to be an internal failure; here's some
    additional information (which may only be relevant to an Open MPI
    developer):

      ompi_mpi_init: ompi_rte_init failed
      --> Returned "Error" (-1) instead of "Success" (0)


If I skip our shared libraries and instead write a standard MPI-based "hello, world" program that links against MPI directly (without --disable-dlopen), everything is again fine.

It seems that having the double dlopenis causing problems for OpenMPI finding its own shared libraries.

Note: I do have LD_LIBRARY_PATHpointing to …"openmpi-2.0.1/lib", as well as OPAL_PREFIXpointing to …"openmpi-2.0.1".

Any thoughts about how I can try to tease out what's going wrong here?

-Sean

--
Sean Ahern
Computational Engineering International
919-363-0883


_______________________________________________
users mailing list
us...@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to