Edgar,

I checked the various release branches, and I think this issue was
fixed by 
https://github.com/open-mpi/ompi/commit/ccf76b779130e065de326f71fe6bac868c565300

This was back-ported into the v3.0.x branch, and that was before the
v3.1.x branch was created.

This has *not* been backported into the v2.x series, and as far as I
am concerned, that would fix the abstraction violation I mentioned
earlier.

I noted the fcoll framework is open is mca_io_base_file_select(), so
an other (a bit convoluted imho, but that could require less changes)
way could be to open the framework in the io/ompio component.


Cheers,

Gilles
On Sat, Jun 9, 2018 at 7:59 AM Gabriel, Edgar <egabr...@central.uh.edu> wrote:
>
> I wanted to add one item before I forget (although I agree with what Jeff 
> said): The error messages shown reminds me of the problem that we had with 
> ompio  in 1.8/1.10 series when the RTLD_GLOBAL  option was not correctly set. 
> However, that was fixed in the 2.0 series and going forward, so if that shows 
> up with later releases, it might an indication of something else.
>
> Edgar
>
> > -----Original Message-----
> > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres) via devel
> > Sent: Friday, June 8, 2018 4:54 PM
> > To: Open MPI Developers List <devel@lists.open-mpi.org>
> > Cc: Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> > Subject: Re: [OMPI devel] Shared object dependencies
> >
> > Before digging any deeper, did you perchance install multiple versions of 
> > Open
> > MPI into the same prefix?
> >
> > If so, remember that Open MPI installs lots of plugins.  The exact set of 
> > plugins
> > changes every release.  So if you install version A.B.C in to /opt/openmpi, 
> > and
> > then install version X.Y.Z in to /opt/openmpi, note that the installation 
> > of X.Y.Z
> > did not *uninstall* A.B.C first.  Hence, you might still have some stale 
> > A.B.C
> > components in the tree that Open MPI X.Y.Z may try to open.  Since the
> > underlying libraries that these plugins use have now been upgraded to X.Y.Z,
> > the stale A.B.C component may (and likely will) fail to open.
> >
> > If that's not what is happening, let us know and we can dig deeper.
> >
> >
> > > On Jun 8, 2018, at 5:37 PM, Tyson Whitehead <twhiteh...@gmail.com>
> > wrote:
> > >
> > > This email starts out talking about version 1.10.7 to give a complete
> > > picture.  I tested 2.1.3 as well, it also exhibits this issue,
> > > although to a lesser extent though, and am asking for help on that
> > > release.
> > >
> > > I was compiling the OpenMPI 1.10.7 shipped with NixOS against a newer
> > > libibverbs with a large set of drivers and get some strange errors
> > > when when running opmi_info (I've replaced the common prefix
> > > /nix/store/9zm0pqsh67fw0xi5cpnybnd7hgzryffs-openmpi-1.10.7 with ...)
> > >
> > > [mon241:04077] mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_btl_openib: .../lib/openmpi/mca_btl_openib.so:
> > > undefined symbol: mca_mpool_grdma_evict (ignored) [mon241:04077] mca:
> > > base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_individual:
> > > .../lib/openmpi/mca_fcoll_individual.so: undefined symbol:
> > > mca_io_ompio_file_write (ignored)
> > > [mon241:04077] mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_ylib: .../lib/openmpi/mca_fcoll_ylib.so:
> > > undefined symbol: ompi_io_ompio_scatter_data (ignored) [mon241:04077]
> > > mca: base: component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_dynamic:
> > > .../lib/openmpi/mca_fcoll_dynamic.so: undefined symbol:
> > > ompi_io_ompio_allgatherv_array (ignored) [mon241:04077] mca: base:
> > > component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_two_phase:
> > > .../lib/openmpi/mca_fcoll_two_phase.so: undefined symbol:
> > > ompi_io_ompio_set_aggregator_props (ignored) [mon241:04077] mca: base:
> > > component_find: unable to open
> > > .../lib/openmpi/mca_fcoll_static: .../lib/openmpi/mca_fcoll_static.so:
> > > undefined symbol: ompi_io_ompio_allgather_array (ignored)
> > >                 Package: Open MPI nixbld@ Distribution
> > >               Open MPI: 1.10.7
> > > Open MPI repo revision: v1.10.6-48-g5e373bf  Open MPI release date:
> > > May 16, 2017
> > >               Open RTE: 1.10.7
> > > Open RTE repo revision: v1.10.6-48-g5e373bf  Open RTE release date:
> > > May 16, 2017
> > >                   OPAL: 1.10.7
> > >     OPAL repo revision: v1.10.6-48-g5e373bf
> > >      OPAL release date: May 16, 2017
> > > ...
> > >
> > > I dug into the first of these (figured out what library provided it,
> > > looked at the declared dependencies, poked around in the automake
> > > file) , and, as far as I could determine, it seems that
> > > mca_btl_openib.so simply isn't linked to list mca_mpool_grdma.so
> > > (which provides the symbol) as a dependency.
> > >
> > > Seeing as 1.10.7 is no longer supported.  I figured I would try 2.1.3
> > > in case this has been fixed.  I compiled it up as well, and it seems
> > > all but the mca_fcoll_individual one have been resolved (I've replaced
> > > /nix/store/4kh0zbn8pmdqhvwagicswg70rwnpm570-openmpi-2.1.3 with ...)
> > >
> > > [mon241:05544] mca_base_component_repository_open: unable to open
> > > mca_fcoll_individual: .../lib/openmpi/mca_fcoll_individual.so:
> > > undefined symbol: ompio_io_ompio_file_read (ignored)
> > >                 Package: Open MPI nixbld@ Distribution
> > >               Open MPI: 2.1.3
> > > Open MPI repo revision: v2.1.2-129-gcfd8f3f  Open MPI release date:
> > > Mar 13, 2018
> > >               Open RTE: 2.1.3
> > > Open RTE repo revision: v2.1.2-129-gcfd8f3f  Open RTE release date:
> > > Mar 13, 2018
> > >                   OPAL: 2.1.3
> > >     OPAL repo revision: v2.1.2-129-gcfd8f3f
> > >      OPAL release date: Mar 13, 2018
> > > ...
> > >
> > > Again I was able to find this symbol in the mca_io_ompio.so library.
> > > I looked through the source again, and it seems pretty clear that the
> > > function is indeed called, but the library isn't linked to list the
> > > mca_io_ompio.so library as a dependency
> > >
> > > Looking through the various shared libraries in the .../lib/openmpi
> > > directory though, and it seems none of them have dependencies on each
> > > other.  How is this suppose to work?  Is the component library just
> > > suppose to load everything so all symbols get resolved?  Is the above
> > > error I'm seeing an error then?
> > >
> > > Any insight would be appreciated.
> > >
> > > Thanks!  -Tyson
> > >
> > > PS:  Please note that the openmpi code was compiled without any
> > > patches and without any special configure flags other than
> > > --prefix=.... (NixOS also adds --diasble-static and
> > > --disable-dependency-tracking by default, but I removed those, it
> > > didn't make a difference)..
> > > _______________________________________________
> > > devel mailing list
> > > devel@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/devel
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to