I don't think this will be a problem. We are now setting the flags correctly and doing a dlopen, which should enable the components to find everything in libmpi.so. If I remember correctly this new change would simply make all components compiled in a consistent way.
I will run this by Lisandro and see what he thinks though. If you don't hear back from us within a day, assume everything is fine. Brian On Dec 10, 2007 10:13 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > On Oct 16, 2007, at 11:20 AM, Brian Granger wrote: > > > Wow, that is quite a study of the different options. I will spend > > some time looking over things to better understand the (complex) > > situation. I will also talk with Lisandro Dalcin about what he thinks > > the best approach is for mpi4py. > > Brian / Lisandro -- > > I don't think that I heard back from you on this issue. Would you > have major heartburn if I remove all linking of our components against > libmpi (etc.)? > > (for a nicely-formatted refresher of the issues, check out > https://svn.open-mpi.org/trac/ompi/wiki/Linkers) > > Thanks. > > > > > One question though. You said that > > nothing had changed in this respect from 1.2.3 to 1.2.4, but 1.2.3 > > doesn't show the problem. Does this make sense? > > > > Brian > > > > On 10/16/07, Jeff Squyres <jsquy...@cisco.com> wrote: > >> On Oct 12, 2007, at 3:5 PM, Brian Granger wrote: > >>>> My guess is that Rmpi is dynamically loading libmpi.so, but not > >>>> specifying the RTLD_GLOBAL flag. This means that libmpi.so is not > >>>> available to the components the way it should be, and all goes > >>>> downhill from there. It only mostly works because we do something > >>>> silly with how we link most of our components, and Linux is just > >>>> smart enough to cover our rears (thankfully). > >>> > >>> In mpi4py, libmpi.so is linked in at compile time, not loaded using > >>> dlopen. Granted, the resulting mpi4py binary is loaded into python > >>> using dlopen. > >> > >> I believe that means that libmpi.so will be loaded as an indirect > >> dependency of mpi4py. See the table below. > >> > >>>> The pt2pt component (rightly) does not have a -lmpi in its link > >>>> line. The other components that use symbols in libmpi.so (wrongly) > >>>> do have a -lmpi in their link line. This can cause some problems on > >>>> some platforms (Linux tends to do dynamic linking / dynamic loading > >>>> better than most). That's why only the pt2pt component fails. > >>> > >>> Did this change from 1.2.3 to 1.2.4? > >> > >> No: > >> > >> % diff openmpi-1.2.3/ompi/mca/osc/pt2pt/Makefile.am openmpi-1.2.4/ > >> ompi/mca/osc/pt2pt/Makefile.am > >> % > >> > >>>> Solutions: > >>>> > >>>> - Someone could make the pt2pt osc component link in libmpi.so > >>>> like the rest of the components and hope that no one ever > >>>> tries this on a non-friendly platform. > >>> > >>> Shouldn't the openmpi build system be able to figure this stuff > >>> out on > >>> a per platform basis? > >> > >> I believe that this would not be useful -- see the tables and > >> conclusions below. > >> > >>>> - Debian (and all Rmpi users) could configure Open MPI with the > >>> > >>>> --disable-dlopen flag and ignore the problem. > >>> > >>> Are there disadvantages to this approach? > >> > >> You won't be able to add more OMPI components to your existing > >> installation (e.g., 3rd party components). But that's probably ok, > >> at least for now -- not many people are distributing 3rd party OMPI > >> components. > >> > >>>> - Someone could fix Rmpi to dlopen libmpi.so with the RTLD_GLOBAL > >>>> flag and fix the problem properly. > >>> > >>> Again, my main problem with this solution is that it means I must > >>> both > >>> link to libmpi at compile time and load it dynamically using dlopen. > >>> This doesn't seem right. Also, it makes it impossible on OS X to > >>> avoid setting LD_LIBRARY_PATH (OS X doesn't have rpath). Being able > >>> to use openmpi without setting LD_LIBRARY_PATH is important. > >> > >> This is a very complex issue. Here's the possibilities that I see... > >> (prepare for confusion!) > >> > >> = > >> = > >> = > >> ===================================================================== > >> == > >> > >> This first table represents what happens in the following scenarios: > >> > >> - compile an application against Open MPI's libmpi, or > >> - compile an "application" DSO that is dlopen'ed with RTLD_GLOBAL, or > >> - explicitly dlopen Open MPI's libmpi with RTLD_GLOBAL > >> > >> OMPI DSO > >> libmpi OMPI DSO components > >> App linked includes components depend on > >> against components? available? libmpi.so? Result > >> ---------- ----------- ---------- ---------- ---------- > >> 1. libmpi.so no no NA won't run > >> 2. libmpi.so no yes no yes > >> 3. libmpi.so no yes yes yes (*1*) > >> 4. libmpi.so yes no NA yes > >> 5. libmpi.so yes yes no maybe (*2*) > >> 6. libmpi.so yes yes yes maybe (*3*) > >> ---------- ------------ ---------- ------------ ---------- > >> 7. libmpi.a no no NA won't run > >> 8. libmpi.a no yes no yes (*4*) > >> 9. libmpi.a no yes yes no (*5*) > >> 10. libmpi.a yes no NA yes > >> 11. libmpi.a yes yes no maybe (*6*) > >> 12. libmpi.a yes yes yes no (*7*) > >> ---------- ------------ ---------- ------------ -------- > >> > >> All libmpi.a scenarios assume that libmpi.so is also available. > >> > >> In the OMPI v1.2 series, most components link against libmpi.so, but > >> some do not (it's our mistake for not being uniform). > >> > >> (*1*) As far as we know, this works on all platforms that have dlopen > >> (i.e., almost everywhere). But we've only tested (recently) Linux, > >> OSX, and Solaris. These 3 dynamic loaders are smart enough to > >> realize > >> that they only need to load libmpi.so once (i.e., that the implicit > >> dependency of libmpi.so brought in by the components is the same > >> libmpi.so that is already loaded), so everything works fine. > >> > >> (*2*) If the *same* component is both in libmpi and available as a > >> DSO, the same symbols will be defined twice when the component is > >> dlopen'ed and Badness will ensure. If the components are different, > >> all platforms should be ok. > >> > >> (*3*) Same caveat as (*2*) about if a components is both in libmpi > >> and > >> available as a DSO. Same as (*1*) for whether libmpi.so is loaded > >> multiple times by the dynamic loader or not. > >> > >> (*4*) Only works if the application was compiled with the equivalent > >> of the GNU linker's --whole-archive flag. > >> > >> (*5*) This does not work because libmpi.a will be loaded and > >> libmpi.so > >> will also be pulled in as a dependency of the components. As such, > >> all the data structures in libmpi will [attempt to] be in the process > >> twice: the "main libmpi" will have one set and the libmpi pulled in > >> by > >> the component dependencies will have a different set. Nothing good > >> will > >> come of that: possibly dynamic linker run-time symbol conflicts or > >> possibly two separate copies of the symbols. Both possibilities are > >> Bad. > >> > >> (*6*) Same caveat as (*2*) about if a components is both in libmpi > >> and > >> available as a DSO. > >> > >> (*7*) Same problem as (*5*). > >> > >> = > >> = > >> = > >> ===================================================================== > >> == > >> > >> This second table represents what happens in the following scenarios: > >> > >> - compile an "application" DSO that is dlopen'ed with RTLD_LOCAL, or > >> - explicitly dlopen Open MPI's libmpi with RTLD_LOCAL > >> > >> OMPI DSO > >> App libmpi OMPI DSO components > >> DSO linked includes components depend on > >> against components? available? libmpi.so? Result > >> ---------- ----------- ---------- ---------- ---------- > >> 13. libmpi.so no no NA won't run > >> 14. libmpi.so no yes no no (*8*) > >> 15. libmpi.so no yes yes maybe (*9*) > >> 16. libmpi.so yes no NA ok > >> 17. libmpi.so yes yes no no (*10*) > >> 18. libmpi.so yes yes yes maybe (*11*) > >> ---------- ------------ ---------- ------------ ---------- > >> 19. libmpi.a no no NA won't run > >> 20. libmpi.a no yes no no (*12*) > >> 21. libmpi.a no yes yes no (*13*) > >> 22. libmpi.a yes no NA ok > >> 23. libmpi.a yes yes no no (*14*) > >> 24. libmpi.a yes yes yes no (*15*) > >> ---------- ------------ ---------- ------------ -------- > >> > >> All libmpi.a scenarios assume that libmpi.so is also available. > >> > >> (*8*) This does not work because the OMPI DSOs require symbols in > >> libmpi that will not be able to be found because libmpi.so was not > >> loaded in the global scope. > >> > >> (*9*) This is a fun case: the Linux dynamic linker is smart enough to > >> make it work, but others likely will not. What happens is that > >> libmpi.so is loaded in a LOCAL scope, but then OMPI dlopens its own > >> DSOs that require symbols from libmpi. The Linux linker figures this > >> out and resolves the required symbols from the already-loaded LOCAL > >> libmpi.so. Other linkers will fail to figure out that there is a > >> libmpi.so already loaded in the process and will therefore load a 2nd > >> copy. This results in the problems cited in (*5*). > >> > >> (*10*) This does not work either a) because of the caveat stated in > >> (*2*) or b) because the unresolved symbol issue stated in (*8*). > >> > >> (*11*) This may not work either because of the caveat stated in (*2*) > >> or because the duplicate libmpi.so issue cited in (*9*). If you are > >> using the Linux linker, then (*9*) is not an issue, and it should > >> work. > >> > >> (*12*) Essentially the same as the unresolved symbol issue cited in > >> (*8*), but with libmpi.a instead of libmpi.so. > >> > >> (*13*) Worse than (*9*); the Linux linker will not figure this one > >> out > >> because the libmpi.so symbols are not part of "libmpi" -- they are > >> simply part of the application DSO and therefore there's no way for > >> the linker to know that by loading libmpi.so, it's going to be > >> loading > >> a 2nd set of the same symbols that are already in the process. > >> Hence, > >> we devolve down to the duplicate symbol issue cited in (*5*). > >> > >> (*14*) This does not work either a) because of the caveat stated in > >> (*2*) or b) because the unresolved symbols issue stated in (*8*). > >> > >> (*15*) This may not work either because of the caveat stated in (*2*) > >> or because the duplicate libmpi.so issue cited in (*13*). > >> > >> = > >> = > >> = > >> ===================================================================== > >> == > >> > >> (I'm going to put this data on the OMPI web site somewhere because it > >> took me all day yesterday to get it straight in my head and type it > >> out :-) ) > >> > >> In the OMPI v1.2 series, most OMPI configurations fall into scenarios > >> 2 and 3 (as I mentioned above, we have some components that link > >> against libmpi and others that don't -- our mistake for not being > >> consistent). > >> > >> The problematic scenario that the R and Python MPI plugins are > >> running into is 14 because the osc_pt2pt component does *not* link > >> against libmpi. Most of the rest of our components do link against > >> libmpi, and therefore fall into scenario 15, and therefore work on > >> Linux (but possibly not elsewhere). > >> > >> With all this being said, if you are looking for a general solution > >> for the Python and R plugins, dlopen() of libmpi with RTLD_GLOBAL > >> before MPI_INIT seems to be the way to go. Specifically, even if we > >> updated osc_pt2pt to link against libmpi, that will work on Linux, > >> but not elsewhere. dlopen'ing libmpi with GLOBAL seems to be the > >> most portable solution. > >> > >> Indeed, table 1 also suggests that we should change our components > >> (as Brian suggests) to all *not* link against libmpi, because then > >> we'll gain the ability to work properly with a static libmpi.a, > >> putting OMPI's common usage into scenarios 2 and 8 (which is better > >> than the 2, 3, 8, and 9 scenarios that are used today, which means we > >> don't work with libmpi.a). > >> > >> ...but I think that this would break the current R and Python plugins > >> until they put in the explicit call to dlopen(). > >> > >> -- > >> Jeff Squyres > >> Cisco Systems > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > > Jeff Squyres > Cisco Systems > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >