Ralph and I just chatted about this on the phone. I think I understand his position better now.
Just to be clear/put some context in this conversation: 1. PSM (aka "PSM1") supports TrueScale Intel networks 2. PSM2 supports OmniScale Intel networks ------ The following three solutions are more-or-less equivalent: a. add "mtl=^psm2" in the mca-params.conf file (George's proposal) b. configure --without-psm2 (similar to George's proposal) c. we release 10.0.1 with no PSM2 MTL (Ralph's proposal) In all 3 cases, the OmniScale end user will not have support for their network (and will likely fall back to TCP?). TrueScale users are unaffected. Technically, there's a 4th solution (proposed by Red Hat): the distro provides 2 different Open MPI installations -- one for (everything+PSM1), another for (everything+PSM2). I agree that this is (very) undesirable. In this case, *all* users are penalized -- not just TrueScale/OmniScale users -- because all users will now wonder "Which Open MPI should I use?" (even if they're not TS/OS users, and it doesn't matter which one they use, they still have to expend unnecessary mental energy trying to understand why there are two, and which they should use). Meh. Hence, we're back to the three possible "more-or-less equivalent" solutions: a, b, or c. I say "more-or-less" because there *is* a semantic difference between a/b and c: 1. For a/b: packagers are responsible for the solution, and also responsible for *documenting* the solution (so that Omniscale users can figure out why they are getting lousy performance). 2. For c: Open MPI is responsible for the solution; we'll likely note in NEWS that PSM2 support was removed. Hence, for the "let's release 1.10.1 without PSM2" solution, users have a (potentially) easier way of figuring out why they're not getting good performance. That being said: 1. I'm not 100% convinced that users will go to the NEWS file to figure out why they're not getting good performance. True, it's our officially-sanctioned method for publishing information to users, but I don't think that it's the first place that comes to mind when you're diagnosing a performance problem. 2. It seems like we have handled this kind of situation differently in the past. 2a. E.g., when we had the hcol/ml conflict, we asked Mellanox for a solution. They promised to release a new libhcol that fixed the problem, and in the meantime, told their customers to get Mellanox Open MPI from mellanox.com that immediately fixed the problem. 2b. Similarly, Cisco distributed its own Cisco Open MPI when we wanted to have libfabric support in the Open MPI v1.8.x series. 2c. This case is not entirely the same as the above two examples, but I think it's similar in spirit: a distro is trying to be all-inclusive with other freely-distributed software in that distro (i.e., both PSM1 and PSM2), and a vendor-specific issue is causing a problem with that plan. 3. I therefore think we should take the same approach that we have taken with other vendors in the past: 3a. Red Hat (and other packagers) can do whatever they need to do to package Open MPI 1.10.0. In this case, Red Hat is asking our advice as to how to package it (because they include both PSM1 and PSM2 support in their distro, and this creates a conflict in Open MPI). ==> My $0.02: we should tell Red Hat to build --without-psm2, because then users can see that "ompi_info | grep psm2" will be empty. That's a dead giveaway that that Open MPI installation has no PSM2 support. 3b. Intel can support its customers by having an "Intel Open MPI" distribution (or whatever they want to name it, just as long as it is not named plan/vanilla "Open MPI") that is configured/built to support both PSM1/PSM2 via their normal software distribution mechanism. 3c. If there's some solution Intel would like to push upstream to the Open MPI community, great -- it can go through the normal review process and be accepted upstream (i.e., just like we work every day). That solution can then be included in future releases. How does that sound? > On Sep 3, 2015, at 10:48 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > if I correctly read between the lines of your second point, omnipath (PSM2) > is working out of the box. I am not sure this is the case, and/or my > extrapolation might be incorrect. > > if I understood correctly, psm2 is a new feature. > from a distro point of view, that could be a new package (known not to > support PSM), or a mpirun-psm2 wrapper, or a release note (e.g. use --mca mtl > ^psm or a psm2 param file) > > I still do not get how removing PSM2 makes things better > (and the same result can be achieved by configuring with --without-psm2) > > Cheers, > > Gilles > > On Thursday, September 3, 2015, Ralph Castain <r...@open-mpi.org> wrote: > I guess I didn’t make it clear in my prior comment, so let me try again. I > understand about dlopen and the fix that George proposed - we had internally > discussed this as well. However, the questions that raises are: > > 1. how does the distro (Michal) decide which PSM module to disable by default > in their package? > > 2. how does the user “discover” that their fabric has automatically been > disabled, especially since this has never been the case before? > > I’ll raise the procedural question at our next telecon. I certainly take no > pleasure out of generating releases, so if we have a better solution, I’m all > for it! > > > > On Sep 3, 2015, at 5:55 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > > wrote: > > > > I agree with what George says. > > > > AFAIK, Red Hat builds Open MPI support for dlopen, so the config file > > option is probably suitable. > > > > However, I have to admit that I resent the fact that PSM's poor upgrade > > path design is forcing both the Open MPI and libfabric communities to have > > similar confusing conversations (e.g., see > > https://github.com/ofiwg/libfabric/issues/1258#issuecomment-137426271). > > > > Specifically: because of the design of PSM1/PSM2, both Open MPI and > > libfabric will have to adjust their configury and use dlopen/function > > pointer indirection to "solve" the problem of supporting both PSM1 and PSM2. > > > > Does that seem weird to anyone else? > > > > IMNSHO, if you have to have extremely confusing conversations in multiple > > software communities explaining your configury, > > function-pointer-indirection code (i.e., PR > > https://github.com/ofiwg/libfabric/pull/1259), compilation, and linking > > scheme to upgrade to a new library, you're doing it wrong. > > > > > > > > > >> On Sep 3, 2015, at 7:19 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > >> > >> Hi Michael, > >> > >> I might have missed some context when proposing this solution. As Gilles > >> suggested if you build Open MPI without support for dlopen (configure > >> option --disable-dlopen) this simple solution will not work because the > >> symbol conflict issue is generated deep inside the constructors of the 2 > >> libraries. > >> > >> Yes, the "mtl = ^psm" (or ^psm2 depending on which one you want to > >> disable) should go in the openmpi-mca-params.conf that gets installed in > >> the $(sysconfigdir). > >> > >> Thanks, > >> George. > >> > >> > >> On Thu, Sep 3, 2015 at 5:14 AM, Michal Schmidt <mschm...@redhat.com> wrote: > >> [I apologize for not threading the email properly. I was not subscribed > >> before and found the conversation in the web archive.] > >> > >> Hello, > >> > >> I am the one who discovered the PSM vs. PSM2 library conflict and > >> proposed the temporary workaround of having two builds of the openmpi > >> package. > >> > >> George Bosilca wrote: > >>> 3. Except if the distro builds OMPI statically, I see no reason to > >>> have 2 build of OMPI due to conflicting symbols between two shared > >>> libraries that OMPI MCA load willingly. Why a simple "mtl = ^psm" in > >>> the OMPI system wide configuration file is not enough to solve the > >>> issue? > >> > >> Thank you for this suggestion. It would go into openmpi-mca-params.conf, > >> right? I will try it. > >> > >> Regards, > >> Michal > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2015/09/17927.php > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2015/09/17928.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/09/17931.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17933.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17937.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/