Looks like I was totally lying in http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said we should not use RTLD_GLOBAL). We *do* use RTLD_GLOBAL:
https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124 This ltdl advice object is passed to lt_dlopen() for all components. My mistake; sorry. So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect. I believe someone said earlier in the thread that adding the right -llibs to the configure line will solve the issue, and that sounds correct to me. If there's a missing symbol because the SLURM libraries are not automatically pulling in the right dependent libraries, then *if* we put a workaround in OMPI to fix this issue, then the right workaround is to add the relevant -llibs when that component is linked. *If* you add that workaround (which is a whole separate discussion), I would suggest adding a configure.m4 test to see if adding the additional -llibs are necessary. Perhaps AC_LINK_IFELSE looking for a symbol, and then if that fails, AC_LINK_IFELSE again with the additional -llibs to see if that works. Or something like that. On Dec 2, 2014, at 6:38 AM, Artem Polyakov <[email protected]> wrote: > Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. > If it is zero - very probably this is the same bug as mine. > > 2014-12-02 17:33 GMT+06:00 Ralph Castain <[email protected]>: > It does look similar - question is: why didn’t this fix the problem? Will > have to investigate. > > Thanks > > >> On Dec 2, 2014, at 3:17 AM, Artem Polyakov <[email protected]> wrote: >> >> >> >> 2014-12-02 17:13 GMT+06:00 Ralph Castain <[email protected]>: >> Hmmm…if that is true, then it didn’t fix this problem as it is being >> reported in the master. >> >> I had this problem on my laptop installation. You can check my report it was >> detailed enough and see if you hitting the same issue. My fix was also >> included into 1.8 branch. I am not sure that this is the same issue but they >> looks similar. >> >> >> >>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov <[email protected]> wrote: >>> >>> I think this might be related to the configuration problem I was fixing >>> with Jeff few months ago. Refer here: >>> https://github.com/open-mpi/ompi/pull/240 >>> >>> 2014-12-02 10:15 GMT+06:00 Ralph Castain <[email protected]>: >>> If it isn’t too much trouble, it would be good to confirm that it remains >>> broken. I strongly suspect it is based on Moe’s comments. >>> >>> Obviously, other people are making this work. For Intel MPI, all you do is >>> point it at libpmi and they can run. However, they do explicitly dlopen it >>> in their code, and I don’t know what flags they might pass when they do so. >>> >>> If necessary, I suppose we could follow that pattern. In other words, >>> rather than specifically linking the “s1” component to libpmi, instead >>> require that the user point us to a pmi library via an MCA param, then >>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues >>> cited by Jeff, but resolves the pmi linkage problem. >>> >>> >>>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet >>>> <[email protected]> wrote: >>>> >>>> $ srun --version >>>> slurm 2.6.6-VENDOR_PROVIDED >>>> >>>> $ srun --mpi=pmi2 -n 1 ~/hw >>>> I am 0 / 1 >>>> >>>> $ srun -n 1 ~/hw >>>> /csc/home1/gouaillardet/hw: symbol lookup error: >>>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose >>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received >>>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or >>>> received >>>> srun: error: soleil: task 0: Exited with exit code 127 >>>> >>>> $ ldd /usr/lib64/slurm/auth_munge.so >>>> linux-vdso.so.1 => (0x00007fff54478000) >>>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000) >>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000) >>>> libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000) >>>> /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000) >>>> >>>> >>>> now, if i reling auth_munge.so so it depends on libslurm : >>>> >>>> $ srun -n 1 ~/hw >>>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined >>>> symbol: slurm_auth_get_arg_desc >>>> >>>> >>>> i can give a try to the latest slurm if needed >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> >>>> On 2014/12/02 12:56, Ralph Castain wrote: >>>>> Out of curiosity - how are you testing these? I have more current >>>>> versions of Slurm and would like to test the observations there. >>>>> >>>>> >>>>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet >>>>>> <[email protected]> >>>>>> wrote: >>>>>> >>>>>> I d like to make a step back ... >>>>>> >>>>>> i previously tested with slurm 2.6.0, and it complained about the >>>>>> slurm_verbose symbol that is defined in libslurm.so >>>>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok >>>>>> >>>>>> now i tested with slurm 2.6.6 and it complains about the >>>>>> slurm_auth_get_arg_desc symbol, and this symbol is not >>>>>> defined in any dynamic library. it is internally defined in the static >>>>>> libcommon.a library, which is used to build the slurm binaries. >>>>>> >>>>>> as far as i understand, auth_munge.so can only be invoked from a slurm >>>>>> binary, which means it cannot be invoked from an mpi application >>>>>> even if it is linked with libslurm, libpmi, ... >>>>>> >>>>>> that looks like a slurm design issue that the slurm folks will take care >>>>>> of. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> On 2014/12/02 12:33, Ralph Castain wrote: >>>>>> >>>>>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 >>>>>>> component as this is the only place that requires it, and it won’t hurt >>>>>>> anything to do so. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet >>>>>>>> <[email protected]> <mailto:[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Jeff, >>>>>>>> >>>>>>>> FWIW, you can read my analysis of what is going wrong at >>>>>>>> >>>>>>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php >>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>> >>>>>>>> >>>>>>>> bottom line, i agree this is a slurm issue (slurm plugin should depend >>>>>>>> on libslurm, but they do not, yet) >>>>>>>> >>>>>>>> a possible workaround would be to make the pmi component a "proxy" that >>>>>>>> dlopen with RTLD_GLOBAL the "real" component in which the job is done. >>>>>>>> that being said, the impact is quite limited (no direct launch in slurm >>>>>>>> with pmi1, but pmi2 works fine) so it makes sense not to work around >>>>>>>> someone else problem. >>>>>>>> and that being said, configure could detect this broken pmi1 and not >>>>>>>> build pmi1 support or print a user friendly error message if pmi1 is >>>>>>>> used. >>>>>>>> >>>>>>>> any thoughts ? >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Gilles >>>>>>>> >>>>>>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote: >>>>>>>> >>>>>>>>> Ok, if the problem is moot, great. >>>>>>>>> >>>>>>>>> (sidenote: this is moot, so ignore this if you want: with this >>>>>>>>> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue) >>>>>>>>> >>>>>>>>> >>>>>>>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain >>>>>>>>> <[email protected]> <mailto:[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> Easy enough to explain. We link libpmi into the pmix/s1 component. >>>>>>>>>> This library is missing the linkage to libslurm that contains the >>>>>>>>>> linkage to libauth where munge resides. So when we call a PMI >>>>>>>>>> function, libpmi references a call to munge for authentication and >>>>>>>>>> hits an “unresolved symbol” error. >>>>>>>>>> >>>>>>>>>> Moe acknowledges the error is in Slurm and is fixing the linkages so >>>>>>>>>> this problem goes away >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) >>>>>>>>>>> <[email protected]> <mailto:[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain >>>>>>>>>>> <[email protected]> <mailto:[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly >>>>>>>>>>>> against its dependencies (the pmi-2 one is correct). Moe is aware >>>>>>>>>>>> of the problem and fixing it on their side. This won’t help >>>>>>>>>>>> existing installations until they upgrade, but I tend to agree >>>>>>>>>>>> with Jeff about not fixing other people’s problems. >>>>>>>>>>>> >>>>>>>>>>> Can you explain what is happening? >>>>>>>>>>> >>>>>>>>>>> I ask because I'm not sure I understand the problem such that using >>>>>>>>>>> RTLD_GLOBAL would fix it. I.e., even if libpmi1.so isn't linked >>>>>>>>>>> against its dependencies properly, that shouldn't cause a problem >>>>>>>>>>> if OMPI components A and B are both linked against libpmi1.so, and >>>>>>>>>>> then A is loaded, and then B is loaded. >>>>>>>>>>> >>>>>>>>>>> ...or perhaps we can just discuss this on the call tomorrow? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Jeff Squyres >>>>>>>>>>> >>>>>>>>>>> [email protected] <mailto:[email protected]> >>>>>>>>>>> >>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> >>>>>>>>>>> [email protected] <mailto:[email protected]> >>>>>>>>>>> >>>>>>>>>>> Subscription: >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>> >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php >>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> >>>>>>>>>> [email protected] <mailto:[email protected]> >>>>>>>>>> >>>>>>>>>> Subscription: >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> >>>>>>>> [email protected] <mailto:[email protected]> >>>>>>>> <mailto:[email protected]> <mailto:[email protected]> >>>>>>>> >>>>>>>> Subscription: >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> >>>>>>> [email protected] <mailto:[email protected]> >>>>>>> >>>>>>> Subscription: >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> >>>>>> [email protected] >>>>>> >>>>>> Subscription: >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> >>>>> [email protected] >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> [email protected] >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16390.php >>> >>> >>> _______________________________________________ >>> devel mailing list >>> [email protected] >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16391.php >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> _______________________________________________ >>> devel mailing list >>> [email protected] >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16393.php >> >> >> _______________________________________________ >> devel mailing list >> [email protected] >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16395.php >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> _______________________________________________ >> devel mailing list >> [email protected] >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16396.php > > > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16397.php > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16398.php -- Jeff Squyres [email protected] For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
