Ralph, no problem :
I just tried slurm-14-11-11-1 and *both* pmi1 and pmi2 fail with the same error message : symbol lookup error: /opt/slurm-14-11.11.1/lib/slurm/auth_munge.so: undefined symbol: slurm_debug on the bright side, auth_munge.so has no slurm_auth_get_arg_desc undefined symbol. if i relink auth_munge.so so it depends on libslurm.so, this fixes *both* pmi1 and pmi2 Cheers, Gilles On 2014/12/02 13:15, Ralph Castain wrote: > If it isn't too much trouble, it would be good to confirm that it remains > broken. I strongly suspect it is based on Moe's comments. > > Obviously, other people are making this work. For Intel MPI, all you do is > point it at libpmi and they can run. However, they do explicitly dlopen it in > their code, and I don't know what flags they might pass when they do so. > > If necessary, I suppose we could follow that pattern. In other words, rather > than specifically linking the "s1" component to libpmi, instead require that > the user point us to a pmi library via an MCA param, then explicitly dlopen > that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but > resolves the pmi linkage problem. > > >> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >> $ srun --version >> slurm 2.6.6-VENDOR_PROVIDED >> >> $ srun --mpi=pmi2 -n 1 ~/hw >> I am 0 / 1 >> >> $ srun -n 1 ~/hw >> /csc/home1/gouaillardet/hw: symbol lookup error: >> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose >> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received >> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or >> received >> srun: error: soleil: task 0: Exited with exit code 127 >> >> $ ldd /usr/lib64/slurm/auth_munge.so >> linux-vdso.so.1 => (0x00007fff54478000) >> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000) >> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000) >> libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000) >> /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000) >> >> >> now, if i reling auth_munge.so so it depends on libslurm : >> >> $ srun -n 1 ~/hw >> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: >> slurm_auth_get_arg_desc >> >> >> i can give a try to the latest slurm if needed >> >> Cheers, >> >> Gilles >> >> >> On 2014/12/02 12:56, Ralph Castain wrote: >>> Out of curiosity - how are you testing these? I have more current versions >>> of Slurm and would like to test the observations there. >>> >>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet >>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> >>>> wrote: >>>> >>>> I d like to make a step back ... >>>> >>>> i previously tested with slurm 2.6.0, and it complained about the >>>> slurm_verbose symbol that is defined in libslurm.so >>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok >>>> >>>> now i tested with slurm 2.6.6 and it complains about the >>>> slurm_auth_get_arg_desc symbol, and this symbol is not >>>> defined in any dynamic library. it is internally defined in the static >>>> libcommon.a library, which is used to build the slurm binaries. >>>> >>>> as far as i understand, auth_munge.so can only be invoked from a slurm >>>> binary, which means it cannot be invoked from an mpi application >>>> even if it is linked with libslurm, libpmi, ... >>>> >>>> that looks like a slurm design issue that the slurm folks will take care >>>> of. >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 2014/12/02 12:33, Ralph Castain wrote: >>>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 >>>>> component as this is the only place that requires it, and it won't hurt >>>>> anything to do so. >>>>> >>>>> >>>>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet >>>>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> >>>>>> <mailto:gilles.gouaillar...@iferc.org> >>>>>> <mailto:gilles.gouaillar...@iferc.org> wrote: >>>>>> >>>>>> Jeff, >>>>>> >>>>>> FWIW, you can read my analysis of what is going wrong at >>>>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php >>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>> >>>>>> bottom line, i agree this is a slurm issue (slurm plugin should depend >>>>>> on libslurm, but they do not, yet) >>>>>> >>>>>> a possible workaround would be to make the pmi component a "proxy" that >>>>>> dlopen with RTLD_GLOBAL the "real" component in which the job is done. >>>>>> that being said, the impact is quite limited (no direct launch in slurm >>>>>> with pmi1, but pmi2 works fine) so it makes sense not to work around >>>>>> someone else problem. >>>>>> and that being said, configure could detect this broken pmi1 and not >>>>>> build pmi1 support or print a user friendly error message if pmi1 is >>>>>> used. >>>>>> >>>>>> any thoughts ? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote: >>>>>>> Ok, if the problem is moot, great. >>>>>>> >>>>>>> (sidenote: this is moot, so ignore this if you want: with this >>>>>>> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue) >>>>>>> >>>>>>> >>>>>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain <r...@open-mpi.org> >>>>>>> <mailto:r...@open-mpi.org> <mailto:r...@open-mpi.org> >>>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>>> >>>>>>>> Easy enough to explain. We link libpmi into the pmix/s1 component. >>>>>>>> This library is missing the linkage to libslurm that contains the >>>>>>>> linkage to libauth where munge resides. So when we call a PMI >>>>>>>> function, libpmi references a call to munge for authentication and >>>>>>>> hits an "unresolved symbol" error. >>>>>>>> >>>>>>>> Moe acknowledges the error is in Slurm and is fixing the linkages so >>>>>>>> this problem goes away >>>>>>>> >>>>>>>> >>>>>>>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) >>>>>>>>> <jsquy...@cisco.com> <mailto:jsquy...@cisco.com> >>>>>>>>> <mailto:jsquy...@cisco.com> <mailto:jsquy...@cisco.com> wrote: >>>>>>>>> >>>>>>>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> <mailto:r...@open-mpi.org> <mailto:r...@open-mpi.org> >>>>>>>>> <mailto:r...@open-mpi.org> wrote: >>>>>>>>> >>>>>>>>>> FWIW: It's Slurm's pmi-1 library that isn't linked correctly against >>>>>>>>>> its dependencies (the pmi-2 one is correct). Moe is aware of the >>>>>>>>>> problem and fixing it on their side. This won't help existing >>>>>>>>>> installations until they upgrade, but I tend to agree with Jeff >>>>>>>>>> about not fixing other people's problems. >>>>>>>>> Can you explain what is happening? >>>>>>>>> >>>>>>>>> I ask because I'm not sure I understand the problem such that using >>>>>>>>> RTLD_GLOBAL would fix it. I.e., even if libpmi1.so isn't linked >>>>>>>>> against its dependencies properly, that shouldn't cause a problem if >>>>>>>>> OMPI components A and B are both linked against libpmi1.so, and then >>>>>>>>> A is loaded, and then B is loaded. >>>>>>>>> >>>>>>>>> ...or perhaps we can just discuss this on the call tomorrow? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Squyres >>>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>>>>>> <mailto:jsquy...@cisco.com> <mailto:jsquy...@cisco.com> >>>>>>>>> For corporate legal information go to: >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php >>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16388.php> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php >>> <http://www.open-mpi.org/community/lists/devel/2014/12/16389.php> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16390.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16391.php