I'm able to replicate Edgar's problem. I'm investigating...
On Dec 2, 2014, at 10:39 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: > the mailing list refused to let me add the config.log file, since it is too > large, I can forward the output to you directly as well (as I did to Jeff). > > I honestly have not looked into the configure logic, I can just tell that > OPAL_HAVE_LTDL_ADVISE is not set on my linux system for master, but is set on > the 1.8 series (1.8 series checkout was from Nov. 20, so if something changed > in between the result might be different). > > > > On 12/2/2014 9:27 AM, Artem Polyakov wrote: >> >> 2014-12-02 20:59 GMT+06:00 Edgar Gabriel <gabr...@cs.uh.edu >> <mailto:gabr...@cs.uh.edu>>: >> >> didn't want to interfere with this thread, although I have a similar >> issue, since I have the solution nearly fully cooked up. But anyway, >> this last email gave the hint on why we have suddenly the problem in >> ompio: >> >> it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not >> set anymore, so the entire section is being skipped. I double >> checked that with the 1.8 branch, it goes through the section, but >> not with master. >> >> >> Hi, Edgar. >> >> Both master and ompi-release (isn't it 1.8?!) are equal in sence of my >> fix. Something else!? I'd like to see config.log too but will look into >> it only tomorrow. >> >> Also I want to add that SLURM PMI2 communicates with local slurmstepd's >> and doesn't need any authentification. All PMI1 processes otherwise >> communicate to the srun process and thus need libslurm services for >> communication and authentification. >> >> >> Thanks >> Edgar >> >> >> >> >> On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote: >> >> Looks like I was totally lying in >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16381.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16381.php> >> (where >> I said we should not use RTLD_GLOBAL). We *do* use RTLD_GLOBAL: >> >> >> https://github.com/open-mpi/__ompi/blob/master/opal/mca/__base/mca_base_component___repository.c#L124 >> >> <https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124> >> >> This ltdl advice object is passed to lt_dlopen() for all >> components. My mistake; sorry. >> >> So the idea that using RTLD_GLOBAL will fix this SLURM bug is >> incorrect. >> >> I believe someone said earlier in the thread that adding the >> right -llibs to the configure line will solve the issue, and >> that sounds correct to me. If there's a missing symbol because >> the SLURM libraries are not automatically pulling in the right >> dependent libraries, then *if* we put a workaround in OMPI to >> fix this issue, then the right workaround is to add the relevant >> -llibs when that component is linked. >> >> *If* you add that workaround (which is a whole separate >> discussion), I would suggest adding a configure.m4 test to see >> if adding the additional -llibs are necessary. Perhaps >> AC_LINK_IFELSE looking for a symbol, and then if that fails, >> AC_LINK_IFELSE again with the additional -llibs to see if that >> works. >> >> Or something like that. >> >> >> >> On Dec 2, 2014, at 6:38 AM, Artem Polyakov <artpo...@gmail.com >> <mailto:artpo...@gmail.com>> wrote: >> >> Agree. First you should check is to what value >> OPAL_HAVE_LTDL_ADVISE is set. If it is zero - very probably >> this is the same bug as mine. >> >> 2014-12-02 17:33 GMT+06:00 Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>>: >> It does look similar - question is: why didn’t this fix the >> problem? Will have to investigate. >> >> Thanks >> >> >> On Dec 2, 2014, at 3:17 AM, Artem Polyakov >> <artpo...@gmail.com <mailto:artpo...@gmail.com>> wrote: >> >> >> >> 2014-12-02 17:13 GMT+06:00 Ralph Castain >> <r...@open-mpi.org <mailto:r...@open-mpi.org>>: >> Hmmm…if that is true, then it didn’t fix this problem as >> it is being reported in the master. >> >> I had this problem on my laptop installation. You can >> check my report it was detailed enough and see if you >> hitting the same issue. My fix was also included into >> 1.8 branch. I am not sure that this is the same issue >> but they looks similar. >> >> >> >> On Dec 1, 2014, at 9:40 PM, Artem Polyakov >> <artpo...@gmail.com <mailto:artpo...@gmail.com>> wrote: >> >> I think this might be related to the configuration >> problem I was fixing with Jeff few months ago. Refer >> here: >> https://github.com/open-mpi/__ompi/pull/240 >> <https://github.com/open-mpi/ompi/pull/240> >> >> 2014-12-02 10:15 GMT+06:00 Ralph Castain >> <r...@open-mpi.org <mailto:r...@open-mpi.org>>: >> If it isn’t too much trouble, it would be good to >> confirm that it remains broken. I strongly suspect >> it is based on Moe’s comments. >> >> Obviously, other people are making this work. For >> Intel MPI, all you do is point it at libpmi and they >> can run. However, they do explicitly dlopen it in >> their code, and I don’t know what flags they might >> pass when they do so. >> >> If necessary, I suppose we could follow that >> pattern. In other words, rather than specifically >> linking the “s1” component to libpmi, instead >> require that the user point us to a pmi library via >> an MCA param, then explicitly dlopen that library >> with RTLD_GLOBAL. This avoids the issues cited by >> Jeff, but resolves the pmi linkage problem. >> >> >> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org >> <mailto:gilles.gouaillar...@iferc.org>__> wrote: >> >> $ srun --version >> slurm 2.6.6-VENDOR_PROVIDED >> >> $ srun --mpi=pmi2 -n 1 ~/hw >> I am 0 / 1 >> >> $ srun -n 1 ~/hw >> /csc/home1/gouaillardet/hw: symbol lookup error: >> /usr/lib64/slurm/auth_munge.__so: undefined >> symbol: slurm_verbose >> srun: error: slurm_receive_msg: Zero Bytes were >> transmitted or received >> srun: error: slurm_receive_msg[10.0.3.15]: Zero >> Bytes were transmitted or received >> srun: error: soleil: task 0: Exited with exit >> code 127 >> >> $ ldd /usr/lib64/slurm/auth_munge.so >> linux-vdso.so.1 => (0x00007fff54478000) >> libmunge.so.2 => /usr/lib64/libmunge.so.2 >> (0x00007f744760f000) >> libpthread.so.0 => /lib64/libpthread.so.0 >> (0x00007f74473f1000) >> libc.so.6 => /lib64/libc.so.6 >> (0x00007f744705d000) >> /lib64/ld-linux-x86-64.so.2 >> (0x0000003bf5400000) >> >> >> now, if i reling auth_munge.so so it depends on >> libslurm : >> >> $ srun -n 1 ~/hw >> srun: symbol lookup error: >> /usr/lib64/slurm/auth_munge.__so: undefined >> symbol: slurm_auth_get_arg_desc >> >> >> i can give a try to the latest slurm if needed >> >> Cheers, >> >> Gilles >> >> >> On 2014/12/02 12:56, Ralph Castain wrote: >> >> Out of curiosity - how are you testing >> these? I have more current versions of Slurm >> and would like to test the observations there. >> >> >> On Dec 1, 2014, at 7:49 PM, Gilles >> Gouaillardet >> <gilles.gouaillar...@iferc.org >> <mailto:gilles.gouaillar...@iferc.org>__> >> wrote: >> >> I d like to make a step back ... >> >> i previously tested with slurm 2.6.0, >> and it complained about the >> slurm_verbose symbol that is defined in >> libslurm.so >> so with slurm 2.6.0, RTLD_GLOBAL or >> relinking is ok >> >> now i tested with slurm 2.6.6 and it >> complains about the >> slurm_auth_get_arg_desc symbol, and this >> symbol is not >> defined in any dynamic library. it is >> internally defined in the static >> libcommon.a library, which is used to >> build the slurm binaries. >> >> as far as i understand, auth_munge.so >> can only be invoked from a slurm binary, >> which means it cannot be invoked from an >> mpi application >> even if it is linked with libslurm, >> libpmi, ... >> >> that looks like a slurm design issue >> that the slurm folks will take care of. >> >> Cheers, >> >> Gilles >> >> On 2014/12/02 12:33, Ralph Castain wrote: >> >> Another option is to simply add the >> -lslurm -lauth flags to the pmix/s1 >> component as this is the only place >> that requires it, and it won’t hurt >> anything to do so. >> >> >> >> On Dec 1, 2014, at 6:03 PM, >> Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org >> >> <mailto:gilles.gouaillar...@iferc.org>__> >> >> <mailto:gilles.gouaillardet@__iferc.org >> >> <mailto:gilles.gouaillar...@iferc.org>> >> wrote: >> >> Jeff, >> >> FWIW, you can read my analysis >> of what is going wrong at >> >> >> http://www.open-mpi.org/__community/lists/pmix-devel/__2014/11/0293.php >> >> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >> >> <http://www.open-mpi.org/__community/lists/pmix-devel/__2014/11/0293.php >> >> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>> >> >> <http://www.open-mpi.org/__community/lists/pmix-devel/__2014/11/0293.php >> >> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>> >> >> <http://www.open-mpi.org/__community/lists/pmix-devel/__2014/11/0293.php >> >> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>> >> >> >> bottom line, i agree this is a >> slurm issue (slurm plugin should >> depend >> on libslurm, but they do not, yet) >> >> a possible workaround would be >> to make the pmi component a >> "proxy" that >> dlopen with RTLD_GLOBAL the >> "real" component in which the >> job is done. >> that being said, the impact is >> quite limited (no direct launch >> in slurm >> with pmi1, but pmi2 works fine) >> so it makes sense not to work around >> someone else problem. >> and that being said, configure >> could detect this broken pmi1 >> and not >> build pmi1 support or print a >> user friendly error message if >> pmi1 is used. >> >> any thoughts ? >> >> Cheers, >> >> Gilles >> >> On 2014/12/02 7:47, Jeff Squyres >> (jsquyres) wrote: >> >> Ok, if the problem is moot, >> great. >> >> (sidenote: this is moot, so >> ignore this if you want: >> with this explanation, I'm >> still not sure how >> RTLD_GLOBAL fixes the issue) >> >> >> On Dec 1, 2014, at 5:15 PM, >> Ralph Castain >> <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> >> <mailto:r...@open-mpi.org >> <mailto:r...@open-mpi.org>> >> wrote: >> >> >> Easy enough to explain. >> We link libpmi into the >> pmix/s1 component. This >> library is missing the >> linkage to libslurm that >> contains the linkage to >> libauth where munge >> resides. So when we call >> a PMI function, libpmi >> references a call to >> munge for authentication >> and hits an “unresolved >> symbol” error. >> >> Moe acknowledges the >> error is in Slurm and is >> fixing the linkages so >> this problem goes away >> >> >> >> On Dec 1, 2014, at >> 2:13 PM, Jeff >> Squyres (jsquyres) >> <jsquy...@cisco.com >> >> <mailto:jsquy...@cisco.com>> >> <mailto:jsquy...@cisco.com >> >> <mailto:jsquy...@cisco.com>> >> wrote: >> >> On Dec 1, 2014, at >> 5:07 PM, Ralph Castain >> <r...@open-mpi.org >> >> <mailto:r...@open-mpi.org>> >> <mailto:r...@open-mpi.org >> >> <mailto:r...@open-mpi.org>> >> wrote: >> >> >> FWIW: It’s >> Slurm’s pmi-1 >> library that >> isn’t linked >> correctly >> against its >> dependencies >> (the pmi-2 one >> is correct). >> Moe is aware of >> the problem and >> fixing it on >> their side. This >> won’t help >> existing >> installations >> until they >> upgrade, but I >> tend to agree >> with Jeff about >> not fixing other >> people’s problems. >> >> Can you explain what >> is happening? >> >> I ask because I'm >> not sure I >> understand the >> problem such that >> using RTLD_GLOBAL >> would fix it. I.e., >> even if libpmi1.so >> isn't linked against >> its dependencies >> properly, that >> shouldn't cause a >> problem if OMPI >> components A and B >> are both linked >> against libpmi1.so, >> and then A is >> loaded, and then B >> is loaded. >> >> ...or perhaps we can >> just discuss this on >> the call tomorrow? >> >> -- >> Jeff Squyres >> >> jsquy...@cisco.com >> >> <mailto:jsquy...@cisco.com> >> <mailto:jsquy...@cisco.com >> >> <mailto:jsquy...@cisco.com>> >> >> For corporate legal >> information go to: >> >> http://www.cisco.com/web/__about/doing_business/legal/__cri/ >> >> <http://www.cisco.com/web/about/doing_business/legal/cri/> >> >> <http://www.cisco.com/web/__about/doing_business/legal/__cri/ >> >> <http://www.cisco.com/web/about/doing_business/legal/cri/>> >> >> >> >> _________________________________________________ >> devel mailing list >> >> de...@open-mpi.org >> >> <mailto:de...@open-mpi.org> >> <mailto:de...@open-mpi.org >> >> <mailto:de...@open-mpi.org>> >> >> Subscription: >> >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> >> <http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>> >> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16383.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >> >> <http://www.open-mpi.org/__community/lists/devel/2014/12/__16383.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php>> >> >> >> _________________________________________________ >> devel mailing list >> >> de...@open-mpi.org >> <mailto:de...@open-mpi.org> >> <mailto:de...@open-mpi.org >> <mailto:de...@open-mpi.org>> >> >> Subscription: >> >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> >> <http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>> >> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16384.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >> >> <http://www.open-mpi.org/__community/lists/devel/2014/12/__16384.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php>> >> >> >> _________________________________________________ >> devel mailing list >> >> de...@open-mpi.org >> <mailto:de...@open-mpi.org> >> <mailto:de...@open-mpi.org >> <mailto:de...@open-mpi.org>> >> <mailto:de...@open-mpi.org >> <mailto:de...@open-mpi.org>> >> <mailto:de...@open-mpi.org >> <mailto:de...@open-mpi.org>> >> >> Subscription: >> >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> >> <http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>> >> >> <http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>> >> >> <http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>> >> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16386.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >> >> <http://www.open-mpi.org/__community/lists/devel/2014/12/__16386.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>> >> >> <http://www.open-mpi.org/__community/lists/devel/2014/12/__16386.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>> >> >> <http://www.open-mpi.org/__community/lists/devel/2014/12/__16386.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>> >> >> >> _________________________________________________ >> devel mailing list >> >> de...@open-mpi.org >> <mailto:de...@open-mpi.org> >> <mailto:de...@open-mpi.org >> <mailto:de...@open-mpi.org>> >> >> Subscription: >> >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> >> <http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>> >> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16387.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >> >> <http://www.open-mpi.org/__community/lists/devel/2014/12/__16387.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php>> >> >> >> _________________________________________________ >> devel mailing list >> >> de...@open-mpi.org >> <mailto:de...@open-mpi.org> >> >> Subscription: >> >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16388.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16388.php> >> >> >> >> _________________________________________________ >> devel mailing list >> >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> >> Subscription: >> >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16389.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16389.php> >> >> >> _________________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16390.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16390.php> >> >> >> >> _________________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16391.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16391.php> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> _________________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16393.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16393.php> >> >> >> >> _________________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16395.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16395.php> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> _________________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16396.php >> >> <http://www.open-mpi.org/community/lists/devel/2014/12/16396.php> >> >> >> >> _________________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16397.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16397.php> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> _________________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: >> http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16398.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16398.php> >> >> >> >> >> -- >> Edgar Gabriel >> Associate Professor >> Parallel Software Technologies Lab http://pstl.cs.uh.edu >> Department of Computer Science University of Houston >> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >> _________________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/__mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/__community/lists/devel/2014/12/__16400.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16400.php> >> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16404.php >> > > -- > Edgar Gabriel > Associate Professor > Parallel Software Technologies Lab http://pstl.cs.uh.edu > Department of Computer Science University of Houston > Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA > Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16405.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/