@#$%#@$% Can you send your configure output and config.log?
On Dec 2, 2014, at 10:06 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: > I checked with the debugger, that it did skip the entire section > > On 12/2/2014 9:04 AM, Jeff Squyres (jsquyres) wrote: >> Oy -- I thought we fixed that. :-( >> >> Are you saying that configure output says that ltdladvise is not found? >> >> >> On Dec 2, 2014, at 9:59 AM, Edgar Gabriel <gabr...@cs.uh.edu> wrote: >> >>> didn't want to interfere with this thread, although I have a similar issue, >>> since I have the solution nearly fully cooked up. But anyway, this last >>> email gave the hint on why we have suddenly the problem in ompio: >>> >>> it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set >>> anymore, so the entire section is being skipped. I double checked that with >>> the 1.8 branch, it goes through the section, but not with master. >>> >>> Thanks >>> Edgar >>> >>> >>> >>> On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote: >>>> Looks like I was totally lying in >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I >>>> said we should not use RTLD_GLOBAL). We *do* use RTLD_GLOBAL: >>>> >>>> https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124 >>>> >>>> This ltdl advice object is passed to lt_dlopen() for all components. My >>>> mistake; sorry. >>>> >>>> So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect. >>>> >>>> I believe someone said earlier in the thread that adding the right -llibs >>>> to the configure line will solve the issue, and that sounds correct to me. >>>> If there's a missing symbol because the SLURM libraries are not >>>> automatically pulling in the right dependent libraries, then *if* we put a >>>> workaround in OMPI to fix this issue, then the right workaround is to add >>>> the relevant -llibs when that component is linked. >>>> >>>> *If* you add that workaround (which is a whole separate discussion), I >>>> would suggest adding a configure.m4 test to see if adding the additional >>>> -llibs are necessary. Perhaps AC_LINK_IFELSE looking for a symbol, and >>>> then if that fails, AC_LINK_IFELSE again with the additional -llibs to see >>>> if that works. >>>> >>>> Or something like that. >>>> >>>> >>>> >>>> On Dec 2, 2014, at 6:38 AM, Artem Polyakov <artpo...@gmail.com> wrote: >>>> >>>>> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is >>>>> set. If it is zero - very probably this is the same bug as mine. >>>>> >>>>> 2014-12-02 17:33 GMT+06:00 Ralph Castain <r...@open-mpi.org>: >>>>> It does look similar - question is: why didn’t this fix the problem? Will >>>>> have to investigate. >>>>> >>>>> Thanks >>>>> >>>>> >>>>>> On Dec 2, 2014, at 3:17 AM, Artem Polyakov <artpo...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> 2014-12-02 17:13 GMT+06:00 Ralph Castain <r...@open-mpi.org>: >>>>>> Hmmm…if that is true, then it didn’t fix this problem as it is being >>>>>> reported in the master. >>>>>> >>>>>> I had this problem on my laptop installation. You can check my report it >>>>>> was detailed enough and see if you hitting the same issue. My fix was >>>>>> also included into 1.8 branch. I am not sure that this is the same issue >>>>>> but they looks similar. >>>>>> >>>>>> >>>>>> >>>>>>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov <artpo...@gmail.com> wrote: >>>>>>> >>>>>>> I think this might be related to the configuration problem I was fixing >>>>>>> with Jeff few months ago. Refer here: >>>>>>> https://github.com/open-mpi/ompi/pull/240 >>>>>>> >>>>>>> 2014-12-02 10:15 GMT+06:00 Ralph Castain <r...@open-mpi.org>: >>>>>>> If it isn’t too much trouble, it would be good to confirm that it >>>>>>> remains broken. I strongly suspect it is based on Moe’s comments. >>>>>>> >>>>>>> Obviously, other people are making this work. For Intel MPI, all you do >>>>>>> is point it at libpmi and they can run. However, they do explicitly >>>>>>> dlopen it in their code, and I don’t know what flags they might pass >>>>>>> when they do so. >>>>>>> >>>>>>> If necessary, I suppose we could follow that pattern. In other words, >>>>>>> rather than specifically linking the “s1” component to libpmi, instead >>>>>>> require that the user point us to a pmi library via an MCA param, then >>>>>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues >>>>>>> cited by Jeff, but resolves the pmi linkage problem. >>>>>>> >>>>>>> >>>>>>>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet >>>>>>>> <gilles.gouaillar...@iferc.org> wrote: >>>>>>>> >>>>>>>> $ srun --version >>>>>>>> slurm 2.6.6-VENDOR_PROVIDED >>>>>>>> >>>>>>>> $ srun --mpi=pmi2 -n 1 ~/hw >>>>>>>> I am 0 / 1 >>>>>>>> >>>>>>>> $ srun -n 1 ~/hw >>>>>>>> /csc/home1/gouaillardet/hw: symbol lookup error: >>>>>>>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose >>>>>>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received >>>>>>>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted >>>>>>>> or received >>>>>>>> srun: error: soleil: task 0: Exited with exit code 127 >>>>>>>> >>>>>>>> $ ldd /usr/lib64/slurm/auth_munge.so >>>>>>>> linux-vdso.so.1 => (0x00007fff54478000) >>>>>>>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000) >>>>>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000) >>>>>>>> libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000) >>>>>>>> /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000) >>>>>>>> >>>>>>>> >>>>>>>> now, if i reling auth_munge.so so it depends on libslurm : >>>>>>>> >>>>>>>> $ srun -n 1 ~/hw >>>>>>>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined >>>>>>>> symbol: slurm_auth_get_arg_desc >>>>>>>> >>>>>>>> >>>>>>>> i can give a try to the latest slurm if needed >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Gilles >>>>>>>> >>>>>>>> >>>>>>>> On 2014/12/02 12:56, Ralph Castain wrote: >>>>>>>>> Out of curiosity - how are you testing these? I have more current >>>>>>>>> versions of Slurm and would like to test the observations there. >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet >>>>>>>>>> <gilles.gouaillar...@iferc.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I d like to make a step back ... >>>>>>>>>> >>>>>>>>>> i previously tested with slurm 2.6.0, and it complained about the >>>>>>>>>> slurm_verbose symbol that is defined in libslurm.so >>>>>>>>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok >>>>>>>>>> >>>>>>>>>> now i tested with slurm 2.6.6 and it complains about the >>>>>>>>>> slurm_auth_get_arg_desc symbol, and this symbol is not >>>>>>>>>> defined in any dynamic library. it is internally defined in the >>>>>>>>>> static libcommon.a library, which is used to build the slurm >>>>>>>>>> binaries. >>>>>>>>>> >>>>>>>>>> as far as i understand, auth_munge.so can only be invoked from a >>>>>>>>>> slurm binary, which means it cannot be invoked from an mpi >>>>>>>>>> application >>>>>>>>>> even if it is linked with libslurm, libpmi, ... >>>>>>>>>> >>>>>>>>>> that looks like a slurm design issue that the slurm folks will take >>>>>>>>>> care of. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> Gilles >>>>>>>>>> >>>>>>>>>> On 2014/12/02 12:33, Ralph Castain wrote: >>>>>>>>>> >>>>>>>>>>> Another option is to simply add the -lslurm -lauth flags to the >>>>>>>>>>> pmix/s1 component as this is the only place that requires it, and >>>>>>>>>>> it won’t hurt anything to do so. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet >>>>>>>>>>>> <gilles.gouaillar...@iferc.org> >>>>>>>>>>>> <mailto:gilles.gouaillar...@iferc.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Jeff, >>>>>>>>>>>> >>>>>>>>>>>> FWIW, you can read my analysis of what is going wrong at >>>>>>>>>>>> >>>>>>>>>>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php >>>>>>>>>>>> >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>>>>>> >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>>>>>> >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> bottom line, i agree this is a slurm issue (slurm plugin should >>>>>>>>>>>> depend >>>>>>>>>>>> on libslurm, but they do not, yet) >>>>>>>>>>>> >>>>>>>>>>>> a possible workaround would be to make the pmi component a "proxy" >>>>>>>>>>>> that >>>>>>>>>>>> dlopen with RTLD_GLOBAL the "real" component in which the job is >>>>>>>>>>>> done. >>>>>>>>>>>> that being said, the impact is quite limited (no direct launch in >>>>>>>>>>>> slurm >>>>>>>>>>>> with pmi1, but pmi2 works fine) so it makes sense not to work >>>>>>>>>>>> around >>>>>>>>>>>> someone else problem. >>>>>>>>>>>> and that being said, configure could detect this broken pmi1 and >>>>>>>>>>>> not >>>>>>>>>>>> build pmi1 support or print a user friendly error message if pmi1 >>>>>>>>>>>> is used. >>>>>>>>>>>> >>>>>>>>>>>> any thoughts ? >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> >>>>>>>>>>>> Gilles >>>>>>>>>>>> >>>>>>>>>>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Ok, if the problem is moot, great. >>>>>>>>>>>>> >>>>>>>>>>>>> (sidenote: this is moot, so ignore this if you want: with this >>>>>>>>>>>>> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain >>>>>>>>>>>>> <r...@open-mpi.org> <mailto:r...@open-mpi.org> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> Easy enough to explain. We link libpmi into the pmix/s1 >>>>>>>>>>>>>> component. This library is missing the linkage to libslurm that >>>>>>>>>>>>>> contains the linkage to libauth where munge resides. So when we >>>>>>>>>>>>>> call a PMI function, libpmi references a call to munge for >>>>>>>>>>>>>> authentication and hits an “unresolved symbol” error. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Moe acknowledges the error is in Slurm and is fixing the >>>>>>>>>>>>>> linkages so this problem goes away >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) >>>>>>>>>>>>>>> <jsquy...@cisco.com> <mailto:jsquy...@cisco.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain >>>>>>>>>>>>>>> <r...@open-mpi.org> <mailto:r...@open-mpi.org> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly >>>>>>>>>>>>>>>> against its dependencies (the pmi-2 one is correct). Moe is >>>>>>>>>>>>>>>> aware of the problem and fixing it on their side. This won’t >>>>>>>>>>>>>>>> help existing installations until they upgrade, but I tend to >>>>>>>>>>>>>>>> agree with Jeff about not fixing other people’s problems. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you explain what is happening? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I ask because I'm not sure I understand the problem such that >>>>>>>>>>>>>>> using RTLD_GLOBAL would fix it. I.e., even if libpmi1.so isn't >>>>>>>>>>>>>>> linked against its dependencies properly, that shouldn't cause >>>>>>>>>>>>>>> a problem if OMPI components A and B are both linked against >>>>>>>>>>>>>>> libpmi1.so, and then A is loaded, and then B is loaded. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ...or perhaps we can just discuss this on the call tomorrow? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php >>>>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> >>>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php >>>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> >>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>> <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>>>>>>>>>> >>>>>>>>>>>> Subscription: >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>> >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> >>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>> >>>>>>>>>>> Subscription: >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>> >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php >>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> >>>>>>>>>> Subscription: >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> >>>>>>>>> de...@open-mpi.org >>>>>>>>> >>>>>>>>> Subscription: >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16390.php >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16391.php >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> С Уважением, Поляков Артем Юрьевич >>>>>>> Best regards, Artem Y. Polyakov >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16393.php >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16395.php >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> С Уважением, Поляков Артем Юрьевич >>>>>> Best regards, Artem Y. Polyakov >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16396.php >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16397.php >>>>> >>>>> >>>>> >>>>> -- >>>>> С Уважением, Поляков Артем Юрьевич >>>>> Best regards, Artem Y. Polyakov >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16398.php >>>> >>>> >>> >>> -- >>> Edgar Gabriel >>> Associate Professor >>> Parallel Software Technologies Lab http://pstl.cs.uh.edu >>> Department of Computer Science University of Houston >>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA >>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16400.php >> >> > > -- > Edgar Gabriel > Associate Professor > Parallel Software Technologies Lab http://pstl.cs.uh.edu > Department of Computer Science University of Houston > Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA > Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16402.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/