Looks like I was totally lying in 
http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said 
we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all components.  My 
mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.

I believe someone said earlier in the thread that adding the right -llibs to 
the configure line will solve the issue, and that sounds correct to me.  If 
there's a missing symbol because the SLURM libraries are not automatically 
pulling in the right dependent libraries, then *if* we put a workaround in OMPI 
to fix this issue, then the right workaround is to add the relevant -llibs when 
that component is linked.

*If* you add that workaround (which is a whole separate discussion), I would 
suggest adding a configure.m4 test to see if adding the additional -llibs are 
necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if that 
fails, AC_LINK_IFELSE again with the additional -llibs to see if that works.

Or something like that.



On Dec 2, 2014, at 6:38 AM, Artem Polyakov <[email protected]> wrote:

> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. 
> If it is zero - very probably this is the same bug as mine.
> 
> 2014-12-02 17:33 GMT+06:00 Ralph Castain <[email protected]>:
> It does look similar - question is: why didn’t this fix the problem? Will 
> have to investigate.
> 
> Thanks
> 
> 
>> On Dec 2, 2014, at 3:17 AM, Artem Polyakov <[email protected]> wrote:
>> 
>> 
>> 
>> 2014-12-02 17:13 GMT+06:00 Ralph Castain <[email protected]>:
>> Hmmm…if that is true, then it didn’t fix this problem as it is being 
>> reported in the master.
>> 
>> I had this problem on my laptop installation. You can check my report it was 
>> detailed enough and see if you hitting the same issue. My fix was also 
>> included into 1.8 branch. I am not sure that this is the same issue but they 
>> looks similar.
>>  
>> 
>> 
>>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov <[email protected]> wrote:
>>> 
>>> I think this might be related to the configuration problem I was fixing 
>>> with Jeff few months ago. Refer here:
>>> https://github.com/open-mpi/ompi/pull/240
>>> 
>>> 2014-12-02 10:15 GMT+06:00 Ralph Castain <[email protected]>:
>>> If it isn’t too much trouble, it would be good to confirm that it remains 
>>> broken. I strongly suspect it is based on Moe’s comments.
>>> 
>>> Obviously, other people are making this work. For Intel MPI, all you do is 
>>> point it at libpmi and they can run. However, they do explicitly dlopen it 
>>> in their code, and I don’t know what flags they might pass when they do so.
>>> 
>>> If necessary, I suppose we could follow that pattern. In other words, 
>>> rather than specifically linking the “s1” component to libpmi, instead 
>>> require that the user point us to a pmi library via an MCA param, then 
>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues 
>>> cited by Jeff, but resolves the pmi linkage problem.
>>> 
>>> 
>>>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>>>> <[email protected]> wrote:
>>>> 
>>>> $ srun --version
>>>> slurm 2.6.6-VENDOR_PROVIDED
>>>> 
>>>> $ srun --mpi=pmi2 -n 1 ~/hw
>>>> I am 0 / 1
>>>> 
>>>> $ srun -n 1 ~/hw
>>>> /csc/home1/gouaillardet/hw: symbol lookup error: 
>>>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>>>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
>>>> received
>>>> srun: error: soleil: task 0: Exited with exit code 127
>>>> 
>>>> $ ldd /usr/lib64/slurm/auth_munge.so
>>>>     linux-vdso.so.1 =>  (0x00007fff54478000)
>>>>     libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000)
>>>>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000)
>>>>     libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000)
>>>>     /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000)
>>>> 
>>>> 
>>>> now, if i reling auth_munge.so so it depends on libslurm :
>>>> 
>>>> $ srun -n 1 ~/hw
>>>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined 
>>>> symbol: slurm_auth_get_arg_desc
>>>> 
>>>> 
>>>> i can give a try to the latest slurm if needed
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> 
>>>> On 2014/12/02 12:56, Ralph Castain wrote:
>>>>> Out of curiosity - how are you testing these? I have more current 
>>>>> versions of Slurm and would like to test the observations there.
>>>>> 
>>>>> 
>>>>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>>>>> <[email protected]>
>>>>>>  wrote:
>>>>>> 
>>>>>> I d like to make a step back ...
>>>>>> 
>>>>>> i previously tested with slurm 2.6.0, and it complained about the 
>>>>>> slurm_verbose symbol that is defined in libslurm.so
>>>>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>>>>> 
>>>>>> now i tested with slurm 2.6.6 and it complains about the 
>>>>>> slurm_auth_get_arg_desc symbol, and this symbol is not
>>>>>> defined in any dynamic library. it is internally defined in the static 
>>>>>> libcommon.a library, which is used to build the slurm binaries.
>>>>>> 
>>>>>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>>>>>> binary, which means it cannot be invoked from an mpi application
>>>>>> even if it is linked with libslurm, libpmi, ...
>>>>>> 
>>>>>> that looks like a slurm design issue that the slurm folks will take care 
>>>>>> of.
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> On 2014/12/02 12:33, Ralph Castain wrote:
>>>>>> 
>>>>>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>>>>>>> component as this is the only place that requires it, and it won’t hurt 
>>>>>>> anything to do so.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>>>>>>> <[email protected]> <mailto:[email protected]>
>>>>>>>>  wrote:
>>>>>>>> 
>>>>>>>> Jeff,
>>>>>>>> 
>>>>>>>> FWIW, you can read my analysis of what is going wrong at
>>>>>>>> 
>>>>>>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
>>>>>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>>>>>>>> on libslurm, but they do not, yet)
>>>>>>>> 
>>>>>>>> a possible workaround would be to make the pmi component a "proxy" that
>>>>>>>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>>>>>>>> that being said, the impact is quite limited (no direct launch in slurm
>>>>>>>> with pmi1, but pmi2 works fine) so it makes sense not to work around
>>>>>>>> someone else problem.
>>>>>>>> and that being said, configure could detect this broken pmi1 and not
>>>>>>>> build pmi1 support or print a user friendly error message if pmi1 is 
>>>>>>>> used.
>>>>>>>> 
>>>>>>>> any thoughts ?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> Gilles
>>>>>>>> 
>>>>>>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>>>>>>>> 
>>>>>>>>> Ok, if the problem is moot, great.
>>>>>>>>> 
>>>>>>>>> (sidenote: this is moot, so ignore this if you want: with this 
>>>>>>>>> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain 
>>>>>>>>> <[email protected]> <mailto:[email protected]>
>>>>>>>>>  wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Easy enough to explain. We link libpmi into the pmix/s1 component. 
>>>>>>>>>> This library is missing the linkage to libslurm that contains the 
>>>>>>>>>> linkage to libauth where munge resides. So when we call a PMI 
>>>>>>>>>> function, libpmi references a call to munge for authentication and 
>>>>>>>>>> hits an “unresolved symbol” error.
>>>>>>>>>> 
>>>>>>>>>> Moe acknowledges the error is in Slurm and is fixing the linkages so 
>>>>>>>>>> this problem goes away
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) 
>>>>>>>>>>> <[email protected]> <mailto:[email protected]>
>>>>>>>>>>>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain 
>>>>>>>>>>> <[email protected]> <mailto:[email protected]>
>>>>>>>>>>>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly 
>>>>>>>>>>>> against its dependencies (the pmi-2 one is correct).  Moe is aware 
>>>>>>>>>>>> of the problem and fixing it on their side. This won’t help 
>>>>>>>>>>>> existing installations until they upgrade, but I tend to agree 
>>>>>>>>>>>> with Jeff about not fixing other people’s problems.
>>>>>>>>>>>> 
>>>>>>>>>>> Can you explain what is happening?
>>>>>>>>>>> 
>>>>>>>>>>> I ask because I'm not sure I understand the problem such that using 
>>>>>>>>>>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked 
>>>>>>>>>>> against its dependencies properly, that shouldn't cause a problem 
>>>>>>>>>>> if OMPI components A and B are both linked against libpmi1.so, and 
>>>>>>>>>>> then A is loaded, and then B is loaded.
>>>>>>>>>>> 
>>>>>>>>>>> ...or perhaps we can just discuss this on the call tomorrow?
>>>>>>>>>>> 
>>>>>>>>>>> -- 
>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>> 
>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>>>> 
>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> 
>>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>>>> 
>>>>>>>>>>> Subscription: 
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>>> 
>>>>>>>>>>> Link to this post: 
>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php 
>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> 
>>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>>> 
>>>>>>>>>> Subscription: 
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>> 
>>>>>>>>>> Link to this post: 
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> 
>>>>>>>> [email protected] <mailto:[email protected]> 
>>>>>>>> <mailto:[email protected]> <mailto:[email protected]>
>>>>>>>> 
>>>>>>>> Subscription: 
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>> 
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php 
>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> 
>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>> 
>>>>>>> Subscription: 
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>> 
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php 
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> 
>>>>>> [email protected]
>>>>>> 
>>>>>> Subscription: 
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> 
>>>>> [email protected]
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16390.php
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16391.php
>>> 
>>> 
>>> 
>>> -- 
>>> С Уважением, Поляков Артем Юрьевич
>>> Best regards, Artem Y. Polyakov
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16393.php
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16395.php
>> 
>> 
>> 
>> -- 
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16396.php
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16397.php
> 
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16398.php


-- 
Jeff Squyres
[email protected]
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to