$ srun --version
slurm 2.6.6-VENDOR_PROVIDED

$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1

$ srun -n 1 ~/hw
/csc/home1/gouaillardet/hw: symbol lookup error:
/usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted
or received
srun: error: soleil: task 0: Exited with exit code 127

$ ldd /usr/lib64/slurm/auth_munge.so
    linux-vdso.so.1 =>  (0x00007fff54478000)
    libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000)


now, if i reling auth_munge.so so it depends on libslurm :

$ srun -n 1 ~/hw
srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
symbol: slurm_auth_get_arg_desc


i can give a try to the latest slurm if needed

Cheers,

Gilles


On 2014/12/02 12:56, Ralph Castain wrote:
> Out of curiosity - how are you testing these? I have more current versions of 
> Slurm and would like to test the observations there.
>
>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org> wrote:
>>
>> I d like to make a step back ...
>>
>> i previously tested with slurm 2.6.0, and it complained about the 
>> slurm_verbose symbol that is defined in libslurm.so
>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>
>> now i tested with slurm 2.6.6 and it complains about the 
>> slurm_auth_get_arg_desc symbol, and this symbol is not
>> defined in any dynamic library. it is internally defined in the static 
>> libcommon.a library, which is used to build the slurm binaries.
>>
>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>> binary, which means it cannot be invoked from an mpi application
>> even if it is linked with libslurm, libpmi, ...
>>
>> that looks like a slurm design issue that the slurm folks will take care of.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/02 12:33, Ralph Castain wrote:
>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>>> component as this is the only place that requires it, and it won't hurt 
>>> anything to do so.
>>>
>>>
>>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> 
>>>> wrote:
>>>>
>>>> Jeff,
>>>>
>>>> FWIW, you can read my analysis of what is going wrong at
>>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
>>>> <http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>
>>>>
>>>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>>>> on libslurm, but they do not, yet)
>>>>
>>>> a possible workaround would be to make the pmi component a "proxy" that
>>>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>>>> that being said, the impact is quite limited (no direct launch in slurm
>>>> with pmi1, but pmi2 works fine) so it makes sense not to work around
>>>> someone else problem.
>>>> and that being said, configure could detect this broken pmi1 and not
>>>> build pmi1 support or print a user friendly error message if pmi1 is used.
>>>>
>>>> any thoughts ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>>>>> Ok, if the problem is moot, great.
>>>>>
>>>>> (sidenote: this is moot, so ignore this if you want: with this 
>>>>> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue)
>>>>>
>>>>>
>>>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain <r...@open-mpi.org> 
>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>
>>>>>> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
>>>>>> library is missing the linkage to libslurm that contains the linkage to 
>>>>>> libauth where munge resides. So when we call a PMI function, libpmi 
>>>>>> references a call to munge for authentication and hits an "unresolved 
>>>>>> symbol" error.
>>>>>>
>>>>>> Moe acknowledges the error is in Slurm and is fixing the linkages so 
>>>>>> this problem goes away
>>>>>>
>>>>>>
>>>>>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) 
>>>>>>> <jsquy...@cisco.com> <mailto:jsquy...@cisco.com> wrote:
>>>>>>>
>>>>>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>>
>>>>>>>> FWIW: It's Slurm's pmi-1 library that isn't linked correctly against 
>>>>>>>> its dependencies (the pmi-2 one is correct).  Moe is aware of the 
>>>>>>>> problem and fixing it on their side. This won't help existing 
>>>>>>>> installations until they upgrade, but I tend to agree with Jeff about 
>>>>>>>> not fixing other people's problems.
>>>>>>> Can you explain what is happening?
>>>>>>>
>>>>>>> I ask because I'm not sure I understand the problem such that using 
>>>>>>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked 
>>>>>>> against its dependencies properly, that shouldn't cause a problem if 
>>>>>>> OMPI components A and B are both linked against libpmi1.so, and then A 
>>>>>>> is loaded, and then B is loaded.
>>>>>>>
>>>>>>> ...or perhaps we can just discuss this on the call tomorrow?
>>>>>>>
>>>>>>> -- 
>>>>>>> Jeff Squyres
>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>>>>>> For corporate legal information go to: 
>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php 
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16383.php>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php 
>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16384.php>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> 
>>>> <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16387.php 
>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16387.php>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16388.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16389.php

Reply via email to