Jeff,

FWIW, you can read my analysis of what is going wrong at
http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php

bottom line, i agree this is a slurm issue (slurm plugin should depend
on libslurm, but they do not, yet)

a possible workaround would be to make the pmi component a "proxy" that
dlopen with RTLD_GLOBAL the "real" component in which the job is done.
that being said, the impact is quite limited (no direct launch in slurm
with pmi1, but pmi2 works fine) so it makes sense not to work around
someone else problem.
and that being said, configure could detect this broken pmi1 and not
build pmi1 support or print a user friendly error message if pmi1 is used.

any thoughts ?

Cheers,

Gilles

On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
> Ok, if the problem is moot, great.
>
> (sidenote: this is moot, so ignore this if you want: with this explanation, 
> I'm still not sure how RTLD_GLOBAL fixes the issue)
>
>
> On Dec 1, 2014, at 5:15 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
>> library is missing the linkage to libslurm that contains the linkage to 
>> libauth where munge resides. So when we call a PMI function, libpmi 
>> references a call to munge for authentication and hits an “unresolved 
>> symbol” error.
>>
>> Moe acknowledges the error is in Slurm and is fixing the linkages so this 
>> problem goes away
>>
>>
>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>>> wrote:
>>>
>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
>>>> dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
>>>> fixing it on their side. This won’t help existing installations until they 
>>>> upgrade, but I tend to agree with Jeff about not fixing other people’s 
>>>> problems.
>>> Can you explain what is happening?
>>>
>>> I ask because I'm not sure I understand the problem such that using 
>>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against 
>>> its dependencies properly, that shouldn't cause a problem if OMPI 
>>> components A and B are both linked against libpmi1.so, and then A is 
>>> loaded, and then B is loaded.
>>>
>>> ...or perhaps we can just discuss this on the call tomorrow?
>>>
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php
>

Reply via email to