didn't want to interfere with this thread, although I have a similar issue, since I have the solution nearly fully cooked up. But anyway, this last email gave the hint on why we have suddenly the problem in ompio:

it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set anymore, so the entire section is being skipped. I double checked that with the 1.8 branch, it goes through the section, but not with master.

Thanks
Edgar



On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
Looks like I was totally lying in 
http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said 
we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all components.  My 
mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.

I believe someone said earlier in the thread that adding the right -llibs to 
the configure line will solve the issue, and that sounds correct to me.  If 
there's a missing symbol because the SLURM libraries are not automatically 
pulling in the right dependent libraries, then *if* we put a workaround in OMPI 
to fix this issue, then the right workaround is to add the relevant -llibs when 
that component is linked.

*If* you add that workaround (which is a whole separate discussion), I would 
suggest adding a configure.m4 test to see if adding the additional -llibs are 
necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if that 
fails, AC_LINK_IFELSE again with the additional -llibs to see if that works.

Or something like that.



On Dec 2, 2014, at 6:38 AM, Artem Polyakov <artpo...@gmail.com> wrote:

Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. If 
it is zero - very probably this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain <r...@open-mpi.org>:
It does look similar - question is: why didn’t this fix the problem? Will have 
to investigate.

Thanks


On Dec 2, 2014, at 3:17 AM, Artem Polyakov <artpo...@gmail.com> wrote:



2014-12-02 17:13 GMT+06:00 Ralph Castain <r...@open-mpi.org>:
Hmmm…if that is true, then it didn’t fix this problem as it is being reported 
in the master.

I had this problem on my laptop installation. You can check my report it was 
detailed enough and see if you hitting the same issue. My fix was also included 
into 1.8 branch. I am not sure that this is the same issue but they looks 
similar.



On Dec 1, 2014, at 9:40 PM, Artem Polyakov <artpo...@gmail.com> wrote:

I think this might be related to the configuration problem I was fixing with 
Jeff few months ago. Refer here:
https://github.com/open-mpi/ompi/pull/240

2014-12-02 10:15 GMT+06:00 Ralph Castain <r...@open-mpi.org>:
If it isn’t too much trouble, it would be good to confirm that it remains 
broken. I strongly suspect it is based on Moe’s comments.

Obviously, other people are making this work. For Intel MPI, all you do is 
point it at libpmi and they can run. However, they do explicitly dlopen it in 
their code, and I don’t know what flags they might pass when they do so.

If necessary, I suppose we could follow that pattern. In other words, rather 
than specifically linking the “s1” component to libpmi, instead require that 
the user point us to a pmi library via an MCA param, then explicitly dlopen 
that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but 
resolves the pmi linkage problem.


On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> 
wrote:

$ srun --version
slurm 2.6.6-VENDOR_PROVIDED

$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1

$ srun -n 1 ~/hw
/csc/home1/gouaillardet/hw: symbol lookup error: 
/usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
received
srun: error: soleil: task 0: Exited with exit code 127

$ ldd /usr/lib64/slurm/auth_munge.so
     linux-vdso.so.1 =>  (0x00007fff54478000)
     libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x00007f744760f000)
     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f74473f1000)
     libc.so.6 => /lib64/libc.so.6 (0x00007f744705d000)
     /lib64/ld-linux-x86-64.so.2 (0x0000003bf5400000)


now, if i reling auth_munge.so so it depends on libslurm :

$ srun -n 1 ~/hw
srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: 
slurm_auth_get_arg_desc


i can give a try to the latest slurm if needed

Cheers,

Gilles


On 2014/12/02 12:56, Ralph Castain wrote:
Out of curiosity - how are you testing these? I have more current versions of 
Slurm and would like to test the observations there.


On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org>
  wrote:

I d like to make a step back ...

i previously tested with slurm 2.6.0, and it complained about the slurm_verbose 
symbol that is defined in libslurm.so
so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok

now i tested with slurm 2.6.6 and it complains about the 
slurm_auth_get_arg_desc symbol, and this symbol is not
defined in any dynamic library. it is internally defined in the static 
libcommon.a library, which is used to build the slurm binaries.

as far as i understand, auth_munge.so can only be invoked from a slurm binary, 
which means it cannot be invoked from an mpi application
even if it is linked with libslurm, libpmi, ...

that looks like a slurm design issue that the slurm folks will take care of.

Cheers,

Gilles

On 2014/12/02 12:33, Ralph Castain wrote:

Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
component as this is the only place that requires it, and it won’t hurt 
anything to do so.



On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> 
<mailto:gilles.gouaillar...@iferc.org>
  wrote:

Jeff,

FWIW, you can read my analysis of what is going wrong at

http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
<http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
<http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php> 
<http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>


bottom line, i agree this is a slurm issue (slurm plugin should depend
on libslurm, but they do not, yet)

a possible workaround would be to make the pmi component a "proxy" that
dlopen with RTLD_GLOBAL the "real" component in which the job is done.
that being said, the impact is quite limited (no direct launch in slurm
with pmi1, but pmi2 works fine) so it makes sense not to work around
someone else problem.
and that being said, configure could detect this broken pmi1 and not
build pmi1 support or print a user friendly error message if pmi1 is used.

any thoughts ?

Cheers,

Gilles

On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:

Ok, if the problem is moot, great.

(sidenote: this is moot, so ignore this if you want: with this explanation, I'm 
still not sure how RTLD_GLOBAL fixes the issue)


On Dec 1, 2014, at 5:15 PM, Ralph Castain
<r...@open-mpi.org> <mailto:r...@open-mpi.org>
  wrote:


Easy enough to explain. We link libpmi into the pmix/s1 component. This library 
is missing the linkage to libslurm that contains the linkage to libauth where 
munge resides. So when we call a PMI function, libpmi references a call to 
munge for authentication and hits an “unresolved symbol” error.

Moe acknowledges the error is in Slurm and is fixing the linkages so this 
problem goes away



On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
<mailto:jsquy...@cisco.com>
  wrote:

On Dec 1, 2014, at 5:07 PM, Ralph Castain
<r...@open-mpi.org> <mailto:r...@open-mpi.org>
  wrote:


FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
fixing it on their side. This won’t help existing installations until they 
upgrade, but I tend to agree with Jeff about not fixing other people’s problems.

Can you explain what is happening?

I ask because I'm not sure I understand the problem such that using RTLD_GLOBAL 
would fix it.  I.e., even if libpmi1.so isn't linked against its dependencies 
properly, that shouldn't cause a problem if OMPI components A and B are both 
linked against libpmi1.so, and then A is loaded, and then B is loaded.

...or perhaps we can just discuss this on the call tomorrow?

--
Jeff Squyres

jsquy...@cisco.com <mailto:jsquy...@cisco.com>

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/ 
<http://www.cisco.com/web/about/doing_business/legal/cri/>


_______________________________________________
devel mailing list

de...@open-mpi.org <mailto:de...@open-mpi.org>

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel 
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>

Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/12/16383.php 
<http://www.open-mpi.org/community/lists/devel/2014/12/16383.php>
_______________________________________________
devel mailing list

de...@open-mpi.org <mailto:de...@open-mpi.org>

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel 
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>

Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/12/16384.php 
<http://www.open-mpi.org/community/lists/devel/2014/12/16384.php>
_______________________________________________
devel mailing list

de...@open-mpi.org <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> 
<mailto:de...@open-mpi.org>

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel 
<http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
<http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>

Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/12/16386.php 
<http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
<http://www.open-mpi.org/community/lists/devel/2014/12/16386.php> 
<http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>
_______________________________________________
devel mailing list

de...@open-mpi.org <mailto:de...@open-mpi.org>

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel 
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>

Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/12/16387.php 
<http://www.open-mpi.org/community/lists/devel/2014/12/16387.php>
_______________________________________________
devel mailing list

de...@open-mpi.org

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/12/16388.php


_______________________________________________
devel mailing list

de...@open-mpi.org

Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/12/16389.php

_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16390.php


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16391.php



--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16393.php


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16395.php



--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16396.php


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16397.php



--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16398.php



--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335

Reply via email to