2014-12-02 20:59 GMT+06:00 Edgar Gabriel <gabr...@cs.uh.edu
<mailto:gabr...@cs.uh.edu>>:
didn't want to interfere with this thread, although I have a similar
issue, since I have the solution nearly fully cooked up. But anyway,
this last email gave the hint on why we have suddenly the problem in
ompio:
it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not
set anymore, so the entire section is being skipped. I double
checked that with the 1.8 branch, it goes through the section, but
not with master.
Hi, Edgar.
Both master and ompi-release (isn't it 1.8?!) are equal in sence of my
fix. Something else!? I'd like to see config.log too but will look into
it only tomorrow.
Also I want to add that SLURM PMI2 communicates with local slurmstepd's
and doesn't need any authentification. All PMI1 processes otherwise
communicate to the srun process and thus need libslurm services for
communication and authentification.
Thanks
Edgar
On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
Looks like I was totally lying in
http://www.open-mpi.org/__community/lists/devel/2014/12/__16381.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16381.php> (where
I said we should not use RTLD_GLOBAL). We *do* use RTLD_GLOBAL:
https://github.com/open-mpi/__ompi/blob/master/opal/mca/__base/mca_base_component___repository.c#L124
<https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124>
This ltdl advice object is passed to lt_dlopen() for all
components. My mistake; sorry.
So the idea that using RTLD_GLOBAL will fix this SLURM bug is
incorrect.
I believe someone said earlier in the thread that adding the
right -llibs to the configure line will solve the issue, and
that sounds correct to me. If there's a missing symbol because
the SLURM libraries are not automatically pulling in the right
dependent libraries, then *if* we put a workaround in OMPI to
fix this issue, then the right workaround is to add the relevant
-llibs when that component is linked.
*If* you add that workaround (which is a whole separate
discussion), I would suggest adding a configure.m4 test to see
if adding the additional -llibs are necessary. Perhaps
AC_LINK_IFELSE looking for a symbol, and then if that fails,
AC_LINK_IFELSE again with the additional -llibs to see if that
works.
Or something like that.
On Dec 2, 2014, at 6:38 AM, Artem Polyakov <artpo...@gmail.com
<mailto:artpo...@gmail.com>> wrote:
Agree. First you should check is to what value
OPAL_HAVE_LTDL_ADVISE is set. If it is zero - very probably
this is the same bug as mine.
2014-12-02 17:33 GMT+06:00 Ralph Castain <r...@open-mpi.org
<mailto:r...@open-mpi.org>>:
It does look similar - question is: why didn’t this fix the
problem? Will have to investigate.
Thanks
On Dec 2, 2014, at 3:17 AM, Artem Polyakov
<artpo...@gmail.com <mailto:artpo...@gmail.com>> wrote:
2014-12-02 17:13 GMT+06:00 Ralph Castain
<r...@open-mpi.org <mailto:r...@open-mpi.org>>:
Hmmm…if that is true, then it didn’t fix this problem as
it is being reported in the master.
I had this problem on my laptop installation. You can
check my report it was detailed enough and see if you
hitting the same issue. My fix was also included into
1.8 branch. I am not sure that this is the same issue
but they looks similar.
On Dec 1, 2014, at 9:40 PM, Artem Polyakov
<artpo...@gmail.com <mailto:artpo...@gmail.com>> wrote:
I think this might be related to the configuration
problem I was fixing with Jeff few months ago. Refer
here:
https://github.com/open-mpi/__ompi/pull/240
<https://github.com/open-mpi/ompi/pull/240>
2014-12-02 10:15 GMT+06:00 Ralph Castain
<r...@open-mpi.org <mailto:r...@open-mpi.org>>:
If it isn’t too much trouble, it would be good to
confirm that it remains broken. I strongly suspect
it is based on Moe’s comments.
Obviously, other people are making this work. For
Intel MPI, all you do is point it at libpmi and they
can run. However, they do explicitly dlopen it in
their code, and I don’t know what flags they might
pass when they do so.
If necessary, I suppose we could follow that
pattern. In other words, rather than specifically
linking the “s1” component to libpmi, instead
require that the user point us to a pmi library via
an MCA param, then explicitly dlopen that library
with RTLD_GLOBAL. This avoids the issues cited by
Jeff, but resolves the pmi linkage problem.
On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org
<mailto:gilles.gouaillar...@iferc.org>__> wrote:
$ srun --version
slurm 2.6.6-VENDOR_PROVIDED
$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1
$ srun -n 1 ~/hw
/csc/home1/gouaillardet/hw: symbol lookup error:
/usr/lib64/slurm/auth_munge.__so: undefined
symbol: slurm_verbose
srun: error: slurm_receive_msg: Zero Bytes were
transmitted or received
srun: error: slurm_receive_msg[10.0.3.15]: Zero
Bytes were transmitted or received
srun: error: soleil: task 0: Exited with exit
code 127
$ ldd /usr/lib64/slurm/auth_munge.so
linux-vdso.so.1 => (0x00007fff54478000)
libmunge.so.2 => /usr/lib64/libmunge.so.2
(0x00007f744760f000)
libpthread.so.0 => /lib64/libpthread.so.0
(0x00007f74473f1000)
libc.so.6 => /lib64/libc.so.6
(0x00007f744705d000)
/lib64/ld-linux-x86-64.so.2
(0x0000003bf5400000)
now, if i reling auth_munge.so so it depends on
libslurm :
$ srun -n 1 ~/hw
srun: symbol lookup error:
/usr/lib64/slurm/auth_munge.__so: undefined
symbol: slurm_auth_get_arg_desc
i can give a try to the latest slurm if needed
Cheers,
Gilles
On 2014/12/02 12:56, Ralph Castain wrote:
Out of curiosity - how are you testing
these? I have more current versions of Slurm
and would like to test the observations there.
On Dec 1, 2014, at 7:49 PM, Gilles
Gouaillardet
<gilles.gouaillar...@iferc.org
<mailto:gilles.gouaillar...@iferc.org>__>
wrote:
I d like to make a step back ...
i previously tested with slurm 2.6.0,
and it complained about the
slurm_verbose symbol that is defined in
libslurm.so
so with slurm 2.6.0, RTLD_GLOBAL or
relinking is ok
now i tested with slurm 2.6.6 and it
complains about the
slurm_auth_get_arg_desc symbol, and this
symbol is not
defined in any dynamic library. it is
internally defined in the static
libcommon.a library, which is used to
build the slurm binaries.
as far as i understand, auth_munge.so
can only be invoked from a slurm binary,
which means it cannot be invoked from an
mpi application
even if it is linked with libslurm,
libpmi, ...
that looks like a slurm design issue
that the slurm folks will take care of.
Cheers,
Gilles
On 2014/12/02 12:33, Ralph Castain wrote:
Another option is to simply add the
-lslurm -lauth flags to the pmix/s1
component as this is the only place
that requires it, and it won’t hurt
anything to do so.
On Dec 1, 2014, at 6:03 PM,
Gilles Gouaillardet
<gilles.gouaillar...@iferc.org
<mailto:gilles.gouaillar...@iferc.org>__>
<mailto:gilles.gouaillardet@__iferc.org
<mailto:gilles.gouaillar...@iferc.org>>
wrote:
Jeff,
FWIW, you can read my analysis
of what is going wrong at
http://www.open-mpi.org/__community/lists/pmix-devel/__2014/11/0293.php
<http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>
<http://www.open-mpi.org/__community/lists/pmix-devel/__2014/11/0293.php
<http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>>
<http://www.open-mpi.org/__community/lists/pmix-devel/__2014/11/0293.php
<http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>>
<http://www.open-mpi.org/__community/lists/pmix-devel/__2014/11/0293.php
<http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php>>
bottom line, i agree this is a
slurm issue (slurm plugin should
depend
on libslurm, but they do not, yet)
a possible workaround would be
to make the pmi component a
"proxy" that
dlopen with RTLD_GLOBAL the
"real" component in which the
job is done.
that being said, the impact is
quite limited (no direct launch
in slurm
with pmi1, but pmi2 works fine)
so it makes sense not to work around
someone else problem.
and that being said, configure
could detect this broken pmi1
and not
build pmi1 support or print a
user friendly error message if
pmi1 is used.
any thoughts ?
Cheers,
Gilles
On 2014/12/02 7:47, Jeff Squyres
(jsquyres) wrote:
Ok, if the problem is moot,
great.
(sidenote: this is moot, so
ignore this if you want:
with this explanation, I'm
still not sure how
RTLD_GLOBAL fixes the issue)
On Dec 1, 2014, at 5:15 PM,
Ralph Castain
<r...@open-mpi.org
<mailto:r...@open-mpi.org>>
<mailto:r...@open-mpi.org
<mailto:r...@open-mpi.org>>
wrote:
Easy enough to explain.
We link libpmi into the
pmix/s1 component. This
library is missing the
linkage to libslurm that
contains the linkage to
libauth where munge
resides. So when we call
a PMI function, libpmi
references a call to
munge for authentication
and hits an “unresolved
symbol” error.
Moe acknowledges the
error is in Slurm and is
fixing the linkages so
this problem goes away
On Dec 1, 2014, at
2:13 PM, Jeff
Squyres (jsquyres)
<jsquy...@cisco.com
<mailto:jsquy...@cisco.com>>
<mailto:jsquy...@cisco.com
<mailto:jsquy...@cisco.com>>
wrote:
On Dec 1, 2014, at
5:07 PM, Ralph Castain
<r...@open-mpi.org
<mailto:r...@open-mpi.org>>
<mailto:r...@open-mpi.org
<mailto:r...@open-mpi.org>>
wrote:
FWIW: It’s
Slurm’s pmi-1
library that
isn’t linked
correctly
against its
dependencies
(the pmi-2 one
is correct).
Moe is aware of
the problem and
fixing it on
their side. This
won’t help
existing
installations
until they
upgrade, but I
tend to agree
with Jeff about
not fixing other
people’s problems.
Can you explain what
is happening?
I ask because I'm
not sure I
understand the
problem such that
using RTLD_GLOBAL
would fix it. I.e.,
even if libpmi1.so
isn't linked against
its dependencies
properly, that
shouldn't cause a
problem if OMPI
components A and B
are both linked
against libpmi1.so,
and then A is
loaded, and then B
is loaded.
...or perhaps we can
just discuss this on
the call tomorrow?
--
Jeff Squyres
jsquy...@cisco.com
<mailto:jsquy...@cisco.com>
<mailto:jsquy...@cisco.com
<mailto:jsquy...@cisco.com>>
For corporate legal
information go to:
http://www.cisco.com/web/__about/doing_business/legal/__cri/
<http://www.cisco.com/web/about/doing_business/legal/cri/>
<http://www.cisco.com/web/__about/doing_business/legal/__cri/
<http://www.cisco.com/web/about/doing_business/legal/cri/>>
_________________________________________________
devel mailing list
de...@open-mpi.org
<mailto:de...@open-mpi.org>
<mailto:de...@open-mpi.org
<mailto:de...@open-mpi.org>>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
<http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16383.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16383.php>
<http://www.open-mpi.org/__community/lists/devel/2014/12/__16383.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16383.php>>
_________________________________________________
devel mailing list
de...@open-mpi.org
<mailto:de...@open-mpi.org>
<mailto:de...@open-mpi.org
<mailto:de...@open-mpi.org>>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
<http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16384.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16384.php>
<http://www.open-mpi.org/__community/lists/devel/2014/12/__16384.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16384.php>>
_________________________________________________
devel mailing list
de...@open-mpi.org
<mailto:de...@open-mpi.org>
<mailto:de...@open-mpi.org
<mailto:de...@open-mpi.org>>
<mailto:de...@open-mpi.org
<mailto:de...@open-mpi.org>>
<mailto:de...@open-mpi.org
<mailto:de...@open-mpi.org>>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
<http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>>
<http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>>
<http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16386.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>
<http://www.open-mpi.org/__community/lists/devel/2014/12/__16386.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>>
<http://www.open-mpi.org/__community/lists/devel/2014/12/__16386.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>>
<http://www.open-mpi.org/__community/lists/devel/2014/12/__16386.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16386.php>>
_________________________________________________
devel mailing list
de...@open-mpi.org
<mailto:de...@open-mpi.org>
<mailto:de...@open-mpi.org
<mailto:de...@open-mpi.org>>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
<http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16387.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16387.php>
<http://www.open-mpi.org/__community/lists/devel/2014/12/__16387.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16387.php>>
_________________________________________________
devel mailing list
de...@open-mpi.org
<mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16388.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16388.php>
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16389.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16389.php>
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16390.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16390.php>
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16391.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16391.php>
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16393.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16393.php>
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16395.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16395.php>
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16396.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16396.php>
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16397.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16397.php>
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:
http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16398.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16398.php>
--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
_________________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/__mailman/listinfo.cgi/devel
<http://www.open-mpi.org/mailman/listinfo.cgi/devel>
Link to this post:
http://www.open-mpi.org/__community/lists/devel/2014/12/__16400.php
<http://www.open-mpi.org/community/lists/devel/2014/12/16400.php>
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/12/16404.php