Re: [OMPI devel] RTLD_GLOBAL question

2014-12-04 Thread Artem Polyakov
2014-12-04 17:29 GMT+06:00 Jeff Squyres (jsquyres) :

> On Dec 3, 2014, at 11:35 PM, Artem Polyakov  wrote:
>
> > Jeff, I must admit that I don't completely understand how your fix work.
> Can you explan me why this veriant was failing:
> >
> > CPPFLAGS="-I$srcdir/opal/libltdl/"
> > AC_EGREP_HEADER([lt_dladvise_init], [$srcdir/opal/libltdl/ltdl.h]
> >
> > while the new one:
> >
> > CPPFLAGS="-I$srcdir -I$srcdir/opal/libltdl/"
> > AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h],
> >  [OPAL_HAVE_LTDL_ADVISE=1])
> >
> > works well?
> >
> > Is there additional header files that are included in conftest.c and has
> to be reached through $srcdir?
>
> No, it was simpler than that: "." (i.e., $srcdir in a non-VPATH build) is
> not necessarily in the default include search path for <> files (which is
> what AC_EGREP_HEADER uses).  For example:
>
> -
> [3:24] savbu-usnic-a:~/g/ompi (topic/master-libfabric●)
> $ cat test.c
> #include <./opal/libltdl/ltdl.h>
> [3:24] savbu-usnic-a:~/g/ompi (topic/master-libfabric●)
> $ gcc -E test.c > /dev/null
> test.c:1:33: fatal error: ./opal/libltdl/ltdl.h: No such file or directory
>  #include <./opal/libltdl/ltdl.h>
>  ^
> compilation terminated.
> -
>
> Notice that if I don't have -I. (i.e., -I$srcdir), the above compilation
> fails because it can't find <./opal/libltdl/ltdl.h>.
>
> But if I add -I., then the file can be found:
>
> -
> [3:24] savbu-usnic-a:~/g/ompi (topic/master-libfabric●)
> $ gcc -E test.c -I. > /dev/null
> [3:25] savbu-usnic-a:~/g/ompi (topic/master-libfabric●)
> $ echo $status
> 0
> -
>
> And since we're -I$srcdir, there's no need to include $srcdir in the
> filename.  Indeed, if $srcdir==., then adding it in the filename is
> harmless.  But if $srcdir=/path/to/somewhere, it's actually a problem.
> Regardless, $srcdir should no longer be in the filename.
>
> The part I forgot was that your version of libtool also requires some sub
> header files in the $srcdir/opal/libltdl tree, so a -I for that also needs
> to be there.
>
> Make sense?
>
Yes. Thank you!


>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16433.php




-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


Re: [OMPI devel] RTLD_GLOBAL question

2014-12-04 Thread Jeff Squyres (jsquyres)
On Dec 3, 2014, at 11:35 PM, Artem Polyakov  wrote:

> Jeff, I must admit that I don't completely understand how your fix work. Can 
> you explan me why this veriant was failing:
> 
> CPPFLAGS="-I$srcdir/opal/libltdl/"
> AC_EGREP_HEADER([lt_dladvise_init], [$srcdir/opal/libltdl/ltdl.h]
> 
> while the new one:
> 
> CPPFLAGS="-I$srcdir -I$srcdir/opal/libltdl/"
> AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h],
>  [OPAL_HAVE_LTDL_ADVISE=1])
> 
> works well?
> 
> Is there additional header files that are included in conftest.c and has to 
> be reached through $srcdir?

No, it was simpler than that: "." (i.e., $srcdir in a non-VPATH build) is not 
necessarily in the default include search path for <> files (which is what 
AC_EGREP_HEADER uses).  For example:

-
[3:24] savbu-usnic-a:~/g/ompi (topic/master-libfabric●)
$ cat test.c
#include <./opal/libltdl/ltdl.h>
[3:24] savbu-usnic-a:~/g/ompi (topic/master-libfabric●)
$ gcc -E test.c > /dev/null
test.c:1:33: fatal error: ./opal/libltdl/ltdl.h: No such file or directory
 #include <./opal/libltdl/ltdl.h>
 ^
compilation terminated.
-

Notice that if I don't have -I. (i.e., -I$srcdir), the above compilation fails 
because it can't find <./opal/libltdl/ltdl.h>.

But if I add -I., then the file can be found:

-
[3:24] savbu-usnic-a:~/g/ompi (topic/master-libfabric●)
$ gcc -E test.c -I. > /dev/null
[3:25] savbu-usnic-a:~/g/ompi (topic/master-libfabric●)
$ echo $status
0
-

And since we're -I$srcdir, there's no need to include $srcdir in the filename.  
Indeed, if $srcdir==., then adding it in the filename is harmless.  But if 
$srcdir=/path/to/somewhere, it's actually a problem.  Regardless, $srcdir 
should no longer be in the filename.

The part I forgot was that your version of libtool also requires some sub 
header files in the $srcdir/opal/libltdl tree, so a -I for that also needs to 
be there.

Make sense?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-03 Thread Artem Polyakov
Jeff, I must admit that I don't completely understand how your fix work.
Can you explan me why this veriant was failing:

CPPFLAGS="-I$srcdir/opal/libltdl/"
AC_EGREP_HEADER([lt_dladvise_init], [$srcdir/opal/libltdl/ltdl.h]

while the new one:

CPPFLAGS="-I$srcdir -I$srcdir/opal/libltdl/"
AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h],
 [OPAL_HAVE_LTDL_ADVISE=1])

works well?

Is there additional header files that are included in conftest.c and has to
be reached through $srcdir?

2014-12-03 20:51 GMT+06:00 Jeff Squyres (jsquyres) :

> Thanks!
>
> On Dec 3, 2014, at 7:03 AM, Artem Polyakov  wrote:
>
> >
> >
> > среда, 3 декабря 2014 г. пользователь Jeff Squyres (jsquyres) написал:
> > They were equivalent until yesterday.  :-)
> > I see. Got that!
> >
> > I was going to file a PR to bring the changes over to v1.8, but not
> until they had shaken out on master.
> >
> > Would you mind filing a PR?
> > Sure, will do that asap.
> >
> >
> >
> >
> > On Dec 3, 2014, at 5:56 AM, Artem Polyakov  wrote:
> >
> > > I finally found the clear reason of this strange situation!
> > >
> > > In ompi opal_setup_libltdl.m4 has the following content:
> > > CPPFLAGS="-I$srcdir -I$srcdir/opal/libltdl"
> > > AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h],
> > > [OPAL_HAVE_LTDL_ADVISE=1])
> > >
> > > And in ompi-release opal_setup_libltdl.m4:
> > > CPPFLAGS="-I$srcdir/opal/libltdl/"
> > > # Must specifically mention $srcdir here for VPATH builds
> > > # (this file is in the src tree).
> > > AC_EGREP_HEADER([lt_dladvise_init], [$srcdir/opal/libltdl/ltdl.h],
> > >   [OPAL_HAVE_LTDL_ADVISE=1])
> > >
> > > This was thesource of my mistake and confusion. In ompi we check for
> "opal/libltdl/ltdl.h" and we do need -I$srcdir and in ompi-release we check
> for "$srcdir/opal/libltdl/ltdl.h". I didn't noticed that wen did the
> backport from ompi-release to ompi. I really thought that this files are
> equal.
> > >
> > > I think we need to converge to the unified solution.
> > >
> > >
> > > 2014-12-03 10:23 GMT+06:00 Ralph Castain :
> > > It is working for me, but I’m not sure if that is because of these
> changes or if it always worked for me. I haven’t tested the slurm
> integration in awhile.
> > >
> > >
> > >> On Dec 2, 2014, at 7:59 PM, Artem Polyakov 
> wrote:
> > >>
> > >> Howard, does current mater fix your problems?
> > >>
> > >> среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
> > >>
> > >> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres)  >:
> > >> On Dec 2, 2014, at 8:43 PM, Artem Polyakov 
> wrote:
> > >>
> > >> > Jeff, your fix brakes my system again. Actually you just reverted
> my changes.
> > >>
> > >> No, I didn't just revert them -- I made changes.  I did forget about
> the second -I, though (to be fair, the 2nd -I was the *only* -I in there
> before I committed).
> > >> Yeah! I was speaking figurally :).
> > >>
> > >> Sorry about that -- I've tested your change (without the trailing /)
> and it seems to work ok.  I'd go ahead and merge.
> > >>
> > >> --
> > >> Jeff Squyres
> > >> jsquy...@cisco.com
> > >> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > >>
> > >> ___
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
> > >>
> > >>
> > >>
> > >> --
> > >> С Уважением, Поляков Артем Юрьевич
> > >> Best regards, Artem Y. Polyakov
> > >>
> > >>
> > >> --
> > >> -
> > >> Best regards, Artem Polyakov
> > >> (Mobile mail)
> > >> ___
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16416.php
> > >
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16417.php
> > >
> > >
> > >
> > > --
> > > С Уважением, Поляков Артем Юрьевич
> > > Best regards, Artem Y. Polyakov
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16421.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-03 Thread Howard Pritchard
Hello Artem,

No, but I was also told by schedmd that the slurm we have on our systems is
ancient.

So I'm no longer considering this problem very important.  We have a
workaround of always configuring
with --disable-dlopen.

Thanks,

Howard



2014-12-02 20:59 GMT-07:00 Artem Polyakov :

> Howard, does current mater fix your problems?
>
> среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
>
>>
>> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) :
>>
>>> On Dec 2, 2014, at 8:43 PM, Artem Polyakov  wrote:
>>>
>>> > Jeff, your fix brakes my system again. Actually you just reverted my
>>> changes.
>>>
>>> No, I didn't just revert them -- I made changes.  I did forget about the
>>> second -I, though (to be fair, the 2nd -I was the *only* -I in there before
>>> I committed).
>>>
>> Yeah! I was speaking figurally :).
>>
>>
>>> Sorry about that -- I've tested your change (without the trailing /) and
>>> it seems to work ok.  I'd go ahead and merge.
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
>>>
>>
>>
>>
>> --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>>
>
>
> --
> -
> Best regards, Artem Polyakov
> (Mobile mail)
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16416.php
>


Re: [OMPI devel] RTLD_GLOBAL question

2014-12-03 Thread Jeff Squyres (jsquyres)
Thanks!

On Dec 3, 2014, at 7:03 AM, Artem Polyakov  wrote:

> 
> 
> среда, 3 декабря 2014 г. пользователь Jeff Squyres (jsquyres) написал:
> They were equivalent until yesterday.  :-)
> I see. Got that! 
> 
> I was going to file a PR to bring the changes over to v1.8, but not until 
> they had shaken out on master.
> 
> Would you mind filing a PR?
> Sure, will do that asap. 
> 
> 
>  
> 
> On Dec 3, 2014, at 5:56 AM, Artem Polyakov  wrote:
> 
> > I finally found the clear reason of this strange situation!
> >
> > In ompi opal_setup_libltdl.m4 has the following content:
> > CPPFLAGS="-I$srcdir -I$srcdir/opal/libltdl"
> > AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h],
> > [OPAL_HAVE_LTDL_ADVISE=1])
> >
> > And in ompi-release opal_setup_libltdl.m4:
> > CPPFLAGS="-I$srcdir/opal/libltdl/"
> > # Must specifically mention $srcdir here for VPATH builds
> > # (this file is in the src tree).
> > AC_EGREP_HEADER([lt_dladvise_init], [$srcdir/opal/libltdl/ltdl.h],
> >   [OPAL_HAVE_LTDL_ADVISE=1])
> >
> > This was thesource of my mistake and confusion. In ompi we check for 
> > "opal/libltdl/ltdl.h" and we do need -I$srcdir and in ompi-release we check 
> > for "$srcdir/opal/libltdl/ltdl.h". I didn't noticed that wen did the 
> > backport from ompi-release to ompi. I really thought that this files are 
> > equal.
> >
> > I think we need to converge to the unified solution.
> >
> >
> > 2014-12-03 10:23 GMT+06:00 Ralph Castain :
> > It is working for me, but I’m not sure if that is because of these changes 
> > or if it always worked for me. I haven’t tested the slurm integration in 
> > awhile.
> >
> >
> >> On Dec 2, 2014, at 7:59 PM, Artem Polyakov  wrote:
> >>
> >> Howard, does current mater fix your problems?
> >>
> >> среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
> >>
> >> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) :
> >> On Dec 2, 2014, at 8:43 PM, Artem Polyakov  wrote:
> >>
> >> > Jeff, your fix brakes my system again. Actually you just reverted my 
> >> > changes.
> >>
> >> No, I didn't just revert them -- I made changes.  I did forget about the 
> >> second -I, though (to be fair, the 2nd -I was the *only* -I in there 
> >> before I committed).
> >> Yeah! I was speaking figurally :).
> >>
> >> Sorry about that -- I've tested your change (without the trailing /) and 
> >> it seems to work ok.  I'd go ahead and merge.
> >>
> >> --
> >> Jeff Squyres
> >> jsquy...@cisco.com
> >> For corporate legal information go to: 
> >> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
> >>
> >>
> >>
> >> --
> >> С Уважением, Поляков Артем Юрьевич
> >> Best regards, Artem Y. Polyakov
> >>
> >>
> >> --
> >> -
> >> Best regards, Artem Polyakov
> >> (Mobile mail)
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/12/16416.php
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/12/16417.php
> >
> >
> >
> > --
> > С Уважением, Поляков Артем Юрьевич
> > Best regards, Artem Y. Polyakov
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/12/16421.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16422.php
> 
> 
> -- 
> -
> Best regards, Artem Polyakov
> (Mobile mail)
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16423.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-03 Thread Artem Polyakov
среда, 3 декабря 2014 г. пользователь Jeff Squyres (jsquyres) написал:

> They were equivalent until yesterday.  :-)

I see. Got that!

>
> I was going to file a PR to bring the changes over to v1.8, but not until
> they had shaken out on master.
>
> Would you mind filing a PR?

Sure, will do that asap.


>

>
> On Dec 3, 2014, at 5:56 AM, Artem Polyakov  > wrote:
>
> > I finally found the clear reason of this strange situation!
> >
> > In ompi opal_setup_libltdl.m4 has the following content:
> > CPPFLAGS="-I$srcdir -I$srcdir/opal/libltdl"
> > AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h],
> > [OPAL_HAVE_LTDL_ADVISE=1])
> >
> > And in ompi-release opal_setup_libltdl.m4:
> > CPPFLAGS="-I$srcdir/opal/libltdl/"
> > # Must specifically mention $srcdir here for VPATH builds
> > # (this file is in the src tree).
> > AC_EGREP_HEADER([lt_dladvise_init], [$srcdir/opal/libltdl/ltdl.h],
> >   [OPAL_HAVE_LTDL_ADVISE=1])
> >
> > This was thesource of my mistake and confusion. In ompi we check for
> "opal/libltdl/ltdl.h" and we do need -I$srcdir and in ompi-release we check
> for "$srcdir/opal/libltdl/ltdl.h". I didn't noticed that wen did the
> backport from ompi-release to ompi. I really thought that this files are
> equal.
> >
> > I think we need to converge to the unified solution.
> >
> >
> > 2014-12-03 10:23 GMT+06:00 Ralph Castain  >:
> > It is working for me, but I’m not sure if that is because of these
> changes or if it always worked for me. I haven’t tested the slurm
> integration in awhile.
> >
> >
> >> On Dec 2, 2014, at 7:59 PM, Artem Polyakov  > wrote:
> >>
> >> Howard, does current mater fix your problems?
> >>
> >> среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
> >>
> >> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres)  >:
> >> On Dec 2, 2014, at 8:43 PM, Artem Polyakov  > wrote:
> >>
> >> > Jeff, your fix brakes my system again. Actually you just reverted my
> changes.
> >>
> >> No, I didn't just revert them -- I made changes.  I did forget about
> the second -I, though (to be fair, the 2nd -I was the *only* -I in there
> before I committed).
> >> Yeah! I was speaking figurally :).
> >>
> >> Sorry about that -- I've tested your change (without the trailing /)
> and it seems to work ok.  I'd go ahead and merge.
> >>
> >> --
> >> Jeff Squyres
> >> jsquy...@cisco.com 
> >> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org 
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
> >>
> >>
> >>
> >> --
> >> С Уважением, Поляков Артем Юрьевич
> >> Best regards, Artem Y. Polyakov
> >>
> >>
> >> --
> >> -
> >> Best regards, Artem Polyakov
> >> (Mobile mail)
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org 
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16416.php
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org 
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16417.php
> >
> >
> >
> > --
> > С Уважением, Поляков Артем Юрьевич
> > Best regards, Artem Y. Polyakov
> > ___
> > devel mailing list
> > de...@open-mpi.org 
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16421.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com 
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16422.php



-- 
-
Best regards, Artem Polyakov
(Mobile mail)


Re: [OMPI devel] RTLD_GLOBAL question

2014-12-03 Thread Jeff Squyres (jsquyres)
They were equivalent until yesterday.  :-)

I was going to file a PR to bring the changes over to v1.8, but not until they 
had shaken out on master.

Would you mind filing a PR?


On Dec 3, 2014, at 5:56 AM, Artem Polyakov  wrote:

> I finally found the clear reason of this strange situation!
> 
> In ompi opal_setup_libltdl.m4 has the following content:
> CPPFLAGS="-I$srcdir -I$srcdir/opal/libltdl"
> AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h],
> [OPAL_HAVE_LTDL_ADVISE=1])
> 
> And in ompi-release opal_setup_libltdl.m4:
> CPPFLAGS="-I$srcdir/opal/libltdl/"
> # Must specifically mention $srcdir here for VPATH builds
> # (this file is in the src tree).
> AC_EGREP_HEADER([lt_dladvise_init], [$srcdir/opal/libltdl/ltdl.h],
>   [OPAL_HAVE_LTDL_ADVISE=1])
> 
> This was thesource of my mistake and confusion. In ompi we check for 
> "opal/libltdl/ltdl.h" and we do need -I$srcdir and in ompi-release we check 
> for "$srcdir/opal/libltdl/ltdl.h". I didn't noticed that wen did the backport 
> from ompi-release to ompi. I really thought that this files are equal.
> 
> I think we need to converge to the unified solution.
> 
> 
> 2014-12-03 10:23 GMT+06:00 Ralph Castain :
> It is working for me, but I’m not sure if that is because of these changes or 
> if it always worked for me. I haven’t tested the slurm integration in awhile.
> 
> 
>> On Dec 2, 2014, at 7:59 PM, Artem Polyakov  wrote:
>> 
>> Howard, does current mater fix your problems?
>> 
>> среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
>> 
>> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) :
>> On Dec 2, 2014, at 8:43 PM, Artem Polyakov  wrote:
>> 
>> > Jeff, your fix brakes my system again. Actually you just reverted my 
>> > changes.
>> 
>> No, I didn't just revert them -- I made changes.  I did forget about the 
>> second -I, though (to be fair, the 2nd -I was the *only* -I in there before 
>> I committed).
>> Yeah! I was speaking figurally :).
>>  
>> Sorry about that -- I've tested your change (without the trailing /) and it 
>> seems to work ok.  I'd go ahead and merge.
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
>> 
>> 
>> 
>> -- 
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>> 
>> 
>> -- 
>> -
>> Best regards, Artem Polyakov
>> (Mobile mail)
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16416.php
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16417.php
> 
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16421.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-03 Thread Artem Polyakov
I finally found the clear reason of this strange situation!

In ompi opal_setup_libltdl.m4 has the following content:
CPPFLAGS="-I$srcdir -I$srcdir/opal/libltdl"
AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h],
[OPAL_HAVE_LTDL_ADVISE=1])

And in ompi-release opal_setup_libltdl.m4:
CPPFLAGS="-I$srcdir/opal/libltdl/"
# Must specifically mention $srcdir here for VPATH builds
# (this file is in the src tree).
AC_EGREP_HEADER([lt_dladvise_init], [*$srcdir*/opal/libltdl/ltdl.h],
[OPAL_HAVE_LTDL_ADVISE=1])

This was thesource of my mistake and confusion. In ompi we check for
"opal/libltdl/ltdl.h" and we do need -I$srcdir and in ompi-release we check
for "*$srcdir*/opal/libltdl/ltdl.h". I didn't noticed that wen did the
backport from ompi-release to ompi. I really thought that this files are
equal.

I think we need to converge to the unified solution.


2014-12-03 10:23 GMT+06:00 Ralph Castain :

> It is working for me, but I’m not sure if that is because of these changes
> or if it always worked for me. I haven’t tested the slurm integration in
> awhile.
>
>
> On Dec 2, 2014, at 7:59 PM, Artem Polyakov  wrote:
>
> Howard, does current mater fix your problems?
>
> среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
>
>>
>> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) :
>>
>>> On Dec 2, 2014, at 8:43 PM, Artem Polyakov  wrote:
>>>
>>> > Jeff, your fix brakes my system again. Actually you just reverted my
>>> changes.
>>>
>>> No, I didn't just revert them -- I made changes.  I did forget about the
>>> second -I, though (to be fair, the 2nd -I was the *only* -I in there before
>>> I committed).
>>>
>> Yeah! I was speaking figurally :).
>>
>>
>>> Sorry about that -- I've tested your change (without the trailing /) and
>>> it seems to work ok.  I'd go ahead and merge.
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
>>>
>>
>>
>>
>> --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>>
>
>
> --
> -
> Best regards, Artem Polyakov
> (Mobile mail)
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16416.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16417.php
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Ralph Castain
It is working for me, but I’m not sure if that is because of these changes or 
if it always worked for me. I haven’t tested the slurm integration in awhile.


> On Dec 2, 2014, at 7:59 PM, Artem Polyakov  wrote:
> 
> Howard, does current mater fix your problems?
> 
> среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
> 
> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) >:
> On Dec 2, 2014, at 8:43 PM, Artem Polyakov > wrote:
> 
> > Jeff, your fix brakes my system again. Actually you just reverted my 
> > changes.
> 
> No, I didn't just revert them -- I made changes.  I did forget about the 
> second -I, though (to be fair, the 2nd -I was the *only* -I in there before I 
> committed).
> Yeah! I was speaking figurally :).
>  
> Sorry about that -- I've tested your change (without the trailing /) and it 
> seems to work ok.  I'd go ahead and merge.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com <>
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/ 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org <>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php 
> 
> 
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> 
> 
> -- 
> -
> Best regards, Artem Polyakov
> (Mobile mail)
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16416.php 
> 


Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Artem Polyakov
Howard, does current mater fix your problems?

среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:

>
> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres)  >:
>
>> On Dec 2, 2014, at 8:43 PM, Artem Polyakov > > wrote:
>>
>> > Jeff, your fix brakes my system again. Actually you just reverted my
>> changes.
>>
>> No, I didn't just revert them -- I made changes.  I did forget about the
>> second -I, though (to be fair, the 2nd -I was the *only* -I in there before
>> I committed).
>>
> Yeah! I was speaking figurally :).
>
>
>> Sorry about that -- I've tested your change (without the trailing /) and
>> it seems to work ok.  I'd go ahead and merge.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com 
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
>>
>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>


-- 
-
Best regards, Artem Polyakov
(Mobile mail)


Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Artem Polyakov
2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) :

> On Dec 2, 2014, at 8:43 PM, Artem Polyakov  wrote:
>
> > Jeff, your fix brakes my system again. Actually you just reverted my
> changes.
>
> No, I didn't just revert them -- I made changes.  I did forget about the
> second -I, though (to be fair, the 2nd -I was the *only* -I in there before
> I committed).
>
Yeah! I was speaking figurally :).


> Sorry about that -- I've tested your change (without the trailing /) and
> it seems to work ok.  I'd go ahead and merge.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Jeff Squyres (jsquyres)
On Dec 2, 2014, at 8:43 PM, Artem Polyakov  wrote:

> Jeff, your fix brakes my system again. Actually you just reverted my changes.

No, I didn't just revert them -- I made changes.  I did forget about the second 
-I, though (to be fair, the 2nd -I was the *only* -I in there before I 
committed).

Sorry about that -- I've tested your change (without the trailing /) and it 
seems to work ok.  I'd go ahead and merge.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Artem Polyakov
Hello,

Jeff, your fix brakes my system again. Actually you just reverted my
changes. Here is what I have:

configure:5441: *** GNU libltdl setup
configure:296939: checking location of libltdl
configure:296952: result: internal copy
configure:297028: OPAL configuring in opal/libltdl
configure:297113: running /bin/bash '.../opal/libltdl/configure'
 '--prefix=.../ompi-pmix-refactoring_install/' '--enable-debug'
'--disable-oshmem' '--with-pmi=/home/artpol/sandboxes/slurm/'
--enable-ltdl-convenience --disable-ltdl-install --enable-shared
--disable-static --cache-file=/dev/null --srcdir=.../opal/libltdl
--disable-option-checking
configure:297119: /bin/bash '.../opal/libltdl/configure' succeeded for
opal/libltdl
In file included from conftest.c:718:0:
.../opal/libltdl/ltdl.h:36:31: fatal error: libltdl/lt_system.h: No such
file or directory
 #include 
   ^
compilation terminated.
configure:297864: checking for lt_dladvise
configure:297870: result: no
configure:297923: creating ./config.lt

Surprisingly to me this error (I am sure!) occurs on any system but only on
mine it fails to set advise on! I checked that on other machines!

The reason was pointed in original PR:
ltdl.h has includes

#include < libltdl/lt_system.h >
#include < libltdl/lt_error.h >


That can't be found without "-I$srcdir/opal/libltdl/".

The point is that we DO need "-I$srcdir/opal/libltdl/" but we ALSO need
"-I$srcdir" too! I filed the new PR (
https://github.com/open-mpi/ompi/pull/301) but won't merge it until Edgar
confirms that it's OK with his system.

So the original error was in removing -I$srcdir. I was sure that we
converged on this and found another valuable discussion in ompi-release:
https://github.com/open-mpi/ompi-release/pull/34

There I was looking into configure script and found:

CPPFLAGS="-I$srcdir/ -I$srcdir/opal/libltdl/"# Must specifically
mention $srcdir here for VPATH builds# (this file is in the src tree).
cat confdefs.h - <<_ACEOF >conftest.$ac_ext/* end confdefs.h.
*/#include <$srcdir/opal/libltdl/ltdl.h>_ACEOF


And it was obvious that we don't need "-I$srcdir/" because it was hardcoded
in the include but it turns out that I've been wrong and maybe some other
building system emmits different code. I would like to see Edgars original
config.log. Jeff could you send it to me directly?

So, everybody, sorry for inconvinience!


2014-12-03 0:41 GMT+06:00 Jeff Squyres (jsquyres) :

> See https://github.com/open-mpi/ompi/pull/298 for a fix.
>
> There's 2 commits on that PR -- the 2nd is just a cleanup.  The real fix
> is the 1st commit, here:
>
>
> https://github.com/jsquyres/ompi/commit/a736d83fb9a7b27986a008a2cda6eb1fea839fb3
>
> If someone can confirm that this works for them, we can bring it to master.
>
> It may have the side effect of "fixing / working around" (by coincidence)
> the SLURM bug (we all agree that the Right solution is to have SLURM fix it
> upstream, but I think this will put us back in the case of "working by
> accident / despite the SLURM bug").
>
>
>
> On Dec 2, 2014, at 10:59 AM, Jeff Squyres (jsquyres) 
> wrote:
>
> > I'm able to replicate Edgar's problem.
> >
> > I'm investigating...
> >
> >
> > On Dec 2, 2014, at 10:39 AM, Edgar Gabriel  wrote:
> >
> >> the mailing list refused to let me add the config.log file, since it is
> too large, I can forward the output to you directly as well (as I did to
> Jeff).
> >>
> >> I honestly have not looked into the configure logic, I can just tell
> that OPAL_HAVE_LTDL_ADVISE is not set on my linux system for master, but is
> set on the 1.8 series (1.8 series checkout was from Nov. 20, so if
> something changed in between the result might be different).
> >>
> >>
> >>
> >> On 12/2/2014 9:27 AM, Artem Polyakov wrote:
> >>>
> >>> 2014-12-02 20:59 GMT+06:00 Edgar Gabriel  >>> >:
> >>>
> >>>   didn't want to interfere with this thread, although I have a similar
> >>>   issue, since I have the solution nearly fully cooked up. But anyway,
> >>>   this last email gave the hint on why we have suddenly the problem in
> >>>   ompio:
> >>>
> >>>   it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not
> >>>   set anymore, so the entire section is being skipped. I double
> >>>   checked that with the 1.8 branch, it goes through the section, but
> >>>   not with master.
> >>>
> >>>
> >>> Hi, Edgar.
> >>>
> >>> Both master and ompi-release (isn't it 1.8?!) are equal in sence of my
> >>> fix. Something else!? I'd like to see config.log too but will look into
> >>> it only tomorrow.
> >>>
> >>> Also I want to add that SLURM PMI2 communicates with local slurmstepd's
> >>> and doesn't need any authentification. All PMI1 processes otherwise
> >>> communicate to the srun process and thus need libslurm services for
> >>> communication and authentification.
> >>>
> >>>
> >>>   Thanks
> >>>   Edgar
> >>>
> >>>
> >>>
> >>>
> >>>   On 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Jeff Squyres (jsquyres)
I'm able to replicate Edgar's problem.

I'm investigating...


On Dec 2, 2014, at 10:39 AM, Edgar Gabriel  wrote:

> the mailing list refused to let me add the config.log file, since it is too 
> large, I can forward the output to you directly as well (as I did to Jeff).
> 
> I honestly have not looked into the configure logic, I can just tell that 
> OPAL_HAVE_LTDL_ADVISE is not set on my linux system for master, but is set on 
> the 1.8 series (1.8 series checkout was from Nov. 20, so if something changed 
> in between the result might be different).
> 
> 
> 
> On 12/2/2014 9:27 AM, Artem Polyakov wrote:
>> 
>> 2014-12-02 20:59 GMT+06:00 Edgar Gabriel > >:
>> 
>>didn't want to interfere with this thread, although I have a similar
>>issue, since I have the solution nearly fully cooked up. But anyway,
>>this last email gave the hint on why we have suddenly the problem in
>>ompio:
>> 
>>it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not
>>set anymore, so the entire section is being skipped. I double
>>checked that with the 1.8 branch, it goes through the section, but
>>not with master.
>> 
>> 
>> Hi, Edgar.
>> 
>> Both master and ompi-release (isn't it 1.8?!) are equal in sence of my
>> fix. Something else!? I'd like to see config.log too but will look into
>> it only tomorrow.
>> 
>> Also I want to add that SLURM PMI2 communicates with local slurmstepd's
>> and doesn't need any authentification. All PMI1 processes otherwise
>> communicate to the srun process and thus need libslurm services for
>> communication and authentification.
>> 
>> 
>>Thanks
>>Edgar
>> 
>> 
>> 
>> 
>>On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
>> 
>>Looks like I was totally lying in
>>http://www.open-mpi.org/__community/lists/devel/2014/12/__16381.php
>> 
>> (where
>>I said we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
>> 
>>
>> https://github.com/open-mpi/__ompi/blob/master/opal/mca/__base/mca_base_component___repository.c#L124
>>
>> 
>> 
>>This ltdl advice object is passed to lt_dlopen() for all
>>components.  My mistake; sorry.
>> 
>>So the idea that using RTLD_GLOBAL will fix this SLURM bug is
>>incorrect.
>> 
>>I believe someone said earlier in the thread that adding the
>>right -llibs to the configure line will solve the issue, and
>>that sounds correct to me.  If there's a missing symbol because
>>the SLURM libraries are not automatically pulling in the right
>>dependent libraries, then *if* we put a workaround in OMPI to
>>fix this issue, then the right workaround is to add the relevant
>>-llibs when that component is linked.
>> 
>>*If* you add that workaround (which is a whole separate
>>discussion), I would suggest adding a configure.m4 test to see
>>if adding the additional -llibs are necessary.  Perhaps
>>AC_LINK_IFELSE looking for a symbol, and then if that fails,
>>AC_LINK_IFELSE again with the additional -llibs to see if that
>>works.
>> 
>>Or something like that.
>> 
>> 
>> 
>>On Dec 2, 2014, at 6:38 AM, Artem Polyakov >> wrote:
>> 
>>Agree. First you should check is to what value
>>OPAL_HAVE_LTDL_ADVISE is set. If it is zero - very probably
>>this is the same bug as mine.
>> 
>>2014-12-02 17:33 GMT+06:00 Ralph Castain >>:
>>It does look similar - question is: why didn’t this fix the
>>problem? Will have to investigate.
>> 
>>Thanks
>> 
>> 
>>On Dec 2, 2014, at 3:17 AM, Artem Polyakov
>>> wrote:
>> 
>> 
>> 
>>2014-12-02 17:13 GMT+06:00 Ralph Castain
>>>:
>>Hmmm…if that is true, then it didn’t fix this problem as
>>it is being reported in the master.
>> 
>>I had this problem on my laptop installation. You can
>>check my report it was detailed enough and see if you
>>hitting the same issue. My fix was also included into
>>1.8 branch. I am not sure that this is the same issue
>>but they looks similar.
>> 
>> 
>> 
>>On Dec 1, 2014, at 9:40 PM, Artem Polyakov
>>> wrote:
>> 
>>I think this might be related to the configuration
>>problem I 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Edgar Gabriel
the mailing list refused to let me add the config.log file, since it is 
too large, I can forward the output to you directly as well (as I did to 
Jeff).


I honestly have not looked into the configure logic, I can just tell 
that OPAL_HAVE_LTDL_ADVISE is not set on my linux system for master, but 
is set on the 1.8 series (1.8 series checkout was from Nov. 20, so if 
something changed in between the result might be different).




On 12/2/2014 9:27 AM, Artem Polyakov wrote:


2014-12-02 20:59 GMT+06:00 Edgar Gabriel >:

didn't want to interfere with this thread, although I have a similar
issue, since I have the solution nearly fully cooked up. But anyway,
this last email gave the hint on why we have suddenly the problem in
ompio:

it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not
set anymore, so the entire section is being skipped. I double
checked that with the 1.8 branch, it goes through the section, but
not with master.


Hi, Edgar.

Both master and ompi-release (isn't it 1.8?!) are equal in sence of my
fix. Something else!? I'd like to see config.log too but will look into
it only tomorrow.

Also I want to add that SLURM PMI2 communicates with local slurmstepd's
and doesn't need any authentification. All PMI1 processes otherwise
communicate to the srun process and thus need libslurm services for
communication and authentification.


Thanks
Edgar




On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:

Looks like I was totally lying in
http://www.open-mpi.org/__community/lists/devel/2014/12/__16381.php
 (where
I said we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:


https://github.com/open-mpi/__ompi/blob/master/opal/mca/__base/mca_base_component___repository.c#L124



This ltdl advice object is passed to lt_dlopen() for all
components.  My mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is
incorrect.

I believe someone said earlier in the thread that adding the
right -llibs to the configure line will solve the issue, and
that sounds correct to me.  If there's a missing symbol because
the SLURM libraries are not automatically pulling in the right
dependent libraries, then *if* we put a workaround in OMPI to
fix this issue, then the right workaround is to add the relevant
-llibs when that component is linked.

*If* you add that workaround (which is a whole separate
discussion), I would suggest adding a configure.m4 test to see
if adding the additional -llibs are necessary.  Perhaps
AC_LINK_IFELSE looking for a symbol, and then if that fails,
AC_LINK_IFELSE again with the additional -llibs to see if that
works.

Or something like that.



On Dec 2, 2014, at 6:38 AM, Artem Polyakov > wrote:

Agree. First you should check is to what value
OPAL_HAVE_LTDL_ADVISE is set. If it is zero - very probably
this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain >:
It does look similar - question is: why didn’t this fix the
problem? Will have to investigate.

Thanks


On Dec 2, 2014, at 3:17 AM, Artem Polyakov
> wrote:



2014-12-02 17:13 GMT+06:00 Ralph Castain
>:
Hmmm…if that is true, then it didn’t fix this problem as
it is being reported in the master.

I had this problem on my laptop installation. You can
check my report it was detailed enough and see if you
hitting the same issue. My fix was also included into
1.8 branch. I am not sure that this is the same issue
but they looks similar.



On Dec 1, 2014, at 9:40 PM, Artem Polyakov
> wrote:

I think this might be related to the configuration
problem I was fixing with Jeff few months ago. Refer
here:
https://github.com/open-mpi/__ompi/pull/240


2014-12-02 10:15 GMT+06:00 Ralph Castain
>:
If it isn’t too much trouble, it would be good to

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Artem Polyakov
2014-12-02 20:59 GMT+06:00 Edgar Gabriel :

> didn't want to interfere with this thread, although I have a similar
> issue, since I have the solution nearly fully cooked up. But anyway, this
> last email gave the hint on why we have suddenly the problem in ompio:
>
> it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set
> anymore, so the entire section is being skipped. I double checked that with
> the 1.8 branch, it goes through the section, but not with master.
>

Hi, Edgar.

Both master and ompi-release (isn't it 1.8?!) are equal in sence of my fix.
Something else!? I'd like to see config.log too but will look into it only
tomorrow.

Also I want to add that SLURM PMI2 communicates with local slurmstepd's and
doesn't need any authentification. All PMI1 processes otherwise communicate
to the srun process and thus need libslurm services for communication and
authentification.


>
> Thanks
> Edgar
>
>
>
>
> On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
>
>> Looks like I was totally lying in http://www.open-mpi.org/
>> community/lists/devel/2014/12/16381.php (where I said we should not use
>> RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
>>
>> https://github.com/open-mpi/ompi/blob/master/opal/mca/
>> base/mca_base_component_repository.c#L124
>>
>> This ltdl advice object is passed to lt_dlopen() for all components.  My
>> mistake; sorry.
>>
>> So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.
>>
>> I believe someone said earlier in the thread that adding the right -llibs
>> to the configure line will solve the issue, and that sounds correct to me.
>> If there's a missing symbol because the SLURM libraries are not
>> automatically pulling in the right dependent libraries, then *if* we put a
>> workaround in OMPI to fix this issue, then the right workaround is to add
>> the relevant -llibs when that component is linked.
>>
>> *If* you add that workaround (which is a whole separate discussion), I
>> would suggest adding a configure.m4 test to see if adding the additional
>> -llibs are necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and
>> then if that fails, AC_LINK_IFELSE again with the additional -llibs to see
>> if that works.
>>
>> Or something like that.
>>
>>
>>
>> On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:
>>
>>  Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is
>>> set. If it is zero - very probably this is the same bug as mine.
>>>
>>> 2014-12-02 17:33 GMT+06:00 Ralph Castain :
>>> It does look similar - question is: why didn’t this fix the problem?
>>> Will have to investigate.
>>>
>>> Thanks
>>>
>>>
>>>  On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:



 2014-12-02 17:13 GMT+06:00 Ralph Castain :
 Hmmm…if that is true, then it didn’t fix this problem as it is being
 reported in the master.

 I had this problem on my laptop installation. You can check my report
 it was detailed enough and see if you hitting the same issue. My fix was
 also included into 1.8 branch. I am not sure that this is the same issue
 but they looks similar.



  On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>
> I think this might be related to the configuration problem I was
> fixing with Jeff few months ago. Refer here:
> https://github.com/open-mpi/ompi/pull/240
>
> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
> If it isn’t too much trouble, it would be good to confirm that it
> remains broken. I strongly suspect it is based on Moe’s comments.
>
> Obviously, other people are making this work. For Intel MPI, all you
> do is point it at libpmi and they can run. However, they do explicitly
> dlopen it in their code, and I don’t know what flags they might pass when
> they do so.
>
> If necessary, I suppose we could follow that pattern. In other words,
> rather than specifically linking the “s1” component to libpmi, instead
> require that the user point us to a pmi library via an MCA param, then
> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues
> cited by Jeff, but resolves the pmi linkage problem.
>
>
>  On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>> $ srun --version
>> slurm 2.6.6-VENDOR_PROVIDED
>>
>> $ srun --mpi=pmi2 -n 1 ~/hw
>> I am 0 / 1
>>
>> $ srun -n 1 ~/hw
>> /csc/home1/gouaillardet/hw: symbol lookup error:
>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or
>> received
>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were
>> transmitted or received
>> srun: error: soleil: task 0: Exited with exit code 127
>>
>> $ ldd 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Edgar Gabriel

I checked with the debugger, that it did skip the entire section

On 12/2/2014 9:04 AM, Jeff Squyres (jsquyres) wrote:

Oy -- I thought we fixed that.  :-(

Are you saying that configure output says that ltdladvise is not found?


On Dec 2, 2014, at 9:59 AM, Edgar Gabriel  wrote:


didn't want to interfere with this thread, although I have a similar issue, 
since I have the solution nearly fully cooked up. But anyway, this last email 
gave the hint on why we have suddenly the problem in ompio:

it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set 
anymore, so the entire section is being skipped. I double checked that with the 
1.8 branch, it goes through the section, but not with master.

Thanks
Edgar



On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:

Looks like I was totally lying in 
http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said 
we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all components.  My 
mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.

I believe someone said earlier in the thread that adding the right -llibs to 
the configure line will solve the issue, and that sounds correct to me.  If 
there's a missing symbol because the SLURM libraries are not automatically 
pulling in the right dependent libraries, then *if* we put a workaround in OMPI 
to fix this issue, then the right workaround is to add the relevant -llibs when 
that component is linked.

*If* you add that workaround (which is a whole separate discussion), I would 
suggest adding a configure.m4 test to see if adding the additional -llibs are 
necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if that 
fails, AC_LINK_IFELSE again with the additional -llibs to see if that works.

Or something like that.



On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:


Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. If 
it is zero - very probably this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain :
It does look similar - question is: why didn’t this fix the problem? Will have 
to investigate.

Thanks



On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:



2014-12-02 17:13 GMT+06:00 Ralph Castain :
Hmmm…if that is true, then it didn’t fix this problem as it is being reported 
in the master.

I had this problem on my laptop installation. You can check my report it was 
detailed enough and see if you hitting the same issue. My fix was also included 
into 1.8 branch. I am not sure that this is the same issue but they looks 
similar.




On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:

I think this might be related to the configuration problem I was fixing with 
Jeff few months ago. Refer here:
https://github.com/open-mpi/ompi/pull/240

2014-12-02 10:15 GMT+06:00 Ralph Castain :
If it isn’t too much trouble, it would be good to confirm that it remains 
broken. I strongly suspect it is based on Moe’s comments.

Obviously, other people are making this work. For Intel MPI, all you do is 
point it at libpmi and they can run. However, they do explicitly dlopen it in 
their code, and I don’t know what flags they might pass when they do so.

If necessary, I suppose we could follow that pattern. In other words, rather 
than specifically linking the “s1” component to libpmi, instead require that 
the user point us to a pmi library via an MCA param, then explicitly dlopen 
that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but 
resolves the pmi linkage problem.



On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet  
wrote:

$ srun --version
slurm 2.6.6-VENDOR_PROVIDED

$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1

$ srun -n 1 ~/hw
/csc/home1/gouaillardet/hw: symbol lookup error: 
/usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
received
srun: error: soleil: task 0: Exited with exit code 127

$ ldd /usr/lib64/slurm/auth_munge.so
 linux-vdso.so.1 =>  (0x7fff54478000)
 libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
 libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
 libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
 /lib64/ld-linux-x86-64.so.2 (0x003bf540)


now, if i reling auth_munge.so so it depends on libslurm :

$ srun -n 1 ~/hw
srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: 
slurm_auth_get_arg_desc


i can give a try to the latest slurm if needed

Cheers,

Gilles


On 2014/12/02 12:56, Ralph Castain wrote:

Out of 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Jeff Squyres (jsquyres)
Oy -- I thought we fixed that.  :-(

Are you saying that configure output says that ltdladvise is not found?


On Dec 2, 2014, at 9:59 AM, Edgar Gabriel  wrote:

> didn't want to interfere with this thread, although I have a similar issue, 
> since I have the solution nearly fully cooked up. But anyway, this last email 
> gave the hint on why we have suddenly the problem in ompio:
> 
> it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set 
> anymore, so the entire section is being skipped. I double checked that with 
> the 1.8 branch, it goes through the section, but not with master.
> 
> Thanks
> Edgar
> 
> 
> 
> On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
>> Looks like I was totally lying in 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I 
>> said we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
>> 
>> https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124
>> 
>> This ltdl advice object is passed to lt_dlopen() for all components.  My 
>> mistake; sorry.
>> 
>> So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.
>> 
>> I believe someone said earlier in the thread that adding the right -llibs to 
>> the configure line will solve the issue, and that sounds correct to me.  If 
>> there's a missing symbol because the SLURM libraries are not automatically 
>> pulling in the right dependent libraries, then *if* we put a workaround in 
>> OMPI to fix this issue, then the right workaround is to add the relevant 
>> -llibs when that component is linked.
>> 
>> *If* you add that workaround (which is a whole separate discussion), I would 
>> suggest adding a configure.m4 test to see if adding the additional -llibs 
>> are necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if 
>> that fails, AC_LINK_IFELSE again with the additional -llibs to see if that 
>> works.
>> 
>> Or something like that.
>> 
>> 
>> 
>> On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:
>> 
>>> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is 
>>> set. If it is zero - very probably this is the same bug as mine.
>>> 
>>> 2014-12-02 17:33 GMT+06:00 Ralph Castain :
>>> It does look similar - question is: why didn’t this fix the problem? Will 
>>> have to investigate.
>>> 
>>> Thanks
>>> 
>>> 
 On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:
 
 
 
 2014-12-02 17:13 GMT+06:00 Ralph Castain :
 Hmmm…if that is true, then it didn’t fix this problem as it is being 
 reported in the master.
 
 I had this problem on my laptop installation. You can check my report it 
 was detailed enough and see if you hitting the same issue. My fix was also 
 included into 1.8 branch. I am not sure that this is the same issue but 
 they looks similar.
 
 
 
> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
> 
> I think this might be related to the configuration problem I was fixing 
> with Jeff few months ago. Refer here:
> https://github.com/open-mpi/ompi/pull/240
> 
> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
> If it isn’t too much trouble, it would be good to confirm that it remains 
> broken. I strongly suspect it is based on Moe’s comments.
> 
> Obviously, other people are making this work. For Intel MPI, all you do 
> is point it at libpmi and they can run. However, they do explicitly 
> dlopen it in their code, and I don’t know what flags they might pass when 
> they do so.
> 
> If necessary, I suppose we could follow that pattern. In other words, 
> rather than specifically linking the “s1” component to libpmi, instead 
> require that the user point us to a pmi library via an MCA param, then 
> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues 
> cited by Jeff, but resolves the pmi linkage problem.
> 
> 
>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>>  wrote:
>> 
>> $ srun --version
>> slurm 2.6.6-VENDOR_PROVIDED
>> 
>> $ srun --mpi=pmi2 -n 1 ~/hw
>> I am 0 / 1
>> 
>> $ srun -n 1 ~/hw
>> /csc/home1/gouaillardet/hw: symbol lookup error: 
>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted 
>> or received
>> srun: error: soleil: task 0: Exited with exit code 127
>> 
>> $ ldd /usr/lib64/slurm/auth_munge.so
>> linux-vdso.so.1 =>  (0x7fff54478000)
>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>> libc.so.6 => 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Edgar Gabriel
didn't want to interfere with this thread, although I have a similar 
issue, since I have the solution nearly fully cooked up. But anyway, 
this last email gave the hint on why we have suddenly the problem in ompio:


it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set 
anymore, so the entire section is being skipped. I double checked that 
with the 1.8 branch, it goes through the section, but not with master.


Thanks
Edgar



On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:

Looks like I was totally lying in 
http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said 
we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all components.  My 
mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.

I believe someone said earlier in the thread that adding the right -llibs to 
the configure line will solve the issue, and that sounds correct to me.  If 
there's a missing symbol because the SLURM libraries are not automatically 
pulling in the right dependent libraries, then *if* we put a workaround in OMPI 
to fix this issue, then the right workaround is to add the relevant -llibs when 
that component is linked.

*If* you add that workaround (which is a whole separate discussion), I would 
suggest adding a configure.m4 test to see if adding the additional -llibs are 
necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if that 
fails, AC_LINK_IFELSE again with the additional -llibs to see if that works.

Or something like that.



On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:


Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. If 
it is zero - very probably this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain :
It does look similar - question is: why didn’t this fix the problem? Will have 
to investigate.

Thanks



On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:



2014-12-02 17:13 GMT+06:00 Ralph Castain :
Hmmm…if that is true, then it didn’t fix this problem as it is being reported 
in the master.

I had this problem on my laptop installation. You can check my report it was 
detailed enough and see if you hitting the same issue. My fix was also included 
into 1.8 branch. I am not sure that this is the same issue but they looks 
similar.




On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:

I think this might be related to the configuration problem I was fixing with 
Jeff few months ago. Refer here:
https://github.com/open-mpi/ompi/pull/240

2014-12-02 10:15 GMT+06:00 Ralph Castain :
If it isn’t too much trouble, it would be good to confirm that it remains 
broken. I strongly suspect it is based on Moe’s comments.

Obviously, other people are making this work. For Intel MPI, all you do is 
point it at libpmi and they can run. However, they do explicitly dlopen it in 
their code, and I don’t know what flags they might pass when they do so.

If necessary, I suppose we could follow that pattern. In other words, rather 
than specifically linking the “s1” component to libpmi, instead require that 
the user point us to a pmi library via an MCA param, then explicitly dlopen 
that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but 
resolves the pmi linkage problem.



On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet  
wrote:

$ srun --version
slurm 2.6.6-VENDOR_PROVIDED

$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1

$ srun -n 1 ~/hw
/csc/home1/gouaillardet/hw: symbol lookup error: 
/usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
received
srun: error: soleil: task 0: Exited with exit code 127

$ ldd /usr/lib64/slurm/auth_munge.so
 linux-vdso.so.1 =>  (0x7fff54478000)
 libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
 libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
 libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
 /lib64/ld-linux-x86-64.so.2 (0x003bf540)


now, if i reling auth_munge.so so it depends on libslurm :

$ srun -n 1 ~/hw
srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: 
slurm_auth_get_arg_desc


i can give a try to the latest slurm if needed

Cheers,

Gilles


On 2014/12/02 12:56, Ralph Castain wrote:

Out of curiosity - how are you testing these? I have more current versions of 
Slurm and would like to test the observations there.



On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
  wrote:

I d like to make a step back ...

i previously tested with slurm 2.6.0, and it 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Jeff Squyres (jsquyres)
Looks like I was totally lying in 
http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said 
we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all components.  My 
mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.

I believe someone said earlier in the thread that adding the right -llibs to 
the configure line will solve the issue, and that sounds correct to me.  If 
there's a missing symbol because the SLURM libraries are not automatically 
pulling in the right dependent libraries, then *if* we put a workaround in OMPI 
to fix this issue, then the right workaround is to add the relevant -llibs when 
that component is linked.

*If* you add that workaround (which is a whole separate discussion), I would 
suggest adding a configure.m4 test to see if adding the additional -llibs are 
necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if that 
fails, AC_LINK_IFELSE again with the additional -llibs to see if that works.

Or something like that.



On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:

> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. 
> If it is zero - very probably this is the same bug as mine.
> 
> 2014-12-02 17:33 GMT+06:00 Ralph Castain :
> It does look similar - question is: why didn’t this fix the problem? Will 
> have to investigate.
> 
> Thanks
> 
> 
>> On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:
>> 
>> 
>> 
>> 2014-12-02 17:13 GMT+06:00 Ralph Castain :
>> Hmmm…if that is true, then it didn’t fix this problem as it is being 
>> reported in the master.
>> 
>> I had this problem on my laptop installation. You can check my report it was 
>> detailed enough and see if you hitting the same issue. My fix was also 
>> included into 1.8 branch. I am not sure that this is the same issue but they 
>> looks similar.
>>  
>> 
>> 
>>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>>> 
>>> I think this might be related to the configuration problem I was fixing 
>>> with Jeff few months ago. Refer here:
>>> https://github.com/open-mpi/ompi/pull/240
>>> 
>>> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
>>> If it isn’t too much trouble, it would be good to confirm that it remains 
>>> broken. I strongly suspect it is based on Moe’s comments.
>>> 
>>> Obviously, other people are making this work. For Intel MPI, all you do is 
>>> point it at libpmi and they can run. However, they do explicitly dlopen it 
>>> in their code, and I don’t know what flags they might pass when they do so.
>>> 
>>> If necessary, I suppose we could follow that pattern. In other words, 
>>> rather than specifically linking the “s1” component to libpmi, instead 
>>> require that the user point us to a pmi library via an MCA param, then 
>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues 
>>> cited by Jeff, but resolves the pmi linkage problem.
>>> 
>>> 
 On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
  wrote:
 
 $ srun --version
 slurm 2.6.6-VENDOR_PROVIDED
 
 $ srun --mpi=pmi2 -n 1 ~/hw
 I am 0 / 1
 
 $ srun -n 1 ~/hw
 /csc/home1/gouaillardet/hw: symbol lookup error: 
 /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
 srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
 srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
 received
 srun: error: soleil: task 0: Exited with exit code 127
 
 $ ldd /usr/lib64/slurm/auth_munge.so
 linux-vdso.so.1 =>  (0x7fff54478000)
 libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
 libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
 libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
 /lib64/ld-linux-x86-64.so.2 (0x003bf540)
 
 
 now, if i reling auth_munge.so so it depends on libslurm :
 
 $ srun -n 1 ~/hw
 srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined 
 symbol: slurm_auth_get_arg_desc
 
 
 i can give a try to the latest slurm if needed
 
 Cheers,
 
 Gilles
 
 
 On 2014/12/02 12:56, Ralph Castain wrote:
> Out of curiosity - how are you testing these? I have more current 
> versions of Slurm and would like to test the observations there.
> 
> 
>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>> 
>>  wrote:
>> 
>> I d like to make a step back ...
>> 
>> i previously tested with slurm 2.6.0, and it complained about the 
>> slurm_verbose symbol that is defined in libslurm.so
>> so with 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Artem Polyakov
Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is
set. If it is zero - very probably this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain :

> It does look similar - question is: why didn’t this fix the problem? Will
> have to investigate.
>
> Thanks
>
>
> On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:
>
>
>
> 2014-12-02 17:13 GMT+06:00 Ralph Castain :
>
>> Hmmm…if that is true, then it didn’t fix this problem as it is being
>> reported in the master.
>>
>
> I had this problem on my laptop installation. You can check my report it
> was detailed enough and see if you hitting the same issue. My fix was also
> included into 1.8 branch. I am not sure that this is the same issue but
> they looks similar.
>
>
>>
>>
>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>>
>> I think this might be related to the configuration problem I was fixing
>> with Jeff few months ago. Refer here:
>> https://github.com/open-mpi/ompi/pull/240
>>
>> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
>>
>>> If it isn’t too much trouble, it would be good to confirm that it
>>> remains broken. I strongly suspect it is based on Moe’s comments.
>>>
>>> Obviously, other people are making this work. For Intel MPI, all you do
>>> is point it at libpmi and they can run. However, they do explicitly dlopen
>>> it in their code, and I don’t know what flags they might pass when they do
>>> so.
>>>
>>> If necessary, I suppose we could follow that pattern. In other words,
>>> rather than specifically linking the “s1” component to libpmi, instead
>>> require that the user point us to a pmi library via an MCA param, then
>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues
>>> cited by Jeff, but resolves the pmi linkage problem.
>>>
>>>
>>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <
>>> gilles.gouaillar...@iferc.org> wrote:
>>>
>>> $ srun --version
>>> slurm 2.6.6-VENDOR_PROVIDED
>>>
>>> $ srun --mpi=pmi2 -n 1 ~/hw
>>> I am 0 / 1
>>>
>>> $ srun -n 1 ~/hw
>>> /csc/home1/gouaillardet/hw: symbol lookup error:
>>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted
>>> or received
>>> srun: error: soleil: task 0: Exited with exit code 127
>>>
>>> $ ldd /usr/lib64/slurm/auth_munge.so
>>> linux-vdso.so.1 =>  (0x7fff54478000)
>>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>>> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
>>> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>>>
>>>
>>> now, if i reling auth_munge.so so it depends on libslurm :
>>>
>>> $ srun -n 1 ~/hw
>>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
>>> symbol: slurm_auth_get_arg_desc
>>>
>>>
>>> i can give a try to the latest slurm if needed
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On 2014/12/02 12:56, Ralph Castain wrote:
>>>
>>> Out of curiosity - how are you testing these? I have more current versions 
>>> of Slurm and would like to test the observations there.
>>>
>>>
>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>>   wrote:
>>>
>>> I d like to make a step back ...
>>>
>>> i previously tested with slurm 2.6.0, and it complained about the 
>>> slurm_verbose symbol that is defined in libslurm.so
>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>>
>>> now i tested with slurm 2.6.6 and it complains about the 
>>> slurm_auth_get_arg_desc symbol, and this symbol is not
>>> defined in any dynamic library. it is internally defined in the static 
>>> libcommon.a library, which is used to build the slurm binaries.
>>>
>>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>>> binary, which means it cannot be invoked from an mpi application
>>> even if it is linked with libslurm, libpmi, ...
>>>
>>> that looks like a slurm design issue that the slurm folks will take care of.
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 2014/12/02 12:33, Ralph Castain wrote:
>>>
>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>>> component as this is the only place that requires it, and it won’t hurt 
>>> anything to do so.
>>>
>>>
>>>
>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>>   
>>>   
>>> wrote:
>>>
>>> Jeff,
>>>
>>> FWIW, you can read my analysis of what is going wrong 
>>> athttp://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>>  
>>> 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Artem Polyakov
2014-12-02 17:13 GMT+06:00 Ralph Castain :

> Hmmm…if that is true, then it didn’t fix this problem as it is being
> reported in the master.
>

I had this problem on my laptop installation. You can check my report it
was detailed enough and see if you hitting the same issue. My fix was also
included into 1.8 branch. I am not sure that this is the same issue but
they looks similar.


>
>
> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>
> I think this might be related to the configuration problem I was fixing
> with Jeff few months ago. Refer here:
> https://github.com/open-mpi/ompi/pull/240
>
> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
>
>> If it isn’t too much trouble, it would be good to confirm that it remains
>> broken. I strongly suspect it is based on Moe’s comments.
>>
>> Obviously, other people are making this work. For Intel MPI, all you do
>> is point it at libpmi and they can run. However, they do explicitly dlopen
>> it in their code, and I don’t know what flags they might pass when they do
>> so.
>>
>> If necessary, I suppose we could follow that pattern. In other words,
>> rather than specifically linking the “s1” component to libpmi, instead
>> require that the user point us to a pmi library via an MCA param, then
>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues
>> cited by Jeff, but resolves the pmi linkage problem.
>>
>>
>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>>  $ srun --version
>> slurm 2.6.6-VENDOR_PROVIDED
>>
>> $ srun --mpi=pmi2 -n 1 ~/hw
>> I am 0 / 1
>>
>> $ srun -n 1 ~/hw
>> /csc/home1/gouaillardet/hw: symbol lookup error:
>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or
>> received
>> srun: error: soleil: task 0: Exited with exit code 127
>>
>> $ ldd /usr/lib64/slurm/auth_munge.so
>> linux-vdso.so.1 =>  (0x7fff54478000)
>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
>> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>>
>>
>> now, if i reling auth_munge.so so it depends on libslurm :
>>
>> $ srun -n 1 ~/hw
>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
>> symbol: slurm_auth_get_arg_desc
>>
>>
>> i can give a try to the latest slurm if needed
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2014/12/02 12:56, Ralph Castain wrote:
>>
>> Out of curiosity - how are you testing these? I have more current versions 
>> of Slurm and would like to test the observations there.
>>
>>
>>  On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>   wrote:
>>
>> I d like to make a step back ...
>>
>> i previously tested with slurm 2.6.0, and it complained about the 
>> slurm_verbose symbol that is defined in libslurm.so
>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>
>> now i tested with slurm 2.6.6 and it complains about the 
>> slurm_auth_get_arg_desc symbol, and this symbol is not
>> defined in any dynamic library. it is internally defined in the static 
>> libcommon.a library, which is used to build the slurm binaries.
>>
>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>> binary, which means it cannot be invoked from an mpi application
>> even if it is linked with libslurm, libpmi, ...
>>
>> that looks like a slurm design issue that the slurm folks will take care of.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/02 12:33, Ralph Castain wrote:
>>
>>  Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>> component as this is the only place that requires it, and it won’t hurt 
>> anything to do so.
>>
>>
>>
>>  On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>   
>>   wrote:
>>
>> Jeff,
>>
>> FWIW, you can read my analysis of what is going wrong 
>> athttp://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>  
>>  
>>  
>>  
>>  
>> 
>>
>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>> on libslurm, but they do not, yet)
>>
>> a possible workaround would be to make the pmi component a "proxy" that
>> dlopen with 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Artem Polyakov
I think this might be related to the configuration problem I was fixing
with Jeff few months ago. Refer here:
https://github.com/open-mpi/ompi/pull/240

2014-12-02 10:15 GMT+06:00 Ralph Castain :

> If it isn’t too much trouble, it would be good to confirm that it remains
> broken. I strongly suspect it is based on Moe’s comments.
>
> Obviously, other people are making this work. For Intel MPI, all you do is
> point it at libpmi and they can run. However, they do explicitly dlopen it
> in their code, and I don’t know what flags they might pass when they do so.
>
> If necessary, I suppose we could follow that pattern. In other words,
> rather than specifically linking the “s1” component to libpmi, instead
> require that the user point us to a pmi library via an MCA param, then
> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues
> cited by Jeff, but resolves the pmi linkage problem.
>
>
> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>  $ srun --version
> slurm 2.6.6-VENDOR_PROVIDED
>
> $ srun --mpi=pmi2 -n 1 ~/hw
> I am 0 / 1
>
> $ srun -n 1 ~/hw
> /csc/home1/gouaillardet/hw: symbol lookup error:
> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or
> received
> srun: error: soleil: task 0: Exited with exit code 127
>
> $ ldd /usr/lib64/slurm/auth_munge.so
> linux-vdso.so.1 =>  (0x7fff54478000)
> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>
>
> now, if i reling auth_munge.so so it depends on libslurm :
>
> $ srun -n 1 ~/hw
> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
> symbol: slurm_auth_get_arg_desc
>
>
> i can give a try to the latest slurm if needed
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/02 12:56, Ralph Castain wrote:
>
> Out of curiosity - how are you testing these? I have more current versions of 
> Slurm and would like to test the observations there.
>
>
>  On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>   wrote:
>
> I d like to make a step back ...
>
> i previously tested with slurm 2.6.0, and it complained about the 
> slurm_verbose symbol that is defined in libslurm.so
> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>
> now i tested with slurm 2.6.6 and it complains about the 
> slurm_auth_get_arg_desc symbol, and this symbol is not
> defined in any dynamic library. it is internally defined in the static 
> libcommon.a library, which is used to build the slurm binaries.
>
> as far as i understand, auth_munge.so can only be invoked from a slurm 
> binary, which means it cannot be invoked from an mpi application
> even if it is linked with libslurm, libpmi, ...
>
> that looks like a slurm design issue that the slurm folks will take care of.
>
> Cheers,
>
> Gilles
>
> On 2014/12/02 12:33, Ralph Castain wrote:
>
>  Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
> component as this is the only place that requires it, and it won’t hurt 
> anything to do so.
>
>
>
>  On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>   
>   wrote:
>
> Jeff,
>
> FWIW, you can read my analysis of what is going wrong 
> athttp://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>  
>  
>  
>  
>  
> 
>
> bottom line, i agree this is a slurm issue (slurm plugin should depend
> on libslurm, but they do not, yet)
>
> a possible workaround would be to make the pmi component a "proxy" that
> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
> that being said, the impact is quite limited (no direct launch in slurm
> with pmi1, but pmi2 works fine) so it makes sense not to work around
> someone else problem.
> and that being said, configure could detect this broken pmi1 and not
> build pmi1 support or print a user friendly error message if pmi1 is used.
>
> any thoughts ?
>
> Cheers,
>
> Gilles
>
> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>
>  Ok, if the problem is moot, great.
>
> (sidenote: this is moot, so ignore this if you want: with this explanation, 
> I'm still not sure 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
If it isn’t too much trouble, it would be good to confirm that it remains 
broken. I strongly suspect it is based on Moe’s comments.

Obviously, other people are making this work. For Intel MPI, all you do is 
point it at libpmi and they can run. However, they do explicitly dlopen it in 
their code, and I don’t know what flags they might pass when they do so.

If necessary, I suppose we could follow that pattern. In other words, rather 
than specifically linking the “s1” component to libpmi, instead require that 
the user point us to a pmi library via an MCA param, then explicitly dlopen 
that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but 
resolves the pmi linkage problem.


> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>  wrote:
> 
> $ srun --version
> slurm 2.6.6-VENDOR_PROVIDED
> 
> $ srun --mpi=pmi2 -n 1 ~/hw
> I am 0 / 1
> 
> $ srun -n 1 ~/hw
> /csc/home1/gouaillardet/hw: symbol lookup error: 
> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
> received
> srun: error: soleil: task 0: Exited with exit code 127
> 
> $ ldd /usr/lib64/slurm/auth_munge.so
> linux-vdso.so.1 =>  (0x7fff54478000)
> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
> 
> 
> now, if i reling auth_munge.so so it depends on libslurm :
> 
> $ srun -n 1 ~/hw
> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: 
> slurm_auth_get_arg_desc
> 
> 
> i can give a try to the latest slurm if needed
> 
> Cheers,
> 
> Gilles
> 
> 
> On 2014/12/02 12:56, Ralph Castain wrote:
>> Out of curiosity - how are you testing these? I have more current versions 
>> of Slurm and would like to test the observations there.
>> 
>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> I d like to make a step back ...
>>> 
>>> i previously tested with slurm 2.6.0, and it complained about the 
>>> slurm_verbose symbol that is defined in libslurm.so
>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>> 
>>> now i tested with slurm 2.6.6 and it complains about the 
>>> slurm_auth_get_arg_desc symbol, and this symbol is not
>>> defined in any dynamic library. it is internally defined in the static 
>>> libcommon.a library, which is used to build the slurm binaries.
>>> 
>>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>>> binary, which means it cannot be invoked from an mpi application
>>> even if it is linked with libslurm, libpmi, ...
>>> 
>>> that looks like a slurm design issue that the slurm folks will take care of.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/12/02 12:33, Ralph Castain wrote:
 Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
 component as this is the only place that requires it, and it won’t hurt 
 anything to do so.
 
 
> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>   
>  
>  wrote:
> 
> Jeff,
> 
> FWIW, you can read my analysis of what is going wrong at
> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>  
>  
>  
>  
>  
>  
> 
> 
> bottom line, i agree this is a slurm issue (slurm plugin should depend
> on libslurm, but they do not, yet)
> 
> a possible workaround would be to make the pmi component a "proxy" that
> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
> that being said, the impact is quite limited (no direct launch in slurm
> with pmi1, but pmi2 works fine) so it makes sense not to work around
> someone else problem.
> and that being said, configure could detect this broken pmi1 and not
> build pmi1 support or print a user friendly error message if pmi1 is used.
> 
> any thoughts ?
> 
> Cheers,
> 
> Gilles
> 
> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>> Ok, if the problem is moot, great.

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
$ srun --version
slurm 2.6.6-VENDOR_PROVIDED

$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1

$ srun -n 1 ~/hw
/csc/home1/gouaillardet/hw: symbol lookup error:
/usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted
or received
srun: error: soleil: task 0: Exited with exit code 127

$ ldd /usr/lib64/slurm/auth_munge.so
linux-vdso.so.1 =>  (0x7fff54478000)
libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
/lib64/ld-linux-x86-64.so.2 (0x003bf540)


now, if i reling auth_munge.so so it depends on libslurm :

$ srun -n 1 ~/hw
srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
symbol: slurm_auth_get_arg_desc


i can give a try to the latest slurm if needed

Cheers,

Gilles


On 2014/12/02 12:56, Ralph Castain wrote:
> Out of curiosity - how are you testing these? I have more current versions of 
> Slurm and would like to test the observations there.
>
>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>  wrote:
>>
>> I d like to make a step back ...
>>
>> i previously tested with slurm 2.6.0, and it complained about the 
>> slurm_verbose symbol that is defined in libslurm.so
>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>
>> now i tested with slurm 2.6.6 and it complains about the 
>> slurm_auth_get_arg_desc symbol, and this symbol is not
>> defined in any dynamic library. it is internally defined in the static 
>> libcommon.a library, which is used to build the slurm binaries.
>>
>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>> binary, which means it cannot be invoked from an mpi application
>> even if it is linked with libslurm, libpmi, ...
>>
>> that looks like a slurm design issue that the slurm folks will take care of.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/02 12:33, Ralph Castain wrote:
>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>>> component as this is the only place that requires it, and it won't hurt 
>>> anything to do so.
>>>
>>>
 On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
   
 wrote:

 Jeff,

 FWIW, you can read my analysis of what is going wrong at
 http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
  
  
 

 bottom line, i agree this is a slurm issue (slurm plugin should depend
 on libslurm, but they do not, yet)

 a possible workaround would be to make the pmi component a "proxy" that
 dlopen with RTLD_GLOBAL the "real" component in which the job is done.
 that being said, the impact is quite limited (no direct launch in slurm
 with pmi1, but pmi2 works fine) so it makes sense not to work around
 someone else problem.
 and that being said, configure could detect this broken pmi1 and not
 build pmi1 support or print a user friendly error message if pmi1 is used.

 any thoughts ?

 Cheers,

 Gilles

 On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
> Ok, if the problem is moot, great.
>
> (sidenote: this is moot, so ignore this if you want: with this 
> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue)
>
>
> On Dec 1, 2014, at 5:15 PM, Ralph Castain  
>  wrote:
>
>> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
>> library is missing the linkage to libslurm that contains the linkage to 
>> libauth where munge resides. So when we call a PMI function, libpmi 
>> references a call to munge for authentication and hits an "unresolved 
>> symbol" error.
>>
>> Moe acknowledges the error is in Slurm and is fixing the linkages so 
>> this problem goes away
>>
>>
>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) 
>>>   wrote:
>>>
>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain  
>>>  wrote:
>>>
 FWIW: It's Slurm's pmi-1 library that isn't linked correctly against 
 its dependencies (the pmi-2 one is correct).  Moe is aware of the 
 problem and fixing it on their side. This won't help existing 
 installations until they upgrade, but I tend to agree with Jeff about 
 not fixing other people's problems.
>>> Can you explain what is happening?
>>>

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
Out of curiosity - how are you testing these? I have more current versions of 
Slurm and would like to test the observations there.

> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>  wrote:
> 
> I d like to make a step back ...
> 
> i previously tested with slurm 2.6.0, and it complained about the 
> slurm_verbose symbol that is defined in libslurm.so
> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
> 
> now i tested with slurm 2.6.6 and it complains about the 
> slurm_auth_get_arg_desc symbol, and this symbol is not
> defined in any dynamic library. it is internally defined in the static 
> libcommon.a library, which is used to build the slurm binaries.
> 
> as far as i understand, auth_munge.so can only be invoked from a slurm 
> binary, which means it cannot be invoked from an mpi application
> even if it is linked with libslurm, libpmi, ...
> 
> that looks like a slurm design issue that the slurm folks will take care of.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/12/02 12:33, Ralph Castain wrote:
>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>> component as this is the only place that requires it, and it won’t hurt 
>> anything to do so.
>> 
>> 
>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> Jeff,
>>> 
>>> FWIW, you can read my analysis of what is going wrong at
>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>>  
>>>  
>>> 
>>> 
>>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>>> on libslurm, but they do not, yet)
>>> 
>>> a possible workaround would be to make the pmi component a "proxy" that
>>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>>> that being said, the impact is quite limited (no direct launch in slurm
>>> with pmi1, but pmi2 works fine) so it makes sense not to work around
>>> someone else problem.
>>> and that being said, configure could detect this broken pmi1 and not
>>> build pmi1 support or print a user friendly error message if pmi1 is used.
>>> 
>>> any thoughts ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
 Ok, if the problem is moot, great.
 
 (sidenote: this is moot, so ignore this if you want: with this 
 explanation, I'm still not sure how RTLD_GLOBAL fixes the issue)
 
 
 On Dec 1, 2014, at 5:15 PM, Ralph Castain  
  wrote:
 
> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
> library is missing the linkage to libslurm that contains the linkage to 
> libauth where munge resides. So when we call a PMI function, libpmi 
> references a call to munge for authentication and hits an “unresolved 
> symbol” error.
> 
> Moe acknowledges the error is in Slurm and is fixing the linkages so this 
> problem goes away
> 
> 
>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
>>  wrote:
>> 
>> On Dec 1, 2014, at 5:07 PM, Ralph Castain  
>>  wrote:
>> 
>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against 
>>> its dependencies (the pmi-2 one is correct).  Moe is aware of the 
>>> problem and fixing it on their side. This won’t help existing 
>>> installations until they upgrade, but I tend to agree with Jeff about 
>>> not fixing other people’s problems.
>> Can you explain what is happening?
>> 
>> I ask because I'm not sure I understand the problem such that using 
>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against 
>> its dependencies properly, that shouldn't cause a problem if OMPI 
>> components A and B are both linked against libpmi1.so, and then A is 
>> loaded, and then B is loaded.
>> 
>> ...or perhaps we can just discuss this on the call tomorrow?
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com 
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php 
>> 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
I d like to make a step back ...

i previously tested with slurm 2.6.0, and it complained about the
slurm_verbose symbol that is defined in libslurm.so
so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok

now i tested with slurm 2.6.6 and it complains about the
slurm_auth_get_arg_desc symbol, and this symbol is not
defined in any dynamic library. it is internally defined in the static
libcommon.a library, which is used to build the slurm binaries.

as far as i understand, auth_munge.so can only be invoked from a slurm
binary, which means it cannot be invoked from an mpi application
even if it is linked with libslurm, libpmi, ...

that looks like a slurm design issue that the slurm folks will take care of.

Cheers,

Gilles

On 2014/12/02 12:33, Ralph Castain wrote:
> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
> component as this is the only place that requires it, and it won't hurt 
> anything to do so.
>
>
>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>  wrote:
>>
>> Jeff,
>>
>> FWIW, you can read my analysis of what is going wrong at
>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>> 
>>
>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>> on libslurm, but they do not, yet)
>>
>> a possible workaround would be to make the pmi component a "proxy" that
>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>> that being said, the impact is quite limited (no direct launch in slurm
>> with pmi1, but pmi2 works fine) so it makes sense not to work around
>> someone else problem.
>> and that being said, configure could detect this broken pmi1 and not
>> build pmi1 support or print a user friendly error message if pmi1 is used.
>>
>> any thoughts ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>>> Ok, if the problem is moot, great.
>>>
>>> (sidenote: this is moot, so ignore this if you want: with this explanation, 
>>> I'm still not sure how RTLD_GLOBAL fixes the issue)
>>>
>>>
>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain  wrote:
>>>
 Easy enough to explain. We link libpmi into the pmix/s1 component. This 
 library is missing the linkage to libslurm that contains the linkage to 
 libauth where munge resides. So when we call a PMI function, libpmi 
 references a call to munge for authentication and hits an "unresolved 
 symbol" error.

 Moe acknowledges the error is in Slurm and is fixing the linkages so this 
 problem goes away


> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
> wrote:
>
> On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:
>
>> FWIW: It's Slurm's pmi-1 library that isn't linked correctly against its 
>> dependencies (the pmi-2 one is correct).  Moe is aware of the problem 
>> and fixing it on their side. This won't help existing installations 
>> until they upgrade, but I tend to agree with Jeff about not fixing other 
>> people's problems.
> Can you explain what is happening?
>
> I ask because I'm not sure I understand the problem such that using 
> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against 
> its dependencies properly, that shouldn't cause a problem if OMPI 
> components A and B are both linked against libpmi1.so, and then A is 
> loaded, and then B is loaded.
>
> ...or perhaps we can just discuss this on the call tomorrow?
>
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php
 ___
 devel mailing list
 de...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
 Link to this post: 
 http://www.open-mpi.org/community/lists/devel/2014/12/16384.php
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php 
>> 
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
Jeff,

FWIW, you can read my analysis of what is going wrong at
http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php

bottom line, i agree this is a slurm issue (slurm plugin should depend
on libslurm, but they do not, yet)

a possible workaround would be to make the pmi component a "proxy" that
dlopen with RTLD_GLOBAL the "real" component in which the job is done.
that being said, the impact is quite limited (no direct launch in slurm
with pmi1, but pmi2 works fine) so it makes sense not to work around
someone else problem.
and that being said, configure could detect this broken pmi1 and not
build pmi1 support or print a user friendly error message if pmi1 is used.

any thoughts ?

Cheers,

Gilles

On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
> Ok, if the problem is moot, great.
>
> (sidenote: this is moot, so ignore this if you want: with this explanation, 
> I'm still not sure how RTLD_GLOBAL fixes the issue)
>
>
> On Dec 1, 2014, at 5:15 PM, Ralph Castain  wrote:
>
>> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
>> library is missing the linkage to libslurm that contains the linkage to 
>> libauth where munge resides. So when we call a PMI function, libpmi 
>> references a call to munge for authentication and hits an “unresolved 
>> symbol” error.
>>
>> Moe acknowledges the error is in Slurm and is fixing the linkages so this 
>> problem goes away
>>
>>
>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
>>> wrote:
>>>
>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:
>>>
 FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
 dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
 fixing it on their side. This won’t help existing installations until they 
 upgrade, but I tend to agree with Jeff about not fixing other people’s 
 problems.
>>> Can you explain what is happening?
>>>
>>> I ask because I'm not sure I understand the problem such that using 
>>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against 
>>> its dependencies properly, that shouldn't cause a problem if OMPI 
>>> components A and B are both linked against libpmi1.so, and then A is 
>>> loaded, and then B is loaded.
>>>
>>> ...or perhaps we can just discuss this on the call tomorrow?
>>>
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php
>



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
Ok, if the problem is moot, great.

(sidenote: this is moot, so ignore this if you want: with this explanation, I'm 
still not sure how RTLD_GLOBAL fixes the issue)


On Dec 1, 2014, at 5:15 PM, Ralph Castain  wrote:

> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
> library is missing the linkage to libslurm that contains the linkage to 
> libauth where munge resides. So when we call a PMI function, libpmi 
> references a call to munge for authentication and hits an “unresolved symbol” 
> error.
> 
> Moe acknowledges the error is in Slurm and is fixing the linkages so this 
> problem goes away
> 
> 
>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:
>> 
>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
>>> dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
>>> fixing it on their side. This won’t help existing installations until they 
>>> upgrade, but I tend to agree with Jeff about not fixing other people’s 
>>> problems.
>> 
>> Can you explain what is happening?
>> 
>> I ask because I'm not sure I understand the problem such that using 
>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against its 
>> dependencies properly, that shouldn't cause a problem if OMPI components A 
>> and B are both linked against libpmi1.so, and then A is loaded, and then B 
>> is loaded.
>> 
>> ...or perhaps we can just discuss this on the call tomorrow?
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
Easy enough to explain. We link libpmi into the pmix/s1 component. This library 
is missing the linkage to libslurm that contains the linkage to libauth where 
munge resides. So when we call a PMI function, libpmi references a call to 
munge for authentication and hits an “unresolved symbol” error.

Moe acknowledges the error is in Slurm and is fixing the linkages so this 
problem goes away


> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:
> 
>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
>> dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
>> fixing it on their side. This won’t help existing installations until they 
>> upgrade, but I tend to agree with Jeff about not fixing other people’s 
>> problems.
> 
> Can you explain what is happening?
> 
> I ask because I'm not sure I understand the problem such that using 
> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against its 
> dependencies properly, that shouldn't cause a problem if OMPI components A 
> and B are both linked against libpmi1.so, and then A is loaded, and then B is 
> loaded.
> 
> ...or perhaps we can just discuss this on the call tomorrow?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:

> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
> dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
> fixing it on their side. This won’t help existing installations until they 
> upgrade, but I tend to agree with Jeff about not fixing other people’s 
> problems.

Can you explain what is happening?

I ask because I'm not sure I understand the problem such that using RTLD_GLOBAL 
would fix it.  I.e., even if libpmi1.so isn't linked against its dependencies 
properly, that shouldn't cause a problem if OMPI components A and B are both 
linked against libpmi1.so, and then A is loaded, and then B is loaded.

...or perhaps we can just discuss this on the call tomorrow?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
dependencies (the pmi-2 one is correct). Moe is aware of the problem and fixing 
it on their side. This won’t help existing installations until they upgrade, 
but I tend to agree with Jeff about not fixing other people’s problems.


> On Dec 1, 2014, at 1:55 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Dec 1, 2014, at 4:07 PM, Howard Pritchard  wrote:
> 
>> There has been some discussion of end case situations with use of dlopen
>> in the ompi mca framework that can lead to unresolved symbols when
>> subsequent shared libraries are dlopen'd that might needs symbols from
>> a library that had been opened previously.  Yes these libraries should be
>> doing something like a second dlopen of the lib they are dependent on,
>> but that's a different story involving other software projects outside of
>> ompi.
> 
> Those other projects should be fixed.  OMPI should not be the compromise 
> location where we compensate for other projects that do not obey proper 
> linking semantics.
> 
> Can you cite some specific examples?
> 
>> The default with the mca framework dlopen'ing of component libraries
>> is not to use RTLD_GLOBAL, and there does not currently appear to be a way
>> to change this behavior at runtime.
>> 
>> Is there a reason for avoiding use of RTLD_GLOBAL in libltdl's use of dlopen?
> 
> Yes.
> 
> There's at least two reasons that I can think of off the top of my head:
> 
> 1. It's the Right Thing to do.  I.e., we shouldn't pollute the general 
> namespace with symbols from dependent libraries.
> 
> 2. We've had specific user requests to not pollute the general namespace.  
> One specific case was because we use an embedded copy of libevent, and 
> another MPI-based program also uses libevent.  If we didn't keep libevent in 
> a private namespace, Bad Things (i.e., symbol clashes) would occur.
> 
>> Would it be okay to add RTLD_GLOBAL to the default module_flags used
>> in the vm_open - modulo detection of the definition of RTLD_GLOBAL at
>> compile time.
> 
> No.
> 
>> Perhaps adding a way with an env. or config option to not
>> enable RTLD_GLOBAL by default?
> 
> This just seems like a bad path to go down.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16381.php



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
On Dec 1, 2014, at 4:07 PM, Howard Pritchard  wrote:

> There has been some discussion of end case situations with use of dlopen
> in the ompi mca framework that can lead to unresolved symbols when
> subsequent shared libraries are dlopen'd that might needs symbols from
> a library that had been opened previously.  Yes these libraries should be
> doing something like a second dlopen of the lib they are dependent on,
> but that's a different story involving other software projects outside of
> ompi.

Those other projects should be fixed.  OMPI should not be the compromise 
location where we compensate for other projects that do not obey proper linking 
semantics.

Can you cite some specific examples?

> The default with the mca framework dlopen'ing of component libraries
> is not to use RTLD_GLOBAL, and there does not currently appear to be a way
> to change this behavior at runtime.
> 
> Is there a reason for avoiding use of RTLD_GLOBAL in libltdl's use of dlopen?

Yes.

There's at least two reasons that I can think of off the top of my head:

1. It's the Right Thing to do.  I.e., we shouldn't pollute the general 
namespace with symbols from dependent libraries.

2. We've had specific user requests to not pollute the general namespace.  One 
specific case was because we use an embedded copy of libevent, and another 
MPI-based program also uses libevent.  If we didn't keep libevent in a private 
namespace, Bad Things (i.e., symbol clashes) would occur.

> Would it be okay to add RTLD_GLOBAL to the default module_flags used
> in the vm_open - modulo detection of the definition of RTLD_GLOBAL at
> compile time.

No.

>  Perhaps adding a way with an env. or config option to not
> enable RTLD_GLOBAL by default?

This just seems like a bad path to go down.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/