date:20141202

We talked about this on the weekly conference call, and adding the usock 
component to 1.8 is just not within our procedures. It would involve bringing 
over much more of the OOB revisions (we’d have to handle the transfer of 
messages between components, if nothing else), and that involves a lot of 
change.

I’ll instead try to provide a faster error response so it is clearer what is 
happening, hopefully letting the user fix the problem by turning on the 
loopback interface.

> On Nov 25, 2014, at 7:05 PM, Ralph Castain  wrote:
> 
> 
>> On Nov 25, 2014, at 6:15 PM, Gilles Gouaillardet 
>> mailto:gilles.gouaillar...@iferc.org>> wrote:
>> 
>> Ralph and Paul,
>> 
>> On 2014/11/26 10:37, Ralph Castain wrote:
>>> So it looks like the issue isn’t so much with our code as it is with the OS 
>>> stack, yes? We aren’t requiring that the loopback be “up”, but the stack is 
>>> in order to establish the connection, even when we are trying a non-lo 
>>> interface.
>> this is correct (imho)
>>> I can look into generating a faster timeout on the socket creation. In the 
>>> trunk, we now use unix domain sockets instead of TCP to avoid such issues, 
>>> but that won’t help with the 1.8 series.
>> i was about to suggest this situation could have been avoided in the first 
>> place by using unix domain sockets instead of TCP sockets :-)
> 
> There were some historical reasons for not doing so - mostly because it 
> generally isn’t necessary on a cluster.
> 
>> 
>> is a backport (since this is already available in the trunk/master) simply 
>> out of the question ?
> 
> It would be against our normal procedures, but I can raise it at next week’s 
> meeting.
> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
 On Nov 25, 2014, at 4:50 PM, Paul Hargrove  
  wrote:

 Ralph,

 I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello"
 I find that there is an attempt (by a secondary thread) to establish a TCP 
 socket from the rank process to the eth0 address of localhost (I am 
 guessing to reach the orted/mpirun).
 However, when the "lo" interface is down, the Linux kernel apparently 
 cannot establish that socket.

 In fact, if I am sufficiently patient, it turns out the "hang" is bounded, 
 and eventually one sees:

 phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out

 A process or daemon was unable to complete a TCP connection
 to another process:
   Local host:blcr-armv7
   Remote host:   10.0.2.15
 This is usually caused by a firewall on the remote host. Please
 check that any firewall (e.g., iptables) has been disabled and
 try again.

 real2m8.151s
 user0m5.360s
 sys 0m57.430s

 Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host.

 There is no firewall, but in case you doubt me on that, here is a 
 demonstration using ping to show that 10.0.2.15 is only reachable when the 
 loopback interface is enabled:

 phargrov@blcr-armv7:~$ sudo ifconfig lo up
 phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
 PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.

 --- 10.0.2.15 ping statistics ---
 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
 rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms

 phargrov@blcr-armv7:~$ sudo ifconfig lo down
 phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
 PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.

 --- 10.0.2.15 ping statistics ---
 2 packets transmitted, 0 received, 100% packet loss, time 1006ms

 So, there is no "hang" -- just a 2 minute pause before the error message 
 is generated.
 However, it may still be possible to present a better/earlier error 
 message when there is no loopback interface (and at least one rank process 
 is to be launched locally).

 -Paul

 On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain >>>   
 > wrote:
 I’ll have to look - there isn’t supposed to be such a requirement, and I 
 certainly haven’t seen it before.

> On Nov 25, 2014, at 3:26 PM, Paul Hargrove    
> > wrote:
> 
> Allan,
> 
> I am glad things are working for you now.
> I can confirm (on a QEMU-emulated Versatile Express A9 board running 
> Ubuntu 14.04) that disabling the "lo" interface reproduces the problem.
> I imagine this is true on other architectures, though I did not attempt 
> to verify.
> 
> Ralph,
> 
> If oob:tcp really does need the loopback interface, shouldn't its lack be 
> something that co

Re: [OMPI devel] RTLD_GLOBAL question

It is working for me, but I’m not sure if that is because of these changes or 
if it always worked for me. I haven’t tested the slurm integration in awhile.


> On Dec 2, 2014, at 7:59 PM, Artem Polyakov  wrote:
> 
> Howard, does current mater fix your problems?
> 
> среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:
> 
> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) >:
> On Dec 2, 2014, at 8:43 PM, Artem Polyakov > wrote:
> 
> > Jeff, your fix brakes my system again. Actually you just reverted my 
> > changes.
> 
> No, I didn't just revert them -- I made changes.  I did forget about the 
> second -I, though (to be fair, the 2nd -I was the *only* -I in there before I 
> committed).
> Yeah! I was speaking figurally :).
>  
> Sorry about that -- I've tested your change (without the trailing /) and it 
> seems to work ok.  I'd go ahead and merge.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com <>
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/ 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org <>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php 
> 
> 
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> 
> 
> -- 
> -
> Best regards, Artem Polyakov
> (Mobile mail)
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16416.php 
>

Re: [OMPI devel] RTLD_GLOBAL question

Howard, does current mater fix your problems?

среда, 3 декабря 2014 г. пользователь Artem Polyakov написал:

>
> 2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres)  >:
>
>> On Dec 2, 2014, at 8:43 PM, Artem Polyakov > > wrote:
>>
>> > Jeff, your fix brakes my system again. Actually you just reverted my
>> changes.
>>
>> No, I didn't just revert them -- I made changes.  I did forget about the
>> second -I, though (to be fair, the 2nd -I was the *only* -I in there before
>> I committed).
>>
> Yeah! I was speaking figurally :).
>
>
>> Sorry about that -- I've tested your change (without the trailing /) and
>> it seems to work ok.  I'd go ahead and merge.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com 
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
>>
>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>


-- 
-
Best regards, Artem Polyakov
(Mobile mail)

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-03 8:30 GMT+06:00 Jeff Squyres (jsquyres) :

> On Dec 2, 2014, at 8:43 PM, Artem Polyakov  wrote:
>
> > Jeff, your fix brakes my system again. Actually you just reverted my
> changes.
>
> No, I didn't just revert them -- I made changes.  I did forget about the
> second -I, though (to be fair, the 2nd -I was the *only* -I in there before
> I committed).
>
Yeah! I was speaking figurally :).


> Sorry about that -- I've tested your change (without the trailing /) and
> it seems to work ok.  I'd go ahead and merge.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16414.php
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Re: [OMPI devel] RTLD_GLOBAL question

On Dec 2, 2014, at 8:43 PM, Artem Polyakov  wrote:

> Jeff, your fix brakes my system again. Actually you just reverted my changes.

No, I didn't just revert them -- I made changes.  I did forget about the second 
-I, though (to be fair, the 2nd -I was the *only* -I in there before I 
committed).

Sorry about that -- I've tested your change (without the trailing /) and it 
seems to work ok.  I'd go ahead and merge.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] RTLD_GLOBAL question

Hello,

Jeff, your fix brakes my system again. Actually you just reverted my
changes. Here is what I have:

configure:5441: *** GNU libltdl setup
configure:296939: checking location of libltdl
configure:296952: result: internal copy
configure:297028: OPAL configuring in opal/libltdl
configure:297113: running /bin/bash '.../opal/libltdl/configure'
 '--prefix=.../ompi-pmix-refactoring_install/' '--enable-debug'
'--disable-oshmem' '--with-pmi=/home/artpol/sandboxes/slurm/'
--enable-ltdl-convenience --disable-ltdl-install --enable-shared
--disable-static --cache-file=/dev/null --srcdir=.../opal/libltdl
--disable-option-checking
configure:297119: /bin/bash '.../opal/libltdl/configure' succeeded for
opal/libltdl
In file included from conftest.c:718:0:
.../opal/libltdl/ltdl.h:36:31: fatal error: libltdl/lt_system.h: No such
file or directory
 #include 
   ^
compilation terminated.
configure:297864: checking for lt_dladvise
configure:297870: result: no
configure:297923: creating ./config.lt

Surprisingly to me this error (I am sure!) occurs on any system but only on
mine it fails to set advise on! I checked that on other machines!

The reason was pointed in original PR:
ltdl.h has includes

#include < libltdl/lt_system.h >
#include < libltdl/lt_error.h >


That can't be found without "-I$srcdir/opal/libltdl/".

The point is that we DO need "-I$srcdir/opal/libltdl/" but we ALSO need
"-I$srcdir" too! I filed the new PR (
https://github.com/open-mpi/ompi/pull/301) but won't merge it until Edgar
confirms that it's OK with his system.

So the original error was in removing -I$srcdir. I was sure that we
converged on this and found another valuable discussion in ompi-release:
https://github.com/open-mpi/ompi-release/pull/34

There I was looking into configure script and found:

CPPFLAGS="-I$srcdir/ -I$srcdir/opal/libltdl/"# Must specifically
mention $srcdir here for VPATH builds# (this file is in the src tree).
cat confdefs.h - <<_ACEOF >conftest.$ac_ext/* end confdefs.h.
*/#include <$srcdir/opal/libltdl/ltdl.h>_ACEOF


And it was obvious that we don't need "-I$srcdir/" because it was hardcoded
in the include but it turns out that I've been wrong and maybe some other
building system emmits different code. I would like to see Edgars original
config.log. Jeff could you send it to me directly?

So, everybody, sorry for inconvinience!


2014-12-03 0:41 GMT+06:00 Jeff Squyres (jsquyres) :

> See https://github.com/open-mpi/ompi/pull/298 for a fix.
>
> There's 2 commits on that PR -- the 2nd is just a cleanup.  The real fix
> is the 1st commit, here:
>
>
> https://github.com/jsquyres/ompi/commit/a736d83fb9a7b27986a008a2cda6eb1fea839fb3
>
> If someone can confirm that this works for them, we can bring it to master.
>
> It may have the side effect of "fixing / working around" (by coincidence)
> the SLURM bug (we all agree that the Right solution is to have SLURM fix it
> upstream, but I think this will put us back in the case of "working by
> accident / despite the SLURM bug").
>
>
>
> On Dec 2, 2014, at 10:59 AM, Jeff Squyres (jsquyres) 
> wrote:
>
> > I'm able to replicate Edgar's problem.
> >
> > I'm investigating...
> >
> >
> > On Dec 2, 2014, at 10:39 AM, Edgar Gabriel  wrote:
> >
> >> the mailing list refused to let me add the config.log file, since it is
> too large, I can forward the output to you directly as well (as I did to
> Jeff).
> >>
> >> I honestly have not looked into the configure logic, I can just tell
> that OPAL_HAVE_LTDL_ADVISE is not set on my linux system for master, but is
> set on the 1.8 series (1.8 series checkout was from Nov. 20, so if
> something changed in between the result might be different).
> >>
> >>
> >>
> >> On 12/2/2014 9:27 AM, Artem Polyakov wrote:
> >>>
> >>> 2014-12-02 20:59 GMT+06:00 Edgar Gabriel  >>> >:
> >>>
> >>>   didn't want to interfere with this thread, although I have a similar
> >>>   issue, since I have the solution nearly fully cooked up. But anyway,
> >>>   this last email gave the hint on why we have suddenly the problem in
> >>>   ompio:
> >>>
> >>>   it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not
> >>>   set anymore, so the entire section is being skipped. I double
> >>>   checked that with the 1.8 branch, it goes through the section, but
> >>>   not with master.
> >>>
> >>>
> >>> Hi, Edgar.
> >>>
> >>> Both master and ompi-release (isn't it 1.8?!) are equal in sence of my
> >>> fix. Something else!? I'd like to see config.log too but will look into
> >>> it only tomorrow.
> >>>
> >>> Also I want to add that SLURM PMI2 communicates with local slurmstepd's
> >>> and doesn't need any authentification. All PMI1 processes otherwise
> >>> communicate to the srun process and thus need libslurm services for
> >>> communication and authentification.
> >>>
> >>>
> >>>   Thanks
> >>>   Edgar
> >>>
> >>>
> >>>
> >>>
> >>>   On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
> >>>
> >>>   Looks like I

Re: [OMPI devel] RFC: update opal lifo class and add fifo class

2014-12-02 Thread Nathan Hjelm

On Tue, Dec 02, 2014 at 05:54:04PM -0500, George Bosilca wrote:
>The FIFO implementation doesn't look right to me. I don't have time to
>look at it right now, but just looking at the push it will not correctly
>succeed if two threads are pushing items in same time.
>A FIFO is a very sensitive algorithm, and should be treated accordingly.
>Moreover, there is no immediate need for it, so I suggest you drop it from
>this RFC.

Agreed there is no immediate need for it so I am willing to push it
off. I included it because it does indeed work (passes the brutal unit
test included in the pull request). The design is based off of the
nemesis fifo enhanced to use either a head spin-lock or 128-bit
compare-and-swap to avoid ABA issues. The nemesis fifo is single-reader
multiple-writer.

-Nathan

pgpGdqULRAQsh.pgp
Description: PGP signature

Re: [OMPI devel] RFC: update opal lifo class and add fifo class

2014-12-02 Thread George Bosilca

The FIFO implementation doesn't look right to me. I don't have time to look
at it right now, but just looking at the push it will not correctly succeed
if two threads are pushing items in same time.

A FIFO is a very sensitive algorithm, and should be treated accordingly.
Moreover, there is no immediate need for it, so I suggest you drop it from
this RFC.

  George.

PS: There are some known FIFO implementations that work correctly, but they
all require a CAS2. http://www.grame.fr/ressources/publications/LockFree.pdf


On Tue, Dec 2, 2014 at 5:41 PM, Nathan Hjelm  wrote:

>
> What: Update the interface to the atomic lifo to include non-atomic and
> opal_using_threads conditioned atomic versions.
>
> Why: We currently only have one type of lifo in the master branch:
> atomic. This has a negative impact on the performance of Open MPI when
> not using threads. To fix this issue I want to add two new flavors of
> lifo push and pop:
>
>  - opal_lifo_push_st/opal_lifo_pop_st: explicit single-threaded. These
>versions can be used when it is guaranteed that no other thread will
>touch the fifo.
>
>  - opal_lifo_push/opal_lifo_pop: use atomics if opal_using_threads is
>set otherwise use the single-threaded versions.
>
>
> The existing functions: opal_atomic_lifo_push and opal_atomic_lifo_pop
> will be renamed to opal_lifo_push_atomic and opal_lifo_pop_atomic
> respectively. I have made some improvements to the atomic implementation
> and included implementations of push/pop that use the 128-bit
> compare-and-swap available on most modern x86_64 processors.
>
> Existing code (ompi_free_list_t) will use the conditioned version. This
> will give good out of the box performance with single threaded
> benchmarks while still providing support for the MPI_THREAD_MULTIPLE
> case.
>
> As part of this RFC I am pushing a fifo implementation and unit tests
> for both the fifo and lifo classes. More info can be found in the commit
> message:
>
>
> https://github.com/hjelmn/ompi/commit/b57b4b2df841a2d309b528717b40d9bf23355a82
>
>
> When: Tuesday, Dec 9.
>
>
> The pull request can be found @ https://github.com/open-mpi/ompi/pull/300
>
> -Nathan
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16410.php
>

[OMPI devel] RFC: update opal lifo class and add fifo class

2014-12-02 Thread Nathan Hjelm


What: Update the interface to the atomic lifo to include non-atomic and
opal_using_threads conditioned atomic versions.

Why: We currently only have one type of lifo in the master branch:
atomic. This has a negative impact on the performance of Open MPI when
not using threads. To fix this issue I want to add two new flavors of
lifo push and pop:

 - opal_lifo_push_st/opal_lifo_pop_st: explicit single-threaded. These
   versions can be used when it is guaranteed that no other thread will
   touch the fifo.

 - opal_lifo_push/opal_lifo_pop: use atomics if opal_using_threads is
   set otherwise use the single-threaded versions.


The existing functions: opal_atomic_lifo_push and opal_atomic_lifo_pop
will be renamed to opal_lifo_push_atomic and opal_lifo_pop_atomic
respectively. I have made some improvements to the atomic implementation
and included implementations of push/pop that use the 128-bit
compare-and-swap available on most modern x86_64 processors.

Existing code (ompi_free_list_t) will use the conditioned version. This
will give good out of the box performance with single threaded
benchmarks while still providing support for the MPI_THREAD_MULTIPLE
case.

As part of this RFC I am pushing a fifo implementation and unit tests
for both the fifo and lifo classes. More info can be found in the commit
message:

https://github.com/hjelmn/ompi/commit/b57b4b2df841a2d309b528717b40d9bf23355a82


When: Tuesday, Dec 9.


The pull request can be found @ https://github.com/open-mpi/ompi/pull/300

-Nathan


pgpT17RQF9weU.pgp
Description: PGP signature

Re: [OMPI devel] RTLD_GLOBAL question

See https://github.com/open-mpi/ompi/pull/298 for a fix.

There's 2 commits on that PR -- the 2nd is just a cleanup.  The real fix is the 
1st commit, here:

https://github.com/jsquyres/ompi/commit/a736d83fb9a7b27986a008a2cda6eb1fea839fb3

If someone can confirm that this works for them, we can bring it to master.

It may have the side effect of "fixing / working around" (by coincidence) the 
SLURM bug (we all agree that the Right solution is to have SLURM fix it 
upstream, but I think this will put us back in the case of "working by accident 
/ despite the SLURM bug").



On Dec 2, 2014, at 10:59 AM, Jeff Squyres (jsquyres)  wrote:

> I'm able to replicate Edgar's problem.
> 
> I'm investigating...
> 
> 
> On Dec 2, 2014, at 10:39 AM, Edgar Gabriel  wrote:
> 
>> the mailing list refused to let me add the config.log file, since it is too 
>> large, I can forward the output to you directly as well (as I did to Jeff).
>> 
>> I honestly have not looked into the configure logic, I can just tell that 
>> OPAL_HAVE_LTDL_ADVISE is not set on my linux system for master, but is set 
>> on the 1.8 series (1.8 series checkout was from Nov. 20, so if something 
>> changed in between the result might be different).
>> 
>> 
>> 
>> On 12/2/2014 9:27 AM, Artem Polyakov wrote:
>>> 
>>> 2014-12-02 20:59 GMT+06:00 Edgar Gabriel >> >:
>>> 
>>>   didn't want to interfere with this thread, although I have a similar
>>>   issue, since I have the solution nearly fully cooked up. But anyway,
>>>   this last email gave the hint on why we have suddenly the problem in
>>>   ompio:
>>> 
>>>   it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not
>>>   set anymore, so the entire section is being skipped. I double
>>>   checked that with the 1.8 branch, it goes through the section, but
>>>   not with master.
>>> 
>>> 
>>> Hi, Edgar.
>>> 
>>> Both master and ompi-release (isn't it 1.8?!) are equal in sence of my
>>> fix. Something else!? I'd like to see config.log too but will look into
>>> it only tomorrow.
>>> 
>>> Also I want to add that SLURM PMI2 communicates with local slurmstepd's
>>> and doesn't need any authentification. All PMI1 processes otherwise
>>> communicate to the srun process and thus need libslurm services for
>>> communication and authentification.
>>> 
>>> 
>>>   Thanks
>>>   Edgar
>>> 
>>> 
>>> 
>>> 
>>>   On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
>>> 
>>>   Looks like I was totally lying in
>>>   http://www.open-mpi.org/__community/lists/devel/2014/12/__16381.php
>>>    
>>> (where
>>>   I said we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
>>> 
>>>   
>>> https://github.com/open-mpi/__ompi/blob/master/opal/mca/__base/mca_base_component___repository.c#L124
>>>   
>>> 
>>> 
>>>   This ltdl advice object is passed to lt_dlopen() for all
>>>   components.  My mistake; sorry.
>>> 
>>>   So the idea that using RTLD_GLOBAL will fix this SLURM bug is
>>>   incorrect.
>>> 
>>>   I believe someone said earlier in the thread that adding the
>>>   right -llibs to the configure line will solve the issue, and
>>>   that sounds correct to me.  If there's a missing symbol because
>>>   the SLURM libraries are not automatically pulling in the right
>>>   dependent libraries, then *if* we put a workaround in OMPI to
>>>   fix this issue, then the right workaround is to add the relevant
>>>   -llibs when that component is linked.
>>> 
>>>   *If* you add that workaround (which is a whole separate
>>>   discussion), I would suggest adding a configure.m4 test to see
>>>   if adding the additional -llibs are necessary.  Perhaps
>>>   AC_LINK_IFELSE looking for a symbol, and then if that fails,
>>>   AC_LINK_IFELSE again with the additional -llibs to see if that
>>>   works.
>>> 
>>>   Or something like that.
>>> 
>>> 
>>> 
>>>   On Dec 2, 2014, at 6:38 AM, Artem Polyakov >>   > wrote:
>>> 
>>>   Agree. First you should check is to what value
>>>   OPAL_HAVE_LTDL_ADVISE is set. If it is zero - very probably
>>>   this is the same bug as mine.
>>> 
>>>   2014-12-02 17:33 GMT+06:00 Ralph Castain >>   >:
>>>   It does look similar - question is: why didn’t this fix the
>>>   problem? Will have to investigate.
>>> 
>>>   Thanks
>>> 
>>> 
>>>   On Dec 2, 2014, at 3:17 AM, Artem Polyakov
>>>   mailto:artpo...@gmail.com>> wrote:
>>> 
>>> 
>>> 
>>>   2014-12-02 17:13 GMT+06:00 Ralph Castain
>>>   mailto:r...@open-mpi.org>>:
>>>   Hmmm…if that is true, then it didn’t fix this problem as
>>>   it is being reported in the master.
>>> 
>>>

Re: [OMPI devel] Introducing memkind + Adding component in mpool framework

Vish --

In general, this sounds like a great idea.

We talked about this on the call today, and it looks like it's going to take a 
bit of thought into how to integrate this into OMPI.  I.e., we might have to 
adjust the mpool and/or allocator frameworks a bit first.

Is there any chance that you can attend the OMPI face-to-face dev meeting in 
late January?

https://github.com/open-mpi/ompi/wiki/Meeting-2015-01


On Nov 18, 2014, at 7:38 PM, Vishwanath Venkatesan  wrote:

> Hello all,
> 
> I have been working on an implementation for supporting the use of 
> MPI_Alloc_mem with our new allocator library called memkind 
> (https://github.com/memkind/). The memkind library allows to allocate from 
> different kinds of memory where, kinds implemented within the library enable 
> the control of NUMA and page size features.  This could be leveraged 
> conveniently with MPI_Alloc_mem. 
> 
> I was hoping to trigger the use of the memkind component by using either an 
> info object or an mca parameter (mpirun -np x --mca mpool memkind ).
> The modules of the mpool framework are loaded from components in the btl 
> framework and not in the base of mpool. But in the case of my implementation, 
> the component can remain independent from the btl framework. Is there a way 
> to introduce priority for mpool component selection? 
> 
> Also, with the use of info objects in mpool_base_alloc.c, it looks like the 
> same code path is taken irrespective of whether the info is null or not, as 
> the branch conditions seem to be commented out. Could this be un-commented or 
> will there be a different patch for this?
> 
> Please let me know,
> Thanks,
> Vish
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16320.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] Introducing memkind + Adding component in mpool framework

Hi Vish

We talked about this on today’s telecon and people are generally supportive of 
the concept. However, the feeling was that this will take some thought and fair 
amount of work to modify mpool and the allocators properly to do this the 
“right way”.

So people asked if you could come to the Jan devel meeting in Dallas where we 
could hammer out the right way to integrate this, and maybe parcel out the work?

The meeting is detailed here:

https://github.com/open-mpi/ompi/wiki 

Hope you can come! Hope the wedding went well!
Ralph


> On Nov 18, 2014, at 4:38 PM, Vishwanath Venkatesan  
> wrote:
> 
> Hello all,
> 
> I have been working on an implementation for supporting the use of 
> MPI_Alloc_mem with our new allocator library called memkind 
> (https://github.com/memkind/ ). The memkind 
> library allows to allocate from different kinds of memory where, kinds 
> implemented within the library enable the control of NUMA and page size 
> features.  This could be leveraged conveniently with MPI_Alloc_mem. 
> 
> I was hoping to trigger the use of the memkind component by using either an 
> info object or an mca parameter (mpirun -np x --mca mpool memkind ).
> The modules of the mpool framework are loaded from components in the btl 
> framework and not in the base of mpool. But in the case of my implementation, 
> the component can remain independent from the btl framework. Is there a way 
> to introduce priority for mpool component selection? 
> 
> Also, with the use of info objects in mpool_base_alloc.c, it looks like the 
> same code path is taken irrespective of whether the info is null or not, as 
> the branch conditions seem to be commented out. Could this be un-commented or 
> will there be a different patch for this?
> 
> Please let me know,
> Thanks,
> Vish
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16320.php

Re: [OMPI devel] RTLD_GLOBAL question

I'm able to replicate Edgar's problem.

I'm investigating...


On Dec 2, 2014, at 10:39 AM, Edgar Gabriel  wrote:

> the mailing list refused to let me add the config.log file, since it is too 
> large, I can forward the output to you directly as well (as I did to Jeff).
> 
> I honestly have not looked into the configure logic, I can just tell that 
> OPAL_HAVE_LTDL_ADVISE is not set on my linux system for master, but is set on 
> the 1.8 series (1.8 series checkout was from Nov. 20, so if something changed 
> in between the result might be different).
> 
> 
> 
> On 12/2/2014 9:27 AM, Artem Polyakov wrote:
>> 
>> 2014-12-02 20:59 GMT+06:00 Edgar Gabriel > >:
>> 
>>didn't want to interfere with this thread, although I have a similar
>>issue, since I have the solution nearly fully cooked up. But anyway,
>>this last email gave the hint on why we have suddenly the problem in
>>ompio:
>> 
>>it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not
>>set anymore, so the entire section is being skipped. I double
>>checked that with the 1.8 branch, it goes through the section, but
>>not with master.
>> 
>> 
>> Hi, Edgar.
>> 
>> Both master and ompi-release (isn't it 1.8?!) are equal in sence of my
>> fix. Something else!? I'd like to see config.log too but will look into
>> it only tomorrow.
>> 
>> Also I want to add that SLURM PMI2 communicates with local slurmstepd's
>> and doesn't need any authentification. All PMI1 processes otherwise
>> communicate to the srun process and thus need libslurm services for
>> communication and authentification.
>> 
>> 
>>Thanks
>>Edgar
>> 
>> 
>> 
>> 
>>On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
>> 
>>Looks like I was totally lying in
>>http://www.open-mpi.org/__community/lists/devel/2014/12/__16381.php
>> 
>> (where
>>I said we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
>> 
>>
>> https://github.com/open-mpi/__ompi/blob/master/opal/mca/__base/mca_base_component___repository.c#L124
>>
>> 
>> 
>>This ltdl advice object is passed to lt_dlopen() for all
>>components.  My mistake; sorry.
>> 
>>So the idea that using RTLD_GLOBAL will fix this SLURM bug is
>>incorrect.
>> 
>>I believe someone said earlier in the thread that adding the
>>right -llibs to the configure line will solve the issue, and
>>that sounds correct to me.  If there's a missing symbol because
>>the SLURM libraries are not automatically pulling in the right
>>dependent libraries, then *if* we put a workaround in OMPI to
>>fix this issue, then the right workaround is to add the relevant
>>-llibs when that component is linked.
>> 
>>*If* you add that workaround (which is a whole separate
>>discussion), I would suggest adding a configure.m4 test to see
>>if adding the additional -llibs are necessary.  Perhaps
>>AC_LINK_IFELSE looking for a symbol, and then if that fails,
>>AC_LINK_IFELSE again with the additional -llibs to see if that
>>works.
>> 
>>Or something like that.
>> 
>> 
>> 
>>On Dec 2, 2014, at 6:38 AM, Artem Polyakov >> wrote:
>> 
>>Agree. First you should check is to what value
>>OPAL_HAVE_LTDL_ADVISE is set. If it is zero - very probably
>>this is the same bug as mine.
>> 
>>2014-12-02 17:33 GMT+06:00 Ralph Castain >>:
>>It does look similar - question is: why didn’t this fix the
>>problem? Will have to investigate.
>> 
>>Thanks
>> 
>> 
>>On Dec 2, 2014, at 3:17 AM, Artem Polyakov
>>mailto:artpo...@gmail.com>> wrote:
>> 
>> 
>> 
>>2014-12-02 17:13 GMT+06:00 Ralph Castain
>>mailto:r...@open-mpi.org>>:
>>Hmmm…if that is true, then it didn’t fix this problem as
>>it is being reported in the master.
>> 
>>I had this problem on my laptop installation. You can
>>check my report it was detailed enough and see if you
>>hitting the same issue. My fix was also included into
>>1.8 branch. I am not sure that this is the same issue
>>but they looks similar.
>> 
>> 
>> 
>>On Dec 1, 2014, at 9:40 PM, Artem Polyakov
>>mailto:artpo...@gmail.com>> wrote:
>> 
>>I think this might be related to the configuration
>>problem I was fixing with Jeff few months ago. Refer
>>here:
>>https://github.com/open-mpi/__ompi/pull/240
>>

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Edgar Gabriel

the mailing list refused to let me add the config.log file, since it is
too large, I can forward the output to you directly as well (as I did to
Jeff).

I honestly have not looked into the configure logic, I can just tell
that OPAL_HAVE_LTDL_ADVISE is not set on my linux system for master, but
is set on the 1.8 series (1.8 series checkout was from Nov. 20, so if
something changed in between the result might be different).

On 12/2/2014 9:27 AM, Artem Polyakov wrote:

2014-12-02 20:59 GMT+06:00 Edgar Gabriel mailto:gabr...@cs.uh.edu>>:

didn't want to interfere with this thread, although I have a similar
issue, since I have the solution nearly fully cooked up. But anyway,
this last email gave the hint on why we have suddenly the problem in
ompio:

it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not
set anymore, so the entire section is being skipped. I double
checked that with the 1.8 branch, it goes through the section, but
not with master.

Hi, Edgar.

Both master and ompi-release (isn't it 1.8?!) are equal in sence of my
fix. Something else!? I'd like to see config.log too but will look into
it only tomorrow.

Also I want to add that SLURM PMI2 communicates with local slurmstepd's
and doesn't need any authentification. All PMI1 processes otherwise
communicate to the srun process and thus need libslurm services for
communication and authentification.

Thanks
Edgar

On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:

Looks like I was totally lying in
http://www.open-mpi.org/__community/lists/devel/2014/12/__16381.php
(where
I said we should not use RTLD_GLOBAL). We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/__ompi/blob/master/opal/mca/__base/mca_base_component___repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all
components. My mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is
incorrect.

I believe someone said earlier in the thread that adding the
right -llibs to the configure line will solve the issue, and
that sounds correct to me. If there's a missing symbol because
the SLURM libraries are not automatically pulling in the right
dependent libraries, then *if* we put a workaround in OMPI to
fix this issue, then the right workaround is to add the relevant
-llibs when that component is linked.

*If* you add that workaround (which is a whole separate
discussion), I would suggest adding a configure.m4 test to see
if adding the additional -llibs are necessary. Perhaps
AC_LINK_IFELSE looking for a symbol, and then if that fails,
AC_LINK_IFELSE again with the additional -llibs to see if that
works.

Or something like that.

On Dec 2, 2014, at 6:38 AM, Artem Polyakov mailto:artpo...@gmail.com>> wrote:

Agree. First you should check is to what value
OPAL_HAVE_LTDL_ADVISE is set. If it is zero - very probably
this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain mailto:r...@open-mpi.org>>:
It does look similar - question is: why didn’t this fix the
problem? Will have to investigate.

Thanks

On Dec 2, 2014, at 3:17 AM, Artem Polyakov
mailto:artpo...@gmail.com>> wrote:

2014-12-02 17:13 GMT+06:00 Ralph Castain
mailto:r...@open-mpi.org>>:
Hmmm…if that is true, then it didn’t fix this problem as
it is being reported in the master.

I had this problem on my laptop installation. You can
check my report it was detailed enough and see if you
hitting the same issue. My fix was also included into
1.8 branch. I am not sure that this is the same issue
but they looks similar.

On Dec 1, 2014, at 9:40 PM, Artem Polyakov
mailto:artpo...@gmail.com>> wrote:

I think this might be related to the configuration
problem I was fixing with Jeff few months ago. Refer
here:
https://github.com/open-mpi/__ompi/pull/240

2014-12-02 10:15 GMT+06:00 Ralph Castain
mailto:r...@open-mpi.org>>:
If it isn’t too much trouble, it would be good to
confirm that it remains broken. I strongly suspect
it is based on Moe’s comments.

Obviously, other people are making this work.

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 20:59 GMT+06:00 Edgar Gabriel :

> didn't want to interfere with this thread, although I have a similar
> issue, since I have the solution nearly fully cooked up. But anyway, this
> last email gave the hint on why we have suddenly the problem in ompio:
>
> it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set
> anymore, so the entire section is being skipped. I double checked that with
> the 1.8 branch, it goes through the section, but not with master.
>

Hi, Edgar.

Both master and ompi-release (isn't it 1.8?!) are equal in sence of my fix.
Something else!? I'd like to see config.log too but will look into it only
tomorrow.

Also I want to add that SLURM PMI2 communicates with local slurmstepd's and
doesn't need any authentification. All PMI1 processes otherwise communicate
to the srun process and thus need libslurm services for communication and
authentification.


>
> Thanks
> Edgar
>
>
>
>
> On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
>
>> Looks like I was totally lying in http://www.open-mpi.org/
>> community/lists/devel/2014/12/16381.php (where I said we should not use
>> RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
>>
>> https://github.com/open-mpi/ompi/blob/master/opal/mca/
>> base/mca_base_component_repository.c#L124
>>
>> This ltdl advice object is passed to lt_dlopen() for all components.  My
>> mistake; sorry.
>>
>> So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.
>>
>> I believe someone said earlier in the thread that adding the right -llibs
>> to the configure line will solve the issue, and that sounds correct to me.
>> If there's a missing symbol because the SLURM libraries are not
>> automatically pulling in the right dependent libraries, then *if* we put a
>> workaround in OMPI to fix this issue, then the right workaround is to add
>> the relevant -llibs when that component is linked.
>>
>> *If* you add that workaround (which is a whole separate discussion), I
>> would suggest adding a configure.m4 test to see if adding the additional
>> -llibs are necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and
>> then if that fails, AC_LINK_IFELSE again with the additional -llibs to see
>> if that works.
>>
>> Or something like that.
>>
>>
>>
>> On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:
>>
>>  Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is
>>> set. If it is zero - very probably this is the same bug as mine.
>>>
>>> 2014-12-02 17:33 GMT+06:00 Ralph Castain :
>>> It does look similar - question is: why didn’t this fix the problem?
>>> Will have to investigate.
>>>
>>> Thanks
>>>
>>>
>>>  On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:



 2014-12-02 17:13 GMT+06:00 Ralph Castain :
 Hmmm…if that is true, then it didn’t fix this problem as it is being
 reported in the master.

 I had this problem on my laptop installation. You can check my report
 it was detailed enough and see if you hitting the same issue. My fix was
 also included into 1.8 branch. I am not sure that this is the same issue
 but they looks similar.



  On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>
> I think this might be related to the configuration problem I was
> fixing with Jeff few months ago. Refer here:
> https://github.com/open-mpi/ompi/pull/240
>
> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
> If it isn’t too much trouble, it would be good to confirm that it
> remains broken. I strongly suspect it is based on Moe’s comments.
>
> Obviously, other people are making this work. For Intel MPI, all you
> do is point it at libpmi and they can run. However, they do explicitly
> dlopen it in their code, and I don’t know what flags they might pass when
> they do so.
>
> If necessary, I suppose we could follow that pattern. In other words,
> rather than specifically linking the “s1” component to libpmi, instead
> require that the user point us to a pmi library via an MCA param, then
> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues
> cited by Jeff, but resolves the pmi linkage problem.
>
>
>  On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>> $ srun --version
>> slurm 2.6.6-VENDOR_PROVIDED
>>
>> $ srun --mpi=pmi2 -n 1 ~/hw
>> I am 0 / 1
>>
>> $ srun -n 1 ~/hw
>> /csc/home1/gouaillardet/hw: symbol lookup error:
>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or
>> received
>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were
>> transmitted or received
>> srun: error: soleil: task 0: Exited with exit code 127
>>
>> $ ldd /usr/lib64/slurm/auth_munge.so
>>  linux-vdso.so.1 =>  (0x7fff54478000)
>>  libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x

Re: [OMPI devel] RTLD_GLOBAL question

@#$%#@$%

Can you send your configure output and config.log?


On Dec 2, 2014, at 10:06 AM, Edgar Gabriel  wrote:

> I checked with the debugger, that it did skip the entire section
> 
> On 12/2/2014 9:04 AM, Jeff Squyres (jsquyres) wrote:
>> Oy -- I thought we fixed that.  :-(
>> 
>> Are you saying that configure output says that ltdladvise is not found?
>> 
>> 
>> On Dec 2, 2014, at 9:59 AM, Edgar Gabriel  wrote:
>> 
>>> didn't want to interfere with this thread, although I have a similar issue, 
>>> since I have the solution nearly fully cooked up. But anyway, this last 
>>> email gave the hint on why we have suddenly the problem in ompio:
>>> 
>>> it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set 
>>> anymore, so the entire section is being skipped. I double checked that with 
>>> the 1.8 branch, it goes through the section, but not with master.
>>> 
>>> Thanks
>>> Edgar
>>> 
>>> 
>>> 
>>> On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
 Looks like I was totally lying in 
 http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I 
 said we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
 
 https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124
 
 This ltdl advice object is passed to lt_dlopen() for all components.  My 
 mistake; sorry.
 
 So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.
 
 I believe someone said earlier in the thread that adding the right -llibs 
 to the configure line will solve the issue, and that sounds correct to me. 
  If there's a missing symbol because the SLURM libraries are not 
 automatically pulling in the right dependent libraries, then *if* we put a 
 workaround in OMPI to fix this issue, then the right workaround is to add 
 the relevant -llibs when that component is linked.
 
 *If* you add that workaround (which is a whole separate discussion), I 
 would suggest adding a configure.m4 test to see if adding the additional 
 -llibs are necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and 
 then if that fails, AC_LINK_IFELSE again with the additional -llibs to see 
 if that works.
 
 Or something like that.
 
 
 
 On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:
 
> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is 
> set. If it is zero - very probably this is the same bug as mine.
> 
> 2014-12-02 17:33 GMT+06:00 Ralph Castain :
> It does look similar - question is: why didn’t this fix the problem? Will 
> have to investigate.
> 
> Thanks
> 
> 
>> On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:
>> 
>> 
>> 
>> 2014-12-02 17:13 GMT+06:00 Ralph Castain :
>> Hmmm…if that is true, then it didn’t fix this problem as it is being 
>> reported in the master.
>> 
>> I had this problem on my laptop installation. You can check my report it 
>> was detailed enough and see if you hitting the same issue. My fix was 
>> also included into 1.8 branch. I am not sure that this is the same issue 
>> but they looks similar.
>> 
>> 
>> 
>>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>>> 
>>> I think this might be related to the configuration problem I was fixing 
>>> with Jeff few months ago. Refer here:
>>> https://github.com/open-mpi/ompi/pull/240
>>> 
>>> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
>>> If it isn’t too much trouble, it would be good to confirm that it 
>>> remains broken. I strongly suspect it is based on Moe’s comments.
>>> 
>>> Obviously, other people are making this work. For Intel MPI, all you do 
>>> is point it at libpmi and they can run. However, they do explicitly 
>>> dlopen it in their code, and I don’t know what flags they might pass 
>>> when they do so.
>>> 
>>> If necessary, I suppose we could follow that pattern. In other words, 
>>> rather than specifically linking the “s1” component to libpmi, instead 
>>> require that the user point us to a pmi library via an MCA param, then 
>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues 
>>> cited by Jeff, but resolves the pmi linkage problem.
>>> 
>>> 
 On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
  wrote:
 
 $ srun --version
 slurm 2.6.6-VENDOR_PROVIDED
 
 $ srun --mpi=pmi2 -n 1 ~/hw
 I am 0 / 1
 
 $ srun -n 1 ~/hw
 /csc/home1/gouaillardet/hw: symbol lookup error: 
 /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
 srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
 srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted 
 or received
 srun: error: soleil: tas

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Edgar Gabriel

I checked with the debugger, that it did skip the entire section

On 12/2/2014 9:04 AM, Jeff Squyres (jsquyres) wrote:

Oy -- I thought we fixed that. :-(

Are you saying that configure output says that ltdladvise is not found?

On Dec 2, 2014, at 9:59 AM, Edgar Gabriel wrote:

didn't want to interfere with this thread, although I have a similar issue,
since I have the solution nearly fully cooked up. But anyway, this last email
gave the hint on why we have suddenly the problem in ompio:

it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set
anymore, so the entire section is being skipped. I double checked that with the
1.8 branch, it goes through the section, but not with master.

Thanks
Edgar

On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:

Looks like I was totally lying in
http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said
we should not use RTLD_GLOBAL). We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all components. My
mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.

I believe someone said earlier in the thread that adding the right -llibs to
the configure line will solve the issue, and that sounds correct to me. If
there's a missing symbol because the SLURM libraries are not automatically
pulling in the right dependent libraries, then *if* we put a workaround in OMPI
to fix this issue, then the right workaround is to add the relevant -llibs when
that component is linked.

*If* you add that workaround (which is a whole separate discussion), I would
suggest adding a configure.m4 test to see if adding the additional -llibs are
necessary. Perhaps AC_LINK_IFELSE looking for a symbol, and then if that
fails, AC_LINK_IFELSE again with the additional -llibs to see if that works.

Or something like that.

On Dec 2, 2014, at 6:38 AM, Artem Polyakov wrote:

Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. If
it is zero - very probably this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain :
It does look similar - question is: why didn’t this fix the problem? Will have
to investigate.

Thanks

On Dec 2, 2014, at 3:17 AM, Artem Polyakov wrote:

2014-12-02 17:13 GMT+06:00 Ralph Castain :
Hmmm…if that is true, then it didn’t fix this problem as it is being reported
in the master.

I had this problem on my laptop installation. You can check my report it was
detailed enough and see if you hitting the same issue. My fix was also included
into 1.8 branch. I am not sure that this is the same issue but they looks
similar.

On Dec 1, 2014, at 9:40 PM, Artem Polyakov wrote:

I think this might be related to the configuration problem I was fixing with
Jeff few months ago. Refer here:
https://github.com/open-mpi/ompi/pull/240

2014-12-02 10:15 GMT+06:00 Ralph Castain :
If it isn’t too much trouble, it would be good to confirm that it remains
broken. I strongly suspect it is based on Moe’s comments.

Obviously, other people are making this work. For Intel MPI, all you do is
point it at libpmi and they can run. However, they do explicitly dlopen it in
their code, and I don’t know what flags they might pass when they do so.

If necessary, I suppose we could follow that pattern. In other words, rather
than specifically linking the “s1” component to libpmi, instead require that
the user point us to a pmi library via an MCA param, then explicitly dlopen
that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but
resolves the pmi linkage problem.

On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet
wrote:

$ srun --version
slurm 2.6.6-VENDOR_PROVIDED

$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1

$ srun -n 1 ~/hw
/csc/home1/gouaillardet/hw: symbol lookup error:
/usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or
received
srun: error: soleil: task 0: Exited with exit code 127

$ ldd /usr/lib64/slurm/auth_munge.so
linux-vdso.so.1 => (0x7fff54478000)
libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
/lib64/ld-linux-x86-64.so.2 (0x003bf540)

now, if i reling auth_munge.so so it depends on libslurm :

$ srun -n 1 ~/hw
srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol:
slurm_auth_get_arg_desc

i can give a try to the latest slurm if needed

Cheers,

Gilles

On 2014/12/02 12:56, Ralph Castain wrote:

Out of curiosity - how are you testing these? I have more current versions of
Slurm and would like to test the observations there.

On Dec 1, 2014, at 7:49 PM, Gilles Goua

Re: [OMPI devel] RTLD_GLOBAL question

Oy -- I thought we fixed that.  :-(

Are you saying that configure output says that ltdladvise is not found?


On Dec 2, 2014, at 9:59 AM, Edgar Gabriel  wrote:

> didn't want to interfere with this thread, although I have a similar issue, 
> since I have the solution nearly fully cooked up. But anyway, this last email 
> gave the hint on why we have suddenly the problem in ompio:
> 
> it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set 
> anymore, so the entire section is being skipped. I double checked that with 
> the 1.8 branch, it goes through the section, but not with master.
> 
> Thanks
> Edgar
> 
> 
> 
> On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:
>> Looks like I was totally lying in 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I 
>> said we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:
>> 
>> https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124
>> 
>> This ltdl advice object is passed to lt_dlopen() for all components.  My 
>> mistake; sorry.
>> 
>> So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.
>> 
>> I believe someone said earlier in the thread that adding the right -llibs to 
>> the configure line will solve the issue, and that sounds correct to me.  If 
>> there's a missing symbol because the SLURM libraries are not automatically 
>> pulling in the right dependent libraries, then *if* we put a workaround in 
>> OMPI to fix this issue, then the right workaround is to add the relevant 
>> -llibs when that component is linked.
>> 
>> *If* you add that workaround (which is a whole separate discussion), I would 
>> suggest adding a configure.m4 test to see if adding the additional -llibs 
>> are necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if 
>> that fails, AC_LINK_IFELSE again with the additional -llibs to see if that 
>> works.
>> 
>> Or something like that.
>> 
>> 
>> 
>> On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:
>> 
>>> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is 
>>> set. If it is zero - very probably this is the same bug as mine.
>>> 
>>> 2014-12-02 17:33 GMT+06:00 Ralph Castain :
>>> It does look similar - question is: why didn’t this fix the problem? Will 
>>> have to investigate.
>>> 
>>> Thanks
>>> 
>>> 
 On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:
 
 
 
 2014-12-02 17:13 GMT+06:00 Ralph Castain :
 Hmmm…if that is true, then it didn’t fix this problem as it is being 
 reported in the master.
 
 I had this problem on my laptop installation. You can check my report it 
 was detailed enough and see if you hitting the same issue. My fix was also 
 included into 1.8 branch. I am not sure that this is the same issue but 
 they looks similar.
 
 
 
> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
> 
> I think this might be related to the configuration problem I was fixing 
> with Jeff few months ago. Refer here:
> https://github.com/open-mpi/ompi/pull/240
> 
> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
> If it isn’t too much trouble, it would be good to confirm that it remains 
> broken. I strongly suspect it is based on Moe’s comments.
> 
> Obviously, other people are making this work. For Intel MPI, all you do 
> is point it at libpmi and they can run. However, they do explicitly 
> dlopen it in their code, and I don’t know what flags they might pass when 
> they do so.
> 
> If necessary, I suppose we could follow that pattern. In other words, 
> rather than specifically linking the “s1” component to libpmi, instead 
> require that the user point us to a pmi library via an MCA param, then 
> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues 
> cited by Jeff, but resolves the pmi linkage problem.
> 
> 
>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>>  wrote:
>> 
>> $ srun --version
>> slurm 2.6.6-VENDOR_PROVIDED
>> 
>> $ srun --mpi=pmi2 -n 1 ~/hw
>> I am 0 / 1
>> 
>> $ srun -n 1 ~/hw
>> /csc/home1/gouaillardet/hw: symbol lookup error: 
>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted 
>> or received
>> srun: error: soleil: task 0: Exited with exit code 127
>> 
>> $ ldd /usr/lib64/slurm/auth_munge.so
>> linux-vdso.so.1 =>  (0x7fff54478000)
>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
>> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>> 
>> 
>> now, if i reling auth_munge.so so it depends on

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Edgar Gabriel

it looks like OPAL_HAVE_LTDL_ADVISE (at least on my systems) is not set
anymore, so the entire section is being skipped. I double checked that
with the 1.8 branch, it goes through the section, but not with master.

Thanks
Edgar

On 12/2/2014 7:56 AM, Jeff Squyres (jsquyres) wrote:

Looks like I was totally lying in
http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said
we should not use RTLD_GLOBAL). We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all components. My
mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.

I believe someone said earlier in the thread that adding the right -llibs to
the configure line will solve the issue, and that sounds correct to me. If
there's a missing symbol because the SLURM libraries are not automatically
pulling in the right dependent libraries, then *if* we put a workaround in OMPI
to fix this issue, then the right workaround is to add the relevant -llibs when
that component is linked.

*If* you add that workaround (which is a whole separate discussion), I would
suggest adding a configure.m4 test to see if adding the additional -llibs are
necessary. Perhaps AC_LINK_IFELSE looking for a symbol, and then if that
fails, AC_LINK_IFELSE again with the additional -llibs to see if that works.

Or something like that.

On Dec 2, 2014, at 6:38 AM, Artem Polyakov wrote:

Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. If
it is zero - very probably this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain :
It does look similar - question is: why didn’t this fix the problem? Will have
to investigate.

Thanks

On Dec 2, 2014, at 3:17 AM, Artem Polyakov wrote:

2014-12-02 17:13 GMT+06:00 Ralph Castain :
Hmmm…if that is true, then it didn’t fix this problem as it is being reported
in the master.

I had this problem on my laptop installation. You can check my report it was
detailed enough and see if you hitting the same issue. My fix was also included
into 1.8 branch. I am not sure that this is the same issue but they looks
similar.

On Dec 1, 2014, at 9:40 PM, Artem Polyakov wrote:

I think this might be related to the configuration problem I was fixing with
Jeff few months ago. Refer here:
https://github.com/open-mpi/ompi/pull/240

2014-12-02 10:15 GMT+06:00 Ralph Castain :
If it isn’t too much trouble, it would be good to confirm that it remains
broken. I strongly suspect it is based on Moe’s comments.

On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet
wrote:

$ srun --version
slurm 2.6.6-VENDOR_PROVIDED

$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1

now, if i reling auth_munge.so so it depends on libslurm :

$ srun -n 1 ~/hw
srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol:
slurm_auth_get_arg_desc

i can give a try to the latest slurm if needed

Cheers,

Gilles

On 2014/12/02 12:56, Ralph Castain wrote:

Out of curiosity - how are you testing these? I have more current versions of
Slurm and would like to test the observations there.

On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet
wrote:

I d like to make a step back ...

i previously tested with slurm 2.6.0, and it complained about the slurm_verbose
symbol that is defined in libslurm.so
so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok

now i tested with slurm 2.6.6 and it complains about the

Re: [OMPI devel] RTLD_GLOBAL question

Looks like I was totally lying in 
http://www.open-mpi.org/community/lists/devel/2014/12/16381.php (where I said 
we should not use RTLD_GLOBAL).  We *do* use RTLD_GLOBAL:

https://github.com/open-mpi/ompi/blob/master/opal/mca/base/mca_base_component_repository.c#L124

This ltdl advice object is passed to lt_dlopen() for all components.  My 
mistake; sorry.

So the idea that using RTLD_GLOBAL will fix this SLURM bug is incorrect.

I believe someone said earlier in the thread that adding the right -llibs to 
the configure line will solve the issue, and that sounds correct to me.  If 
there's a missing symbol because the SLURM libraries are not automatically 
pulling in the right dependent libraries, then *if* we put a workaround in OMPI 
to fix this issue, then the right workaround is to add the relevant -llibs when 
that component is linked.

*If* you add that workaround (which is a whole separate discussion), I would 
suggest adding a configure.m4 test to see if adding the additional -llibs are 
necessary.  Perhaps AC_LINK_IFELSE looking for a symbol, and then if that 
fails, AC_LINK_IFELSE again with the additional -llibs to see if that works.

Or something like that.



On Dec 2, 2014, at 6:38 AM, Artem Polyakov  wrote:

> Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is set. 
> If it is zero - very probably this is the same bug as mine.
> 
> 2014-12-02 17:33 GMT+06:00 Ralph Castain :
> It does look similar - question is: why didn’t this fix the problem? Will 
> have to investigate.
> 
> Thanks
> 
> 
>> On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:
>> 
>> 
>> 
>> 2014-12-02 17:13 GMT+06:00 Ralph Castain :
>> Hmmm…if that is true, then it didn’t fix this problem as it is being 
>> reported in the master.
>> 
>> I had this problem on my laptop installation. You can check my report it was 
>> detailed enough and see if you hitting the same issue. My fix was also 
>> included into 1.8 branch. I am not sure that this is the same issue but they 
>> looks similar.
>>  
>> 
>> 
>>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>>> 
>>> I think this might be related to the configuration problem I was fixing 
>>> with Jeff few months ago. Refer here:
>>> https://github.com/open-mpi/ompi/pull/240
>>> 
>>> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
>>> If it isn’t too much trouble, it would be good to confirm that it remains 
>>> broken. I strongly suspect it is based on Moe’s comments.
>>> 
>>> Obviously, other people are making this work. For Intel MPI, all you do is 
>>> point it at libpmi and they can run. However, they do explicitly dlopen it 
>>> in their code, and I don’t know what flags they might pass when they do so.
>>> 
>>> If necessary, I suppose we could follow that pattern. In other words, 
>>> rather than specifically linking the “s1” component to libpmi, instead 
>>> require that the user point us to a pmi library via an MCA param, then 
>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues 
>>> cited by Jeff, but resolves the pmi linkage problem.
>>> 
>>> 
 On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
  wrote:
 
 $ srun --version
 slurm 2.6.6-VENDOR_PROVIDED
 
 $ srun --mpi=pmi2 -n 1 ~/hw
 I am 0 / 1
 
 $ srun -n 1 ~/hw
 /csc/home1/gouaillardet/hw: symbol lookup error: 
 /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
 srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
 srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
 received
 srun: error: soleil: task 0: Exited with exit code 127
 
 $ ldd /usr/lib64/slurm/auth_munge.so
 linux-vdso.so.1 =>  (0x7fff54478000)
 libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
 libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
 libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
 /lib64/ld-linux-x86-64.so.2 (0x003bf540)
 
 
 now, if i reling auth_munge.so so it depends on libslurm :
 
 $ srun -n 1 ~/hw
 srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined 
 symbol: slurm_auth_get_arg_desc
 
 
 i can give a try to the latest slurm if needed
 
 Cheers,
 
 Gilles
 
 
 On 2014/12/02 12:56, Ralph Castain wrote:
> Out of curiosity - how are you testing these? I have more current 
> versions of Slurm and would like to test the observations there.
> 
> 
>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>> 
>>  wrote:
>> 
>> I d like to make a step back ...
>> 
>> i previously tested with slurm 2.6.0, and it complained about the 
>> slurm_verbose symbol that is defined in libslurm.so
>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>> 
>> now i tested with slurm 2.6.6 and it complains about the 
>> slurm_auth_get_arg_desc symbol, and this symbol is not
>>

Re: [OMPI devel] RTLD_GLOBAL question

Agree. First you should check is to what value OPAL_HAVE_LTDL_ADVISE is
set. If it is zero - very probably this is the same bug as mine.

2014-12-02 17:33 GMT+06:00 Ralph Castain :

> It does look similar - question is: why didn’t this fix the problem? Will
> have to investigate.
>
> Thanks
>
>
> On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:
>
>
>
> 2014-12-02 17:13 GMT+06:00 Ralph Castain :
>
>> Hmmm…if that is true, then it didn’t fix this problem as it is being
>> reported in the master.
>>
>
> I had this problem on my laptop installation. You can check my report it
> was detailed enough and see if you hitting the same issue. My fix was also
> included into 1.8 branch. I am not sure that this is the same issue but
> they looks similar.
>
>
>>
>>
>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>>
>> I think this might be related to the configuration problem I was fixing
>> with Jeff few months ago. Refer here:
>> https://github.com/open-mpi/ompi/pull/240
>>
>> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
>>
>>> If it isn’t too much trouble, it would be good to confirm that it
>>> remains broken. I strongly suspect it is based on Moe’s comments.
>>>
>>> Obviously, other people are making this work. For Intel MPI, all you do
>>> is point it at libpmi and they can run. However, they do explicitly dlopen
>>> it in their code, and I don’t know what flags they might pass when they do
>>> so.
>>>
>>> If necessary, I suppose we could follow that pattern. In other words,
>>> rather than specifically linking the “s1” component to libpmi, instead
>>> require that the user point us to a pmi library via an MCA param, then
>>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues
>>> cited by Jeff, but resolves the pmi linkage problem.
>>>
>>>
>>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <
>>> gilles.gouaillar...@iferc.org> wrote:
>>>
>>> $ srun --version
>>> slurm 2.6.6-VENDOR_PROVIDED
>>>
>>> $ srun --mpi=pmi2 -n 1 ~/hw
>>> I am 0 / 1
>>>
>>> $ srun -n 1 ~/hw
>>> /csc/home1/gouaillardet/hw: symbol lookup error:
>>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted
>>> or received
>>> srun: error: soleil: task 0: Exited with exit code 127
>>>
>>> $ ldd /usr/lib64/slurm/auth_munge.so
>>> linux-vdso.so.1 =>  (0x7fff54478000)
>>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>>> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
>>> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>>>
>>>
>>> now, if i reling auth_munge.so so it depends on libslurm :
>>>
>>> $ srun -n 1 ~/hw
>>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
>>> symbol: slurm_auth_get_arg_desc
>>>
>>>
>>> i can give a try to the latest slurm if needed
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On 2014/12/02 12:56, Ralph Castain wrote:
>>>
>>> Out of curiosity - how are you testing these? I have more current versions 
>>> of Slurm and would like to test the observations there.
>>>
>>>
>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>>   wrote:
>>>
>>> I d like to make a step back ...
>>>
>>> i previously tested with slurm 2.6.0, and it complained about the 
>>> slurm_verbose symbol that is defined in libslurm.so
>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>>
>>> now i tested with slurm 2.6.6 and it complains about the 
>>> slurm_auth_get_arg_desc symbol, and this symbol is not
>>> defined in any dynamic library. it is internally defined in the static 
>>> libcommon.a library, which is used to build the slurm binaries.
>>>
>>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>>> binary, which means it cannot be invoked from an mpi application
>>> even if it is linked with libslurm, libpmi, ...
>>>
>>> that looks like a slurm design issue that the slurm folks will take care of.
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 2014/12/02 12:33, Ralph Castain wrote:
>>>
>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>>> component as this is the only place that requires it, and it won’t hurt 
>>> anything to do so.
>>>
>>>
>>>
>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>>   
>>>   
>>> wrote:
>>>
>>> Jeff,
>>>
>>> FWIW, you can read my analysis of what is going wrong 
>>> athttp://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>>  
>>>  
>>>  
>>>  
>>>  
>>>

Re: [OMPI devel] RTLD_GLOBAL question

It does look similar - question is: why didn’t this fix the problem? Will have 
to investigate.

Thanks


> On Dec 2, 2014, at 3:17 AM, Artem Polyakov  wrote:
> 
> 
> 
> 2014-12-02 17:13 GMT+06:00 Ralph Castain  >:
> Hmmm…if that is true, then it didn’t fix this problem as it is being reported 
> in the master.
> 
> I had this problem on my laptop installation. You can check my report it was 
> detailed enough and see if you hitting the same issue. My fix was also 
> included into 1.8 branch. I am not sure that this is the same issue but they 
> looks similar.
>  
> 
> 
>> On Dec 1, 2014, at 9:40 PM, Artem Polyakov > > wrote:
>> 
>> I think this might be related to the configuration problem I was fixing with 
>> Jeff few months ago. Refer here:
>> https://github.com/open-mpi/ompi/pull/240 
>> 
>> 
>> 2014-12-02 10:15 GMT+06:00 Ralph Castain > >:
>> If it isn’t too much trouble, it would be good to confirm that it remains 
>> broken. I strongly suspect it is based on Moe’s comments.
>> 
>> Obviously, other people are making this work. For Intel MPI, all you do is 
>> point it at libpmi and they can run. However, they do explicitly dlopen it 
>> in their code, and I don’t know what flags they might pass when they do so.
>> 
>> If necessary, I suppose we could follow that pattern. In other words, rather 
>> than specifically linking the “s1” component to libpmi, instead require that 
>> the user point us to a pmi library via an MCA param, then explicitly dlopen 
>> that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but 
>> resolves the pmi linkage problem.
>> 
>> 
>>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>>> mailto:gilles.gouaillar...@iferc.org>> 
>>> wrote:
>>> 
>>> $ srun --version
>>> slurm 2.6.6-VENDOR_PROVIDED
>>> 
>>> $ srun --mpi=pmi2 -n 1 ~/hw
>>> I am 0 / 1
>>> 
>>> $ srun -n 1 ~/hw
>>> /csc/home1/gouaillardet/hw: symbol lookup error: 
>>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
>>> received
>>> srun: error: soleil: task 0: Exited with exit code 127
>>> 
>>> $ ldd /usr/lib64/slurm/auth_munge.so
>>> linux-vdso.so.1 =>  (0x7fff54478000)
>>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>>> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
>>> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>>> 
>>> 
>>> now, if i reling auth_munge.so so it depends on libslurm :
>>> 
>>> $ srun -n 1 ~/hw
>>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined 
>>> symbol: slurm_auth_get_arg_desc
>>> 
>>> 
>>> i can give a try to the latest slurm if needed
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> On 2014/12/02 12:56, Ralph Castain wrote:
 Out of curiosity - how are you testing these? I have more current versions 
 of Slurm and would like to test the observations there.
 
> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>   
> wrote:
> 
> I d like to make a step back ...
> 
> i previously tested with slurm 2.6.0, and it complained about the 
> slurm_verbose symbol that is defined in libslurm.so
> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
> 
> now i tested with slurm 2.6.6 and it complains about the 
> slurm_auth_get_arg_desc symbol, and this symbol is not
> defined in any dynamic library. it is internally defined in the static 
> libcommon.a library, which is used to build the slurm binaries.
> 
> as far as i understand, auth_munge.so can only be invoked from a slurm 
> binary, which means it cannot be invoked from an mpi application
> even if it is linked with libslurm, libpmi, ...
> 
> that looks like a slurm design issue that the slurm folks will take care 
> of.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/12/02 12:33, Ralph Castain wrote:
>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>> component as this is the only place that requires it, and it won’t hurt 
>> anything to do so.
>> 
>> 
>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>>   
>>>  
>>>  wrote:
>>> 
>>> Jeff,
>>> 
>>> FWIW, you can read my analysis of what is going wrong at
>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>>  
>>>  
>>>

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 17:13 GMT+06:00 Ralph Castain :

> Hmmm…if that is true, then it didn’t fix this problem as it is being
> reported in the master.
>

I had this problem on my laptop installation. You can check my report it
was detailed enough and see if you hitting the same issue. My fix was also
included into 1.8 branch. I am not sure that this is the same issue but
they looks similar.


>
>
> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
>
> I think this might be related to the configuration problem I was fixing
> with Jeff few months ago. Refer here:
> https://github.com/open-mpi/ompi/pull/240
>
> 2014-12-02 10:15 GMT+06:00 Ralph Castain :
>
>> If it isn’t too much trouble, it would be good to confirm that it remains
>> broken. I strongly suspect it is based on Moe’s comments.
>>
>> Obviously, other people are making this work. For Intel MPI, all you do
>> is point it at libpmi and they can run. However, they do explicitly dlopen
>> it in their code, and I don’t know what flags they might pass when they do
>> so.
>>
>> If necessary, I suppose we could follow that pattern. In other words,
>> rather than specifically linking the “s1” component to libpmi, instead
>> require that the user point us to a pmi library via an MCA param, then
>> explicitly dlopen that library with RTLD_GLOBAL. This avoids the issues
>> cited by Jeff, but resolves the pmi linkage problem.
>>
>>
>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>>  $ srun --version
>> slurm 2.6.6-VENDOR_PROVIDED
>>
>> $ srun --mpi=pmi2 -n 1 ~/hw
>> I am 0 / 1
>>
>> $ srun -n 1 ~/hw
>> /csc/home1/gouaillardet/hw: symbol lookup error:
>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or
>> received
>> srun: error: soleil: task 0: Exited with exit code 127
>>
>> $ ldd /usr/lib64/slurm/auth_munge.so
>> linux-vdso.so.1 =>  (0x7fff54478000)
>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
>> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>>
>>
>> now, if i reling auth_munge.so so it depends on libslurm :
>>
>> $ srun -n 1 ~/hw
>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
>> symbol: slurm_auth_get_arg_desc
>>
>>
>> i can give a try to the latest slurm if needed
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2014/12/02 12:56, Ralph Castain wrote:
>>
>> Out of curiosity - how are you testing these? I have more current versions 
>> of Slurm and would like to test the observations there.
>>
>>
>>  On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>   wrote:
>>
>> I d like to make a step back ...
>>
>> i previously tested with slurm 2.6.0, and it complained about the 
>> slurm_verbose symbol that is defined in libslurm.so
>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>
>> now i tested with slurm 2.6.6 and it complains about the 
>> slurm_auth_get_arg_desc symbol, and this symbol is not
>> defined in any dynamic library. it is internally defined in the static 
>> libcommon.a library, which is used to build the slurm binaries.
>>
>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>> binary, which means it cannot be invoked from an mpi application
>> even if it is linked with libslurm, libpmi, ...
>>
>> that looks like a slurm design issue that the slurm folks will take care of.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/02 12:33, Ralph Castain wrote:
>>
>>  Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>> component as this is the only place that requires it, and it won’t hurt 
>> anything to do so.
>>
>>
>>
>>  On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>   
>>   wrote:
>>
>> Jeff,
>>
>> FWIW, you can read my analysis of what is going wrong 
>> athttp://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>  
>>  
>>  
>>  
>>  
>> 
>>
>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>> on libslurm, but they do not, yet)
>>
>> a possible workaround would be to make the pmi component a "proxy" that
>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>> that being said, the impact is quite limited (no direct launch in slurm
>> with pmi1, but pmi2 works fine) so it makes sense not to work around
>> someone else

Re: [OMPI devel] RTLD_GLOBAL question

Hmmm…if that is true, then it didn’t fix this problem as it is being reported 
in the master.


> On Dec 1, 2014, at 9:40 PM, Artem Polyakov  wrote:
> 
> I think this might be related to the configuration problem I was fixing with 
> Jeff few months ago. Refer here:
> https://github.com/open-mpi/ompi/pull/240 
> 
> 
> 2014-12-02 10:15 GMT+06:00 Ralph Castain  >:
> If it isn’t too much trouble, it would be good to confirm that it remains 
> broken. I strongly suspect it is based on Moe’s comments.
> 
> Obviously, other people are making this work. For Intel MPI, all you do is 
> point it at libpmi and they can run. However, they do explicitly dlopen it in 
> their code, and I don’t know what flags they might pass when they do so.
> 
> If necessary, I suppose we could follow that pattern. In other words, rather 
> than specifically linking the “s1” component to libpmi, instead require that 
> the user point us to a pmi library via an MCA param, then explicitly dlopen 
> that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but 
> resolves the pmi linkage problem.
> 
> 
>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>> mailto:gilles.gouaillar...@iferc.org>> wrote:
>> 
>> $ srun --version
>> slurm 2.6.6-VENDOR_PROVIDED
>> 
>> $ srun --mpi=pmi2 -n 1 ~/hw
>> I am 0 / 1
>> 
>> $ srun -n 1 ~/hw
>> /csc/home1/gouaillardet/hw: symbol lookup error: 
>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
>> received
>> srun: error: soleil: task 0: Exited with exit code 127
>> 
>> $ ldd /usr/lib64/slurm/auth_munge.so
>> linux-vdso.so.1 =>  (0x7fff54478000)
>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
>> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>> 
>> 
>> now, if i reling auth_munge.so so it depends on libslurm :
>> 
>> $ srun -n 1 ~/hw
>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: 
>> slurm_auth_get_arg_desc
>> 
>> 
>> i can give a try to the latest slurm if needed
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> On 2014/12/02 12:56, Ralph Castain wrote:
>>> Out of curiosity - how are you testing these? I have more current versions 
>>> of Slurm and would like to test the observations there.
>>> 
 On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
   
 wrote:
 
 I d like to make a step back ...
 
 i previously tested with slurm 2.6.0, and it complained about the 
 slurm_verbose symbol that is defined in libslurm.so
 so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
 
 now i tested with slurm 2.6.6 and it complains about the 
 slurm_auth_get_arg_desc symbol, and this symbol is not
 defined in any dynamic library. it is internally defined in the static 
 libcommon.a library, which is used to build the slurm binaries.
 
 as far as i understand, auth_munge.so can only be invoked from a slurm 
 binary, which means it cannot be invoked from an mpi application
 even if it is linked with libslurm, libpmi, ...
 
 that looks like a slurm design issue that the slurm folks will take care 
 of.
 
 Cheers,
 
 Gilles
 
 On 2014/12/02 12:33, Ralph Castain wrote:
> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
> component as this is the only place that requires it, and it won’t hurt 
> anything to do so.
> 
> 
>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>   
>>  
>>  wrote:
>> 
>> Jeff,
>> 
>> FWIW, you can read my analysis of what is going wrong at
>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>  
>>  
>>  
>>  
>>  
>>  
>> 
>> 
>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>> on libslurm, but they do not, yet)
>> 
>> a possible workaround would be to make the pmi component a "proxy" that
>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-02 Thread Gilles Gouaillardet

Ralph,

no problem :

I just tried slurm-14-11-11-1 and *both* pmi1 and pmi2 fail with the
same error message :

symbol lookup error: /opt/slurm-14-11.11.1/lib/slurm/auth_munge.so:
undefined symbol: slurm_debug

on the bright side, auth_munge.so has no slurm_auth_get_arg_desc
undefined symbol.
if i relink auth_munge.so so it depends on libslurm.so, this fixes
*both* pmi1 and pmi2

Cheers,

Gilles

On 2014/12/02 13:15, Ralph Castain wrote:
> If it isn't too much trouble, it would be good to confirm that it remains 
> broken. I strongly suspect it is based on Moe's comments.
>
> Obviously, other people are making this work. For Intel MPI, all you do is 
> point it at libpmi and they can run. However, they do explicitly dlopen it in 
> their code, and I don't know what flags they might pass when they do so.
>
> If necessary, I suppose we could follow that pattern. In other words, rather 
> than specifically linking the "s1" component to libpmi, instead require that 
> the user point us to a pmi library via an MCA param, then explicitly dlopen 
> that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but 
> resolves the pmi linkage problem.
>
>
>> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>>  wrote:
>>
>> $ srun --version
>> slurm 2.6.6-VENDOR_PROVIDED
>>
>> $ srun --mpi=pmi2 -n 1 ~/hw
>> I am 0 / 1
>>
>> $ srun -n 1 ~/hw
>> /csc/home1/gouaillardet/hw: symbol lookup error: 
>> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
>> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
>> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
>> received
>> srun: error: soleil: task 0: Exited with exit code 127
>>
>> $ ldd /usr/lib64/slurm/auth_munge.so
>> linux-vdso.so.1 =>  (0x7fff54478000)
>> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
>> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
>> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
>>
>>
>> now, if i reling auth_munge.so so it depends on libslurm :
>>
>> $ srun -n 1 ~/hw
>> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: 
>> slurm_auth_get_arg_desc
>>
>>
>> i can give a try to the latest slurm if needed
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2014/12/02 12:56, Ralph Castain wrote:
>>> Out of curiosity - how are you testing these? I have more current versions 
>>> of Slurm and would like to test the observations there.
>>>
 On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
   
 wrote:

 I d like to make a step back ...

 i previously tested with slurm 2.6.0, and it complained about the 
 slurm_verbose symbol that is defined in libslurm.so
 so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok

 now i tested with slurm 2.6.6 and it complains about the 
 slurm_auth_get_arg_desc symbol, and this symbol is not
 defined in any dynamic library. it is internally defined in the static 
 libcommon.a library, which is used to build the slurm binaries.

 as far as i understand, auth_munge.so can only be invoked from a slurm 
 binary, which means it cannot be invoked from an mpi application
 even if it is linked with libslurm, libpmi, ...

 that looks like a slurm design issue that the slurm folks will take care 
 of.

 Cheers,

 Gilles

 On 2014/12/02 12:33, Ralph Castain wrote:
> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
> component as this is the only place that requires it, and it won't hurt 
> anything to do so.
>
>
>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>   
>>  
>>  wrote:
>>
>> Jeff,
>>
>> FWIW, you can read my analysis of what is going wrong at
>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>  
>>  
>>  
>>  
>>  
>>  
>> 
>>
>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>> on libslurm, but they do not, yet)
>>
>> a possible workaround would be to make the pmi component a "proxy" that
>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>> that being said, the impact is quite limited (no direct lau

Re: [OMPI devel] RTLD_GLOBAL question