Re: [OMPI devel] Setting up debug environment on Eclipse PTP

2014-12-01 Thread Alvyn Liang
Hi Ralph,

Yes, Eclipse is currently being actively developed. To my understanding
https://www.eclipse.org/ptp/
is also active. I did drop a question on Eclipse forum, but I got no
response. https://www.eclipse.org/forums/index.php/t/869298/

I am still looking for answers. Hopefully I will find an answer myself but
I still need hints from experienced ones. I am looking on a pages saying if
the source code exists I should add the code as the following page:

http://help.eclipse.org/juno/index.jsp?topic=%2Forg.eclipse.cdt.doc.user%2Fgetting_started%2Fcdt_w_import.htm

I am confused by things might be trivial for you. For instance,
Should I setup a MPI project or should I setup a normal C project? I have
this confusion because ompi project is itself a C project. If I setup a MPI
project, Eclipse will include its own tool chain. I am not quite sure if
there will be something funny prevented the debugging setup.

I will keep digging myself but I appreciate any help. I would be good if
anyone of you could give a little guide of how to setup debug environment
or even in pure terminal mode. Any link or documentation will also welcome.


List-Post: devel@lists.open-mpi.org
Date: Fri, 28 Nov 2014 08:08:37 -0800
> From: Ralph Castain 
> To: Open MPI Developers 
> Subject: Re: [OMPI devel] Setting up debug environment on Eclipse PTP
> Message-ID: 
> Content-Type: text/plain; charset="utf-8"
>
> I?m not sure we have any developers using PTP - have you tried asking this
> question on the PTP mailing list, assuming that project still exists?
>
>
> > On Nov 27, 2014, at 7:38 PM, Alvyn Liang  wrote:
> >
> > Dear all,
> >
> > I am trying to learn how Open MPI works. Followed many instructions on
> Web, I tried to setup MPI Hello projects on Eclipse PTP. I am wondering if
> there is any protocol to setup such an environment.
> >
> > I did try a few combination, but still stuck at the point where
> sometimes there are:
> > 1. little bugs symbol showing on the left panel (next to the line
> numbers) while debugging. Things like "Symbol 'ompi_mpi_finalized' could
> not be resolved". I think this is due to environmental variables or paths
> not being set correctly, but I don't know what I have missed.
> > 2. Cannot toggle breakpoints or toggled breakpoints being set on a
> relative file path. This makes the threads not stopping at the breakpoints.
> >
> > My environment is CentOS 6.6 running on a machine with 32GB memory, and
> Intel i7-3770. Since I am still experimenting on local debugging, I am
> debugging on Generic Open MPI Interactive with connection type local or
> remotely to 127.0.0.1, and with only a few processes. Detailed Eclipse
> installation configuration attached.
> >
> > My Open MPI is configured as
> > ../configure --enable-debug --enable-event-debug --enable-mem-debug
> --enable-mem-profile
> > The compiler is GNU C compiler.
> >
> > This gives a lot of information in the console while debugging but not
> very useful. I am not sure if I should run 'make install' for Open MPI to
> /usr, or simply set Open MPI source tree as part of the project, or both.
> Open MPI has examples folder but I don't know how to use the code directly
> as my source code. For now I can step into source code of Open MPI, but
> sometimes I cannot toggle breakpoints. Attached is my current debug
> configuration.
> >
> > Good day,
> >
> > Alvyn
> >
>


Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
If it isn’t too much trouble, it would be good to confirm that it remains 
broken. I strongly suspect it is based on Moe’s comments.

Obviously, other people are making this work. For Intel MPI, all you do is 
point it at libpmi and they can run. However, they do explicitly dlopen it in 
their code, and I don’t know what flags they might pass when they do so.

If necessary, I suppose we could follow that pattern. In other words, rather 
than specifically linking the “s1” component to libpmi, instead require that 
the user point us to a pmi library via an MCA param, then explicitly dlopen 
that library with RTLD_GLOBAL. This avoids the issues cited by Jeff, but 
resolves the pmi linkage problem.


> On Dec 1, 2014, at 8:09 PM, Gilles Gouaillardet 
>  wrote:
> 
> $ srun --version
> slurm 2.6.6-VENDOR_PROVIDED
> 
> $ srun --mpi=pmi2 -n 1 ~/hw
> I am 0 / 1
> 
> $ srun -n 1 ~/hw
> /csc/home1/gouaillardet/hw: symbol lookup error: 
> /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
> srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
> srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted or 
> received
> srun: error: soleil: task 0: Exited with exit code 127
> 
> $ ldd /usr/lib64/slurm/auth_munge.so
> linux-vdso.so.1 =>  (0x7fff54478000)
> libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
> libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
> /lib64/ld-linux-x86-64.so.2 (0x003bf540)
> 
> 
> now, if i reling auth_munge.so so it depends on libslurm :
> 
> $ srun -n 1 ~/hw
> srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: 
> slurm_auth_get_arg_desc
> 
> 
> i can give a try to the latest slurm if needed
> 
> Cheers,
> 
> Gilles
> 
> 
> On 2014/12/02 12:56, Ralph Castain wrote:
>> Out of curiosity - how are you testing these? I have more current versions 
>> of Slurm and would like to test the observations there.
>> 
>>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> I d like to make a step back ...
>>> 
>>> i previously tested with slurm 2.6.0, and it complained about the 
>>> slurm_verbose symbol that is defined in libslurm.so
>>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>> 
>>> now i tested with slurm 2.6.6 and it complains about the 
>>> slurm_auth_get_arg_desc symbol, and this symbol is not
>>> defined in any dynamic library. it is internally defined in the static 
>>> libcommon.a library, which is used to build the slurm binaries.
>>> 
>>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>>> binary, which means it cannot be invoked from an mpi application
>>> even if it is linked with libslurm, libpmi, ...
>>> 
>>> that looks like a slurm design issue that the slurm folks will take care of.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/12/02 12:33, Ralph Castain wrote:
 Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
 component as this is the only place that requires it, and it won’t hurt 
 anything to do so.
 
 
> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>   
>  
>  wrote:
> 
> Jeff,
> 
> FWIW, you can read my analysis of what is going wrong at
> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>  
>  
>  
>  
>  
>  
> 
> 
> bottom line, i agree this is a slurm issue (slurm plugin should depend
> on libslurm, but they do not, yet)
> 
> a possible workaround would be to make the pmi component a "proxy" that
> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
> that being said, the impact is quite limited (no direct launch in slurm
> with pmi1, but pmi2 works fine) so it makes sense not to work around
> someone else problem.
> and that being said, configure could detect this broken pmi1 and not
> build pmi1 support or print a user friendly error message if pmi1 is used.
> 
> any thoughts ?
> 
> Cheers,
> 
> Gilles
> 
> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>> Ok, if the problem is moot, great.

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
$ srun --version
slurm 2.6.6-VENDOR_PROVIDED

$ srun --mpi=pmi2 -n 1 ~/hw
I am 0 / 1

$ srun -n 1 ~/hw
/csc/home1/gouaillardet/hw: symbol lookup error:
/usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose
srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: slurm_receive_msg[10.0.3.15]: Zero Bytes were transmitted
or received
srun: error: soleil: task 0: Exited with exit code 127

$ ldd /usr/lib64/slurm/auth_munge.so
linux-vdso.so.1 =>  (0x7fff54478000)
libmunge.so.2 => /usr/lib64/libmunge.so.2 (0x7f744760f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f74473f1000)
libc.so.6 => /lib64/libc.so.6 (0x7f744705d000)
/lib64/ld-linux-x86-64.so.2 (0x003bf540)


now, if i reling auth_munge.so so it depends on libslurm :

$ srun -n 1 ~/hw
srun: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined
symbol: slurm_auth_get_arg_desc


i can give a try to the latest slurm if needed

Cheers,

Gilles


On 2014/12/02 12:56, Ralph Castain wrote:
> Out of curiosity - how are you testing these? I have more current versions of 
> Slurm and would like to test the observations there.
>
>> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>>  wrote:
>>
>> I d like to make a step back ...
>>
>> i previously tested with slurm 2.6.0, and it complained about the 
>> slurm_verbose symbol that is defined in libslurm.so
>> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
>>
>> now i tested with slurm 2.6.6 and it complains about the 
>> slurm_auth_get_arg_desc symbol, and this symbol is not
>> defined in any dynamic library. it is internally defined in the static 
>> libcommon.a library, which is used to build the slurm binaries.
>>
>> as far as i understand, auth_munge.so can only be invoked from a slurm 
>> binary, which means it cannot be invoked from an mpi application
>> even if it is linked with libslurm, libpmi, ...
>>
>> that looks like a slurm design issue that the slurm folks will take care of.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/02 12:33, Ralph Castain wrote:
>>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>>> component as this is the only place that requires it, and it won't hurt 
>>> anything to do so.
>>>
>>>
 On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
   
 wrote:

 Jeff,

 FWIW, you can read my analysis of what is going wrong at
 http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
  
  
 

 bottom line, i agree this is a slurm issue (slurm plugin should depend
 on libslurm, but they do not, yet)

 a possible workaround would be to make the pmi component a "proxy" that
 dlopen with RTLD_GLOBAL the "real" component in which the job is done.
 that being said, the impact is quite limited (no direct launch in slurm
 with pmi1, but pmi2 works fine) so it makes sense not to work around
 someone else problem.
 and that being said, configure could detect this broken pmi1 and not
 build pmi1 support or print a user friendly error message if pmi1 is used.

 any thoughts ?

 Cheers,

 Gilles

 On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
> Ok, if the problem is moot, great.
>
> (sidenote: this is moot, so ignore this if you want: with this 
> explanation, I'm still not sure how RTLD_GLOBAL fixes the issue)
>
>
> On Dec 1, 2014, at 5:15 PM, Ralph Castain  
>  wrote:
>
>> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
>> library is missing the linkage to libslurm that contains the linkage to 
>> libauth where munge resides. So when we call a PMI function, libpmi 
>> references a call to munge for authentication and hits an "unresolved 
>> symbol" error.
>>
>> Moe acknowledges the error is in Slurm and is fixing the linkages so 
>> this problem goes away
>>
>>
>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres) 
>>>   wrote:
>>>
>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain  
>>>  wrote:
>>>
 FWIW: It's Slurm's pmi-1 library that isn't linked correctly against 
 its dependencies (the pmi-2 one is correct).  Moe is aware of the 
 problem and fixing it on their side. This won't help existing 
 installations until they upgrade, but I tend to agree with Jeff about 
 not fixing other people's problems.
>>> Can you explain what is happening?
>>>

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
Out of curiosity - how are you testing these? I have more current versions of 
Slurm and would like to test the observations there.

> On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet 
>  wrote:
> 
> I d like to make a step back ...
> 
> i previously tested with slurm 2.6.0, and it complained about the 
> slurm_verbose symbol that is defined in libslurm.so
> so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok
> 
> now i tested with slurm 2.6.6 and it complains about the 
> slurm_auth_get_arg_desc symbol, and this symbol is not
> defined in any dynamic library. it is internally defined in the static 
> libcommon.a library, which is used to build the slurm binaries.
> 
> as far as i understand, auth_munge.so can only be invoked from a slurm 
> binary, which means it cannot be invoked from an mpi application
> even if it is linked with libslurm, libpmi, ...
> 
> that looks like a slurm design issue that the slurm folks will take care of.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/12/02 12:33, Ralph Castain wrote:
>> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
>> component as this is the only place that requires it, and it won’t hurt 
>> anything to do so.
>> 
>> 
>>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> Jeff,
>>> 
>>> FWIW, you can read my analysis of what is going wrong at
>>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>>>  
>>>  
>>> 
>>> 
>>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>>> on libslurm, but they do not, yet)
>>> 
>>> a possible workaround would be to make the pmi component a "proxy" that
>>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>>> that being said, the impact is quite limited (no direct launch in slurm
>>> with pmi1, but pmi2 works fine) so it makes sense not to work around
>>> someone else problem.
>>> and that being said, configure could detect this broken pmi1 and not
>>> build pmi1 support or print a user friendly error message if pmi1 is used.
>>> 
>>> any thoughts ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
 Ok, if the problem is moot, great.
 
 (sidenote: this is moot, so ignore this if you want: with this 
 explanation, I'm still not sure how RTLD_GLOBAL fixes the issue)
 
 
 On Dec 1, 2014, at 5:15 PM, Ralph Castain  
  wrote:
 
> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
> library is missing the linkage to libslurm that contains the linkage to 
> libauth where munge resides. So when we call a PMI function, libpmi 
> references a call to munge for authentication and hits an “unresolved 
> symbol” error.
> 
> Moe acknowledges the error is in Slurm and is fixing the linkages so this 
> problem goes away
> 
> 
>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
>>  wrote:
>> 
>> On Dec 1, 2014, at 5:07 PM, Ralph Castain  
>>  wrote:
>> 
>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against 
>>> its dependencies (the pmi-2 one is correct).  Moe is aware of the 
>>> problem and fixing it on their side. This won’t help existing 
>>> installations until they upgrade, but I tend to agree with Jeff about 
>>> not fixing other people’s problems.
>> Can you explain what is happening?
>> 
>> I ask because I'm not sure I understand the problem such that using 
>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against 
>> its dependencies properly, that shouldn't cause a problem if OMPI 
>> components A and B are both linked against libpmi1.so, and then A is 
>> loaded, and then B is loaded.
>> 
>> ...or perhaps we can just discuss this on the call tomorrow?
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com 
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php 
>> 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
I d like to make a step back ...

i previously tested with slurm 2.6.0, and it complained about the
slurm_verbose symbol that is defined in libslurm.so
so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok

now i tested with slurm 2.6.6 and it complains about the
slurm_auth_get_arg_desc symbol, and this symbol is not
defined in any dynamic library. it is internally defined in the static
libcommon.a library, which is used to build the slurm binaries.

as far as i understand, auth_munge.so can only be invoked from a slurm
binary, which means it cannot be invoked from an mpi application
even if it is linked with libslurm, libpmi, ...

that looks like a slurm design issue that the slurm folks will take care of.

Cheers,

Gilles

On 2014/12/02 12:33, Ralph Castain wrote:
> Another option is to simply add the -lslurm -lauth flags to the pmix/s1 
> component as this is the only place that requires it, and it won't hurt 
> anything to do so.
>
>
>> On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet 
>>  wrote:
>>
>> Jeff,
>>
>> FWIW, you can read my analysis of what is going wrong at
>> http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php 
>> 
>>
>> bottom line, i agree this is a slurm issue (slurm plugin should depend
>> on libslurm, but they do not, yet)
>>
>> a possible workaround would be to make the pmi component a "proxy" that
>> dlopen with RTLD_GLOBAL the "real" component in which the job is done.
>> that being said, the impact is quite limited (no direct launch in slurm
>> with pmi1, but pmi2 works fine) so it makes sense not to work around
>> someone else problem.
>> and that being said, configure could detect this broken pmi1 and not
>> build pmi1 support or print a user friendly error message if pmi1 is used.
>>
>> any thoughts ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
>>> Ok, if the problem is moot, great.
>>>
>>> (sidenote: this is moot, so ignore this if you want: with this explanation, 
>>> I'm still not sure how RTLD_GLOBAL fixes the issue)
>>>
>>>
>>> On Dec 1, 2014, at 5:15 PM, Ralph Castain  wrote:
>>>
 Easy enough to explain. We link libpmi into the pmix/s1 component. This 
 library is missing the linkage to libslurm that contains the linkage to 
 libauth where munge resides. So when we call a PMI function, libpmi 
 references a call to munge for authentication and hits an "unresolved 
 symbol" error.

 Moe acknowledges the error is in Slurm and is fixing the linkages so this 
 problem goes away


> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
> wrote:
>
> On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:
>
>> FWIW: It's Slurm's pmi-1 library that isn't linked correctly against its 
>> dependencies (the pmi-2 one is correct).  Moe is aware of the problem 
>> and fixing it on their side. This won't help existing installations 
>> until they upgrade, but I tend to agree with Jeff about not fixing other 
>> people's problems.
> Can you explain what is happening?
>
> I ask because I'm not sure I understand the problem such that using 
> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against 
> its dependencies properly, that shouldn't cause a problem if OMPI 
> components A and B are both linked against libpmi1.so, and then A is 
> loaded, and then B is loaded.
>
> ...or perhaps we can just discuss this on the call tomorrow?
>
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php
 ___
 devel mailing list
 de...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
 Link to this post: 
 http://www.open-mpi.org/community/lists/devel/2014/12/16384.php
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16386.php 
>> 
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> 

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
Jeff,

FWIW, you can read my analysis of what is going wrong at
http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php

bottom line, i agree this is a slurm issue (slurm plugin should depend
on libslurm, but they do not, yet)

a possible workaround would be to make the pmi component a "proxy" that
dlopen with RTLD_GLOBAL the "real" component in which the job is done.
that being said, the impact is quite limited (no direct launch in slurm
with pmi1, but pmi2 works fine) so it makes sense not to work around
someone else problem.
and that being said, configure could detect this broken pmi1 and not
build pmi1 support or print a user friendly error message if pmi1 is used.

any thoughts ?

Cheers,

Gilles

On 2014/12/02 7:47, Jeff Squyres (jsquyres) wrote:
> Ok, if the problem is moot, great.
>
> (sidenote: this is moot, so ignore this if you want: with this explanation, 
> I'm still not sure how RTLD_GLOBAL fixes the issue)
>
>
> On Dec 1, 2014, at 5:15 PM, Ralph Castain  wrote:
>
>> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
>> library is missing the linkage to libslurm that contains the linkage to 
>> libauth where munge resides. So when we call a PMI function, libpmi 
>> references a call to munge for authentication and hits an “unresolved 
>> symbol” error.
>>
>> Moe acknowledges the error is in Slurm and is fixing the linkages so this 
>> problem goes away
>>
>>
>>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
>>> wrote:
>>>
>>> On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:
>>>
 FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
 dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
 fixing it on their side. This won’t help existing installations until they 
 upgrade, but I tend to agree with Jeff about not fixing other people’s 
 problems.
>>> Can you explain what is happening?
>>>
>>> I ask because I'm not sure I understand the problem such that using 
>>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against 
>>> its dependencies properly, that shouldn't cause a problem if OMPI 
>>> components A and B are both linked against libpmi1.so, and then A is 
>>> loaded, and then B is loaded.
>>>
>>> ...or perhaps we can just discuss this on the call tomorrow?
>>>
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php
>



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
Ok, if the problem is moot, great.

(sidenote: this is moot, so ignore this if you want: with this explanation, I'm 
still not sure how RTLD_GLOBAL fixes the issue)


On Dec 1, 2014, at 5:15 PM, Ralph Castain  wrote:

> Easy enough to explain. We link libpmi into the pmix/s1 component. This 
> library is missing the linkage to libslurm that contains the linkage to 
> libauth where munge resides. So when we call a PMI function, libpmi 
> references a call to munge for authentication and hits an “unresolved symbol” 
> error.
> 
> Moe acknowledges the error is in Slurm and is fixing the linkages so this 
> problem goes away
> 
> 
>> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:
>> 
>>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
>>> dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
>>> fixing it on their side. This won’t help existing installations until they 
>>> upgrade, but I tend to agree with Jeff about not fixing other people’s 
>>> problems.
>> 
>> Can you explain what is happening?
>> 
>> I ask because I'm not sure I understand the problem such that using 
>> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against its 
>> dependencies properly, that shouldn't cause a problem if OMPI components A 
>> and B are both linked against libpmi1.so, and then A is loaded, and then B 
>> is loaded.
>> 
>> ...or perhaps we can just discuss this on the call tomorrow?
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16384.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
Easy enough to explain. We link libpmi into the pmix/s1 component. This library 
is missing the linkage to libslurm that contains the linkage to libauth where 
munge resides. So when we call a PMI function, libpmi references a call to 
munge for authentication and hits an “unresolved symbol” error.

Moe acknowledges the error is in Slurm and is fixing the linkages so this 
problem goes away


> On Dec 1, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:
> 
>> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
>> dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
>> fixing it on their side. This won’t help existing installations until they 
>> upgrade, but I tend to agree with Jeff about not fixing other people’s 
>> problems.
> 
> Can you explain what is happening?
> 
> I ask because I'm not sure I understand the problem such that using 
> RTLD_GLOBAL would fix it.  I.e., even if libpmi1.so isn't linked against its 
> dependencies properly, that shouldn't cause a problem if OMPI components A 
> and B are both linked against libpmi1.so, and then A is loaded, and then B is 
> loaded.
> 
> ...or perhaps we can just discuss this on the call tomorrow?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16383.php



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
On Dec 1, 2014, at 5:07 PM, Ralph Castain  wrote:

> FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
> dependencies (the pmi-2 one is correct).  Moe is aware of the problem and 
> fixing it on their side. This won’t help existing installations until they 
> upgrade, but I tend to agree with Jeff about not fixing other people’s 
> problems.

Can you explain what is happening?

I ask because I'm not sure I understand the problem such that using RTLD_GLOBAL 
would fix it.  I.e., even if libpmi1.so isn't linked against its dependencies 
properly, that shouldn't cause a problem if OMPI components A and B are both 
linked against libpmi1.so, and then A is loaded, and then B is loaded.

...or perhaps we can just discuss this on the call tomorrow?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its 
dependencies (the pmi-2 one is correct). Moe is aware of the problem and fixing 
it on their side. This won’t help existing installations until they upgrade, 
but I tend to agree with Jeff about not fixing other people’s problems.


> On Dec 1, 2014, at 1:55 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Dec 1, 2014, at 4:07 PM, Howard Pritchard  wrote:
> 
>> There has been some discussion of end case situations with use of dlopen
>> in the ompi mca framework that can lead to unresolved symbols when
>> subsequent shared libraries are dlopen'd that might needs symbols from
>> a library that had been opened previously.  Yes these libraries should be
>> doing something like a second dlopen of the lib they are dependent on,
>> but that's a different story involving other software projects outside of
>> ompi.
> 
> Those other projects should be fixed.  OMPI should not be the compromise 
> location where we compensate for other projects that do not obey proper 
> linking semantics.
> 
> Can you cite some specific examples?
> 
>> The default with the mca framework dlopen'ing of component libraries
>> is not to use RTLD_GLOBAL, and there does not currently appear to be a way
>> to change this behavior at runtime.
>> 
>> Is there a reason for avoiding use of RTLD_GLOBAL in libltdl's use of dlopen?
> 
> Yes.
> 
> There's at least two reasons that I can think of off the top of my head:
> 
> 1. It's the Right Thing to do.  I.e., we shouldn't pollute the general 
> namespace with symbols from dependent libraries.
> 
> 2. We've had specific user requests to not pollute the general namespace.  
> One specific case was because we use an embedded copy of libevent, and 
> another MPI-based program also uses libevent.  If we didn't keep libevent in 
> a private namespace, Bad Things (i.e., symbol clashes) would occur.
> 
>> Would it be okay to add RTLD_GLOBAL to the default module_flags used
>> in the vm_open - modulo detection of the definition of RTLD_GLOBAL at
>> compile time.
> 
> No.
> 
>> Perhaps adding a way with an env. or config option to not
>> enable RTLD_GLOBAL by default?
> 
> This just seems like a bad path to go down.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16381.php



Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
On Dec 1, 2014, at 4:07 PM, Howard Pritchard  wrote:

> There has been some discussion of end case situations with use of dlopen
> in the ompi mca framework that can lead to unresolved symbols when
> subsequent shared libraries are dlopen'd that might needs symbols from
> a library that had been opened previously.  Yes these libraries should be
> doing something like a second dlopen of the lib they are dependent on,
> but that's a different story involving other software projects outside of
> ompi.

Those other projects should be fixed.  OMPI should not be the compromise 
location where we compensate for other projects that do not obey proper linking 
semantics.

Can you cite some specific examples?

> The default with the mca framework dlopen'ing of component libraries
> is not to use RTLD_GLOBAL, and there does not currently appear to be a way
> to change this behavior at runtime.
> 
> Is there a reason for avoiding use of RTLD_GLOBAL in libltdl's use of dlopen?

Yes.

There's at least two reasons that I can think of off the top of my head:

1. It's the Right Thing to do.  I.e., we shouldn't pollute the general 
namespace with symbols from dependent libraries.

2. We've had specific user requests to not pollute the general namespace.  One 
specific case was because we use an embedded copy of libevent, and another 
MPI-based program also uses libevent.  If we didn't keep libevent in a private 
namespace, Bad Things (i.e., symbol clashes) would occur.

> Would it be okay to add RTLD_GLOBAL to the default module_flags used
> in the vm_open - modulo detection of the definition of RTLD_GLOBAL at
> compile time.

No.

>  Perhaps adding a way with an env. or config option to not
> enable RTLD_GLOBAL by default?

This just seems like a bad path to go down.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Howard Pritchard
Hi ompi developers,

If you always configure ompi with --disable-dlopen you can delete this
message now.

There has been some discussion of end case situations with use of dlopen
in the ompi mca framework that can lead to unresolved symbols when
subsequent shared libraries are dlopen'd that might needs symbols from
a library that had been opened previously.  Yes these libraries should be
doing something like a second dlopen of the lib they are dependent on,
but that's a different story involving other software projects outside of
ompi.

The default with the mca framework dlopen'ing of component libraries
is not to use RTLD_GLOBAL, and there does not currently appear to be a way
to change this behavior at runtime.

Is there a reason for avoiding use of RTLD_GLOBAL in libltdl's use of
dlopen?
Would it be okay to add RTLD_GLOBAL to the default module_flags used
in the vm_open - modulo detection of the definition of RTLD_GLOBAL at
compile time.  Perhaps adding a way with an env. or config option to not
enable RTLD_GLOBAL by default?

Thanks,

Howard


Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-12-01 Thread Ralph Castain
Looks like this should be fixed in my PR #101 - could you please review it?

Thanks
Ralph


> On Nov 26, 2014, at 8:14 PM, Ralph Castain  wrote:
> 
> Aha - I see what happened. I have that param set to false in my default mca 
> param file. If I set it to true on the cmd line, then I run without 
> segfaulting.
> 
> Thanks!
> Ralph
> 
> 
>> On Nov 26, 2014, at 5:55 PM, Gilles Gouaillardet 
>> > wrote:
>> 
>> Ralph,
>> 
>> let me correct and enhance my previous statement :
>> 
>> - i cannot reproduce your crash in my environment (RHEL6 like vs your RHEL7 
>> like)
>> (i configured with --enable-debug --enable-picky)
>> 
>> - i can reproduce the crash with
>> mpirun --mca mpi_param_check false
>> 
>> - if you configured with --without-mpi-param-check, i assume you would get 
>> the same crash
>> (and if i understand correctly, there would be no way to --mca 
>> mpi_param_check true)
>> 
>> here is the relevant part of my config.status :
>> $ grep MPI_PARAM_CHECK config.status 
>> D["MPI_PARAM_CHECK"]=" ompi_mpi_param_check"
>> D["OMPI_PARAM_CHECK"]=" 1"
>> 
>> i will try on a centos7 box from now.
>> in the mean time, can you check you config.status and try again with 
>> mpirun --mca mpi_param_check true
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/11/27 10:06, Gilles Gouaillardet wrote:
>>> I will double check this(afk right now)
>>> Are you running on a rhel6 like distro with gcc ?
>>> 
>>> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
>>> this...
>>> 
>>> Cheers,
>>> 
>>> Gilles 
>>> 
>>> Ralph Castain  さんのメール:
 I tried it with both the fortran and c versions - got the same result.
 
 
 This was indeed with a debug build. I wouldn’t expect a segfault even with 
 an optimized build, though - I would expect an MPI error, yes?
 
 
 
 
 On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
   
 wrote:
 
 
 I will have a look
 
 Btw, i was running the fortran version, not the c one.
 Did you confgure with --enable--debug ?
 The program sends to a rank *not* in the communicator, so this behavior 
 could make some sense on an optimized build.
 
 Cheers,
 
 Gilles
 
 Ralph Castain  さんのメール:
 Ick - I’m getting a segfault when trying to run that test:
 
 
 MPITEST info  (0): Starting MPI_Errhandler_fatal test
 
 MPITEST info  (0): This test will abort after printing the results message
 
 MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
 
 [bend001:07714] *** Process received signal ***
 
 [bend001:07714] Signal: Segmentation fault (11)
 
 [bend001:07714] Signal code: Address not mapped (1)
 
 [bend001:07714] Failing at address: 0x50
 
 [bend001:07715] *** Process received signal ***
 
 [bend001:07715] Signal: Segmentation fault (11)
 
 [bend001:07715] Signal code: Address not mapped (1)
 
 [bend001:07715] Failing at address: 0x50
 
 [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
 
 [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
 
 [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
 
 [bend001:07713] *** Process received signal ***
 
 [bend001:07713] Signal: Segmentation fault (11)
 
 [bend001:07713] Signal code: Address not mapped (1)
 
 [bend001:07713] Failing at address: 0x50
 
 [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
 
 [bend001:07713] [ 1] 
 /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
 
 [bend001:07713] [ 2] [bend001:07714] [ 0] 
 /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
 
 [bend001:07714] [ 1] 
 /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
 
 [bend001:07714] [ 2] [bend001:07715] [ 0] 
 /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
 
 [bend001:07715] [ 1] 
 /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
 
 [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests 
 PASSED (3)
 
 
 
 This is with the head of the 1.8 branch. Any suggestions?
 
 Ralph
 
 
 
 On Nov 26, 2014, at 8:46 AM, Ralph Castain  
  wrote:
 
 
 Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
 like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
 sure I remember how I fixed it) - thanks!

Re: [OMPI devel] Open MPI 1.8: link problem when Fortran+C+Platform LSF

2014-12-01 Thread Jeff Squyres (jsquyres)
Paul --

Sorry for the delay -- SC and the US Thanksgiving holiday last week got in the 
way of responding to this properly.

I talked with Dave Goodell about this issue a bunch today.  

Going back to the original email in this thread 
(http://www.open-mpi.org/community/lists/devel/2014/10/16064.php), it seems 
like this is the original problem:


$ make 
mpif90 -c main.f90 
yacc -d example4.y 
mpicc -c y.tab.c 
mpicc -c mymain.c 
lex example4.l 
mpicc -c lex.yy.c 
mpif90 -o example main.o y.tab.o mymain.o lex.yy.o 
ld: y.tab.o(.text+0xd9): unresolvable R_X86_64_PC32 relocation against symbol 
`yylval' 
ld: y.tab.o(.text+0x16f): unresolvable R_X86_64_PC32 relocation against symbol 
`yyval' 
-

You later confirmed that adding -fPIC to the compile/link lines make everything 
work without adding -lbat -llsf.

Dave and I are sorta convinced (i.e., we could still be wrong, but we *think* 
this is right) that adding -lbat and -llsf to the link line is the Wrong 
solution.  The issue seems to be that a correct/matching yylval symbol is not 
being found during your final link.  

Crucial point: the yylval symbol should be in *your* code, not in the bat and 
lsf libraries.  Indeed, if adding -lbat -llsf resolves the problem (because a 
matching yylval symbol is found in libbat or liblsf), then it means you're 
using the lex/yacc-generated yylval symbol in the LSF libraries, not your code 
(!).

And that definitely does not seem right.

(even though it *works* [in v1.6 and/or by adding -lbat -llsf in v1.8], it may 
not be actually doing what you expect under the covers, and you're really just 
getting lucky that it actually works at all)

It *seems* like this is a generic C/Fortran linkage issue; i.e., it would be 
good to look at the docs for your version of icc/ifort to see if they are 
generating different modes of .o files by default, or somesuch (i.e., why 
adding -fPIC to the compile/link line makes it work).

Make sense?

That being said, you previously sent the v1.6/v1.8 differences between "mpicc 
--showme" -- can you send the differences between "mpif90 -o example main.o 
y.tab.o mymain.o lex.yy.o --showme"?

Thanks.



On Oct 21, 2014, at 4:13 AM, Paul Kapinos  wrote:

> Jeff,
> the output of "mpicc --showme" is attached below.
> 
> > Do you really need to add "-lbat -llsf" to the command line to make it work?
> As both 1.6.5 and 1.8.3 versions are build for work with Platform LSF, yes, 
> we need libbat and liblsf. The 1.6.5 version links this library explicitly in 
> the link line. The 1.8.3 does not.
> 
> 
> 
> ### 1.6.5:
> icc 
> -I/opt/MPI/openmpi-1.6.5/linux/intel/include/openmpi/opal/mca/hwloc/hwloc132/hwloc/include
>  -I/opt/MPI/openmpi-1.6.5/linux/intel/include 
> -I/opt/MPI/openmpi-1.6.5/linux/intel/include/openmpi -fexceptions -pthread 
> -L/opt/lsf/9.1/linux2.6-glibc2.3-x86_64/lib 
> -L/opt/MPI/openmpi-1.6.5/linux/intel/lib -lmpi -losmcomp -lrdmacm -libverbs 
> -lrt -lnsl -lutil -lpsm_infinipath -lbat -llsf -ldl -lm -lnuma -lrt -lnsl 
> -lutil
> 
> ### 1.8.3:
> icc 
> -I/opt/MPI/openmpi-1.8.3/linux/intel/include/openmpi/opal/mca/hwloc/hwloc172/hwloc/include
>  
> -I/opt/MPI/openmpi-1.8.3/linux/intel/include/openmpi/opal/mca/event/libevent2021/libevent
>  
> -I/opt/MPI/openmpi-1.8.3/linux/intel/include/openmpi/opal/mca/event/libevent2021/libevent/include
>  -I/opt/MPI/openmpi-1.8.3/linux/intel/include 
> -I/opt/MPI/openmpi-1.8.3/linux/intel/include/openmpi -fexceptions -pthread 
> -L/opt/lsf/9.1/linux2.6-glibc2.3-x86_64/lib -Wl,-rpath 
> -Wl,/opt/lsf/9.1/linux2.6-glibc2.3-x86_64/lib -Wl,-rpath 
> -Wl,/opt/MPI/openmpi-1.8.3/linux/intel/lib -Wl,--enable-new-dtags 
> -L/opt/MPI/openmpi-1.8.3/linux/intel/lib -lmpi
> 
> 
> On 10/18/14 01:56, Jeff Squyres (jsquyres) wrote:
>> I think the LSF part of this may be a red herring.  Do you really need to 
>> add "-lbat -llsf" to the command line to make it work?
>> 
>> The error message *sounds* like y.tab.o was compiled differently than 
>> others...?  It's hard to know without seeing the output of mpicc --showme.
>> 
>> 
>> On Oct 17, 2014, at 7:51 AM, Ralph Castain  wrote:
>> 
>>> Forwarding this for Paul until his email address gets updated on the User 
>>> list:
>>> 
 Begin forwarded message:
 
 Date: October 17, 2014 at 6:35:31 AM PDT
 From: Paul Kapinos 
 To: Open MPI Users 
 Cc: "Kapinos, Paul" , 
 
 Subject: Open MPI 1.8: link problem when Fortran+C+Platform LSF
 
 Dear Open MPI developer,
 
 we have both Open MPI 1.6(.5) and 1.8(.3) in our cluster, configured to be 
 used with Platform LSF.
 
 One of our users run into an issue when trying to link his code 
 (combination of lex/C and Fortran) with v.1.8, whereby with OpenMPI/1.6er 
 the code can be linked OK.
 
> $ make
> mpif90 -c main.f90
> yacc -d 

Re: [OMPI devel] Question about tight integration with not-yet-supported queuing systems

2014-12-01 Thread Gilles Gouaillardet
Marc,

i am not aware of any mpi implementation in which mpirun does the job
allocation.

instead, mpirun gets job info from the batch manager (e.g. number of nodes)
so the job can be launched seamlessly and be properly killed in case of
a job abort
(bkill or equivalent)

Cheers,

Gilles

On 2014/12/01 17:47, marc.hoepp...@bils.se wrote:
>  
>
> HI,
>
>  sorry for the late reply - I've been traveling with limited email
> access. I think you can leave this issue be. I think I was hoping for a
> way to just launch mpirun and have it create the allocation by itself.
> It's not super important right now, more something I was wondering
> about. 
>
>  Thanks again for looking into this!
>
>  /Marc
>
> On 28.11.2014 17:10, Ralph Castain wrote: 
>
>> Hey Marc - just wanted to check to see if you felt this would indeed solve 
>> the problem for you. I'd rather not invest the time if this isn't going to 
>> meet the need, and I honestly don't know of a better solution. 
>>
>> On Nov 20, 2014, at 2:13 PM, Ralph Castain  wrote: 
>>
>> Here's what I can provide: 
>>
>> * lsrun -n N bash This causes openlava to create an allocation and start you 
>> off in a bash shell (or pick your shell) 
>>
>> * mpirun . Will read the allocation and use openlava to start the 
>> daemons, and then the application, on the allocated nodes 
>>
>> You can execute as many mpirun's as you like, then release the allocation (I 
>> believe by simply exiting the shell) when done. 
>>
>> Does that match your expectations? 
>> Ralph 
>>
>> On Nov 20, 2014, at 2:03 PM, Marc Höppner  wrote: 
>>
>> Hi,
>>
>> yes, lsrun exists under openlava. 
>>
>> Using mpirun is fine, but openlava currently requires that to be launched 
>> through a bash script (openmpi-mpirun). Would be neater if one could do away 
>> with that. 
>>
>> Agan, thanks for looking into this!
>>
>> /Marc
>>
>> Hold on - was discussing this with a (possibly former) OpenLava developer 
>> who made some suggestions that would make this work. It all hinges on one 
>> thing. 
>>
>> Can you please check and see if you have "lsrun" on your system? If you do, 
>> then I can offer a tight integration in that we would use OpenLava to 
>> actually launch the OMPI daemons. Still not sure I could support you 
>> directly launching MPI apps without using mpirun, if that's what you are 
>> after. 
>>
>> On Nov 18, 2014, at 7:58 AM, Marc Höppner  wrote: 
>>
>> Hi Ralph,
>>
>> I really appreciate you guys looking into this! At least now I know that 
>> there isn't a better way to run mpi jobs. Probably worth looking into LSF 
>> again..
>>
>> Cheers,
>>
>> Marc 
>> I took a brief gander at the OpenLava source code, and a couple of things 
>> jump out. First, OpenLava is a batch scheduler and only supports batch 
>> execution - there is no interactive command for "run this job". So you would 
>> have to "bsub" mpirun regardless. 
>>
>> Once you submit the job, mpirun can certainly read the local allocation via 
>> the environment. However, we cannot use the OpenLava internal functions to 
>> launch the daemons or processes as the code is GPL2, and thus has a viral 
>> incompatible license. Ordinarily, we get around that by just executing the 
>> interactive job execution command, but OpenLava doesn't have one. 
>>
>> So we'd have no other choice but to use ssh to launch the daemons on the 
>> remote nodes. This is exactly what the provided openmpi wrapper script that 
>> comes with OpenLava already does. 
>>
>> Bottom line: I don't see a way to do any deeper integration minus the 
>> interactive execution command. If OpenLava had a way of getting an 
>> allocation and then interactively running jobs, we could support what you 
>> requested. This doesn't seem to be what they are intending, unless I'm 
>> missing something (the documentation is rather incomplete). 
>>
>> Ralph 
>>
>> On Tue, Nov 18, 2014 at 6:20 AM, Marc Höppner  wrote:
>>
>> Hi, 
>>
>> sure, no problem. And about the C Api, I really don't know more than what I 
>> was told in the google group post I referred to (i.e. the API is essentially 
>> identical to LSF 4-6, which should be on the web). 
>>
>> The output of env can be found here: 
>> https://dl.dropboxusercontent.com/u/1918141/env.txt [6] 
>>
>> /M 
>>
>> Marc P. Hoeppner, PhD 
>> Team Leader 
>> BILS Genome Annotation Platform 
>> Department for Medical Biochemistry and Microbiology 
>> Uppsala University, Sweden 
>> marc.hoepp...@bils.se 
>>
>> On 18 Nov 2014, at 15:14, Ralph Castain  wrote: 
>>
>> If you could just run a single copy of "env" and send the output along, that 
>> would help a lot. I'm not interested in the usual path etc, but would like 
>> to see the envars that OpenLava is setting. 
>>
>> Thanks 
>> Ralph 
>>
>> On Tue, Nov 18, 2014 at 2:19 AM, Gilles Gouaillardet 
>>  wrote:
>>
>> Marc,
>>
>> the reply you pointed is a 

Re: [OMPI devel] Question about tight integration with not-yet-supported queuing systems

2014-12-01 Thread marc . hoeppner


HI,

 sorry for the late reply - I've been traveling with limited email
access. I think you can leave this issue be. I think I was hoping for a
way to just launch mpirun and have it create the allocation by itself.
It's not super important right now, more something I was wondering
about. 

 Thanks again for looking into this!

 /Marc

On 28.11.2014 17:10, Ralph Castain wrote: 

> Hey Marc - just wanted to check to see if you felt this would indeed solve 
> the problem for you. I'd rather not invest the time if this isn't going to 
> meet the need, and I honestly don't know of a better solution. 
> 
> On Nov 20, 2014, at 2:13 PM, Ralph Castain  wrote: 
> 
> Here's what I can provide: 
> 
> * lsrun -n N bash This causes openlava to create an allocation and start you 
> off in a bash shell (or pick your shell) 
> 
> * mpirun . Will read the allocation and use openlava to start the 
> daemons, and then the application, on the allocated nodes 
> 
> You can execute as many mpirun's as you like, then release the allocation (I 
> believe by simply exiting the shell) when done. 
> 
> Does that match your expectations? 
> Ralph 
> 
> On Nov 20, 2014, at 2:03 PM, Marc Höppner  wrote: 
> 
> Hi,
> 
> yes, lsrun exists under openlava. 
> 
> Using mpirun is fine, but openlava currently requires that to be launched 
> through a bash script (openmpi-mpirun). Would be neater if one could do away 
> with that. 
> 
> Agan, thanks for looking into this!
> 
> /Marc
> 
> Hold on - was discussing this with a (possibly former) OpenLava developer who 
> made some suggestions that would make this work. It all hinges on one thing. 
> 
> Can you please check and see if you have "lsrun" on your system? If you do, 
> then I can offer a tight integration in that we would use OpenLava to 
> actually launch the OMPI daemons. Still not sure I could support you directly 
> launching MPI apps without using mpirun, if that's what you are after. 
> 
> On Nov 18, 2014, at 7:58 AM, Marc Höppner  wrote: 
> 
> Hi Ralph,
> 
> I really appreciate you guys looking into this! At least now I know that 
> there isn't a better way to run mpi jobs. Probably worth looking into LSF 
> again..
> 
> Cheers,
> 
> Marc 
> I took a brief gander at the OpenLava source code, and a couple of things 
> jump out. First, OpenLava is a batch scheduler and only supports batch 
> execution - there is no interactive command for "run this job". So you would 
> have to "bsub" mpirun regardless. 
> 
> Once you submit the job, mpirun can certainly read the local allocation via 
> the environment. However, we cannot use the OpenLava internal functions to 
> launch the daemons or processes as the code is GPL2, and thus has a viral 
> incompatible license. Ordinarily, we get around that by just executing the 
> interactive job execution command, but OpenLava doesn't have one. 
> 
> So we'd have no other choice but to use ssh to launch the daemons on the 
> remote nodes. This is exactly what the provided openmpi wrapper script that 
> comes with OpenLava already does. 
> 
> Bottom line: I don't see a way to do any deeper integration minus the 
> interactive execution command. If OpenLava had a way of getting an allocation 
> and then interactively running jobs, we could support what you requested. 
> This doesn't seem to be what they are intending, unless I'm missing something 
> (the documentation is rather incomplete). 
> 
> Ralph 
> 
> On Tue, Nov 18, 2014 at 6:20 AM, Marc Höppner  wrote:
> 
> Hi, 
> 
> sure, no problem. And about the C Api, I really don't know more than what I 
> was told in the google group post I referred to (i.e. the API is essentially 
> identical to LSF 4-6, which should be on the web). 
> 
> The output of env can be found here: 
> https://dl.dropboxusercontent.com/u/1918141/env.txt [6] 
> 
> /M 
> 
> Marc P. Hoeppner, PhD 
> Team Leader 
> BILS Genome Annotation Platform 
> Department for Medical Biochemistry and Microbiology 
> Uppsala University, Sweden 
> marc.hoepp...@bils.se 
> 
> On 18 Nov 2014, at 15:14, Ralph Castain  wrote: 
> 
> If you could just run a single copy of "env" and send the output along, that 
> would help a lot. I'm not interested in the usual path etc, but would like to 
> see the envars that OpenLava is setting. 
> 
> Thanks 
> Ralph 
> 
> On Tue, Nov 18, 2014 at 2:19 AM, Gilles Gouaillardet 
>  wrote:
> 
> Marc,
> 
> the reply you pointed is a bit confusing to me :
> 
> "There is a native C API which can submit/start/stop/kill/re queue jobs"
> this is not what i am looking for :-(
> 
> "you need to make an appropriate call to openlava to start a remote process"
> this is what i am interested in :-)
> could you be more specific (e.g. point me to the functions, since the 
> OpenLava doc is pretty minimal ...)
> 
> the goal here is to spawn the orted daemons as part of the parallel job,
>