[OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-12 Thread Audet, Martin
Hi OMPI_Users && OMPI_Developers,

Is there an equivalent to the MCA parameter btl_openib_include_if when using 
MXM over Infiniband (e.g. either (pml=cm  mtl=mxm) or (pml=yalla)) ?

I ask this question because I'm working on a cluster where LXC containers are 
used on compute nodes (with SR-IOV I think) and multiple mlx4 interfaces are 
reported by lstopo (e.g. mlx4_0, mlx4_1, ..., mlx4_16) even if a single 
physical Mellanox Connect-X3 HCA is present per node.

I found that when I use the plain openib btl (e.g. (pml=ob1  btl=openib)), it 
is much faster if I specify the MCA parameter btl_openib_include_if=mlx4_0 to 
force Open MPI to use a single interface. By doing that the latency is lower 
while the bandwidth higher. I guess it is because otherwise Open MPI mess by 
trying to use all "virtual" interfaces at once.

However we all know that MXM is better than plain openib since it allows the 
HCAs to perform message matching, transfer message in the background and 
provide communication progress.

So in this case is there a way to use only mlx4_0 ?

I mean when using mxm mtl (pml=cm  mtl=mxm) or preferably using it more 
directly by yalla pml (pml=yalla).

Note I'm using Open MPI 1.10.3 I compiled myself for now but I could use 
instead Open MPI 2.0 if necessary .

Thanks,

Martin Audet

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread r...@open-mpi.org

> On Aug 12, 2016, at 1:48 PM, Reuti  wrote:
> 
> 
> Am 12.08.2016 um 21:44 schrieb r...@open-mpi.org :
> 
>> Don’t know about the toolchain issue - I use those same versions, and don’t 
>> have a problem. I’m on CentOS-7, so that might be the difference?
>> 
>> Anyway, I found the missing code to assemble the cmd line for qrsh - not 
>> sure how/why it got deleted.
>> 
>> https://github.com/open-mpi/ompi/pull/1960 
>> 
> 
> Yep, it's working again - thx.
> 
> But for sure there was a reason behind the removal, which may be elaborated 
> in the Open MPI team to avoid any side effects by fixing this issue.

I actually don’t recall a reason - and I’m the one that generally maintains 
that code area. I think it fell of the map accidentally when I was updating 
that area.

However, we’ll toss it out there for comment - anyone recall?


> 
> -- Reuti
> 
> PS: The other items I'll investigate on Monday.
> 
> 
>>> On Aug 12, 2016, at 12:15 PM, Reuti  wrote:
>>> 
 
 Am 12.08.2016 um 16:52 schrieb r...@open-mpi.org:
 
 IIRC, the rationale behind adding the check was that someone using SGE 
 wanted to specify a custom launch agent, and we were overriding it with 
 qrsh. However, the check is incorrect as that MCA param cannot be NULL.
 
 I have updated this on master - can you see if this fixes the problem for 
 you?
 
 https://github.com/open-mpi/ompi/pull/1957
>>> 
>>> I updated my tools to:
>>> 
>>> autoconf-2.69
>>> automake-1.15
>>> libtool-2.4.6
>>> 
>>> but I face with Open MPI's ./autogen.pl:
>>> 
>>> configure.ac:152: error: possibly undefined macro: AC_PROG_LIBTOOL
>>> 
>>> I recall seeing in already before, how to get rid of it? For now I fixed 
>>> the single source file just by hand.
>>> 
>>> -- Reuti
>>> 
>>> 
 As for the blank in the cmd line - that is likely due to a space reserved 
 for some entry that you aren’t using (e.g., when someone manually 
 specifies the prefix). It shouldn’t cause any harm as the cmd line parser 
 is required to ignore spaces
 
 The -ldl problem sounds like a configuration issue - you might want to 
 file a separate issue about it
 
> On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
> 
> Hi,
> 
> In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, 
> which seems to prevent the tight integration with SGE to start:
> 
> if (NULL == mca_plm_rsh_component.agent) {
> 
> Why is it there (it wasn't in 1.10.3)?
> 
> If I just remove it I get:
> 
> [node17:25001] [[27678,0],0] plm:rsh: final template argv:
> qrsh   orted --hnp-topo-sig ...
> 
> instead of the former:
> 
> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   
> orted --hnp-topo-sig ...
> 
> So, just removing the if-statement is not a perfect cure as the 
> $SGE_ROOT/$ARC does not prefix `qrsh`.
> 
> ==
> 
> BTW: why is there blank before " orted" in the assembled command line - 
> and it's really in the argument when I check this on the slave nodes what 
> should be started by the `qrsh_starter`? As long as there is a wrapping 
> shell, it will be removed anyway. But in a special setup we noticed this 
> additional blank.
> 
> ==
> 
> I also notice, that I have to supply "-ldl" to `mpicc` to allow the 
> compilation of an application to succeed in 2.0.0.
> 
> -- Reuti
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
 
 ___
 users mailing list
 users@lists.open-mpi.org
 https://rfd.newmexicoconsortium.org/mailman/listinfo/users
 
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread Reuti

Am 12.08.2016 um 21:44 schrieb r...@open-mpi.org:

> Don’t know about the toolchain issue - I use those same versions, and don’t 
> have a problem. I’m on CentOS-7, so that might be the difference?
> 
> Anyway, I found the missing code to assemble the cmd line for qrsh - not sure 
> how/why it got deleted.
> 
> https://github.com/open-mpi/ompi/pull/1960

Yep, it's working again - thx.

But for sure there was a reason behind the removal, which may be elaborated in 
the Open MPI team to avoid any side effects by fixing this issue.

-- Reuti

PS: The other items I'll investigate on Monday.


>> On Aug 12, 2016, at 12:15 PM, Reuti  wrote:
>> 
>>> 
>>> Am 12.08.2016 um 16:52 schrieb r...@open-mpi.org:
>>> 
>>> IIRC, the rationale behind adding the check was that someone using SGE 
>>> wanted to specify a custom launch agent, and we were overriding it with 
>>> qrsh. However, the check is incorrect as that MCA param cannot be NULL.
>>> 
>>> I have updated this on master - can you see if this fixes the problem for 
>>> you?
>>> 
>>> https://github.com/open-mpi/ompi/pull/1957
>> 
>> I updated my tools to:
>> 
>> autoconf-2.69
>> automake-1.15
>> libtool-2.4.6
>> 
>> but I face with Open MPI's ./autogen.pl:
>> 
>> configure.ac:152: error: possibly undefined macro: AC_PROG_LIBTOOL
>> 
>> I recall seeing in already before, how to get rid of it? For now I fixed the 
>> single source file just by hand.
>> 
>> -- Reuti
>> 
>> 
>>> As for the blank in the cmd line - that is likely due to a space reserved 
>>> for some entry that you aren’t using (e.g., when someone manually specifies 
>>> the prefix). It shouldn’t cause any harm as the cmd line parser is required 
>>> to ignore spaces
>>> 
>>> The -ldl problem sounds like a configuration issue - you might want to file 
>>> a separate issue about it
>>> 
 On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
 
 Hi,
 
 In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, 
 which seems to prevent the tight integration with SGE to start:
 
  if (NULL == mca_plm_rsh_component.agent) {
 
 Why is it there (it wasn't in 1.10.3)?
 
 If I just remove it I get:
 
 [node17:25001] [[27678,0],0] plm:rsh: final template argv:
  qrsh   orted --hnp-topo-sig ...
 
 instead of the former:
 
 /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   
 orted --hnp-topo-sig ...
 
 So, just removing the if-statement is not a perfect cure as the 
 $SGE_ROOT/$ARC does not prefix `qrsh`.
 
 ==
 
 BTW: why is there blank before " orted" in the assembled command line - 
 and it's really in the argument when I check this on the slave nodes what 
 should be started by the `qrsh_starter`? As long as there is a wrapping 
 shell, it will be removed anyway. But in a special setup we noticed this 
 additional blank.
 
 ==
 
 I also notice, that I have to supply "-ldl" to `mpicc` to allow the 
 compilation of an application to succeed in 2.0.0.
 
 -- Reuti
 ___
 users mailing list
 users@lists.open-mpi.org
 https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread r...@open-mpi.org
Don’t know about the toolchain issue - I use those same versions, and don’t 
have a problem. I’m on CentOS-7, so that might be the difference?

Anyway, I found the missing code to assemble the cmd line for qrsh - not sure 
how/why it got deleted.

https://github.com/open-mpi/ompi/pull/1960


> On Aug 12, 2016, at 12:15 PM, Reuti  wrote:
> 
>> 
>> Am 12.08.2016 um 16:52 schrieb r...@open-mpi.org :
>> 
>> IIRC, the rationale behind adding the check was that someone using SGE 
>> wanted to specify a custom launch agent, and we were overriding it with 
>> qrsh. However, the check is incorrect as that MCA param cannot be NULL.
>> 
>> I have updated this on master - can you see if this fixes the problem for 
>> you?
>> 
>> https://github.com/open-mpi/ompi/pull/1957 
>> 
> 
> I updated my tools to:
> 
> autoconf-2.69
> automake-1.15
> libtool-2.4.6
> 
> but I face with Open MPI's ./autogen.pl:
> 
> configure.ac:152: error: possibly undefined macro: AC_PROG_LIBTOOL
> 
> I recall seeing in already before, how to get rid of it? For now I fixed the 
> single source file just by hand.
> 
> -- Reuti
> 
> 
>> As for the blank in the cmd line - that is likely due to a space reserved 
>> for some entry that you aren’t using (e.g., when someone manually specifies 
>> the prefix). It shouldn’t cause any harm as the cmd line parser is required 
>> to ignore spaces
>> 
>> The -ldl problem sounds like a configuration issue - you might want to file 
>> a separate issue about it
>> 
>>> On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
>>> 
>>> Hi,
>>> 
>>> In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, which 
>>> seems to prevent the tight integration with SGE to start:
>>> 
>>>  if (NULL == mca_plm_rsh_component.agent) {
>>> 
>>> Why is it there (it wasn't in 1.10.3)?
>>> 
>>> If I just remove it I get:
>>> 
>>> [node17:25001] [[27678,0],0] plm:rsh: final template argv:
>>>  qrsh   orted --hnp-topo-sig ...
>>> 
>>> instead of the former:
>>> 
>>> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   
>>> orted --hnp-topo-sig ...
>>> 
>>> So, just removing the if-statement is not a perfect cure as the 
>>> $SGE_ROOT/$ARC does not prefix `qrsh`.
>>> 
>>> ==
>>> 
>>> BTW: why is there blank before " orted" in the assembled command line - and 
>>> it's really in the argument when I check this on the slave nodes what 
>>> should be started by the `qrsh_starter`? As long as there is a wrapping 
>>> shell, it will be removed anyway. But in a special setup we noticed this 
>>> additional blank.
>>> 
>>> ==
>>> 
>>> I also notice, that I have to supply "-ldl" to `mpicc` to allow the 
>>> compilation of an application to succeed in 2.0.0.
>>> 
>>> -- Reuti
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread Reuti

> Am 12.08.2016 um 16:52 schrieb r...@open-mpi.org:
> 
> IIRC, the rationale behind adding the check was that someone using SGE wanted 
> to specify a custom launch agent, and we were overriding it with qrsh. 
> However, the check is incorrect as that MCA param cannot be NULL.
> 
> I have updated this on master - can you see if this fixes the problem for you?
> 
> https://github.com/open-mpi/ompi/pull/1957

I updated my tools to:

autoconf-2.69
automake-1.15
libtool-2.4.6

but I face with Open MPI's ./autogen.pl:

configure.ac:152: error: possibly undefined macro: AC_PROG_LIBTOOL

I recall seeing in already before, how to get rid of it? For now I fixed the 
single source file just by hand.

-- Reuti


> As for the blank in the cmd line - that is likely due to a space reserved for 
> some entry that you aren’t using (e.g., when someone manually specifies the 
> prefix). It shouldn’t cause any harm as the cmd line parser is required to 
> ignore spaces
> 
> The -ldl problem sounds like a configuration issue - you might want to file a 
> separate issue about it
> 
>> On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
>> 
>> Hi,
>> 
>> In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, which 
>> seems to prevent the tight integration with SGE to start:
>> 
>>   if (NULL == mca_plm_rsh_component.agent) {
>> 
>> Why is it there (it wasn't in 1.10.3)?
>> 
>> If I just remove it I get:
>> 
>> [node17:25001] [[27678,0],0] plm:rsh: final template argv:
>>   qrsh   orted --hnp-topo-sig ...
>> 
>> instead of the former:
>> 
>> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   orted 
>> --hnp-topo-sig ...
>> 
>> So, just removing the if-statement is not a perfect cure as the 
>> $SGE_ROOT/$ARC does not prefix `qrsh`.
>> 
>> ==
>> 
>> BTW: why is there blank before " orted" in the assembled command line - and 
>> it's really in the argument when I check this on the slave nodes what should 
>> be started by the `qrsh_starter`? As long as there is a wrapping shell, it 
>> will be removed anyway. But in a special setup we noticed this additional 
>> blank.
>> 
>> ==
>> 
>> I also notice, that I have to supply "-ldl" to `mpicc` to allow the 
>> compilation of an application to succeed in 2.0.0.
>> 
>> -- Reuti
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread Reuti

Am 12.08.2016 um 16:52 schrieb r...@open-mpi.org:

> IIRC, the rationale behind adding the check was that someone using SGE wanted 
> to specify a custom launch agent, and we were overriding it with qrsh. 
> However, the check is incorrect as that MCA param cannot be NULL.
> 
> I have updated this on master - can you see if this fixes the problem for you?
> 
> https://github.com/open-mpi/ompi/pull/1957

As written initially, I get now this verbose output with " --mca 
plm_base_verbose 10":

[node22:02220] mca: base: close: component isolated closed
[node22:02220] mca: base: close: unloading component isolated
[node22:02220] mca: base: close: component slurm closed
[node22:02220] mca: base: close: unloading component slurm
[node22:02220] [[28119,0],0] plm:rsh: final template argv:
qrsh   orted --hnp-topo-sig 2N:2S:2L3:8L2:8L1:8C:8H:x86_64 
-mca ess "env" -mca ess_base_jobid "1842806784" -mca es
s_base_vpid "" -mca ess_base_num_procs "9" -mca orte_hnp_uri 
"1842806784.0;usock;tcp://192.168.154.22,192.168.154.92:46186
" --mca plm_base_verbose "10" -mca plm "rsh" -mca pmix "^s1,s2,cray"
bash: node13: command not found
bash: node20: command not found
bash: node12: command not found
bash: node16: command not found
bash: node17: command not found
bash: node14: command not found
bash: node15: command not found
Your "qrsh" request could not be scheduled, try again later.

Sure, the name of the machine is allowed only after the additional "-inherit" 
to `qrsh`. Please see below for the complete  in 1.10.3,  hence the 
assembly seems also not to be done in the correct way.

-- Reuti


> On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
> ...
> instead of the former:
> 
> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   orted 
> --hnp-topo-sig ...
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] mpirun won't find programs from the PATH environment variable that are in directories that are relative paths

2016-08-12 Thread Reuti

Am 12.08.2016 um 20:34 schrieb r...@open-mpi.org:

> Sorry for the delay - I had to catchup on some other things before I could 
> come back to checking this one. Took me awhile to track this down, but the 
> change is in test for master:
> 
> https://github.com/open-mpi/ompi/pull/1958
> 
> Once complete, I’ll set it up for inclusion in v2.0.1
> 
> Thanks for reporting it!
> Ralph
> 
> 
>> On Jul 29, 2016, at 5:47 PM, Phil Regier  
>> wrote:
>> 
>> If I'm reading you right, you're presently unable to do the equivalent 
>> (albeit probably with PATH set on a different line somewhere above) of
>> 
>> PATH=arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
>> 
>> I'm mildly curious whether it would help to add a leading "./" to get the 
>> equivalent of
>> 
>> PATH=./arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
>> 
>> But to be clear, I'm advocating
>> 
>> PATH=$PWD/arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
>> 
>> as opposed to
>> 
>> mpirun -n 1 $PWD/arch/x86_64-rhel7-gcc48-opt/bin/psana
>> 
>> mostly because you still get to set the path once and use it many times 
>> without duplicating code.
>> 
>> 
>> For what it's worth, I've seen Ralph's suggestion generalized to something 
>> like
>> 
>> PREFIX=$PWD/arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 $PREFIX/psana

AFAICS $PREFIX is evaluated too early.

$ PREFIX=old_value
$ PREFIX=foobar /bin/echo $PREFIX
old_value

Unless exactly this is the desired effect.

-- Reuti


>> 
>> where PREFIX might be set above in the same script, or sourced from a common 
>> config script or a custom environment module.  I think this style appeals to 
>> many users on many levels.
>> 
>> 
>> In any event, though, if this really is a bug that gets fixed, you've got 
>> lots of options.
>> 
>> 
>> 
>> 
>> On Fri, Jul 29, 2016 at 5:24 PM, Schneider, David A. 
>>  wrote:
>> Hi, Thanks for the reply! It does look like mpirun runs from the same 
>> directory as where I launch it, and that the environment has the same value 
>> for PATH that I had before (with the relative directory in front), but of 
>> course, there are lots of other MPI based environment variables defined - 
>> maybe one of those means don't use the relative paths?
>> 
>> Explicitly setting the path with $PWD like you say, yes, I agree that is a 
>> good defensive practice, but it is more cumbersome, the actually path looks
>> 
>>  mpirun -n 1 $PWD/arch/x86_64-rhel7-gcc48-opt/bin/psana
>> 
>> best,
>> 
>> David Schneider
>> SLAC/LCLS
>> 
>> From: users [users-boun...@lists.open-mpi.org] on behalf of Phil Regier 
>> [preg...@penguincomputing.com]
>> Sent: Friday, July 29, 2016 5:12 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] mpirun won't find programs from the PATH 
>> environment variable that are in directories that are relative paths
>> 
>> I might be three steps behind you here, but does "mpirun  pwd" show 
>> that all your launched processes are running in the same directory as the 
>> mpirun command?  I assume that "mpirun  env" would show that your PATH 
>> variable is being passed along correctly, since you don't have any problems 
>> with absolute paths.  In any event, is PATH=$PWD/dir/bin not an option?
>> 
>> Seems to me that this last would be good practice for location-sensitive 
>> launches in general, though I do tend to miss things.
>> 
>> On Fri, Jul 29, 2016 at 4:34 PM, Schneider, David A. 
>> > wrote:
>> I am finding, on linux, rhel7, with openmpi 1.8.8 and 1.10.3, that mpirun 
>> won't find apps that are specified on a relative path, i.e, if I have
>> 
>> PATH=dir/bin
>> 
>> and I am in a directory which has dir/bin as a subdirectory, and an 
>> executable bir/bin/myprogram, I can't do
>> 
>> mpirun myprogram
>> 
>> I get the error message that
>> 
>> mpirun was unable to find the specified executable file, and therefore
>> did not launch the job.
>> 
>> whereas if I put an absolute path, something like
>> 
>> PATH=/home/me/dir/bin
>> 
>> then it works.
>> 
>> This causes some problematic silent failure, sometimes we use relative 
>> directories to override a 'base' release, so if I had
>> 
>> PATH=dir/bin:/central/install/dir/bin
>> 
>> and myprogram was in both dir/bin and /central/install/dir/bin, through 
>> mpirun, I would be running myprogram from the central install, but otherwise 
>> I would run it from my own directory.
>> 
>> Do other people find this is the case? I wonder if it is a problem that got 
>> introduced through our installation of openmpi.  We do create relocatable 
>> rpm's, and I'm also trying openmpi from a conda package that is relocatable, 
>> I think all the prefix paths in the binary and text files were corrected 
>> properly for the install - at least everything else seems to work.
>> 
>> best,
>> 
>> David Schneider
>> SLAC/LCLS
>> ___
>> 

Re: [OMPI users] mpirun won't find programs from the PATH environment variable that are in directories that are relative paths

2016-08-12 Thread r...@open-mpi.org
Sorry for the delay - I had to catchup on some other things before I could come 
back to checking this one. Took me awhile to track this down, but the change is 
in test for master:

https://github.com/open-mpi/ompi/pull/1958

Once complete, I’ll set it up for inclusion in v2.0.1

Thanks for reporting it!
Ralph


> On Jul 29, 2016, at 5:47 PM, Phil Regier  wrote:
> 
> If I'm reading you right, you're presently unable to do the equivalent 
> (albeit probably with PATH set on a different line somewhere above) of
> 
> PATH=arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
> 
> I'm mildly curious whether it would help to add a leading "./" to get the 
> equivalent of
> 
> PATH=./arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
> 
> But to be clear, I'm advocating
> 
> PATH=$PWD/arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
> 
> as opposed to
> 
> mpirun -n 1 $PWD/arch/x86_64-rhel7-gcc48-opt/bin/psana
> 
> mostly because you still get to set the path once and use it many times 
> without duplicating code.
> 
> 
> For what it's worth, I've seen Ralph's suggestion generalized to something 
> like
> 
> PREFIX=$PWD/arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 $PREFIX/psana
> 
> where PREFIX might be set above in the same script, or sourced from a common 
> config script or a custom environment module.  I think this style appeals to 
> many users on many levels.
> 
> 
> In any event, though, if this really is a bug that gets fixed, you've got 
> lots of options.
> 
> 
> 
> 
> On Fri, Jul 29, 2016 at 5:24 PM, Schneider, David A. 
> > wrote:
> Hi, Thanks for the reply! It does look like mpirun runs from the same 
> directory as where I launch it, and that the environment has the same value 
> for PATH that I had before (with the relative directory in front), but of 
> course, there are lots of other MPI based environment variables defined - 
> maybe one of those means don't use the relative paths?
> 
> Explicitly setting the path with $PWD like you say, yes, I agree that is a 
> good defensive practice, but it is more cumbersome, the actually path looks
> 
>  mpirun -n 1 $PWD/arch/x86_64-rhel7-gcc48-opt/bin/psana
> 
> best,
> 
> David Schneider
> SLAC/LCLS
> 
> From: users [users-boun...@lists.open-mpi.org 
> ] on behalf of Phil Regier 
> [preg...@penguincomputing.com ]
> Sent: Friday, July 29, 2016 5:12 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] mpirun won't find programs from the PATH 
> environment variable that are in directories that are relative paths
> 
> I might be three steps behind you here, but does "mpirun  pwd" show 
> that all your launched processes are running in the same directory as the 
> mpirun command?  I assume that "mpirun  env" would show that your PATH 
> variable is being passed along correctly, since you don't have any problems 
> with absolute paths.  In any event, is PATH=$PWD/dir/bin not an option?
> 
> Seems to me that this last would be good practice for location-sensitive 
> launches in general, though I do tend to miss things.
> 
> On Fri, Jul 29, 2016 at 4:34 PM, Schneider, David A. 
>   >> wrote:
> I am finding, on linux, rhel7, with openmpi 1.8.8 and 1.10.3, that mpirun 
> won't find apps that are specified on a relative path, i.e, if I have
> 
> PATH=dir/bin
> 
> and I am in a directory which has dir/bin as a subdirectory, and an 
> executable bir/bin/myprogram, I can't do
> 
> mpirun myprogram
> 
> I get the error message that
> 
> mpirun was unable to find the specified executable file, and therefore
> did not launch the job.
> 
> whereas if I put an absolute path, something like
> 
> PATH=/home/me/dir/bin
> 
> then it works.
> 
> This causes some problematic silent failure, sometimes we use relative 
> directories to override a 'base' release, so if I had
> 
> PATH=dir/bin:/central/install/dir/bin
> 
> and myprogram was in both dir/bin and /central/install/dir/bin, through 
> mpirun, I would be running myprogram from the central install, but otherwise 
> I would run it from my own directory.
> 
> Do other people find this is the case? I wonder if it is a problem that got 
> introduced through our installation of openmpi.  We do create relocatable 
> rpm's, and I'm also trying openmpi from a conda package that is relocatable, 
> I think all the prefix paths in the binary and text files were corrected 
> properly for the install - at least everything else seems to work.
> 
> best,
> 
> David Schneider
> SLAC/LCLS
> ___
> users mailing list
> users@lists.open-mpi.org 
>  >
> 

[OMPI users] Open MPI mail archives now back online

2016-08-12 Thread Jeff Squyres (jsquyres)
mail-archive.com now has all of the old Open MPI mail archives online.  Example:

https://www.mail-archive.com/users@lists.open-mpi.org/
https://www.mail-archive.com/devel@lists.open-mpi.org/

Note that there are two different ways you can permalink to messages on 
mail-archive:

1. Take the "main" URL of the message (i.e., the one shown in the address bar 
when you're viewing a message) -- e.g.

https://www.mail-archive.com/users@lists.open-mpi.org/msg28978.html

2. Use the message ID (which uniquely identifies a message) in the form:

http://mid.mail-archive.com/MESSAGE_ID
NOTE: http, not https!

e.g.


1233038409.12589.1460577425350.JavaMail.yahoo@mail.yahoo.com">http://mid.mail-archive.com/1233038409.12589.1460577425350.JavaMail.yahoo@mail.yahoo.com

The index on each of the mailing lists is a sliding window that only lasts for 
a few thousand messages, but *all* messages are available (even if they're not 
listed on the index pages):

- via their permalinks
- via Google search (give Google a little while to finish indexing all the new 
messages we recently uploaded to mail-archive.com)
- via the mail-archive.com web site search box

Finally, all of the old Open MPI mail archives are still available under 
https://www.open-mpi.org/community/lists/ (so that we don't break lots of old 
links from around the web), but they are frozen.  No new messages have been 
added to the frozen archive since late July 2016 or so.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread r...@open-mpi.org
IIRC, the rationale behind adding the check was that someone using SGE wanted 
to specify a custom launch agent, and we were overriding it with qrsh. However, 
the check is incorrect as that MCA param cannot be NULL.

I have updated this on master - can you see if this fixes the problem for you?

https://github.com/open-mpi/ompi/pull/1957

As for the blank in the cmd line - that is likely due to a space reserved for 
some entry that you aren’t using (e.g., when someone manually specifies the 
prefix). It shouldn’t cause any harm as the cmd line parser is required to 
ignore spaces

The -ldl problem sounds like a configuration issue - you might want to file a 
separate issue about it

> On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
> 
> Hi,
> 
> In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, which 
> seems to prevent the tight integration with SGE to start:
> 
>if (NULL == mca_plm_rsh_component.agent) {
> 
> Why is it there (it wasn't in 1.10.3)?
> 
> If I just remove it I get:
> 
> [node17:25001] [[27678,0],0] plm:rsh: final template argv:
>qrsh   orted --hnp-topo-sig ...
> 
> instead of the former:
> 
> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   orted 
> --hnp-topo-sig ...
> 
> So, just removing the if-statement is not a perfect cure as the 
> $SGE_ROOT/$ARC does not prefix `qrsh`.
> 
> ==
> 
> BTW: why is there blank before " orted" in the assembled command line - and 
> it's really in the argument when I check this on the slave nodes what should 
> be started by the `qrsh_starter`? As long as there is a wrapping shell, it 
> will be removed anyway. But in a special setup we noticed this additional 
> blank.
> 
> ==
> 
> I also notice, that I have to supply "-ldl" to `mpicc` to allow the 
> compilation of an application to succeed in 2.0.0.
> 
> -- Reuti
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OPENSHMEM ERROR with 2+ Distributed Machines

2016-08-12 Thread r...@open-mpi.org
Just as a suggestion: most of us are leery of opening Word attachments on 
mailing lists. I’d suggest sending this to us as plain text if you want us to 
read it.


> On Aug 12, 2016, at 4:03 AM, Debendra Das  wrote:
> 
> I have installed OpenMPI-2.0.0 in 5 systems with IP addresses 172.16.5.29, 
> 172.16.5.30, 172.16.5.31, 172.16.5.32, 172.16.5.33.While executing the 
> hello_oshmem_c.c program (under the examples directory) , correct output is 
> coming only when executing is done using 2 distributed machines.But error is 
> coming when 3 or more distributed machines are used.The outputs and the host 
> file  are attached.Can anybody please help me to sort out this error?
> 
> Thanking You.
> Debendranath Das 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] OPENSHMEM ERROR with 2+ Distributed Machines

2016-08-12 Thread Debendra Das
I have installed OpenMPI-2.0.0 in 5 systems with IP addresses 172.16.5.29,
172.16.5.30, 172.16.5.31, 172.16.5.32, 172.16.5.33.While executing the
hello_oshmem_c.c program (under the examples directory) , correct output is
coming only when executing is done using 2 distributed machines.But error
is coming when 3 or more distributed machines are used.The outputs and the
host file  are attached.Can anybody please help me to sort out this error?

Thanking You.
Debendranath Das


Error Report.doc
Description: MS-Word document
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users