Re: [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Lenny Verkhovsky
Following Jeff's suggestion adding devel mailing list.

Hi All,
I am currently facing strange situation that I can't run OMPI on more than 65 
nodes.
It seems like environmental issue that does not allow me to open more 
connections.
Any ideas ?
Log attached, more info below in the mail.

Running OMPI from trunk
[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288

Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lenny Verkhovsky
Sent: Tuesday, August 12, 2014 1:13 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65


Hi,

Config:
./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
--enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
--disable-openib-connectx-xrc

Run:
/home/sources/ompi-bin/bin/mpirun -np 65 --host 
ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
 --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca 
btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
hostname 2>&1|tee > /tmp/mpi.log

Environment:
 According to the attached log it's rsh environment


Output attached

Notes:
The problem is always with tha last node, 64 connections work, 65 connections 
fail.
node-119.ssauniversal.ssa.kodiak.nx == ko0237

mpi.log line 1034:
--
An invalid value was supplied for an enum variable.
  Variable : orte_debug_daemons
  Value: 1,1
  Valid values : 0: f|false|disabled, 1: t|true|enabled
--

mpi.log line 1059:
[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288



Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 4:53 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Okay, let's start with the basics :-)

How was this configured? What environment are you running in (rsh, slurm, ??)? 
If you configured --enable-debug, then please run it with

--mca plm_base_verbose 5 --debug-daemons

and send the output


On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
mailto:len...@mellanox.com>> wrote:

I don't think so,
It's always the 66th node, even if I swap between 65th and 66th
I also get the same error when setting np=66, while having only 65 hosts in 
hostfile
(I am using only tcp btl )


Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com/>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 1:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Looks to me like your 65th host is missing the dstore library - is it possible 
you don't have your paths set correctly on all hosts in your hostfile?


On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
mailto:len...@mellanox.com>> wrote:


Hi all,

Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
OMPI with more than 65 procs.
It looks like MPI failes to open 66th connection even with running `hostname` 
over tcp.
It also seems to unrelated to specific host.
All hosts are Ubuntu 12.04.1 LTS

mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
--mca btl tcp,self hostname
[nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 288

...
It looks like environment issue, but I can't find any limit related.
Any ideas ?
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com/>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/l

Re: [OMPI users] OpenMPI fails with np > 65

2014-08-12 Thread Lenny Verkhovsky

Hi,

Config:
./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
--enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
--disable-openib-connectx-xrc

Run:
/home/sources/ompi-bin/bin/mpirun -np 65 --host 
ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
 --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 --mca 
btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
hostname 2>&1|tee > /tmp/mpi.log

Environment:
 According to the attached log it's rsh environment


Output attached

Notes:
The problem is always with tha last node, 64 connections work, 65 connections 
fail.
node-119.ssauniversal.ssa.kodiak.nx == ko0237

mpi.log line 1034:
--
An invalid value was supplied for an enum variable.
  Variable : orte_debug_daemons
  Value: 1,1
  Valid values : 0: f|false|disabled, 1: t|true|enabled
--

mpi.log line 1059:
[node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
Error in file base/ess_base_std_orted.c at line 288



Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 4:53 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Okay, let's start with the basics :-)

How was this configured? What environment are you running in (rsh, slurm, ??)? 
If you configured --enable-debug, then please run it with

--mca plm_base_verbose 5 --debug-daemons

and send the output


On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
mailto:len...@mellanox.com>> wrote:


I don't think so,
It's always the 66th node, even if I swap between 65th and 66th
I also get the same error when setting np=66, while having only 65 hosts in 
hostfile
(I am using only tcp btl )


Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com/>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 1:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Looks to me like your 65th host is missing the dstore library - is it possible 
you don't have your paths set correctly on all hosts in your hostfile?


On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
mailto:len...@mellanox.com>> wrote:



Hi all,

Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
OMPI with more than 65 procs.
It looks like MPI failes to open 66th connection even with running `hostname` 
over tcp.
It also seems to unrelated to specific host.
All hosts are Ubuntu 12.04.1 LTS

mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
--mca btl tcp,self hostname
[nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 288

...
It looks like environment issue, but I can't find any limit related.
Any ideas ?
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com/>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24961.php

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24964.php



mpi65.tgz
Description: mpi65.tgz


Re: [OMPI users] OpenMPI fails with np > 65

2014-08-11 Thread Lenny Verkhovsky
I don't think so,
It's always the 66th node, even if I swap between 65th and 66th
I also get the same error when setting np=66, while having only 65 hosts in 
hostfile
(I am using only tcp btl )


Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, August 11, 2014 1:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI fails with np > 65

Looks to me like your 65th host is missing the dstore library - is it possible 
you don't have your paths set correctly on all hosts in your hostfile?


On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
mailto:len...@mellanox.com>> wrote:


Hi all,

Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
OMPI with more than 65 procs.
It looks like MPI failes to open 66th connection even with running `hostname` 
over tcp.
It also seems to unrelated to specific host.
All hosts are Ubuntu 12.04.1 LTS

mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
--mca btl tcp,self hostname
[nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 288

...
It looks like environment issue, but I can't find any limit related.
Any ideas ?
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com/>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24961.php



[OMPI users] OpenMPI fails with np > 65

2014-08-10 Thread Lenny Verkhovsky
Hi all,

Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
OMPI with more than 65 procs.
It looks like MPI failes to open 66th connection even with running `hostname` 
over tcp.
It also seems to unrelated to specific host.
All hosts are Ubuntu 12.04.1 LTS

mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
--mca btl tcp,self hostname
[nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 288

...
It looks like environment issue, but I can't find any limit related.
Any ideas ?
Thanks.
Lenny Verkhovsky
SW Engineer,  Mellanox Technologies
www.mellanox.com<http://www.mellanox.com>

Office:+972 74 712 9244
Mobile:  +972 54 554 0233
Fax:+972 72 257 9400



Re: [OMPI users] Dual quad core Opteron hangs on Bcast.

2010-01-04 Thread Lenny Verkhovsky
have you tried IMB benchmark with Bcast,
I think the problem is in the app.
All ranks in the communicator should enter Bcast,
since you have
if (rank==0)
else state, not all of them enters the same flow.
  if (iRank == 0)
 {
  iLength = sizeof (acMessage);
  MPI_Bcast (&iLength, 1, MPI_INT, 0, MPI_COMM_WORLD);
  MPI_Bcast (acMessage, iLength, MPI_CHAR, 0, MPI_COMM_WORLD);
  printf ("Process 0: Message sent\n");
 }
  else
 {
  MPI_Bcast (&iLength, 1, MPI_INT, 0, MPI_COMM_WORLD);
  pMessage = (char *) malloc (iLength);
  MPI_Bcast (pMessage, iLength, MPI_CHAR, 0, MPI_COMM_WORLD);
  printf ("Process %d: %s\n", iRank, pMessage);
 }

Lenny.

On Mon, Jan 4, 2010 at 8:23 AM, Eugene Loh  wrote:

>  If you're willing to try some stuff:
>
> 1) What about "-mca coll_sync_barrier_before 100"?  (The default may be
> 1000.  So, you can try various values less than 1000.  I'm suggesting 100.)
> Note that broadcast has somewhat one-way traffic flow, which can have some
> undesirable flow control issues.
>
> 2) What about "-mca btl_sm_num_fifos 16"?  Default is 1.  If the problem is
> trac ticket 2043, then this suggestion can help.
>
> P.S.  There's a memory leak, right?  The receive buffer is being allocated
> over and over again.  Might not be that closely related to the problem you
> see here, but at a minimum it's bad style.
>
> Louis Rossi wrote:
>
> I am having a problem with BCast hanging on a dual quad core Opteron (2382,
> 2.6GHz, Quad Core, 4 x 512KB L2, 6MB L3 Cache) system running FC11 with
> openmpi-1.4.  The LD_LIBRARY_PATH and PATH variables are correctly set.  I
> have used the FC11 rpm distribution of openmpi and built openmpi-1.4 locally
> with the same results.  The problem was first observed in a larger reliable
> CFD code, but I can create the problem with a simple demo code (attached).
> The code attempts to execute 2000 pairs of broadcasts.
>
> The hostfile contains a single line
>  slots=8
>
> If I run it with 4 cores or fewer, the code will run fine.
>
> If I run it with 5 cores or more, it will hang some of the time after
> successfully executing several hundred broadcasts.  The number varies from
> run to run.  The code usually finishes with 5 cores.  The probability of
> hanging seems to increase with the number of nodes.  The syntax I use is
> simple.
>
> mpiexec -machinefile hostfile -np 5 bcast_example
>
> There was some discussion of a similar problem on the user list, but I
> could not find a resolution.  I have tried setting the processor affinity
> (--mca mpi_paffinity_alone 1).  I have tried varying the broadcast algorithm
> (--mca coll_tuned_bcast_algorithm 1-6).  I have also tried excluding (-mca
> oob_tcp_if_exclude) my eth1 interface (see ifconfig.txt attached) which is
> not connected to anything.  None of these changed the outcome.
>
> Any thoughts or suggestions would be appreciated.
>
> --
>
> ___
> users mailing 
> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openmpi 1.4 broken -mca coll_tuned_use_dynamic_rules 1

2009-12-30 Thread Lenny Verkhovsky
it may crash if it doesnt see a file with rules.
try providing it through the command line
$mpirun -mca coll_tuned_use_dynamic_rules 1 -mca
coll_tuned_dynamic_rules_filename full_path_to_file_  .

On Wed, Dec 30, 2009 at 5:35 PM, Daniel Spångberg wrote:

> Thanks for the help with how to set up the collectives file. I am unable to
> make it work though,
>
> My simple alltoall test is still crashing, although I added even added a
> line specifically for my test commsize of 64 and 100 bytes using bruck.
>
> daniels@kalkyl1:~/.openmpi > cat mca-params.conf
>
> coll_tuned_use_dynamic_rules=1
> coll_base_verbose=0
>
> coll_tuned_dynamic_rules_filename="/home/daniels/.openmpi/dynamic_rules_file"
> daniels@kalkyl1:~/.openmpi > cat dynamic_rules_file
>
> 1 # num of collectives
> 3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
> 1 # number of com sizes
> 64 # comm size 64
> 3 # number of msg sizes
> 0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
> 100 3 0 0 # for message size 100, bruck 1, topo 0, 0 segmentation
>
> 8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
> # end of collective rule
>
> Still it useful to know how to do this, when this issue gets fixed in the
> future!
>
> Daniel
>
>
>
> Den 2009-12-30 15:57:50 skrev Lenny Verkhovsky  >:
>
>
>  The only workaround that I found is a file with dynamic rules.
>> This is an example that George sent me once. It helped for me, until it
>> will
>> be fixed.
>>
>> " Lenny,
>>
>> You asked for dynamic rules but it looks like you didn't provide them.
>> Dynamic rules allow the user to specify which algorithm to be used for
>> each
>> collective based on a set of rules. I corrected the current behavior, so
>> it
>> will not crash. However, as you didn't provide dynamic rules, it will just
>> switch back to default behavior (i.e. ignore the
>> coll_tuned_use_dynamic_rules MCA parameter).
>>
>> As an example, here is a set of dynamic rules. I added some comment to
>> clarify it, but if you have any questions please ask.
>>
>> 2 # num of collectives
>> 3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
>> 1 # number of com sizes
>> 64 # comm size 64
>> 2 # number of msg sizes
>> 0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
>> 8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
>> # end of collective rule
>> #
>> 2 # ID = 2 Allreduce collective (ID in coll_tuned.h)
>> 1 # number of com sizes
>> 1 # comm size 2
>> 2 # number of msg sizes
>> 0 1 0 0 # for message size 0, basic linear 1, topo 0, 0 segmentation
>> 1024 2 0 0 # for messages size > 1024, nonoverlapping 2, topo 0, 0
>> segmentation
>> # end of collective rule
>> #
>>
>> And here is what I have in my $(HOME)/.openmpi/mca-params.conf to activate
>> them:
>> #
>> # Dealing with collective
>> #
>> coll_base_verbose = 0
>>
>> coll_tuned_use_dynamic_rules = 1
>> coll_tuned_dynamic_rules_filename = **the name of the file where you saved
>> the rules **
>>
>> "
>>
>> On Wed, Dec 30, 2009 at 4:44 PM, Daniel Spångberg > >wrote:
>>
>>  Interesting. I found your issue before I sent my report, but I did not
>>> realise that this was the same problem. I see now that your example is
>>> really for openmpi 1.3.4++
>>>
>>> Do you know of a work around? I have not used a rule file before and seem
>>> to be unable to find the documentation for how to use one, unfortunately.
>>>
>>> Daniel
>>>
>>> Den 2009-12-30 15:17:17 skrev Lenny Verkhovsky <
>>> lenny.verkhov...@gmail.com
>>> >:
>>>
>>>
>>>  This is the a knowing issue,
>>>
>>>>   https://svn.open-mpi.org/trac/ompi/ticket/2087
>>>> Maybe it's priority should be raised up.
>>>> Lenny.
>>>>
>>>>  ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>
> --
> Daniel Spångberg
> Materialkemi
> Uppsala Universitet
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openmpi 1.4 broken -mca coll_tuned_use_dynamic_rules 1

2009-12-30 Thread Lenny Verkhovsky
The only workaround that I found is a file with dynamic rules.
This is an example that George sent me once. It helped for me, until it will
be fixed.

" Lenny,

You asked for dynamic rules but it looks like you didn't provide them.
Dynamic rules allow the user to specify which algorithm to be used for each
collective based on a set of rules. I corrected the current behavior, so it
will not crash. However, as you didn't provide dynamic rules, it will just
switch back to default behavior (i.e. ignore the
coll_tuned_use_dynamic_rules MCA parameter).

As an example, here is a set of dynamic rules. I added some comment to
clarify it, but if you have any questions please ask.

2 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 64
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of collective rule
#
2 # ID = 2 Allreduce collective (ID in coll_tuned.h)
1 # number of com sizes
1 # comm size 2
2 # number of msg sizes
0 1 0 0 # for message size 0, basic linear 1, topo 0, 0 segmentation
1024 2 0 0 # for messages size > 1024, nonoverlapping 2, topo 0, 0
segmentation
# end of collective rule
#

And here is what I have in my $(HOME)/.openmpi/mca-params.conf to activate
them:
#
# Dealing with collective
#
coll_base_verbose = 0

coll_tuned_use_dynamic_rules = 1
coll_tuned_dynamic_rules_filename = **the name of the file where you saved
the rules **

"

On Wed, Dec 30, 2009 at 4:44 PM, Daniel Spångberg wrote:

> Interesting. I found your issue before I sent my report, but I did not
> realise that this was the same problem. I see now that your example is
> really for openmpi 1.3.4++
>
> Do you know of a work around? I have not used a rule file before and seem
> to be unable to find the documentation for how to use one, unfortunately.
>
> Daniel
>
> Den 2009-12-30 15:17:17 skrev Lenny Verkhovsky  >:
>
>
>  This is the a knowing issue,
>>https://svn.open-mpi.org/trac/ompi/ticket/2087
>> Maybe it's priority should be raised up.
>> Lenny.
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openmpi 1.4 broken -mca coll_tuned_use_dynamic_rules 1

2009-12-30 Thread Lenny Verkhovsky
This is the a knowing issue,
https://svn.open-mpi.org/trac/ompi/ticket/2087
Maybe it's priority should be raised up.
Lenny.

On Wed, Dec 30, 2009 at 12:13 PM, Daniel Spångberg wrote:

> Dear OpenMPI list,
>
> I have used the dynamic rules for collectives to be able to select one
> specific algorithm. With the latest versions of openmpi this seems to be
> broken. Just enabling coll_tuned_use_dynamic_rules causes the code to
> segfault. However, I do not provide a file with rules, since I just want to
> modify the behavior of one routine.
>
> I have tried the below example code on openmpi 1.3.2, 1.3.3, 1.3.4, and
> 1.4. It *works* on 1.3.2, 1.3.3, but segfaults on 1.3.4 and 1.4. I have
> confirmed this on Scientific Linux 5.2, and 5.4. I have also successfully
> reproduced the crash using version 1.4 running on debian etch. All running
> on amd64, compiled from source without other options to configure than
> --prefix. The crash occurs whether I use the intel 11.1 compiler (via env
> CC) or gcc. It also occurs no matter the btl is set to openib,self tcp,self
> sm,self or combinations of those. See below for ompi_info and other info. I
> have tried MPI_Alltoall, MPI_Alltoallv, and MPI_Allreduce which behave the
> same.
>
> #include 
> #include 
>


>
> int main(int argc, char **argv)
> {
>  int rank,size;
>  char *buffer, *buffer2;
>
>  MPI_Init(&argc,&argv);
>
>  MPI_Comm_size(MPI_COMM_WORLD,&size);
>  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>
>  buffer=calloc(100*size,1);
>  buffer2=calloc(100*size,1);
>
>  MPI_Alltoall(buffer,100,MPI_BYTE,buffer2,100,MPI_BYTE,MPI_COMM_WORLD);
>
>  MPI_Finalize();
>  return 0;
> }
>
> Demonstrated behaviour:
>
> $ ompi_info
> Package: Open MPI daniels@arthur Distribution
>Open MPI: 1.4
>   Open MPI SVN revision: r22285
>   Open MPI release date: Dec 08, 2009
>Open RTE: 1.4
>   Open RTE SVN revision: r22285
>   Open RTE release date: Dec 08, 2009
>OPAL: 1.4
>   OPAL SVN revision: r22285
>   OPAL release date: Dec 08, 2009
>Ident string: 1.4
>  Prefix:
> /home/daniels/src/MISC/openmpi-1.4/openmpi-1.4_install
>  Configured architecture: x86_64-unknown-linux-gnu
>  Configure host: arthur
>   Configured by: daniels
>   Configured on: Tue Dec 29 16:54:37 CET 2009
>  Configure host: arthur
>Built by: daniels
>Built on: Tue Dec 29 17:04:36 CET 2009
>  Built host: arthur
>  C bindings: yes
>C++ bindings: yes
>  Fortran77 bindings: yes (all)
>  Fortran90 bindings: yes
>  Fortran90 bindings size: small
>  C compiler: gcc
> C compiler absolute: /usr/bin/gcc
>C++ compiler: g++
>   C++ compiler absolute: /usr/bin/g++
>  Fortran77 compiler: gfortran
>  Fortran77 compiler abs: /usr/bin/gfortran
>  Fortran90 compiler: gfortran
>  Fortran90 compiler abs: /usr/bin/gfortran
> C profiling: yes
>   C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
>  C++ exceptions: no
>  Thread support: posix (mpi: no, progress: no)
>   Sparse Groups: no
>  Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
>   Heterogeneous support: no
>  mpirun default --prefix: no
> MPI I/O support: yes
>   MPI_WTIME support: gettimeofday
> Symbol visibility support: yes
>   FT Checkpoint support: no  (checkpoint thread: no)
>   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4)
>  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4)
>   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4)
>
>   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.4)
>   MCA carto: file (MCA v2.0, API v2.0, Component v1.4)
>   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4)
>   MCA timer: linux (MCA v2.0, API v2.0, Component v1.4)
> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4)
> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4)
> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.4)
>  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.4)
>   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.4)
>   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.4)
>MCA coll: basic (MCA v2.0, API v2.0, Component v1.4)
>MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.4)
>MCA coll: inter (MCA v2.0, API v2.0, Component v1.4)
>MCA coll: self (MCA v2.0, API v2.0, Component v1.4)
>MCA coll: sm (MCA v2.0, API v2.0, Component v1.4)
>MCA coll: sync (MCA v2.0, API v2.0, Component v1.4)
>MCA coll: tuned (MCA v2.0, API

Re: [OMPI users] mpirun not working on more than one node

2009-11-17 Thread Lenny Verkhovsky
I noticed that you also have different versions of OMPI. You have 1.3.2 on
node1 and 1.3 on node2.
can you try to put same versions of OMPI on both nodes.
can you also try running np 16 on node1 when you try running separately.
Lenny.

On Tue, Nov 17, 2009 at 5:45 PM, Laurin Müller wrote:

>
>
> >>> Ralph Castain 11/17/09 4:04 PM >>>
>
> >Your cmd line is telling OMPI to run 17 processes. Since your hostfile
> indicates that only 16 of them are to >run on 10.4.23.107 (which I assume is
> your PS3 node?), 1 process is going to be run on 10.4.1.23 (I assume >this
> is node1?).
> node1 has 16 Cores (4 x AMD Quad Core Processors)
>
> node2 is the ps3 with two processors (slots)
>
>
> >I would guess that the executable is compiled to run on the PS3 given your
> specified path, so I would >expect it to bomb on node1 - which is exactly
> what appears to be happening.
> the executable is compiled on each node separately and lies at each node in
> the same directory
>
> /mnt/projects/PS3Cluster/Benchmark/pi
> on each node different directories are mounted. so there exists a separate
> executable file compiled at each node.
>
> in the end i want to ran R on this cluster with Rmpi - as i get a similar
> problem there i rist wanted to try with an c programm.
>
> with r happens the same thing it works when i start it on each node but if
> i want to start more than 16 processes on node one in exits.
>
>
> On Nov 17, 2009, at 1:59 AM, Laurin Müller wrote:
>
> Hi,
>
> i want to build a cluster with openmpi.
>
> 2 nodes:
> node 1: 4 x Amd Quad Core, ubuntu 9.04, openmpi 1.3.2
> node 2: Sony PS3, ubuntu 9.04, openmpi 1.3
>
> both can connect with ssh to each other and to itself without passwd.
>
> I can run the sample proramm pi.c on both nodes seperatly (see below). But
> if i try to start it on node1 with --hostfile option to use node 2 "remote"
> i got this error:
>
> cluster@bioclust:~$  mpirun --hostfile
> /etc/openmpi/openmpi-default-hostfile -np 17
> /mnt/projects/PS3Cluster/Benchmark/pi
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> my hostfile:
> cluster@bioclust:~$  cat
> /etc/openmpi/openmpi-default-hostfile
> 10.4.23.107 slots=16
> 10.4.1.23 slots=2
> i can see with top that the processors of node2 begin to work shortly, then
> it apports on node1.
>
> I use this sample/test program:
> #include 
> #include 
> #include "mpi.h"
> int main(int argc, char *argv[])
> {
>   inti, n;
>   double h, pi, x;
>   intme, nprocs;
>   double piece;
> /* --- */
>   MPI_Init (&argc, &argv);
>   MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
>   MPI_Comm_rank (MPI_COMM_WORLD, &me);
> /* --- */
>   if (me == 0)
>   {
>  printf("%s", "Input number of intervals:\n");
>  scanf ("%d", &n);
>   }
> /* --- */
>   MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
> /* --- */
>   h = 1. / (double) n;
>   piece = 0.;
>   for (i=me+1; i <= n; i+=nprocs)
>   {
>x = (i-1)*h;
>piece = piece + ( 4/(1+(x)*(x)) + 4/(1+(x+h)*(x+h))) / 2 * h;
>   }
>   printf("%d: pi = %25.15f\n", me, piece);
> /* --- */
>   MPI_Reduce (&piece, &pi, 1, MPI_DOUBLE,
>   MPI_SUM, 0, MPI_COMM_WORLD);
> /* --- */
>   if (me == 0)
>   {
>  printf("pi = %25.15f\n", pi);
>   }
> /* --- */
>  MPI_Finalize();
>   return 0;
> }
> it works on each node.
> node1:
> cluster@bioclust:~$  mpirun -np 4
> /mnt/projects/PS3Cluster/Benchmark/piInput number of intervals:
> 20
> 0: pi = 0.822248040052981
> 2: pi = 0.773339953424083
> 3: pi = 0.747089984650041
> 1: pi = 0.798498008827023
> pi = 3.141175986954128
>
> node2:
> cluster@kasimir:~$  mpirun -np 2
> /mnt/projects/PS3Cluster/Benchmark/pi
> Input number of intervals:
> 5
> 1: pi = 1.267463056905495
> 0: pi = 1.867463056905495
> pi = 3.134926113810990
> cluster@kasimir:~$ 
>
> Thx in advance,
> Laurin
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpirun failure

2009-10-18 Thread Lenny Verkhovsky
you can use full path to mpirun, you can also set prefix
$mpirun -prefix path/to/mpi/home -np .
Lenny.

On Sun, Oct 18, 2009 at 12:03 PM, Oswald Benedikt wrote:

> Hi, thanks, that's what puzzled  when I saw the reference to 1.3, but the
> LD_LIBRARY_PATH was set to point
> to the respective version, i.e. 1.3.2 or 1.3.3 and the 1.3 executables were
> not in the PATH.
>
> Are there any other env variables or . files that need to be set ?
>
> Benedikt
>
>
> -Original Message-
> From: users-boun...@open-mpi.org on behalf of Ralph Castain
> Sent: Sun 18.10.2009 6:04
> To: Open MPI Users
> Subject: Re: [OMPI users] mpirun failure
>
> Looks to me like you may be picking up an earlier version when you
> launch. At least, when I look at the error message, it says that it
> came from a file in the openmpi-1.3 directory tree. Yet you say you
> installed 1.3.2 and 1.3.3.
>
> Any chance your ld_library_path is pointing at the older version?
>
>
> On Oct 17, 2009, at 11:29 AM, Oswald Benedikt wrote:
>
> > Dear open-mpi users / developers, maybe this problem has been
> > treated before but at least I can not find it:
> >
> > I have tried both open mpi 1.3.2 and 1.3.3 on Mac OS X (10.5.8).
> > Compilation and installation of openmpi
> > works well, also compilation and linking of users applications.
> > However, when I want to start an application
> > with mpirun, it crashes, both for open mpi 1.3.3. and 1.3.2 as
> > follows:
> >
> >
> >
> > benedikt-oswalds-macbook-pro:mieScatteringDispersive benediktoswald$
> > mpirun -np 2 ../../../hades3d/hades3d --option-
> > file=mieScatteringDispersive.job
> > [benedikt-oswalds-macbook-pro.local:50793] [[7314,1],0]
> > ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-1.3/orte/
> > mca/ess/env/ess_env_module.c at line 235
> > [benedikt-oswalds-macbook-pro.local:50793] [[7314,1],0]
> > ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-1.3/orte/
> > mca/ess/env/ess_env_module.c at line 261
> > [benedikt-oswalds-macbook-pro.local:50794] [[7314,1],1]
> > ORTE_ERROR_LOG: Value out of bounds in file ../../../../openmpi-1.3/
> > orte/mca/ess/base/ess_base_nidmap.c at line 153
> > [benedikt-oswalds-macbook-pro.local:50794] [[7314,1],1]
> > ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-1.3/orte/
> > mca/ess/env/ess_env_module.c at line 235
> > [benedikt-oswalds-macbook-pro.local:50794] [[7314,1],1]
> > ORTE_ERROR_LOG: Value out of bounds in file ../../../../openmpi-1.3/
> > orte/mca/ess/base/ess_base_nidmap.c at line 153
> > [benedikt-oswalds-macbook-pro.local:50794] [[7314,1],1]
> > ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-1.3/orte/
> > mca/ess/env/ess_env_module.c at line 261
> >
> --
> > It looks like MPI_INIT failed for some reason; your parallel process
> > is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or
> > environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >  orte_grpcomm_modex failed
> >  --> Returned "Not found" (-13) instead of "Success" (0)
> >
> --
> > *** An error occurred in MPI_Init
> > *** before MPI was initialized
> > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> > [benedikt-oswalds-macbook-pro.local:50794] Abort before MPI_INIT
> > completed successfully; not able to guarantee that all other
> > processes were killed!
> > [benedikt-oswalds-macbook-pro.local:50794] [[7314,1],1]
> > ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-1.3/orte/
> > mca/ess/env/ess_env_module.c at line 297
> > [benedikt-oswalds-macbook-pro.local:50794] [[7314,1],1]
> > ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-1.3/orte/
> > mca/grpcomm/bad/grpcomm_bad_module.c at line 559
> >
> --
> > mpirun has exited due to process rank 1 with PID 50794 on
> > node benedikt-oswalds-macbook-pro.local exiting without calling
> > "finalize". This may
> > have caused other processes in the application to be
> > terminated by signals sent by mpirun (as reported here).
> >
> --
> >
> >
> > Can anyone comment on this ? Is this a basic installation or path
> > problem ?
> >
> > openmpi 1.3 does not show this problem.
> >
> > Thanks, Benedikt
> >
> >
> >
> >
> >
> --
> > Benedikt Oswald, Dr. sc. techn., dipl. El. Ing. ETH, www.psi.ch,
> > Computational Accelerator Scientist
> > Paul Scherrer  Institute (PSI), CH-5232 Villigen, Suisse,
> benedikt.osw...@psi.ch
> > , +41(0)56 310 32 12
> > "Passion is required for any grea

Re: [OMPI users] cartofile

2009-09-21 Thread Lenny Verkhovsky
Hi Eugene,
carto file is a file with a staic graph topology of your node.
in the opal/mca/carto/file/carto_file.h you can see example.
( yes I know that , it should be help/man list :) )
Basically it describes a map of your node and inside interconnection.
Hopefully it will be discovered automatically someday,
but for now you can describe your node manually.
Best regards
Lenny.

On Thu, Sep 17, 2009 at 12:38 AM, Eugene Loh  wrote:

> I feel like I should know, but what's a cartofile?  I guess you supply
> "topological" information about a host, but I can't tell how this
> information is used by, say, mpirun.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] unable to access or execute

2009-09-15 Thread Lenny Verkhovsky
you can use a shared ( i.e. NFS ) folder with this app, or provide a full
PATH to it.
ex:
$mpirun -np 2 -hostfile hostfile  /home/user/app

2009/9/15 Dominik Táborský 

> So I have to manually copy the compiled hello world program to all of
> the nodes so that they can be executed? I really didn't expect that...
>
> So, where (in the filesystem) does the executable have to be? On the
> same place as on the master?
>
> Thanks
>
> Dominik
>
>
> Ralph Castain píše v Út 15. 09. 2009 v 02:27 +0200:
> > I assume this
> > executable doesn't have to be on the node - that would be
> > silly
> >
> > Not silly at all - we don't preposition the binary for you. It has to
> > be present on the node where it is to be executed.
> >
> > I have added an option to preposition binaries in the OMPI developer's
> > trunk, but that feature isn't scheduled for release until the next
> > major code release.
> >
> > 2009/9/14 Dominik Táborský 
> > Hi again,
> >
> > Since last time I made progress - I compiled openMPI 1.3.3
> > from sources,
> > now I'm trying to run it on one of my nodes. I am using the
> > same
> > software on the master, but master is Ubuntu 9.04 (NOT using
> > openMPI
> > 1.3.2 from repos) and the node is my own Linux system - it
> > lacks many
> > features so there might be the source of the problem.
> >
> > When I try to run hello world program, it gives me this error:
> >
> > $ /openMPI/bin/mpirun
> > -hostfile /home/eddy/Dreddux/host.machine5
> ./projekty/openMPI/hello/hello
> >
> --
> > mpirun was unable to launch the specified application as it
> > could not
> > access
> > or execute an executable:
> >
> > Executable: ./projekty/openMPI/hello/hello
> > Node: machine5
> >
> > while attempting to start process rank 0.
> >
> --
> >
> > The executable is hello world program and is executable. I
> > assume this
> > executable doesn't have to be on the node - that would be
> > silly. So, I
> > don't understand what am I missing. Any ideas? Please!
> >
> > Dominik
> >
> > PS: thanks for your time!
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] [OMPI devel] Error message improvement

2009-09-09 Thread Lenny Verkhovsky
Hi All,
does C99 complient compiler is something unusual
or is there a policy among OMPI developers/users that prevent me f
rom using __func__  instead of hardcoded strings in the code ?
Thanks.
Lenny.

On Wed, Sep 9, 2009 at 1:48 PM, Nysal Jan  wrote:

> __FUNCTION__ is not portable.
> __func__ is but it needs a C99 compliant compiler.
>
> --Nysal
>
> On Tue, Sep 8, 2009 at 9:06 PM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> fixed in r21952
>> thanks.
>>
>> On Tue, Sep 8, 2009 at 5:08 PM, Arthur Huillet 
>> wrote:
>>
>>> Lenny Verkhovsky wrote:
>>>
>>>> Why not using __FUNCTION__  in all our error messages ???
>>>>
>>>
>>> Sounds good, this way the function names are always correct.
>>>
>>> --
>>> Greetings, A. Huillet
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI users] Help: Infiniband interface hang

2009-09-02 Thread Lenny Verkhovsky
have you tried running hostname
$mpirun -np 2 --mca btl openib,self --host node1,node2 hostname
if it hangs, it's not Open MPI problem, check your setup,
especially check your firewall settings and disable it.

On Wed, Sep 2, 2009 at 2:06 PM, Lee Amy  wrote:

> Hi,
>
> I encountered a very very confused problem when running IMB via two
> nodes by using IB.
>
> OS: RHEL 5.2
> OFED Version: 1.4.1
> MPI: OpenMPI 1.3.2 (OFED owned)
>
> I run IMB-MPI1 provided by OFED OpenMPI tests. The command line is
>
> mpirun -np 2 --mca btl openib,self --host node1,node2 IMB-MPI1
>
> After that the machine hangs and no output, and I cannot see any exist
> mpirun related programs. Then I use Ctrl-C to stop the hang process.
> Following messages are reported when I press Ctrl-C.
>
>
> mpirun: killing job...
>
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> --
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
>node2 - daemon did not report back when launched
>
> I use strace to get some detailed messages.
>
> ===Strace output start===
> execve("/usr/mpi/gcc/openmpi-1.3.2/bin/mpirun", ["mpirun", "-np", "2",
> "--mca", "btl", "openib,self", "--host", "node1,node2", "IMB-MPI1",
> "pingpong"], [/* 26 vars */]) = 0
> brk(0)  = 0x1e45a000
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x2aaab000
> uname({sys="Linux", node="node1", ...}) = 0
> access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or
> directory)
> open("/usr/mpi/gcc/openmpi-1.3.2/lib64/tls/x86_64/libopen-rte.so.0",
> O_RDONLY) = -1 ENOENT (No such file or directory)
> stat("/usr/mpi/gcc/openmpi-1.3.2/lib64/tls/x86_64", 0x743b6080) =
> -1 ENOENT (No such file or directory)
> open("/usr/mpi/gcc/openmpi-1.3.2/lib64/tls/libopen-rte.so.0",
> O_RDONLY) = -1 ENOENT (No such file or directory)
> stat("/usr/mpi/gcc/openmpi-1.3.2/lib64/tls", 0x743b6080) = -1
> ENOENT (No such file or directory)
> open("/usr/mpi/gcc/openmpi-1.3.2/lib64/x86_64/libopen-rte.so.0",
> O_RDONLY) = -1 ENOENT (No such file or directory)
> stat("/usr/mpi/gcc/openmpi-1.3.2/lib64/x86_64", 0x743b6080) = -1
> ENOENT (No such file or directory)
> open("/usr/mpi/gcc/openmpi-1.3.2/lib64/libopen-rte.so.0", O_RDONLY) = 3
> read(3,
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\316\0\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=1308252, ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x2aaac000
> mmap(NULL, 2382848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x2aaad000
> mprotect(0x2aaef000, 2093056, PROT_NONE) = 0
> mmap(0x2acee000, 16384, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x41000) = 0x2acee000
> mmap(0x2acf2000, 3072, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2acf2000
> close(3)= 0
> open("/usr/mpi/gcc/openmpi-1.3.2/lib64/libopen-pal.so.0", O_RDONLY) = 3
> read(3,
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\337\0\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=1398411, ...}) = 0
> mmap(NULL, 2562984, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x2acf3000
> mprotect(0x2ad3f000, 2097152, PROT_NONE) = 0
> mmap(0x2af3f000, 12288, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4c000) = 0x2af3f000
> mmap(0x2af42000, 142248, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2af42000
> close(3)= 0
> open("/usr/mpi/gcc/openmpi-1.3.2/lib64/libdl.so.2", O_RDONLY) = -1
> ENOENT (No such file or directory)
> open("/etc/ld.so.cache", O_RDONLY)  = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=184283, ...}) = 0
> mmap(NULL, 184283, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2af65000
> close(3)= 0
> open("/lib64/libdl.so.2", O_RDONLY) = 3
> read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0
> \16\300F:\0\0\0"..., 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=23520, ...}) = 0
> mmap(0x3a46c0, 2109728, PROT_READ|PROT_EXEC,
> MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3a46c0
> mprotect(0x3a46c02000, 2097152, PROT_NONE) = 0
> mmap(0x3a46e02000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x3a46e02000
> close(3)= 0
> open("/usr/mpi/gcc/openmpi-1.3.2/lib64/libnsl.so.1", O_RDONLY) = -1
> ENOENT (No 

Re: [OMPI users] rankfile error on openmpi/1.3.3

2009-09-01 Thread Lenny Verkhovsky
I changed error message, I hope it will be more clear now.
r21919.

On Tue, Sep 1, 2009 at 2:13 PM, Lenny Verkhovsky  wrote:

> please try using full ( drdb0235.en.desres.deshaw.com ) hostname
> in the hostfile/rankfile.
> It should help.
> Lenny.
>
> On Mon, Aug 31, 2009 at 7:43 PM, Ralph Castain  wrote:
>
>> I'm afraid the rank-file mapper in 1.3.3 has several known problems that
>> have been described on the list by users. We hopefully have those fixed in
>> the upcoming 1.3.4 release.
>>
>> On Aug 31, 2009, at 10:01 AM, Sacerdoti, Federico wrote:
>>
>>  Hi,
>>
>> I am trying to use the rankmap to bind a 4-proc mpi job to one socket of a
>> two-socket, 8 core machine. However I'm getting a strange error.
>>
>> CMDS USED
>> orterun --hostfile hostlist.1 -n 4  --mca rmaps_rank_file_path ./rankmap.1
>> desres-netscan  -o $OUTDIR
>>
>> $ cat rankmap.1
>> rank 0=drdb0235.en slot=0:0
>> rank 1=drdb0235.en slot=0:1
>> rank 2=drdb0235.en slot=0:2
>> rank 3=drdb0235.en slot=0:3
>>
>> $ cat hostlist.1
>> drdb0235.en slots=8
>> ERROR SEEN
>> --
>> Rankfile claimed host drdb0235.en that was not allocated or oversubscribed
>> it's slots:
>> --
>> [drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad
>> parameter in file rmaps_rank_file.c at line 108
>> [drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad
>> parameter in file base/rmaps_base_map_job.c at line 87
>> [drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad
>> parameter in file base/plm_base_launch_support.c at line 77
>> [drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad
>> parameter in file plm_rsh_module.c at line 985
>>
>> From looking at the code in rmaps_rank_file.c it seems the error occurs
>> when the node-gathering code wraps twice around the hostlist. However I dont
>> see why that is happening.
>>
>> If I specify 8 slots in the rankmap, I see a different error: Error,
>> invalid rank (4) in the rankfile (./rankmap.1)
>>
>> Thanks,
>> Federico
>>
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] rankfile error on openmpi/1.3.3

2009-09-01 Thread Lenny Verkhovsky
please try using full ( drdb0235.en.desres.deshaw.com ) hostname
in the hostfile/rankfile.
It should help.
Lenny.

On Mon, Aug 31, 2009 at 7:43 PM, Ralph Castain  wrote:

> I'm afraid the rank-file mapper in 1.3.3 has several known problems that
> have been described on the list by users. We hopefully have those fixed in
> the upcoming 1.3.4 release.
>
> On Aug 31, 2009, at 10:01 AM, Sacerdoti, Federico wrote:
>
>  Hi,
>
> I am trying to use the rankmap to bind a 4-proc mpi job to one socket of a
> two-socket, 8 core machine. However I'm getting a strange error.
>
> CMDS USED
> orterun --hostfile hostlist.1 -n 4  --mca rmaps_rank_file_path ./rankmap.1
> desres-netscan  -o $OUTDIR
>
> $ cat rankmap.1
> rank 0=drdb0235.en slot=0:0
> rank 1=drdb0235.en slot=0:1
> rank 2=drdb0235.en slot=0:2
> rank 3=drdb0235.en slot=0:3
>
> $ cat hostlist.1
> drdb0235.en slots=8
> ERROR SEEN
> --
> Rankfile claimed host drdb0235.en that was not allocated or oversubscribed
> it's slots:
> --
> [drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad
> parameter in file rmaps_rank_file.c at line 108
> [drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad
> parameter in file base/rmaps_base_map_job.c at line 87
> [drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad
> parameter in file base/plm_base_launch_support.c at line 77
> [drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG: Bad
> parameter in file plm_rsh_module.c at line 985
>
> From looking at the code in rmaps_rank_file.c it seems the error occurs
> when the node-gathering code wraps twice around the hostlist. However I dont
> see why that is happening.
>
> If I specify 8 slots in the rankmap, I see a different error: Error,
> invalid rank (4) in the rankfile (./rankmap.1)
>
> Thanks,
> Federico
>
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Help: OFED Version problem

2009-08-31 Thread Lenny Verkhovsky
you need to check the release notes, and compare the differences.
also check the Open MPI version in both of them.
In general it's not so good idea to run different versions of the software
for performance comparison or art all.
since both of them are Open source, backward computability is not always
promised ( IMHO ).
Lenny.

On Mon, Aug 31, 2009 at 7:54 AM, Lee Amy  wrote:

> Hi,
>
> I have two machines with RHEL 5.2, then I installed OFED 1.4.1 driver
> on the first machine, the second machine is using OFED 1.3.1 by RHEL
> owned. My question is if the different version of OFED drivers will
> affect performance?
>
> Thanks.
>
> Eric Lee
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] VMware and OpenMPI

2009-08-27 Thread Lenny Verkhovsky
Hi all,
Does OpenMPI support VMware ?
I am trying to run OpenMPI 1.3.3 on VMware and it got stacked during OSU
benchmarks and IMB.
looks like random deadlock, I wander if anyone have ever tried it ?
thanks,
Lenny.


Re: [OMPI users] Program runs successfully...but with error messages displayed

2009-08-27 Thread Lenny Verkhovsky
mostlike that you compiled MPI with --with-openib flag, but since there are
no openib devices avaliable on
n06 machine, you got an error.
you can "disable" this message by either recompilnig Open MPI without openib
flag, or by disabling openib btl
-mca btl ^openib
or
-mca btl sm,self,tcp
Lenny.

On Thu, Aug 27, 2009 at 1:36 PM, Jean Potsam  wrote:

> Dear All,
>   I have installed openmpi 1.3.2 on one on the nodes of our
> cluster and is running a simple helloword mpi program. The program runs fine
> but I get a lot of unexpected messages in between the result.
>
> ##
>
> jean@n06:~/examples$ mpirun -np 2 --host n06 hello_c
> libibverbs: Fatal: couldn't read uverbs ABI version.
> --
> [[11410,1],1]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
>
> Module: OpenFabrics (openib)
>   Host: n06
>
> Another transport will be used instead, although this may result in
> lower performance.
> --
> libibverbs: Fatal: couldn't read uverbs ABI version.
>
> Hello, world, I am 0 of 2 and running on n06
> Hello, world, I am 1 of 2 and running on n06
>
>
> [n06:08470] 1 more process has sent help message help-mpi-btl-base.txt /
> btl:no-nics
> [n06:08470] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
>
> ##
>
> Does anyone know why these messages appear and how to fix this.
>
> Thanks
>
> Jean
>
> start: -00-00 end: -00-00
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] problem with LD_LIBRARY_PATH???

2009-08-19 Thread Lenny Verkhovsky
sound like environmental problems.
try running
$mpirun -prefix/home/jean/openmpisof/ ..
Lenny.

On Wed, Aug 19, 2009 at 5:36 PM, Jean Potsam  wrote:

> Hi All,
>   I'm a trying to install openmpi with self. However, I am
> experiencing some problems with openmpi itself.
>
> I have successfully installed the software and added the path in the
> .bashrc file as follows:
>
> export PATH="/home/jean/openmpisof/bin:$PATH"
> export LD_LIBRARY_PATH="/home/jean/openmpisof/lib:$LD_LIBRARY_PATH"
>
> when i run my mpi application specifying the whole path to mpirun, it works
> fine.
>
> jean:$ /home/jean/openmpisof/bin/mpirun mympi
>
> however if I do:
> jean:$ mpirun mympi
>
> I get:
>
> 
> bash: orted: command not found
> --
> A daemon (pid 8464) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
>
> ##
>
> I am using  a single processor desktop PC with linux Ubuntu as the OS.
>
> Please email me of you have any solution for this problem.
>
> Cheers
>
> Jean
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] rank file error: Rankfile claimed...

2009-08-17 Thread Lenny Verkhovsky
can you try not specifiyng "max-slots" in the hostfile.
if you are the only user of the nodes, there will be no oversibscibing of
the processors.
This one definetly looks like a bug,
but as Ralph said there is a current disscusion and working on this
component.
Lenny.

On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain  wrote:

> Is there an explanation for this?
>>
>
> I believe the word is "bug". :-)
>
> The rank_file mapper has been substantially revised lately - we are
> discussing now how much of that revision to bring to 1.3.4 versus the next
> major release.
>
> Ralph
>
> On Aug 17, 2009, at 4:45 AM, jody wrote:
>
>  Hi Lenny
>>
>>  I think it has something to do with your environment,  /etc/hosts, IT
>>> setup,
>>> hostname function return value e.t.c
>>> I am not sure if it has something to do with Open MPI at all.
>>>
>>
>> OK. I just thought this was Open MPI related because i was able to use the
>> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in
>> the host file...
>>
>> However, I encountered a new problem:
>> if the rankfile lists all the entries which occur in the host file
>> there is an error message.
>> In the following example, the hostfile is
>> [jody@plankton neander]$ cat th_02
>> nano_00.uzh.ch  slots=2 max-slots=2
>> nano_02.uzh.ch  slots=2 max-slots=2
>>
>> and the rankfile is:
>> [jody@plankton neander]$ cat rf_02
>> rank  0=nano_00.uzh.ch  slot=0
>> rank  2=nano_00.uzh.ch  slot=1
>> rank  1=nano_02.uzh.ch  slot=0
>> rank  3=nano_02.uzh.ch  slot=1
>>
>> Here is the error:
>> [jody@plankton neander]$ mpirun -np 4 -hostfile th_02  -rf rf_02
>> ./HelloMPI
>> --
>> There are not enough slots available in the system to satisfy the 4 slots
>> that were requested by the application:
>>   ./HelloMPI
>>
>> Either request fewer slots for your application, or make more slots
>> available
>> for use.
>>
>> --
>> --
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> mpirun: clean termination accomplished
>>
>> If i use a hostfile with one more entry
>> [jody@aim-plankton neander]$ cat th_021
>> aim-nano_00.uzh.ch  slots=2 max-slots=2
>> aim-nano_02.uzh.ch  slots=2 max-slots=2
>> aim-nano_01.uzh.ch  slots=1 max-slots=1
>>
>> Then this works fine:
>> [jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021  -rf rf_02
>> ./HelloMPI
>>
>> Is there an explanation for this?
>>
>> Thank You
>>  Jody
>>
>>  Lenny.
>>> On Mon, Aug 17, 2009 at 12:59 PM, jody  wrote:
>>>
>>>>
>>>> Hi Lenny
>>>>
>>>> Thanks - using the full names makes it work!
>>>> Is there a reason why the rankfile option treats
>>>> host names differently than the hostfile option?
>>>>
>>>> Thanks
>>>>  Jody
>>>>
>>>>
>>>>
>>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
>>>> Verkhovsky wrote:
>>>>
>>>>> Hi
>>>>> This message means
>>>>> that you are trying to use host "plankton", that was not allocated via
>>>>> hostfile or hostlist.
>>>>> But according to the files and command line, everything seems fine.
>>>>> Can you try using "plankton.uzh.ch" hostname instead of "plankton".
>>>>> thanks
>>>>> Lenny.
>>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody  wrote:
>>>>>
>>>>>

Re: [OMPI users] rank file error: Rankfile claimed...

2009-08-17 Thread Lenny Verkhovsky
I think it has something to do with your environment,  /etc/hosts, IT setup,
hostname function return value e.t.c
I am not sure if it has something to do with Open MPI at all.
Lenny.
On Mon, Aug 17, 2009 at 12:59 PM, jody  wrote:

> Hi Lenny
>
> Thanks - using the full names makes it work!
> Is there a reason why the rankfile option treats
> host names differently than the hostfile option?
>
> Thanks
>   Jody
>
>
>
> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
> Verkhovsky wrote:
> > Hi
> > This message means
> > that you are trying to use host "plankton", that was not allocated via
> > hostfile or hostlist.
> > But according to the files and command line, everything seems fine.
> > Can you try using "plankton.uzh.ch" hostname instead of "plankton".
> > thanks
> > Lenny.
> > On Mon, Aug 17, 2009 at 10:36 AM, jody  wrote:
> >>
> >> Hi
> >>
> >> When i use a rankfile, i get an error message which i don't understand:
> >>
> >> [jody@plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts
> >> ./HelloMPI
> >>
> --
> >> Rankfile claimed host plankton that was not allocated or
> >> oversubscribed it's slots:
> >>
> >>
> --
> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
> >> file rmaps_rank_file.c at line 108
> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
> >> file base/rmaps_base_map_job.c at line 87
> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
> >> file base/plm_base_launch_support.c at line 77
> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
> >> file plm_rsh_module.c at line 990
> >>
> --
> >> A daemon (pid unknown) died unexpectedly on signal 1  while attempting
> to
> >> launch so we are aborting.
> >>
> >> There may be more information reported by the environment (see above).
> >>
> >> This may be because the daemon was unable to find all the needed shared
> >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> >> location of the shared libraries on the remote nodes and this will
> >> automatically be forwarded to the remote nodes.
> >>
> --
> >>
> --
> >> mpirun noticed that the job aborted, but has no info as to the process
> >> that caused that situation.
> >>
> --
> >> mpirun: clean termination accomplished
> >>
> >>
> >>
> >> With out the '-rf rankfile' option everything works as expected.
> >>
> >> My hostfile :
> >> [jody@plankton tests]$ cat testhosts
> >> # The following node is a quad-processor machine, and we absolutely
> >> # want to disallow over-subscribing it:
> >> plankton slots=3  max-slots=3
> >> # The following nodes are dual-processor machines:
> >> nano_00  slots=2 max-slots=2
> >> nano_01  slots=2 max-slots=2
> >> nano_02  slots=2 max-slots=2
> >> nano_03  slots=2 max-slots=2
> >> nano_04  slots=2 max-slots=2
> >> nano_05  slots=2 max-slots=2
> >> nano_06  slots=2 max-slots=2
> >>
> >> my rank file:
> >> [jody@plankton neander]$ cat rankfile
> >> rank  0=nano_00  slot=1
> >> rank  1=plankton slot=0
> >> rank  2=nano_01  slot=1
> >>
> >> my Open MPI version: 1.3.2
> >>
> >> i get the same error if i use a rankfile which has a single line
> >>  rank  0=plankton  slot=0
> >> (plankton is my local machine) and call mpirun with np 1
> >>
> >> What does the "Rankfile claimed..." message mean?
> >> Did i make an error in my rankfile?
> >> If yes, what would be the correct way to write it?
> >>
> >> Thank You
> >>  Jody
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] rank file error: Rankfile claimed...

2009-08-17 Thread Lenny Verkhovsky
Hi
This message means
that you are trying to use host "plankton", that was not allocated via
hostfile or hostlist.
But according to the files and command line, everything seems fine.
Can you try using "plankton.uzh.ch" hostname instead of "plankton".
thanks
Lenny.

On Mon, Aug 17, 2009 at 10:36 AM, jody  wrote:

> Hi
>
> When i use a rankfile, i get an error message which i don't understand:
>
> [jody@plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts
> ./HelloMPI
> --
> Rankfile claimed host plankton that was not allocated or
> oversubscribed it's slots:
>
> --
> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
> file rmaps_rank_file.c at line 108
> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
> file base/rmaps_base_map_job.c at line 87
> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
> file base/plm_base_launch_support.c at line 77
> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
> file plm_rsh_module.c at line 990
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
>
>
>
> With out the '-rf rankfile' option everything works as expected.
>
> My hostfile :
> [jody@plankton tests]$ cat testhosts
> # The following node is a quad-processor machine, and we absolutely
> # want to disallow over-subscribing it:
> plankton slots=3  max-slots=3
> # The following nodes are dual-processor machines:
> nano_00  slots=2 max-slots=2
> nano_01  slots=2 max-slots=2
> nano_02  slots=2 max-slots=2
> nano_03  slots=2 max-slots=2
> nano_04  slots=2 max-slots=2
> nano_05  slots=2 max-slots=2
> nano_06  slots=2 max-slots=2
>
> my rank file:
> [jody@plankton neander]$ cat rankfile
> rank  0=nano_00  slot=1
> rank  1=plankton slot=0
> rank  2=nano_01  slot=1
>
> my Open MPI version: 1.3.2
>
> i get the same error if i use a rankfile which has a single line
>  rank  0=plankton  slot=0
> (plankton is my local machine) and call mpirun with np 1
>
> What does the "Rankfile claimed..." message mean?
> Did i make an error in my rankfile?
> If yes, what would be the correct way to write it?
>
> Thank You
>   Jody
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Help: How to accomplish processors affinity

2009-08-17 Thread Lenny Verkhovsky
Hi
http://www.open-mpi.org/faq/?category=tuning#using-paffinity
I am not familiar with this cluster, but in the FAQ ( see link above ) you
can find an example of the rankfile.
another simple example is the following:
$cat rankfile
rank 0=host1 slot=0
rank 1=host2 slot=0
rank 2=host3 slot=0
rank 3=host4 slot=0
$mpirun -np 4 -H host1,host2,host3,host4 -rf rankfile ./app
if you OS sees your cluster as a one machine and $cat /proc/cpuinfo shows 4
CPUs ( let's assume 0-3 ) and one IP and hostname,
then try this:
$cat rankfile
rank 0=host1 slot=0
rank 1=host1 slot=1
rank 2=host1 slot=2
rank 3=host1 slot=3
$mpirun -np 4 -H host1 -rf rankfile ./app
best regards
Lenny.

On Fri, Aug 14, 2009 at 6:50 AM, Lee Amy  wrote:

> HI,
>
> I read some howtos at OpenMPI official site but i still have some problems
> here.
>
> I build a Kerrighed Clusters with 4 nodes so they look like a big SMP
> machine. every node has 1 processor with dingle core.
>
> 1) Dose MPI programs could be running on such kinds of machine? If
> yes, could anyone show me some examples?
>
> 2) In this SMP machine there are 4 processors I could see. So how do I
> use OpenMPI to run some programs on these CPUs? Though I read how to
> make a rank file but I am still feel confused. Could anyone show me a
> simple rank file example for my Clusters?
>
> Thank you very much.
>
> Regards,
>
> Amy Lee
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] OpenMPI 1.3 Infiniband Hang

2009-08-13 Thread Lenny Verkhovsky
Hi,
1.
The Mellanox has a newer fw for those HCAshttp://
www.mellanox.com/content/pages.php?pg=firmware_table_IH3Lx
I am not sure if it will help, but newer fw usually have some bug fixes.
2.
try to disable leave_pinned during the run. It's on by default in 1.3.3
Lenny.

On Thu, Aug 13, 2009 at 5:12 AM, Allen Barnett wrote:

> Hi:
> I recently tried to build my MPI application against OpenMPI 1.3.3. It
> worked fine with OMPI 1.2.9, but with OMPI 1.3.3, it hangs part way
> through. It does a fair amount of comm, but eventually it stops in a
> Send/Recv point-to-point exchange. If I turn off the openib btl, it runs
> to completion. Also, I built 1.3.3 with memchecker (which is very nice;
> thanks to everyone who worked on that!) and it runs to completion, even
> with openib active.
>
> Our cluster consists of dual dual-core opteron boxes with Mellanox
> MT25204 (InfiniHost III Lx) HCAs and a Mellanox MT47396 Infiniscale-III
> switch. We're running RHEL 4.8 which appears to include OFED 1.4. I've
> built everything using GCC 4.3.2. Here is the output from ibv_devinfo.
> "ompi_info --all" is attached.
> $ ibv_devinfo
> hca_id: mthca0
>fw_ver: 1.1.0
>node_guid:  0002:c902:0024:3284
>sys_image_guid: 0002:c902:0024:3287
>vendor_id:  0x02c9
>vendor_part_id: 25204
>hw_ver: 0xA0
>board_id:   MT_03B0140002
>phys_port_cnt:  1
>port:   1
>state:  active (4)
>max_mtu:2048 (4)
>active_mtu: 2048 (4)
>sm_lid: 1
>port_lid:   1
>port_lmc:   0x00
>
> I'd appreciate any tips for debugging this.
> Thanks,
> Allen
>
> --
> Allen Barnett
> Transpire, Inc
> E-Mail: al...@transpireinc.com
> Skype:  allenbarnett
> Ph: 518-887-2930
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Failure trying to use tuned collectives

2009-08-10 Thread Lenny Verkhovsky
By default coll framework scans all avaliable modules and sets the avaliable
functions with the highest priorities.
So, to use tuned collectives explicetly you can higher it's priority.
-mca coll_tuned_priority 100
p.s. Collective modules can have only partial set of avaliable functions,
for example module "sm" not necesseraly  has implementation of MPI_Barrier.
In this case MPI_Barrier will be taken from the module where it is avaliable
and has highest priority. Which means, that if you run MPI_scatter and the
MPI_Barrier, then MPI_Scatter will be taken from sm collective and
MPI_barrier will be taken from tuned collective ( This is an example only ).
Lenny.

On Fri, Aug 7, 2009 at 5:41 PM, Craig Tierney wrote:

> To use tuned collectives, do all I have to do is add --mca coll tuned?
>
> I am trying to run with:
>
> # mpirun -np 8 --mca coll tuned --mca orte_base_help_aggregate 0 ./wrf.exe
>
> But all the processes fail with the folling message:
>
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>  mca_coll_base_comm_select(MPI_COMM_WORLD) failed
>  --> Returned "Not found" (-13) instead of "Success" (0)
> --
>
> Thanks,
> Craig
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] compile mpi program on Cell BE

2009-08-10 Thread Lenny Verkhovsky
can this be related ?
http://www.open-mpi.org/faq/?category=building#build-qs22

On Sun, Aug 9, 2009 at 12:22 PM, Attila Börcs wrote:

> Hi Everyone,
>
> What the regular method of compiling and running mpi code on Cell Broadband
> ppu-gcc and spu-gcc?
>
>
> Regards,
>
> Attila Borcs
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] bin/orted: Command not found.

2009-08-10 Thread Lenny Verkhovsky
try specifing  -prefix in the command line
ex:  mpirun -np 4 -prefix $MPIHOME ./app
Lenny.

On Sat, Aug 8, 2009 at 5:04 PM, Kenneth Yoshimoto  wrote:

>
> I don't own these nodes, so I have to use them with
> whatever path setups they came with.  In particular,
> my home directory has a different path on each set.
>
> It would be nice to be able to specify the path to orted
> on each remote node.
>
> Kenneth
>
> On Fri, 7 Aug 2009, Ralph Castain wrote:
>
>  Date: Fri, 7 Aug 2009 18:49:13 -0600
>> From: Ralph Castain 
>> To: Kenneth Yoshimoto , Open MPI Users <
>> us...@open-mpi.org>
>> Subject: Re: [OMPI users] bin/orted: Command not found.
>>
>> Not that I know of - I don't think we currently have any way for you to
>> specify a location for individual nodes.
>>
>> Is there some reason why you installed it this way?
>>
>>
>> On Fri, Aug 7, 2009 at 11:27 AM, Kenneth Yoshimoto 
>> wrote:
>>
>>
>>> Hello,
>>>
>>>  I have three sets of nodes, each with openmpi installed in
>>> a different location.  I am getting an error related to orted:
>>>
>>> /users/kenneth/info/openmpi/install/bin/orted: Command not found.
>>>
>>>  I think it's looking for orted in the wrong place on some of
>>> the nodes.  Is there an easy way to have mpirun look for
>>> orted in the right place on the different sets of nodes?
>>>
>>> Thanks,
>>> Kenneth
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>  ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Tuned collectives: How to choose them dynamically? (-mca coll_tuned_dynamic_rules_filename dyn_rules)"

2009-08-04 Thread Lenny Verkhovsky
Hi,
I am looking too for a file example of rules for dynamic collectives,
Have anybody tried it ? Where can I find a proper syntax for it ?

thanks.
Lenny.



On Thu, Jul 23, 2009 at 3:08 PM, Igor Kozin wrote:

> Hi Gus,
> I played with collectives a few months ago. Details are here
> http://www.cse.scitech.ac.uk/disco/publications/WorkingNotes.ConnectX.pdf
> That was in the context of 1.2.6
>
> You can get available tuning options by doing
> ompi_info -all -mca coll_tuned_use_dynamic_rules 1 | grep alltoall
> and similarly for other collectives.
>
> Best,
> Igor
>
> 2009/7/23 Gus Correa :
>  > Dear OpenMPI experts
> >
> > I would like to experiment with the OpenMPI tuned collectives,
> > hoping to improve the performance of some programs we run
> > in production mode.
> >
> > However, I could not find any documentation on how to select the
> > different collective algorithms and other parameters.
> > In particular, I would love to read an explanation clarifying
> > the syntax and meaning of the lines on "dyn_rules"
> > file that is passed to
> > "-mca coll_tuned_dynamic_rules_filename ./dyn_rules"
> >
> > Recently there was an interesting discussion on the list
> > about this topic.  It showed that choosing the right collective
> > algorithm can make a big difference in overall performance:
> >
> > http://www.open-mpi.org/community/lists/users/2009/05/9355.php
> > http://www.open-mpi.org/community/lists/users/2009/05/9399.php
> > http://www.open-mpi.org/community/lists/users/2009/05/9401.php
> > http://www.open-mpi.org/community/lists/users/2009/05/9419.php
> >
> > However, the thread was concentrated on "MPI_Alltoall".
> > Nothing was said about other collective functions.
> > Not much was said about the
> > "tuned collective dynamic rules" file syntax,
> > the meaning of its parameters, etc.
> >
> > Is there any source of information about that which I missed?
> > Thank you for any pointers or clarifications.
> >
> > Gus Correa
> > -
> > Gustavo Correa
> > Lamont-Doherty Earth Observatory - Columbia University
> > Palisades, NY, 10964-8000 - USA
> > -
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Help: Processors Binding

2009-08-03 Thread Lenny Verkhovsky
Hi,
you can find a lot of useful information under FAQ section
*http://www.open-mpi.org/faq/*
http://www.open-mpi.org/faq/?category=tuning#paffinity-defs
Lenny.
On Mon, Aug 3, 2009 at 11:55 AM, Lee Amy  wrote:

> Hi,
>
> Dose OpenMPI has the processors binding like command "taskset"? For
> example, I started 16 MPI processes then I want to bind them with
> specific processor.
>
> How to do that?
>
> Thank you very much.
>
> Best Regards,
>
> Amy
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] selectively bind MPI to one HCA out of available ones

2009-07-15 Thread Lenny Verkhovsky
Make sure you have Open MPI 1.3 series,
I dont think the if_include param is not avaliable in 1.2 series.

max btls controls fragmentation and load balancing over similar BTLS (
example using LMC > 0, or 2 ports connected to 1 network )
you need if_include param



On Wed, Jul 15, 2009 at 4:20 PM, Ralph Castain  wrote:

> Take a look at the output from "ompi_info --params btl openib" and you will
> see the available MCA params to direct the openib subsystem. I believe you
> will find that you can indeed specify the interface.
>
>
>   On Wed, Jul 15, 2009 at 7:15 AM,  wrote:
>
>>
>> Hi all,
>>
>> I have a cluster where both HCA's of blade are active, but
>> connected to different subnet.
>> Is there an option in MPI to select one HCA out of available
>> one's? I know it can be done by making changes in openmpi code, but i need
>> clean interface like option during mpi launch time to select mthca0 or
>> mthca1?
>>
>> Any help is appreciated. Btw i just checked Mvapich and feature is
>> there inside.
>>
>> Regards
>>
>> Neeraj Chourasia (MTS)
>> Computational Research Laboratories Ltd.
>> (A wholly Owned Subsidiary of TATA SONS Ltd)
>> B-101, ICC Trade Towers, Senapati Bapat Road
>> Pune 411016 (Mah) INDIA
>> (O) +91-20-6620 9863  (Fax) +91-20-6620 9862
>> M: +91.9225520634
>>
>> =-=-= Notice: The information contained in this e-mail
>> message and/or attachments to it may contain confidential or privileged
>> information. If you are not the intended recipient, any dissemination, use,
>> review, distribution, printing or copying of the information contained in
>> this e-mail message and/or attachments to it are strictly prohibited. If you
>> have received this communication in error, please notify us by reply e-mail
>> or telephone and immediately and permanently delete the message and any
>> attachments. Internet communications cannot be guaranteed to be timely,
>> secure, error or virus-free. The sender does not accept liability for any
>> errors or omissions.Thank you =-=-=
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-15 Thread Lenny Verkhovsky
Thanks, Ralph,
I guess your guess was correct, here is the display map.


$cat rankfile
rank 0=+n1 slot=0
rank 1=+n0 slot=0
$cat appfile
-np 1 -host witch1 ./hello_world
-np 1 -host witch2 ./hello_world
$mpirun -np 2 -rf rankfile --display-allocation  -app appfile

==   ALLOCATED NODES   ==

 Data for node: Name: dellix7   Num slots: 0Max slots: 0
 Data for node: Name: witch1Num slots: 1Max slots: 0
 Data for node: Name: witch2Num slots: 1Max slots: 0

=
--
Rankfile claimed host +n1 by index that is bigger than number of allocated
hosts.


On Wed, Jul 15, 2009 at 4:10 PM, Ralph Castain  wrote:

> What is supposed to happen is this:
>
> 1. each line of the appfile causes us to create a new app_context. We store
> the provided -host info in that object.
>
> 2. when we create the "allocation", we cycle through -all- the app_contexts
> and add -all- of their -host info into the list of allocated nodes
>
> 3. when we get_target_nodes, we start with the entire list of allocated
> nodes, and then use -host for that app_context to filter down to the hosts
> allowed for that specific app_context
>
> So you should have to only provide -np 1 and 1 host on each line. My guess
> is that the rankfile mapper isn't correctly behaving for multiple
> app_contexts.
>
> Add --display-allocation to your mpirun cmd line for the "not working" cse
> and let's see what mpirun thinks the total allocation is - I'll bet that
> both nodes show up, which would tell us that my "guess" is correct. Then
> I'll know what needs to be fixed.
>
> Thanks
> Ralph
>
>
>
> On Wed, Jul 15, 2009 at 6:08 AM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>>  Same result.
>> I still suspect that rankfile claims for node in small hostlist provided
>> by line in the app file, and not from the hostlist provided by mpirun on HNP
>> node.
>> According to my suspections your proposal should not work(and it does
>> not), since in appfile line I provide np=1, and 1 host, while rankfile tries
>> to allocate all ranks (np=2).
>>
>> $orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 338
>>
>>   if(ORTE_SUCCESS != (rc = orte_rmaps_base_get_target_nodes(&node_list,
>> &num_slots, app,
>>
>> map->policy))) {
>>
>> node_list will be partial, according to app, and not full provided by
>> mpirun cmd. If I didnt provide hostlist in the appfile line, mpirun uses
>> local host and not hosts from the hostfile.
>>
>>
>> Tell me if I am wrong by expecting the following behaivor
>>
>> I provide to mpirun NP, full_hostlist, full_rankfile, appfile
>> I provide in appfile only partial NP and partial hostlist.
>> and it works.
>>
>> Currently, in order to get it working I need to provide full hostlist in
>> the appfile. Which is quit a problematic.
>>
>>
>> $mpirun -np 2 -rf rankfile -app appfile
>> --
>> Rankfile claimed host +n1 by index that is bigger than number of allocated
>> hosts.
>> --
>> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
>> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
>> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
>> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>>
>>
>> Thanks
>> Lenny.
>>
>>
>> On Wed, Jul 15, 2009 at 2:02 PM, Ralph Castain  wrote:
>>
>>> Try your "not working" example without the -H on the mpirun cmd line -
>>> i.e.,, just use "mpirun -np 2 -rf rankfile -app appfile". Does that work?
>>> Sorry to have to keep asking you to try things - I don't have a setup
>>> here where I can test this as everything is RM managed.
>>>
>>>
>>>  On Jul 15, 2009, at 12:09 AM, Lenny Verkhovsky wrote:
>>>
>>>
>>> Thanks Ralph, after playing with prefixes it worked,
>>>
>>> I still have a problem running app file with rankfile, by providing full
>&

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-15 Thread Lenny Verkhovsky
Same result.
I still suspect that rankfile claims for node in small hostlist provided by
line in the app file, and not from the hostlist provided by mpirun on HNP
node.
According to my suspections your proposal should not work(and it does not),
since in appfile line I provide np=1, and 1 host, while rankfile tries to
allocate all ranks (np=2).

$orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 338

  if(ORTE_SUCCESS != (rc = orte_rmaps_base_get_target_nodes(&node_list,
&num_slots, app,

map->policy))) {

node_list will be partial, according to app, and not full provided by mpirun
cmd. If I didnt provide hostlist in the appfile line, mpirun uses local host
and not hosts from the hostfile.


Tell me if I am wrong by expecting the following behaivor

I provide to mpirun NP, full_hostlist, full_rankfile, appfile
I provide in appfile only partial NP and partial hostlist.
and it works.

Currently, in order to get it working I need to provide full hostlist in the
appfile. Which is quit a problematic.


$mpirun -np 2 -rf rankfile -app appfile
--
Rankfile claimed host +n1 by index that is bigger than number of allocated
hosts.
--
[dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
[dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
[dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
[dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001


Thanks
Lenny.


On Wed, Jul 15, 2009 at 2:02 PM, Ralph Castain  wrote:

> Try your "not working" example without the -H on the mpirun cmd line -
> i.e.,, just use "mpirun -np 2 -rf rankfile -app appfile". Does that work?
> Sorry to have to keep asking you to try things - I don't have a setup here
> where I can test this as everything is RM managed.
>
>
>  On Jul 15, 2009, at 12:09 AM, Lenny Verkhovsky wrote:
>
>
> Thanks Ralph, after playing with prefixes it worked,
>
> I still have a problem running app file with rankfile, by providing full
> hostlist in mpirun command and not in app file.
> Is is planned behaviour, or it can be fixed ?
>
> See Working example:
>
> $cat rankfile
> rank 0=+n1 slot=0
> rank 1=+n0 slot=0
> $cat appfile
> -np 1 -H witch1,witch2  ./hello_world
> -np 1 -H witch1,witch2 ./hello_world
>
> $mpirun -rf rankfile -app appfile
> Hello world! I'm 1 of 2 on witch1
> Hello world! I'm 0 of 2 on witch2
>
> See NOT working example:
>
> $cat appfile
> -np 1 -H witch1 ./hello_world
> -np 1 -H witch2 ./hello_world
> $mpirun  -np 2 -H witch1,witch2 -rf rankfile -app appfile
> --
> Rankfile claimed host +n1 by index that is bigger than number of allocated
> hosts.
> --
> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>
>
>
> On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain  wrote:
>
>> Took a deeper look into this, and I think that your first guess was
>> correct.
>> When we changed hostfile and -host to be per-app-context options, it
>> became necessary for you to put that info in the appfile itself. So try
>> adding it there. What you would need in your appfile is the following:
>>
>> -np 1 -H witch1 hostname
>> -np 1 -H witch2 hostname
>>
>> That should get you what you want.
>> Ralph
>>
>>  On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote:
>>
>>  No,  it's not working as I expect , unless I expect somthing wrong .
>> ( sorry for the long PATH, I needed to provide it )
>>
>> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
>> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
>> -np 2 -H witch1,witch2 hostname
>> witch1
>> witch2
>>
>> $LD_LIBRARY_PATH=/hpc/home

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-15 Thread Lenny Verkhovsky
Thanks Ralph, after playing with prefixes it worked,

I still have a problem running app file with rankfile, by providing full
hostlist in mpirun command and not in app file.
Is is planned behaviour, or it can be fixed ?

See Working example:

$cat rankfile
rank 0=+n1 slot=0
rank 1=+n0 slot=0
$cat appfile
-np 1 -H witch1,witch2  ./hello_world
-np 1 -H witch1,witch2 ./hello_world

$mpirun -rf rankfile -app appfile
Hello world! I'm 1 of 2 on witch1
Hello world! I'm 0 of 2 on witch2

See NOT working example:

$cat appfile
-np 1 -H witch1 ./hello_world
-np 1 -H witch2 ./hello_world
$mpirun  -np 2 -H witch1,witch2 -rf rankfile -app appfile
--
Rankfile claimed host +n1 by index that is bigger than number of allocated
hosts.
--
[dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
[dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
[dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
[dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001



On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain  wrote:

> Took a deeper look into this, and I think that your first guess was
> correct.
> When we changed hostfile and -host to be per-app-context options, it became
> necessary for you to put that info in the appfile itself. So try adding it
> there. What you would need in your appfile is the following:
>
> -np 1 -H witch1 hostname
> -np 1 -H witch2 hostname
>
> That should get you what you want.
> Ralph
>
>  On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote:
>
>  No,  it's not working as I expect , unless I expect somthing wrong .
> ( sorry for the long PATH, I needed to provide it )
>
> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
> -np 2 -H witch1,witch2 hostname
> witch1
> witch2
>
> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
> -np 2 -H witch1,witch2 -app appfile
> dellix7
> dellix7
> $cat appfile
> -np 1 hostname
> -np 1 hostname
>
>
> On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain  wrote:
>
>> Run it without the appfile, just putting the apps on the cmd line - does
>> it work right then?
>>
>>  On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote:
>>
>>  additional info
>> I am running mpirun on hostA, and providing hostlist with hostB and hostC.
>> I expect that each application would run on hostB and hostC, but I get all
>> of them running on hostA.
>>  dellix7$cat appfile
>> -np 1 hostname
>> -np 1 hostname
>>  dellix7$mpirun -np 2 -H witch1,witch2 -app appfile
>> dellix7
>> dellix7
>>  Thanks
>> Lenny.
>>
>> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain  wrote:
>>
>>> Strange - let me have a look at it later today. Probably something simple
>>> that another pair of eyes might spot.
>>>  On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote:
>>>
>>>  Seems like connected problem:
>>> I can't use rankfile with app, even after all those fixes ( working with
>>> trunk 1.4a1r21657).
>>> This is my case :
>>>
>>> $cat rankfile
>>> rank 0=+n1 slot=0
>>> rank 1=+n0 slot=0
>>> $cat appfile
>>> -np 1 hostname
>>> -np 1 hostname
>>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
>>>
>>> --
>>> Rankfile claimed host +n1 by index that is bigger than number of
>>> allocated hosts.
>>>
>>> --
>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in fi

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-14 Thread Lenny Verkhovsky
No,  it's not working as I expect , unless I expect somthing wrong .
( sorry for the long PATH, I needed to provide it )

$LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
-np 2 -H witch1,witch2 hostname
witch1
witch2

$LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
-np 2 -H witch1,witch2 -app appfile
dellix7
dellix7
$cat appfile
-np 1 hostname
-np 1 hostname


On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain  wrote:

> Run it without the appfile, just putting the apps on the cmd line - does it
> work right then?
>
>  On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote:
>
>  additional info
> I am running mpirun on hostA, and providing hostlist with hostB and hostC.
> I expect that each application would run on hostB and hostC, but I get all
> of them running on hostA.
>  dellix7$cat appfile
> -np 1 hostname
> -np 1 hostname
>  dellix7$mpirun -np 2 -H witch1,witch2 -app appfile
> dellix7
> dellix7
>  Thanks
> Lenny.
>
> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain  wrote:
>
>> Strange - let me have a look at it later today. Probably something simple
>> that another pair of eyes might spot.
>>  On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote:
>>
>>  Seems like connected problem:
>> I can't use rankfile with app, even after all those fixes ( working with
>> trunk 1.4a1r21657).
>> This is my case :
>>
>> $cat rankfile
>> rank 0=+n1 slot=0
>> rank 1=+n0 slot=0
>> $cat appfile
>> -np 1 hostname
>> -np 1 hostname
>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
>> --
>> Rankfile claimed host +n1 by index that is bigger than number of allocated
>> hosts.
>> --
>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>>
>>
>> The problem is, that rankfile mapper tries to find an appropriate host in
>> the partial ( and not full ) hostlist.
>>
>> Any suggestions how to fix it?
>>
>> Thanks
>> Lenny.
>>
>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain  wrote:
>>
>>> Okay, I fixed this today toor21219
>>>
>>>
>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
>>>
>>> Now there is another problem :)
>>>>
>>>> You can try oversubscribe node. At least by 1 task.
>>>> If you hostfile and rank file limit you at N procs, you can ask mpirun
>>>> for N+1 and it wil be not rejected.
>>>> Although in reality there will be N tasks.
>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np 5"
>>>> both works, but in both cases there are only 4 tasks. It isn't crucial,
>>>> because there is nor real oversubscription, but there is still some bug
>>>> which can affect something in future.
>>>>
>>>> --
>>>> Anton Starikov.
>>>>
>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
>>>>
>>>> This is fixed as of r21208.
>>>>>
>>>>> Thanks for reporting it!
>>>>> Ralph
>>>>>
>>>>>
>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
>>>>>
>>>>> Although removing this check solves problem of having more slots in
>>>>>> rankfile than necessary, there is another problem.
>>>>>>
>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example:
>>>>>>
>>>>>>
>>>>>> hostfile:
>>>>>>
>>>>>> node01
>>>>>> node01
>>>>>> node02
>>>>>> node02
>>>>>>
>>>>>> rankfile:
>>>>>>
>>>>>> rank 

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-14 Thread Lenny Verkhovsky
additional info
I am running mpirun on hostA, and providing hostlist with hostB and hostC.
I expect that each application would run on hostB and hostC, but I get all
of them running on hostA.
dellix7$cat appfile
-np 1 hostname
-np 1 hostname
dellix7$mpirun -np 2 -H witch1,witch2 -app appfile
dellix7
dellix7
Thanks
Lenny.

On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain  wrote:

> Strange - let me have a look at it later today. Probably something simple
> that another pair of eyes might spot.
> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote:
>
> Seems like connected problem:
> I can't use rankfile with app, even after all those fixes ( working with
> trunk 1.4a1r21657).
> This is my case :
>
> $cat rankfile
> rank 0=+n1 slot=0
> rank 1=+n0 slot=0
> $cat appfile
> -np 1 hostname
> -np 1 hostname
> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
> --
> Rankfile claimed host +n1 by index that is bigger than number of allocated
> hosts.
> --
> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>
>
> The problem is, that rankfile mapper tries to find an appropriate host in
> the partial ( and not full ) hostlist.
>
> Any suggestions how to fix it?
>
> Thanks
> Lenny.
>
> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain  wrote:
>
>> Okay, I fixed this today toor21219
>>
>>
>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
>>
>> Now there is another problem :)
>>>
>>> You can try oversubscribe node. At least by 1 task.
>>> If you hostfile and rank file limit you at N procs, you can ask mpirun
>>> for N+1 and it wil be not rejected.
>>> Although in reality there will be N tasks.
>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np 5"
>>> both works, but in both cases there are only 4 tasks. It isn't crucial,
>>> because there is nor real oversubscription, but there is still some bug
>>> which can affect something in future.
>>>
>>> --
>>> Anton Starikov.
>>>
>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
>>>
>>> This is fixed as of r21208.
>>>>
>>>> Thanks for reporting it!
>>>> Ralph
>>>>
>>>>
>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
>>>>
>>>> Although removing this check solves problem of having more slots in
>>>>> rankfile than necessary, there is another problem.
>>>>>
>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example:
>>>>>
>>>>>
>>>>> hostfile:
>>>>>
>>>>> node01
>>>>> node01
>>>>> node02
>>>>> node02
>>>>>
>>>>> rankfile:
>>>>>
>>>>> rank 0=node01 slot=1
>>>>> rank 1=node01 slot=0
>>>>> rank 2=node02 slot=1
>>>>> rank 3=node02 slot=0
>>>>>
>>>>> mpirun -np 4 ./something
>>>>>
>>>>> complains with:
>>>>>
>>>>> "There are not enough slots available in the system to satisfy the 4
>>>>> slots
>>>>> that were requested by the application"
>>>>>
>>>>> but "mpirun -np 3 ./something" will work though. It works, when you ask
>>>>> for 1 CPU less. And the same behavior in any case (shared nodes, 
>>>>> non-shared
>>>>> nodes, multi-node)
>>>>>
>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and all
>>>>> affinities set as it requested in rankfile, there is no oversubscription.
>>>>>
>>>>>
>>>>> Anton.
>>>>>
>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
>>>>>
>>>>> Ah - thx for catching that, I'll remove that check

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-14 Thread Lenny Verkhovsky
Seems like connected problem:
I can't use rankfile with app, even after all those fixes ( working with
trunk 1.4a1r21657).
This is my case :

$cat rankfile
rank 0=+n1 slot=0
rank 1=+n0 slot=0
$cat appfile
-np 1 hostname
-np 1 hostname
$mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
--
Rankfile claimed host +n1 by index that is bigger than number of allocated
hosts.
--
[dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
[dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
[dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
[dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001


The problem is, that rankfile mapper tries to find an appropriate host in
the partial ( and not full ) hostlist.

Any suggestions how to fix it?

Thanks
Lenny.

On Wed, May 13, 2009 at 1:55 AM, Ralph Castain  wrote:

> Okay, I fixed this today toor21219
>
>
>
> On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
>
> Now there is another problem :)
>>
>> You can try oversubscribe node. At least by 1 task.
>> If you hostfile and rank file limit you at N procs, you can ask mpirun for
>> N+1 and it wil be not rejected.
>> Although in reality there will be N tasks.
>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np 5"
>> both works, but in both cases there are only 4 tasks. It isn't crucial,
>> because there is nor real oversubscription, but there is still some bug
>> which can affect something in future.
>>
>> --
>> Anton Starikov.
>>
>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
>>
>> This is fixed as of r21208.
>>>
>>> Thanks for reporting it!
>>> Ralph
>>>
>>>
>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
>>>
>>> Although removing this check solves problem of having more slots in
>>>> rankfile than necessary, there is another problem.
>>>>
>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example:
>>>>
>>>>
>>>> hostfile:
>>>>
>>>> node01
>>>> node01
>>>> node02
>>>> node02
>>>>
>>>> rankfile:
>>>>
>>>> rank 0=node01 slot=1
>>>> rank 1=node01 slot=0
>>>> rank 2=node02 slot=1
>>>> rank 3=node02 slot=0
>>>>
>>>> mpirun -np 4 ./something
>>>>
>>>> complains with:
>>>>
>>>> "There are not enough slots available in the system to satisfy the 4
>>>> slots
>>>> that were requested by the application"
>>>>
>>>> but "mpirun -np 3 ./something" will work though. It works, when you ask
>>>> for 1 CPU less. And the same behavior in any case (shared nodes, non-shared
>>>> nodes, multi-node)
>>>>
>>>> If you switch off rmaps_base_no_oversubscribe, then it works and all
>>>> affinities set as it requested in rankfile, there is no oversubscription.
>>>>
>>>>
>>>> Anton.
>>>>
>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
>>>>
>>>> Ah - thx for catching that, I'll remove that check. It no longer is
>>>>> required.
>>>>>
>>>>> Thx!
>>>>>
>>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky <
>>>>> lenny.verkhov...@gmail.com> wrote:
>>>>> According to the code it does cares.
>>>>>
>>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
>>>>>
>>>>> ival = orte_rmaps_rank_file_value.ival;
>>>>> if ( ival > (np-1) ) {
>>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, ival,
>>>>> rankfile);
>>>>> rc = ORTE_ERR_BAD_PARAM;
>>>>> goto unlock;
>>>>> }
>>>>>
>>>>> If I remember correctly, I used an array to map ranks, and since the
>>>>> length of array is NP, maximum index must be less than np, so if you have
>>>>> the number of rank > NP, you have no place to put

Re: [OMPI users] enable-mpi-threads

2009-07-09 Thread Lenny Verkhovsky
I guess this question was already before
https://svn.open-mpi.org/trac/ompi/ticket/1367
On Thu, Jul 9, 2009 at 10:35 AM, Lenny Verkhovsky <
lenny.verkhov...@gmail.com> wrote:

> BTW, What kind of threads Open MPI supports ?
> I found in the https://svn.open-mpi.org/trac/ompi/browser/trunk/READMEthat we 
> support  MPI_THREAD_MULTIPLE,
> and found few unclear mails about  MPI_THREAD_FUNNELED and
>  MPI_THREAD_SERIALIZED.
> Also found nothing in FAQ :(.
> Thanks,Lenny.
>
> On Thu, Jul 2, 2009 at 6:37 AM, rahmani  wrote:
>
>> Hi,
>> Very thanks for your discussion
>>
>> - Original Message -
>> From: "Jeff Squyres" 
>> To: "Open MPI Users" 
>> Sent: Tuesday, June 30, 2009 7:23:13 AM (GMT-0500) America/New_York
>> Subject: Re: [OMPI users] enable-mpi-threads
>>
>> On Jun 30, 2009, at 1:29 AM, rahmani wrote:
>>
>> > I want install openmpi in a cluster with multicore processor.
>> > Is it necessary to configure with --enable-mpi-threads option?
>> > when this option should be used?
>> >
>>
>>
>> Open MPI's threading support is functional but not optimized.
>>
>> It depends on the problem you're trying to solve.  There's many ways
>> to write software, but two not-uncommon models for MPI applications are:
>>
>> 1. Write the software such that MPI will launch one process for each
>> core.  You communicate between these processes via MPI communication
>> calls such as MPI_SEND, MPI_RECV, etc.
>>
>> 2. Write the software that that MPI will launch one process per host,
>> and then spawn threads for all the cores on that host.  The threads
>> communicate with each other via typical threaded IPC mechanisms
>> (usually not MPI); MPI processes communicate across hosts via MPI
>> communication calls.  Sometimes MPI function calls are restricted to
>> one thread; sometimes they're invoked by any thread.
>>
>> So it really depends on how you want to write your software.  Make
>> sense?
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] enable-mpi-threads

2009-07-09 Thread Lenny Verkhovsky
BTW, What kind of threads Open MPI supports ?
I found in the https://svn.open-mpi.org/trac/ompi/browser/trunk/README that
we support  MPI_THREAD_MULTIPLE,
and found few unclear mails about  MPI_THREAD_FUNNELED and
 MPI_THREAD_SERIALIZED.
Also found nothing in FAQ :(.
Thanks,Lenny.

On Thu, Jul 2, 2009 at 6:37 AM, rahmani  wrote:

> Hi,
> Very thanks for your discussion
>
> - Original Message -
> From: "Jeff Squyres" 
> To: "Open MPI Users" 
> Sent: Tuesday, June 30, 2009 7:23:13 AM (GMT-0500) America/New_York
> Subject: Re: [OMPI users] enable-mpi-threads
>
> On Jun 30, 2009, at 1:29 AM, rahmani wrote:
>
> > I want install openmpi in a cluster with multicore processor.
> > Is it necessary to configure with --enable-mpi-threads option?
> > when this option should be used?
> >
>
>
> Open MPI's threading support is functional but not optimized.
>
> It depends on the problem you're trying to solve.  There's many ways
> to write software, but two not-uncommon models for MPI applications are:
>
> 1. Write the software such that MPI will launch one process for each
> core.  You communicate between these processes via MPI communication
> calls such as MPI_SEND, MPI_RECV, etc.
>
> 2. Write the software that that MPI will launch one process per host,
> and then spawn threads for all the cores on that host.  The threads
> communicate with each other via typical threaded IPC mechanisms
> (usually not MPI); MPI processes communicate across hosts via MPI
> communication calls.  Sometimes MPI function calls are restricted to
> one thread; sometimes they're invoked by any thread.
>
> So it really depends on how you want to write your software.  Make
> sense?
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] OpenMPI vs Intel MPI

2009-07-02 Thread Lenny Verkhovsky
Hi,
I am not an HPL expert, but this might help.

1.   rankfile mapper is avaliale only from Open MPI 1.3 version, if you are
using Open MPI 1.2.8 try -mca mpi_paffinity_alone 1
2.   if you are using Open MPI 1.3 you dont have to use mpi_leave_pinned 1 ,
since it's a default value

Lenny.

On Thu, Jul 2, 2009 at 4:47 PM, Swamy Kandadai  wrote:

>  Jeff:
>
> I am running on a 2.66 GHz Nehalem node. On this node, the turbo mode and
> hyperthreading are enabled.
> When I run LINPACK with Intel MPI, I get 82.68 GFlops without much trouble.
>
> When I ran with OpenMPI (I have OpenMPI 1.2.8 but my colleague was using
> 1.3.2). I was using the same MKL libraries both with OpenMPI and
> Intel MPI. But with OpenMPI, the best I got so far is 80.22 GFlops and I
> could never achieve close to what I am getting with Intel MPI.
> Here are muy options with OpenMPI:
>
> mpirun -n 8 --machinefile hf --mca rmaps_rank_file_path rankfile --mca
> coll_sm_info_num_procs 8 --mca btl self,sm -mca mpi_leave_pinned 1
> ./xhpl_ompi
>
> Here is my rankfile:
>
> at rankfile
> rank 0=i02n05 slot=0
> rank 1=i02n05 slot=1
> rank 2=i02n05 slot=2
> rank 3=i02n05 slot=3
> rank 4=i02n05 slot=4
> rank 5=i02n05 slot=5
> rank 6=i02n05 slot=6
> rank 7=i02n05 slot=7
>
> In this case the physical cores are 0-7 while the additional logical
> processors with hyperthreading are 8-15.
> With "top" command, I could see all the 8 tasks are running on 8 different
> physical cores. I did not see
> 2 MPI tasks running on the same physical core. Also, the program is not
> paging as the problem size
> fits in the meory.
>
> Do you have any ideas how I can improve the performance so that it matches
> with Intel MPI performance?
> Any suggestions will be greatly appreciated.
>
> Thanks
> Swamy Kandadai
>
>
> Dr. Swamy N. Kandadai
> IBM Senior Certified Executive IT Specialist
> STG WW Modular Systems Benchmark Center
> STG WW HPC and BI CoC Benchmark Center
> Phone:( 845) 433 -8429 (8-293) Fax:(845)432-9789
> sw...@us.ibm.com
> http://w3.ibm.com/sales/systems/benchmarks
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpirun fails on remote applications

2009-05-12 Thread Lenny Verkhovsky
sounds like firewall problems to or from anfield04.
Lenny,

On Tue, May 12, 2009 at 8:18 AM, feng chen  wrote:

>  hi all,
>
> First of all,i'm new to openmpi. So i don't know much about mpi setting.
> That's why i'm following manual and FAQ suggestions from the beginning.
> Everything went well untile i try to run a pllication on a remote node by
> using 'mpirun -np' command. It just hanging there without doing anything, no
> error messanges, no
> complaining or whatsoever. What confused me is that i can run application
> over ssh with no problem, while it comes to mpirun, just stuck in there does
> nothing.
> I'm pretty sure i got everyting setup in the right way manner, including no
> password signin over ssh, environment variables for bot interactive and
> non-interactive logons.
> A sample list of commands been used list as following:
>
>
>
>
>  [fch6699@anfield05 test]$ mpicc -o hello hello.f
> [fch6699@anfield05 test]$ ssh anfield04 ./hello
> 0 of 1: Hello world!
> [fch6699@anfield05 test]$ mpirun -host anfield05 -np 4 ./hello
> 0 of 4: Hello world!
> 2 of 4: Hello world!
> 3 of 4: Hello world!
> 1 of 4: Hello world!
> [fch6699@anfield05 test]$ mpirun -host anfield04 -np 4 ./hello
> just hanging there for years!!!
> need help to fix this !!
> if u try it in another way
> [fch6699@anfield05 test]$ mpirun -hostfile my_hostfile -np 4 ./hell
> still nothing happened, no warnnings, no complains, no error messages.. !!
>
> All other files related to this issue can be found in my_files.tar.gz in
> attachment.
>
> .cshrc
> The output of the "ompi_info --all" command.
> my_hostfile
> hello.c
> output of iptables
>
> The only thing i've noticed is that the port of our ssh has been changed
> from 22 to other number for security issues.
> Don't know will that have anything to with it or not.
>
>
> Any help will be highly appreciated!!
>
> thanks in advance!
>
> Kevin
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Lenny Verkhovsky
According to the code it does cares.

$vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572

ival = orte_rmaps_rank_file_value.ival;
  if ( ival > (np-1) ) {
  orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, ival,
rankfile);
  rc = ORTE_ERR_BAD_PARAM;
  goto unlock;
  }

If I remember correctly, I used an array to map ranks, and since the length
of array is NP, maximum index must be less than np, so if you have the
number of rank > NP, you have no place to put it inside array.

"Likewise, if you have more procs than the rankfile specifies, we map the
additional procs either byslot (default) or bynode (if you specify that
option). So the rankfile doesn't need to contain an entry for every proc."
 - Correct point.

Lenny.

On 5/5/09, Ralph Castain  wrote:
>
> Sorry Lenny, but that isn't correct. The rankfile mapper doesn't care if
> the rankfile contains additional info - it only maps up to the number of
> processes, and ignores anything beyond that number. So there is no need to
> remove the additional info.
>
> Likewise, if you have more procs than the rankfile specifies, we map the
> additional procs either byslot (default) or bynode (if you specify that
> option). So the rankfile doesn't need to contain an entry for every proc.
>
> Just don't want to confuse folks.
> Ralph
>
>
>
> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> Hi,
>> maximum rank number must be less then np.
>> if np=1 then there is only rank 0 in the system, so rank 1 is invalid.
>> please remove "rank 1=node2 slot=*" from the rankfile
>> Best regards,
>> Lenny.
>>
>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot wrote:
>>
>>> Hi ,
>>>
>>> I got the 
>>> openmpi-1.4a1r21095.tar.gz<http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21095.tar.gz>tarball,
>>>  but unfortunately my command doesn't work
>>>
>>> cat rankf:
>>> rank 0=node1 slot=*
>>> rank 1=node2 slot=*
>>>
>>> cat hostf:
>>> node1 slots=2
>>> node2 slots=2
>>>
>>> mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1 hostname :
>>> --host node2 -n 1 hostname
>>>
>>> Error, invalid rank (1) in the rankfile (rankf)
>>>
>>>
>>> --
>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> rmaps_rank_file.c at line 403
>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> base/rmaps_base_map_job.c at line 86
>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> base/plm_base_launch_support.c at line 86
>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> plm_rsh_module.c at line 1016
>>>
>>>
>>> Ralph, could you tell me if my command syntax is correct or not ? if not,
>>> give me the expected one ?
>>>
>>> Regards
>>>
>>> Geoffroy
>>>
>>>
>>>
>>>
>>> 2009/4/30 Geoffroy Pignot 
>>>
>>>> Immediately Sir !!! :)
>>>>
>>>> Thanks again Ralph
>>>>
>>>> Geoffroy
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Message: 2
>>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>>>>> From: Ralph Castain 
>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>>> To: Open MPI Users 
>>>>> Message-ID:
>>>>><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
>>>>> Content-Type: text/plain; charset="iso-8859-1"
>>>>>
>>>>> I believe this is fixed now in our development trunk - you can download
>>>>> any
>>>>> tarball starting from last night and give it a try, if you like. Any
>>>>> feedback would be appreciated.
>>>>>
>>>>> Ralph
>>>>>
>>>>>
>>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>>>>
>>>>> Ah now, I didn't say it -worked-, did I? :-)
>>>>>
>>>>> Clearly a bug exists in the program. I'll try to take a look at it (if
>>>>> Lenny
>>>>> doesn't get to it first), but it won't be until later

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Lenny Verkhovsky
e remote nodes.
>>> > >
>>> >
>>> --
>>> > >
>>> >
>>> --
>>> > > orterun noticed that the job aborted, but has no info as to the
>>> > process
>>> > > that caused that situation.
>>> > >
>>> >
>>> --
>>> > > orterun: clean termination accomplished
>>>
>>>
>>>
>>> Message: 4
>>> Date: Tue, 14 Apr 2009 06:55:58 -0600
>>> From: Ralph Castain 
>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> To: Open MPI Users 
>>> Message-ID: 
>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>>   DelSp="yes"
>>>
>>> The rankfile cuts across the entire job - it isn't applied on an
>>> app_context basis. So the ranks in your rankfile must correspond to
>>> the eventual rank of each process in the cmd line.
>>>
>>> Unfortunately, that means you have to count ranks. In your case, you
>>> only have four, so that makes life easier. Your rankfile would look
>>> something like this:
>>>
>>> rank 0=r001n001 slot=0
>>> rank 1=r001n002 slot=1
>>> rank 2=r001n001 slot=1
>>> rank 3=r001n002 slot=2
>>>
>>> HTH
>>> Ralph
>>>
>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>>>
>>> > Hi,
>>> >
>>> > I agree that my examples are not very clear. What I want to do is to
>>> > launch a multiexes application (masters-slaves) and benefit from the
>>> > processor affinity.
>>> > Could you show me how to convert this command , using -rf option
>>> > (whatever the affinity is)
>>> >
>>> > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002
>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>>> > host r001n002 slave.x options4
>>> >
>>> > Thanks for your help
>>> >
>>> > Geoffroy
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Message: 2
>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300
>>> > From: Lenny Verkhovsky 
>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> > To: Open MPI Users 
>>> > Message-ID:
>>> ><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
>>> > Content-Type: text/plain; charset="iso-8859-1"
>>> >
>>> > Hi,
>>> >
>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1
>>> > defined,
>>> > while n=1, which means only rank 0 is present and can be allocated.
>>> >
>>> > NP must be >= the largest rank in rankfile.
>>> >
>>> > What exactly are you trying to do ?
>>> >
>>> > I tried to recreate your seqv but all I got was
>>> >
>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>>> > hostfile.0
>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>>> > [witch19:30798] mca: base: component_find: paffinity
>>> > "mca_paffinity_linux"
>>> > uses an MCA interface that is not recognized (component MCA v1.0.0 !=
>>> > supported MCA v2.0.0) -- ignored
>>> >
>>> --
>>> > It looks like opal_init failed for some reason; your parallel
>>> > process is
>>> > likely to abort. There are many reasons that a parallel process can
>>> > fail during opal_init; some of which are due to configuration or
>>> > environment problems. This failure appears to be an internal failure;
>>> > here's some additional information (which may only be relevant to an
>>> > Open MPI developer):
>>> >
>>> >  opal_carto_base_select failed
>>> >  --> Returned value -13 instead of OPAL_SUCCESS
>>> >
>>> --
>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not foun

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-21 Thread Lenny Verkhovsky
It's something in the basis, right,
I tried to investigate it yesterday and saw that for some reason
jdata->bookmark->index is 2 instead of 1 ( in this example ).

[dellix7:28454] [ ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c
+417 ]  node->index = 1, jdata->bookmark->index=2
[dellix7:28454] [ ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c
+417 ]  node->index = 2, jdata->bookmark->index=2
I am not so familiar with this part of code, since it appears in all rmap
component and I just copied it :).

I am also not quite understand what Geoffroy tries to run, so I can think od
workaround.
Lenny.



On Mon, Apr 20, 2009 at 8:30 PM, Ralph Castain  wrote:

> I'm afraid this is a more extensive rewrite than I had hoped - the
> revisions are most unlikely to make it for 1.3.2. Looks like it will be
> 1.3.3 at the earliest.
>
> Ralph
>
>
> On Mon, Apr 20, 2009 at 7:50 AM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>>  Me too, sorry, it definately seems like a bug. Somewere in the code
>> probably undefined variable.
>> I just never tested this code with such "bizzare" command line :)
>>
>> Lenny.
>>
>>   On Mon, Apr 20, 2009 at 4:08 PM, Geoffroy Pignot 
>> wrote:
>>
>>> Thanks,
>>>
>>> I am not in a hurry but it would be nice if I could benefit from this
>>> feature in the next release.
>>> Regards
>>>
>>> Geoffroy
>>>
>>>
>>>
>>> 2009/4/20 
>>>
>>>> Send users mailing list submissions to
>>>>us...@open-mpi.org
>>>>
>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> or, via email, send a message with subject or body 'help' to
>>>>users-requ...@open-mpi.org
>>>>
>>>> You can reach the person managing the list at
>>>>users-ow...@open-mpi.org
>>>>
>>>> When replying, please edit your Subject line so it is more specific
>>>> than "Re: Contents of users digest..."
>>>>
>>>>
>>>> Today's Topics:
>>>>
>>>>   1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain)
>>>>
>>>>
>>>> --
>>>>
>>>> Message: 1
>>>> Date: Mon, 20 Apr 2009 05:59:52 -0600
>>>> From: Ralph Castain 
>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>> To: Open MPI Users 
>>>> Message-ID: <6378a8c1-1763-4a1c-abca-c6fcc3605...@open-mpi.org>
>>>>
>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>>>DelSp="yes"
>>>>
>>>> Honestly haven't had time to look at it yet - hopefully in the next
>>>> couple of days...
>>>>
>>>> Sorry for delay
>>>>
>>>>
>>>> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote:
>>>>
>>>> > Do you have any news about this bug.
>>>> > Thanks
>>>> >
>>>> > Geoffroy
>>>> >
>>>> >
>>>> > Message: 1
>>>> > Date: Tue, 14 Apr 2009 07:57:44 -0600
>>>> > From: Ralph Castain 
>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>> > To: Open MPI Users 
>>>> > Message-ID: 
>>>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>>> >DelSp="yes"
>>>> >
>>>> > Ah now, I didn't say it -worked-, did I? :-)
>>>> >
>>>> > Clearly a bug exists in the program. I'll try to take a look at it (if
>>>> > Lenny doesn't get to it first), but it won't be until later in the
>>>> > week.
>>>> >
>>>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>>> >
>>>> > > I agree with you Ralph , and that 's what I expect from openmpi but
>>>> > > my second example shows that it's not working
>>>> > >
>>>> > > cat hostfile.0
>>>> > >r011n002 slots=4
>>>> > >r011n003 slots=4
>>>> > >
>>>> > >  cat rankfile.0
>>>> > > rank 0=r011n002 slot=0
&

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-20 Thread Lenny Verkhovsky
ecause the daemon was unable to find all the needed
>> > > > shared
>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>> > to
>> > > > have the
>> > > > > location of the shared libraries on the remote nodes and this
>> > will
>> > > > > automatically be forwarded to the remote nodes.
>> > > > >
>> > > >
>> > >
>> >
>> --
>> > > > >
>> > > >
>> > >
>> >
>> --
>> > > > > orterun noticed that the job aborted, but has no info as to the
>> > > > process
>> > > > > that caused that situation.
>> > > > >
>> > > >
>> > >
>> >
>> --
>> > > > > orterun: clean termination accomplished
>> > >
>> > >
>> > >
>> > > Message: 4
>> > > Date: Tue, 14 Apr 2009 06:55:58 -0600
>> > > From: Ralph Castain 
>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> > > To: Open MPI Users 
>> > > Message-ID: 
>> > > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> > >DelSp="yes"
>> > >
>> > > The rankfile cuts across the entire job - it isn't applied on an
>> > > app_context basis. So the ranks in your rankfile must correspond to
>> > > the eventual rank of each process in the cmd line.
>> > >
>> > > Unfortunately, that means you have to count ranks. In your case, you
>> > > only have four, so that makes life easier. Your rankfile would look
>> > > something like this:
>> > >
>> > > rank 0=r001n001 slot=0
>> > > rank 1=r001n002 slot=1
>> > > rank 2=r001n001 slot=1
>> > > rank 3=r001n002 slot=2
>> > >
>> > > HTH
>> > > Ralph
>> > >
>> > > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I agree that my examples are not very clear. What I want to do
>> > is to
>> > > > launch a multiexes application (masters-slaves) and benefit from
>> > the
>> > > > processor affinity.
>> > > > Could you show me how to convert this command , using -rf option
>> > > > (whatever the affinity is)
>> > > >
>> > > > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host
>> > r001n002
>> > > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>> > > > host r001n002 slave.x options4
>> > > >
>> > > > Thanks for your help
>> > > >
>> > > > Geoffroy
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Message: 2
>> > > > Date: Sun, 12 Apr 2009 18:26:35 +0300
>> > > > From: Lenny Verkhovsky 
>> > > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> > > > To: Open MPI Users 
>> > > > Message-ID:
>> > > ><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com
>> > >
>> > > > Content-Type: text/plain; charset="iso-8859-1"
>> > > >
>> > > > Hi,
>> > > >
>> > > > The first "crash" is OK, since your rankfile has ranks 0 and 1
>> > > > defined,
>> > > > while n=1, which means only rank 0 is present and can be
>> > allocated.
>> > > >
>> > > > NP must be >= the largest rank in rankfile.
>> > > >
>> > > > What exactly are you trying to do ?
>> > > >
>> > > > I tried to recreate your seqv but all I got was
>> > > >
>> > > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>> > > > hostfile.0
>> > > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>> > > > [witch19:30798] mca: base: component_find: paffinity
>> > > > "mca_paffinity_linux"
>> > > > uses an

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-12 Thread Lenny Verkhovsky
Hi,

The first "crash" is OK, since your rankfile has ranks 0 and 1 defined,
while n=1, which means only rank 0 is present and can be allocated.

NP must be >= the largest rank in rankfile.

What exactly are you trying to do ?

I tried to recreate your seqv but all I got was

~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile hostfile.0
-rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
[witch19:30798] mca: base: component_find: paffinity "mca_paffinity_linux"
uses an MCA interface that is not recognized (component MCA v1.0.0 !=
supported MCA v2.0.0) -- ignored
--
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_carto_base_select failed
  --> Returned value -13 instead of OPAL_SUCCESS
--
[witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../orte/runtime/orte_init.c at line 78
[witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../orte/orted/orted_main.c at line 344
--
A daemon (pid 11629) died unexpectedly with status 243 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished


Lenny.


On 4/10/09, Geoffroy Pignot  wrote:
>
> Hi ,
>
> I am currently testing the process affinity capabilities of openmpi and I
> would like to know if the rankfile behaviour I will describe below is normal
> or not ?
>
> cat hostfile.0
> r011n002 slots=4
> r011n003 slots=4
>
> cat rankfile.0
> rank 0=r011n002 slot=0
> rank 1=r011n003 slot=1
>
>
> ##
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2  hostname ### OK
> r011n002
> r011n003
>
>
> ##
> but
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
> ### CRASHED
> *
>  --
> Error, invalid rank (1) in the rankfile (rankfile.0)
> --
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> rmaps_rank_file.c at line 404
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/rmaps_base_map_job.c at line 87
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/plm_base_launch_support.c at line 77
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> plm_rsh_module.c at line 985
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> orterun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> orterun: clean termination accomplished
> *
> It seems that the rankfile option is not propagted to the second command
> line ; there is no global understanding of the ranking inside a mpirun
> command.
>
>
> ##
>
> Assuming that , I tried to provide a rankfile to each command line:
>
> cat rankfile.0
> rank 0=r011n002 slot=0
>
> cat rankfile.1
> rank 0=r011n003 slot=1
>
> mpirun --hostfile hostfile.0 

Re: [OMPI users] OpenMPI program getting stuck at poll()

2009-03-10 Thread Lenny Verkhovsky
Hi,
can you try Open MPI 1.3 version.


On 3/9/09, Prasanna Ranganathan  wrote:
>
>  Hi all,
>
>   I have a distributed program running on 400+ nodes and using OpenMPI. I
> have run the same binary with nearly the same setup successfully previously.
> However in my last two runs the program seems to be getting stuck after a
> while before it completes. The stack trace at the time it gets stuck is as
> follows:
>
>   #0  0x2adc00df in poll () from /lib/libc.so.6
>  #1  0x2acfffa49c27 in opal_poll_dispatch () from
> /usr/lib64/libopen-pal.so.0
>  #2  0x2acfffa47add in opal_event_base_loop () from
> /usr/lib64/libopen-pal.so.0
>  #3  0x2acfffa43203 in opal_progress () from /usr/lib64/libopen-pal.so.0
>  #4  0x2acfff78b315 in ompi_request_test_some () from
> /usr/lib64/libmpi.so.0
>  #5  0x2acfff7adf7a in PMPI_Testsome () from /usr/lib64/libmpi.so.0
>  
>
>  I checked all the nodes and they seem to be up and doing fine. Any
> suggestions/hints on what might be happening here would help greatly. Thanks
> in advance.
>
>  I am using OpenMPI 1.2.7 on gentoo linux.
>
>  Regards,
>
>  Prasanna.
> ___
>  users mailing list
>  us...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Problem with MPI_Comm_spawn_multiple & MPI_Info_fre

2009-03-10 Thread Lenny Verkhovsky
can you try Open MPI 1.3,

Lenny.

On 3/10/09, Tee Wen Kai  wrote:
>
> Hi,
>
> I am using version 1.2.8.
>
> Thank you.
>
> Regards,
> Wenkai
>
> --- On *Mon, 9/3/09, Ralph Castain * wrote:
>
>
> From: Ralph Castain 
> Subject: Re: [OMPI users] Problem with MPI_Comm_spawn_multiple &
> MPI_Info_free
> To: "Open MPI Users" 
> Date: Monday, 9 March, 2009, 7:42 PM
>
> Could you tell us what version of Open MPI you are using? It would help us
> to provide you with advice.
> Thanks
> Ralph
>
>  On Mar 9, 2009, at 2:18 AM, Tee Wen Kai wrote:
>
>  Hi,
>
> I have a program that allow user to enter their choice of operation. For
> example, when the user enter '4', the program will enter a function which
> will spawn some other programs stored in the same directory. When the user
> enter '5', the program will enter another function to request all spawned
> processes to exit. Therefore, initially only one process which act as the
> controller is spawned.
>
> My MPI_Info_create and MPI_Comm_spawn_multiple are called in a function.
> Everything is working fine except when I tried to free the MPI_Info in the
> function by calling MPI_Info_free, I have segmentation fault error. The
> second problem is when i do the spawning once and exit the controller
> program with MPI_Finalize, the program exited normally. But when spawning is
> done more than once and exit the controller program with MPI_Finalize, there
> is segmentation fault. I also realize that when the spawed processes exit,
> the 'orted' process is still running. Thus, when multiple
> MPI_Comm_spawn_multiple are called, there will be multiple 'orted'
> processes.
>
> Thank you and hope someone has a solution to my problem.
>
> Regards,
> Wenkai
>
> --
> New Email names for you!
> 
> Get the Email name you've always wanted on the new @ymail and @rocketmail.
> Hurry before someone else
> does!___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> -Inline Attachment Follows-
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
>  Adding more friends is quick and 
> easy.
> Import them over to Yahoo! Mail today!
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] OpenMPI 1.3.1 rpm build error

2009-03-01 Thread Lenny Verkhovsky
We saw the same problem with compilation,
the  workaround for us was configuring without vt (  ./configure --help ).
I hope vt guys will fix it somewhen .

Lenny.

On Mon, Feb 23, 2009 at 11:48 PM, Jeff Squyres  wrote:
> It would be interesting to see what happens with the 1.3 build.
>
> It's hard to interpret the output of your user's test program without
> knowing exactly what that printf means...
>
>
> On Feb 23, 2009, at 4:44 PM, Jim Kusznir wrote:
>
>> I haven't had time to do the openmpi build from the nightly yet, but
>> my user has run some more tests and now has a simple program and
>> algorithm to "break" openmpi.  His notes:
>>
>> hey, just fyi, I can reproduce the error readily in a simple test case
>> my "way to break mpi" is as follows: Master proc runs MPI_Send 1000
>> times to each child, then waits for a "I got it" ack from each child.
>> Each child receives 1000 numbers from the Master, then sends "I got
>> it" to the master
>> running this on 25 nodes causes it to break about 60% of the time
>> interestingly, it usually breaks on the same process number each time
>>
>> ah. It looks like if I let it sit for about 5 minutes, sometimes it
>> will work. From my log
>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 816
>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 817
>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 818
>> rank: 23 Mon Feb 23 13:33:08 2009 recieved 819
>> rank: 23 Mon Feb 23 13:33:08 2009 recieved 820
>>
>> Any thoughts on this problem?
>> (this is the only reason I'm currently working on upgrading openmpi)
>>
>> --Jim
>>
>> On Fri, Feb 20, 2009 at 1:59 PM, Jeff Squyres  wrote:
>>>
>>> There won't be an official SRPM until 1.3.1 is released.
>>>
>>> But to test if 1.3.1 is on-track to deliver a proper solution to you, can
>>> you try a nightly tarball, perhaps in conjunction with our "buildrpm.sh"
>>> script?
>>>
>>>
>>>
>>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/buildrpm.sh
>>>
>>> It should build a trivial SRPM for you from the tarball.  You'll likely
>>> need
>>> to get the specfile, too, and put it in the same dir as buildrpm.sh.  The
>>> specfile is in the same SVN directory:
>>>
>>>
>>>
>>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/openmpi.spec
>>>
>>>
>>>
>>> On Feb 20, 2009, at 3:51 PM, Jim Kusznir wrote:
>>>
 As long as I can still build the rpm for it and install it via rpm.
 I'm running it on a ROCKS cluster, so it needs to be an RPM to get
 pushed out to the compute nodes.

 --Jim

 On Fri, Feb 20, 2009 at 11:30 AM, Jeff Squyres 
 wrote:
>
> On Feb 20, 2009, at 2:20 PM, Jim Kusznir wrote:
>
>> I just went to www.open-mpi.org, went to download, then source rpm.
>> Looks like it was actually 1.3-1.  Here's the src.rpm that I pulled
>> in:
>>
>>
>>
>> http://www.open-mpi.org/software/ompi/v1.3/downloads/openmpi-1.3-1.src.rpm
>
> Ah, gotcha.  Yes, that's 1.3.0, SRPM version 1.  We didn't make up this
> nomenclature.  :-(
>
>> The reason for this upgrade is it seems a user found some bug that may
>> be in the OpenMPI code that results in occasionally an MPI_Send()
>> message getting lost.  He's managed to reproduce it multiple times,
>> and we can't find anything in his code that can cause it...He's got
>> logs of mpi_send() going out, but the matching mpi_receive() never
>> getting anything, thus killing his code.  We're currently running
>> 1.2.8 with ofed support (Haven't tried turning off ofed, etc. yet).
>
> Ok.  1.3.x is much mo' betta' then 1.2 in many ways.  We could probably
> help
> track down the problem, but if you're willing to upgrade to 1.3.x,
> it'll
> hopefully just make the problem go away.
>
> Can you try a 1.3.1 nightly tarball?
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] OpenMPI hangs across multiple nodes.

2009-02-04 Thread Lenny Verkhovsky
what kind of communication between nodes do you have - tcp, openib (
IB/IWARP ) ?
you can try

mpirun -np 4 -host node1,node2 -mca btl tcp,self random



On Wed, Feb 4, 2009 at 1:21 AM, Ralph Castain  wrote:
> Could you tell us which version of OpenMPI you are using, and how it was
> configured?
>
> Did you install the OMPI libraries and binaries on both nodes? Are they in
> the same absolute path locations?
>
> Thanks
> Ralph
>
>
> On Feb 3, 2009, at 3:46 PM, Robertson Burgess wrote:
>
>> Dear users,
>> I am quite new to OpenMPI, I have compiled it on two nodes, each node with
>> 8 CPU cores. The two nodes are identical. The code I am using works in
>> parallel across the 8 cores on a single node. However, whenever I try to run
>> across both nodes, OpenMPI simply hangs. There is no output whatsoever, when
>> I run it in background, outputting to a log file, the log file is always
>> empty. The cores do not appear to be doing anything at all, either on the
>> host node or on the remote node. This happens whether I am running my code,
>> or even if I when I tell it to run a process that doesn't even exist, for
>> instance
>>
>> mpirun -np 4 -host node1,node2 random
>>
>> Simply results in the terminal hanging, so all I can do is close the
>> terminal and open up a new one.
>>
>> mpirun -np 4 -host node1,node2 random >& log.log &
>>
>> simply produces and empty log.log file
>>
>> I am running Redhat Linux on the systems, and compiled OpenMPI with the
>> Intel Compilers 10.1. As I've said, it works fine on one node. I have set up
>> both nodes such that they can log into each other via ssh without the need
>> for a password, and I have altered my .bashrc file so the PATH and
>> LD_LIBRARY_PATH include the appropriate folders.
>> I have looked through the FAQ and mailing lists, but I was unable to find
>> anything that really matched my problem. Any help would be greatly
>> appreciated.
>>
>> Sincerely,
>> Robertson Burgess
>> University of Newcastle
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Problem with openmpi and infiniband

2009-01-04 Thread Lenny Verkhovsky
Hi,  just to make sure,

you wrote in the previous mail that you tested IMB-MPI1 and it
"reports for the last test" , and the results are for
"processes=6", since you have 4 and 8 core machines, this test could
be run on the same 8 core machine over shared memory and not over
Infiniband, as you suspected.

You can rerun the IMB-MPI1 test with -mca btl self,openib to be sure
that the test does not use shared memory or tcp.

Lenny.



On 12/24/08, Biagio Lucini  wrote:
> Pavel Shamis (Pasha) wrote:
>
> > Biagio Lucini wrote:
> >
> > > Hello,
> > >
> > > I am new to this list, where I hope to find a solution for a problem
> > > that I have been having for quite a longtime.
> > >
> > > I run various versions of openmpi (from 1.1.2 to 1.2.8) on a cluster
> > > with Infiniband interconnects that I use and administer at the same
> > > time. The openfabric stac is OFED-1.2.5, the compilers gcc 4.2 and
> > > Intel. The queue manager is SGE 6.0u8.
> > >
> > Do you use OpenMPI version that is included in OFED ? Did you was able
> > to run basic OFED/OMPI tests/benchmarks between two nodes ?
> >
> >
>
>  Hi,
>
>  yes to both questions: the OMPI version is the one that comes with OFED
> (1.1.2-1) and the basic tests run fine. For instance, IMB-MPI1 (which is
> more than basic, as far as I can see) reports for the last test:
>
>  #---
>  # Benchmarking Barrier
>  # #processes = 6
>  #---
>   #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>  100022.9322.9522.94
>
>
>  for the openib,self btl (6 processes, all processes on different nodes)
>  and
>
>  #---
>  # Benchmarking Barrier
>  # #processes = 6
>  #---
>   #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>  1000   191.30   191.42   191.34
>
>  for the tcp,self btl (same test)
>
>  No anomalies for other tests (ping-pong, all-to-all etc.)
>
>  Thanks,
>  Biagio
>
>
>  --
>  =
>
>  Dr. Biagio Lucini
>  Department of Physics, Swansea University
>  Singleton Park, SA2 8PP Swansea (UK)
>  Tel. +44 (0)1792 602284
>
>  =
>  ___
>  users mailing list
>  us...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Bug in 1.3 nightly

2008-12-16 Thread Lenny Verkhovsky
I didn't see any errors on 1.3rc3r20130, I am running mtt nightly
and it seems to be fine on x86-64 Centos5.

On Tue, Dec 16, 2008 at 10:27 AM, Gabriele Fatigati
 wrote:
> Dear OpenMPI developers,
> trying to compile 1.3 nightly version , i get the follow error:
>
> ../../../orte/.libs/libopen-rte.so: undefined reference to `ORTE_NAME_PRINT'
> ../../../orte/.libs/libopen-rte.so: undefined reference to `ORTE_JOBID_PRINT'
>
>
> The version affected are:
>
> openmpi-1.3rc3r20130
> openmpi-1.3rc3r20107
> openmpi-1.3rc3r20092
> openmpi-1.3rc2r20084
>
> Thanks in advance.
>
>
> --
> Ing. Gabriele Fatigati
>
> Parallel programmer
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.itTel:   +39 051 6171722
>
> g.fatig...@cineca.it
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-09 Thread Lenny Verkhovsky
Hi,
1.  please,  provide #cat /proc/cpu_info
2.  see http://www.open-mpi.org/faq/?category=tuning#paffinity-defs.

Best regards
Lenny.


Re: [OMPI users] Deadlock on large numbers of processors

2008-12-09 Thread Lenny Verkhovsky
also see https://svn.open-mpi.org/trac/ompi/ticket/1449



On 12/9/08, Lenny Verkhovsky  wrote:
>
> maybe it's related to https://svn.open-mpi.org/trac/ompi/ticket/1378  ??
>
> On 12/5/08, Justin  wrote:
>>
>> The reason i'd like to disable these eager buffers is to help detect the
>> deadlock better.  I would not run with this for a normal run but it would be
>> useful for debugging.  If the deadlock is indeed due to our code then
>> disabling any shared buffers or eager sends would make that deadlock
>> reproduceable.In addition we might be able to lower the number of
>> processors down.  Right now determining which processor is deadlocks when we
>> are using 8K cores and each processor has hundreds of messages sent out
>> would be quite difficult.
>>
>> Thanks for your suggestions,
>> Justin
>> Brock Palen wrote:
>>
>>> OpenMPI has differnt eager limits for all the network types, on your
>>> system run:
>>>
>>> ompi_info --param btl all
>>>
>>> and look for the eager_limits
>>> You can set these values to 0 using the syntax I showed you before. That
>>> would disable eager messages.
>>> There might be a better way to disable eager messages.
>>> Not sure why you would want to disable them, they are there for
>>> performance.
>>>
>>> Maybe you would still see a deadlock if every message was below the
>>> threshold. I think there is a limit of the number of eager messages a
>>> receving cpus will accept. Not sure about that though.  I still kind of
>>> doubt it though.
>>>
>>> Try tweaking your buffer sizes,  make the openib  btl eager limit the
>>> same as shared memory. and see if you get locks up between hosts and not
>>> just shared memory.
>>>
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>>
>>>
>>>
>>> On Dec 5, 2008, at 2:10 PM, Justin wrote:
>>>
>>>  Thank you for this info.  I should add that our code tends to post a lot
>>>> of sends prior to the other side posting receives.  This causes a lot of
>>>> unexpected messages to exist.  Our code explicitly matches up all tags and
>>>> processors (that is we do not use MPI wild cards).  If we had a dead lock I
>>>> would think we would see it regardless of weather or not we cross the
>>>> roundevous threshold.  I guess one way to test this would be to to set this
>>>> threshold to 0.  If it then dead locks we would likely be able to track 
>>>> down
>>>> the deadlock.  Are there any other parameters we can send mpi that will 
>>>> turn
>>>> off buffering?
>>>>
>>>> Thanks,
>>>> Justin
>>>>
>>>> Brock Palen wrote:
>>>>
>>>>> When ever this happens we found the code to have a deadlock.  users
>>>>> never saw it until they cross the eager->roundevous threshold.
>>>>>
>>>>> Yes you can disable shared memory with:
>>>>>
>>>>> mpirun --mca btl ^sm
>>>>>
>>>>> Or you can try increasing the eager limit.
>>>>>
>>>>> ompi_info --param btl sm
>>>>>
>>>>> MCA btl: parameter "btl_sm_eager_limit" (current value:
>>>>>  "4096")
>>>>>
>>>>> You can modify this limit at run time,  I think (can't test it right
>>>>> now) it is just:
>>>>>
>>>>> mpirun --mca btl_sm_eager_limit 40960
>>>>>
>>>>> I think you can also in tweaking these values use env Vars in place of
>>>>> putting it all in the mpirun line:
>>>>>
>>>>> export OMPI_MCA_btl_sm_eager_limit=40960
>>>>>
>>>>> See:
>>>>> http://www.open-mpi.org/faq/?category=tuning
>>>>>
>>>>>
>>>>> Brock Palen
>>>>> www.umich.edu/~brockp
>>>>> Center for Advanced Computing
>>>>> bro...@umich.edu
>>>>> (734)936-1985
>>>>>
>>>>>
>>>>>
>>>>> On Dec 5, 2008, at 12:22 PM, Justin wrote:
>>>>>
>>>>>  Hi,
>>>>>>
>>>>>> We are currently using OpenMPI 1.3 on Ranger for large processor jobs
>>>>>> (8K+).  Our code

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-09 Thread Lenny Verkhovsky
maybe it's related to https://svn.open-mpi.org/trac/ompi/ticket/1378  ??

On 12/5/08, Justin  wrote:
>
> The reason i'd like to disable these eager buffers is to help detect the
> deadlock better.  I would not run with this for a normal run but it would be
> useful for debugging.  If the deadlock is indeed due to our code then
> disabling any shared buffers or eager sends would make that deadlock
> reproduceable.In addition we might be able to lower the number of
> processors down.  Right now determining which processor is deadlocks when we
> are using 8K cores and each processor has hundreds of messages sent out
> would be quite difficult.
>
> Thanks for your suggestions,
> Justin
> Brock Palen wrote:
>
>> OpenMPI has differnt eager limits for all the network types, on your
>> system run:
>>
>> ompi_info --param btl all
>>
>> and look for the eager_limits
>> You can set these values to 0 using the syntax I showed you before. That
>> would disable eager messages.
>> There might be a better way to disable eager messages.
>> Not sure why you would want to disable them, they are there for
>> performance.
>>
>> Maybe you would still see a deadlock if every message was below the
>> threshold. I think there is a limit of the number of eager messages a
>> receving cpus will accept. Not sure about that though.  I still kind of
>> doubt it though.
>>
>> Try tweaking your buffer sizes,  make the openib  btl eager limit the same
>> as shared memory. and see if you get locks up between hosts and not just
>> shared memory.
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>>
>>
>>
>> On Dec 5, 2008, at 2:10 PM, Justin wrote:
>>
>>  Thank you for this info.  I should add that our code tends to post a lot
>>> of sends prior to the other side posting receives.  This causes a lot of
>>> unexpected messages to exist.  Our code explicitly matches up all tags and
>>> processors (that is we do not use MPI wild cards).  If we had a dead lock I
>>> would think we would see it regardless of weather or not we cross the
>>> roundevous threshold.  I guess one way to test this would be to to set this
>>> threshold to 0.  If it then dead locks we would likely be able to track down
>>> the deadlock.  Are there any other parameters we can send mpi that will turn
>>> off buffering?
>>>
>>> Thanks,
>>> Justin
>>>
>>> Brock Palen wrote:
>>>
 When ever this happens we found the code to have a deadlock.  users
 never saw it until they cross the eager->roundevous threshold.

 Yes you can disable shared memory with:

 mpirun --mca btl ^sm

 Or you can try increasing the eager limit.

 ompi_info --param btl sm

 MCA btl: parameter "btl_sm_eager_limit" (current value:
  "4096")

 You can modify this limit at run time,  I think (can't test it right
 now) it is just:

 mpirun --mca btl_sm_eager_limit 40960

 I think you can also in tweaking these values use env Vars in place of
 putting it all in the mpirun line:

 export OMPI_MCA_btl_sm_eager_limit=40960

 See:
 http://www.open-mpi.org/faq/?category=tuning


 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 bro...@umich.edu
 (734)936-1985



 On Dec 5, 2008, at 12:22 PM, Justin wrote:

  Hi,
>
> We are currently using OpenMPI 1.3 on Ranger for large processor jobs
> (8K+).  Our code appears to be occasionally deadlocking at random within
> point to point communication (see stacktrace below).  This code has been
> tested on many different MPI versions and as far as we know it does not
> contain a deadlock.  However, in the past we have ran into problems with
> shared memory optimizations within MPI causing deadlocks.  We can usually
> avoid these by setting a few environment variables to either increase the
> size of shared memory buffers or disable shared memory optimizations all
> together.   Does OpenMPI have any known deadlocks that might be causing 
> our
> deadlocks?  If are there any work arounds?  Also how do we disable shared
> memory within OpenMPI?
>
> Here is an example of where processors are hanging:
>
> #0  0x2b2df3522683 in mca_btl_sm_component_progress () from
> /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so
> #1  0x2b2df2cb46bf in mca_bml_r2_progress () from
> /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so
> #2  0x2b2df0032ea4 in opal_progress () from
> /opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0
> #3  0x2b2ded0d7622 in ompi_request_default_wait_some () from
> /opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
> #4  0x2b2ded109e34 in PMPI_Waitsome () from
> /opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
>
>
> Thanks,
> Justin
> 

Re: [OMPI users] Hybrid program

2008-11-23 Thread Lenny Verkhovsky
Hi,

Sorry for not answering sooner,

In Open MPI 1.3 we added a paffinity mapping module.

The syntax is quite simple and flexible:

rank N=hostA slot=socket:core_range

rank M=hostB slot=cpu

see the fallowing example:

ex:

#mpirun -rf rankfile_name ./app

#cat rankfile_name

rank 0=host1 slot=0

rank 1=host2 slot=0:*

rank 2=host3 slot=1:0,1

rank 3=host3 slot=1:2-3

rank 4=host1 slot=1:0,0:2

explanation:

Let's assume we have Quad core Dual CPU machines named host1,host2,host3

Using the rankfile above we get rank 0 running on CPU#0  ( cat
/proc/cpu_info you see what is CPU #0 )

rank 1 will run on all cores of socket #0

rank 2 will run on host3 socket #1, cores 0,1

rank 3 will run on host3 socket #1, cores from #2 to #3

rank 4 will run on host1 socket1:core0 and socket0:core2

So, using threads you probably should use slot=0:*, this way all threads
will run on all cores of socket #0 ( or any other specified ).

or coma separated list of exact pairs like rank4 in the example above.

you can also use -mca paffinity_base_verbose 10 to see the mapping that took
place in the job.

Best Regards.

Lenny.


On 11/20/08, Ralph Castain  wrote:
>
> At the very least, you would have to call these functions -after- MPI_Init
> so they could override what OMPI did.
>
>
> On Nov 20, 2008, at 8:03 AM, Gabriele Fatigati wrote:
>
>  And in the hybrid program MPi+OpenMP?
>> Are these considerations still good?
>>
>> 2008/11/20 Edgar Gabriel :
>>
>>> I don't think that they conflict with our paffinity module and setting.
>>> My
>>> understanding is that if you set a new affinity mask, it simply
>>> overwrites
>>> the previous setting. So in the worst case it voids the setting made by
>>> Open
>>> MPI, but I don't think that it should cause 'problems'. Admittedly, I
>>> haven't tried the library and the function calls yet, I just learned
>>> relatively recently about them...
>>>
>>> Thanks
>>> Edga
>>>
>>> Ralph Castain wrote:
>>>

 Interesting - learn something new every day! :-)

 How does this interact with OMPI's paffinity/maffinity assignments? With
 the rank/slot mapping and binding system?

 Should users -not- set paffinity if they include these numa calls in
 their
 code?

 Can we detect any potential conflict in OMPI and avoid setting
 paffinity_alone? Reason I ask: many systems set paffinity_alone in the
 default mca param file because they always assign dedicated nodes to
 users.
 While users can be told to be sure to turn it "off" when using these
 calls,
 it seems inevitable that they will forget - and complaints will appear.

 Thanks
 Ralph



 On Nov 20, 2008, at 7:34 AM, Edgar Gabriel wrote:

  if you look at recent versions of libnuma, there are two functions
> called
> numa_run_on_node() and numa_run_on_node_mask(), which allow
> thread-based
> assignments to CPUs
>
> Thanks
> Edgar
>
> Gabriele Fatigati wrote:
>
>>
>> Is there a way to assign one thread to one core? Also from code, not
>> necessary with OpenMPI option.
>> Thanks.
>> 2008/11/19 Stephen Wornom :
>>
>>>
>>> Gabriele Fatigati wrote:
>>>

 Ok,
 but in Ompi 1.3 how can i enable it?

  This may not be relevant, but I could not get a hybrid mpi+OpenMP
>>> code
>>> to
>>> work correctly.
>>> Would my problem be related to Gabriele's and perhaps fixed in
>>> openmpi
>>> 1.3?
>>> Stephen
>>>

 2008/11/18 Ralph Castain :

  I am afraid it is only available in 1.3 - we didn't backport it to
> the
> 1.2
> series
>
>
> On Nov 18, 2008, at 10:06 AM, Gabriele Fatigati wrote:
>
>
>  Hi,
>> how can i set "slot mapping" as you told me? With TASK GEOMETRY?
>> Or
>> is
>> a new 1.3 OpenMPI feature?
>>
>> Thanks.
>>
>> 2008/11/18 Ralph Castain :
>>
>>  Unfortunately, paffinity doesn't know anything about assigning
>>> threads
>>> to
>>> cores. This is actually a behavior of Linux, which only allows
>>> paffinity
>>> to
>>> be set at the process level. So, when you set paffinity on a
>>> process,
>>> you
>>> bind all threads of that process to the specified core(s). You
>>> cannot
>>> specify that a thread be given a specific core.
>>>
>>> In this case, your two threads/process are sharing the same core
>>> and
>>> thus
>>> contending for it. As you'd expect in that situation, one thread
>>> gets
>>> the
>>> vast majority of the attention, while the other thread is mostly
>>> idle.
>>>
>>> If you can upgrade to the beta 1.3 release, try using the slot

Re: [OMPI users] dual cores

2008-11-10 Thread Lenny Verkhovsky
you can also press "f" while"top" is running and choose option "j"
 this way you will see  what CPU is chosen under column P
Lenny.

On Mon, Nov 10, 2008 at 7:38 AM, Hodgess, Erin  wrote:

> great!
>
> Thanks,
> Erin
>
>
> Erin M. Hodgess, PhD
> Associate Professor
> Department of Computer and Mathematical Sciences
> University of Houston - Downtown
> mailto: hodge...@uhd.edu
>
>
>
> -Original Message-
> From: users-boun...@open-mpi.org on behalf of Brock Palen
> Sent: Sun 11/9/2008 11:21 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] dual cores
>
>  Run 'top' For long running applications you should see 4 processes
> each at 50%  (4*50=200% two cpus).
>
> You are ok, your hello_c did what it should, each of thoese 'hello's
> could have came from any of the two cpus.
>
> Also if your only running on your local machine, you don't need a
> hostfile, and -byslot is meaningless in this case,
>
> mpirun -np 4 ./hello_c
>
> Would work just fine.
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
>
>
>
> On Nov 10, 2008, at 12:05 AM, Hodgess, Erin wrote:
>
> > Dear Open MPI gurus:
> >
> > I have just installed Open MPI this evening.
> >
> > I have a dual core laptop and I would like to have both cores running.
> >
> > Here is the following my-hosts file:
> > localhost slots=2
> >
> > and here is the command and output:
> >  mpirun --hostfile my-hosts -np 4 --byslot hello_c |sort
> > Hello, world, I am 0 of 4
> > Hello, world, I am 1 of 4
> > Hello, world, I am 2 of 4
> > Hello, world, I am 3 of 4
> > hodgesse@erinstoy:~/Desktop/openmpi-1.2.8/examples>
> >
> >
> > How do I know if both cores are running, please?
> >
> > thanks,
> > Erin
> >
> >
> > Erin M. Hodgess, PhD
> > Associate Professor
> > Department of Computer and Mathematical Sciences
> > University of Houston - Downtown
> > mailto: hodge...@uhd.edu
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Working with a CellBlade cluster

2008-10-27 Thread Lenny Verkhovsky
can you update me with the mapping or the way to get it from the OS on the
Cell.

thanks

On Thu, Oct 23, 2008 at 8:08 PM, Mi Yan  wrote:

> Lenny,
>
> Thanks.
> I asked the Cell/BE Linux Kernel developer to get the CPU mapping :) The
> mapping is fixed in current kernel.
>
> Mi
> [image: Inactive hide details for "Lenny Verkhovsky"
> ]"Lenny Verkhovsky" <
> lenny.verkhov...@gmail.com>
>
>
>
> *"Lenny Verkhovsky" *
> Sent by: users-boun...@open-mpi.org
>
> 10/23/2008 01:52 PM Please respond to
> Open MPI Users 
>
>
> To
>
> "Open MPI Users" 
> cc
>
>
> Subject
>
> Re: [OMPI users] Working with a CellBlade cluster
> According to *
> https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3*<https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3>very
>  soon,
> but you can download trunk version 
> *http://www.open-mpi.org/svn/*<http://www.open-mpi.org/svn/>and check if it 
> works for you.
>
> how can you check mapping CPUs by OS , my cat /proc/cpuinfo shows very
> little info
> # cat /proc/cpuinfo
> processor : 0
> cpu : Cell Broadband Engine, altivec supported
> clock : 3200.00MHz
> revision : 48.0 (pvr 0070 3000)
> processor : 1
> cpu : Cell Broadband Engine, altivec supported
> clock : 3200.00MHz
> revision : 48.0 (pvr 0070 3000)
> processor : 2
> cpu : Cell Broadband Engine, altivec supported
> clock : 3200.00MHz
> revision : 48.0 (pvr 0070 3000)
> processor : 3
> cpu : Cell Broadband Engine, altivec supported
> clock : 3200.00MHz
> revision : 48.0 (pvr 0070 3000)
> timebase : 2666
> platform : Cell
> machine : CHRP IBM,0793-1RZ
>
>
>
> On Thu, Oct 23, 2008 at 3:00 PM, Mi Yan <*mi...@us.ibm.com*>
> wrote:
>
>Hi, Lenny,
>
>So rank file map will be supported in OpenMPI 1.3? I'm using
>    OpenMPI1.2.6 and did not find parameter "rmaps_rank_file_".
>Do you have idea when OpenMPI 1.3 will be available? OpenMPI 1.3 has
>quite a few features I'm looking for.
>
>Thanks,
>
>Mi
>[image: Inactive hide details for "Lenny Verkhovsky"
>]"Lenny Verkhovsky" <*
>lenny.verkhov...@gmail.com* >
>
>
>
>
>   *"Lenny Verkhovsky" 
> <**lenny.verkhov...@gmail.com*
> *>*
> Sent by: 
> *users-boun...@open-mpi.org*
>
> 10/23/2008 05:48 AM
>
> Please respond to
> Open MPI Users <*us...@open-mpi.org* >
>  To
>
> "Open MPI Users" <*us...@open-mpi.org* >cc
> Subject
>
> Re: [OMPI users] Working with a CellBlade cluster
>
>
>Hi,
>
>
>If I understand you correctly the most suitable way to do it is by
>paffinity that we have in Open MPI 1.3 and the trank.
>how ever usually OS is distributing processes evenly between sockets by
>it self.
>
>There still no formal FAQ due to a multiple reasons but you can read
>how to use it in the attached scratch ( there were few name changings of 
> the
>params, so check with ompi_info )
>
>shared memory is used between processes that share same machine, and
>openib is used between different machines ( hostnames ), no special mca
>params are needed.
>
>Best Regards
>Lenny,
>
>
> On Sun, Oct 19, 2008 at 10:32 AM, Gilbert Grosdidier <*
>gro...@mail.cern.ch* > wrote:
>   Working with a CellBlade cluster (QS22), the requirement is to have
>  one
>  instance of the executable running on each socket of the blade
>  (there are 2
>  sockets). The application is of the 'domain decomposition' type,
>  and each
>  instance is required to often send/receive data with both the
>  remote blades and
>  the neighbor socket.
>
>  Question is : which specification must be used for the mca btl
>  component
>  to force 1) shmem type messages when communicating with this
>  neighbor socket,
>  while 2) using openib to communicate with the remote blades ?
>  Is '-mca btl sm,openib,self' suitable for this ?
>
>  Also, which debug flags could be used to crosscheck that the
>  messages are
>  _actually_ going thru the right channel for a given channel,
>  please ?
>
>  We are currently using OpenMPI 1.2.5 shipped with RHEL5.2
>  (ppc64).
>  Which version do you think is currently the most optimised for
>

Re: [OMPI users] Working with a CellBlade cluster

2008-10-23 Thread Lenny Verkhovsky
According to https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3 very
soon,
but you can download trunk version http://www.open-mpi.org/svn/  and check
if it works for you.

how can you check mapping CPUs by OS , my cat /proc/cpuinfo shows very
little info
# cat /proc/cpuinfo
processor   : 0
cpu : Cell Broadband Engine, altivec supported
clock   : 3200.00MHz
revision: 48.0 (pvr 0070 3000)
processor   : 1
cpu : Cell Broadband Engine, altivec supported
clock   : 3200.00MHz
revision: 48.0 (pvr 0070 3000)
processor   : 2
cpu : Cell Broadband Engine, altivec supported
clock   : 3200.00MHz
revision: 48.0 (pvr 0070 3000)
processor   : 3
cpu : Cell Broadband Engine, altivec supported
clock   : 3200.00MHz
revision: 48.0 (pvr 0070 3000)
timebase: 2666
platform: Cell
machine : CHRP IBM,0793-1RZ



On Thu, Oct 23, 2008 at 3:00 PM, Mi Yan  wrote:

>  Hi, Lenny,
>
> So rank file map will be supported in OpenMPI 1.3? I'm using OpenMPI1.2.6
> and did not find parameter "rmaps_rank_file_".
> Do you have idea when OpenMPI 1.3 will be available? OpenMPI 1.3 has quite
> a few features I'm looking for.
>
> Thanks,
> Mi
> [image: Inactive hide details for "Lenny Verkhovsky"
> ]"Lenny Verkhovsky" <
> lenny.verkhov...@gmail.com>
>
>
>
> *"Lenny Verkhovsky" *
> Sent by: users-boun...@open-mpi.org
>
> 10/23/2008 05:48 AM   Please respond to
> Open MPI Users 
>
>
> To
>
> "Open MPI Users" 
> cc
>
>
> Subject
>
> Re: [OMPI users] Working with a CellBlade cluster
>
> Hi,
>
>
> If I understand you correctly the most suitable way to do it is by
> paffinity that we have in Open MPI 1.3 and the trank.
> how ever usually OS is distributing processes evenly between sockets by it
> self.
>
> There still no formal FAQ due to a multiple reasons but you can read how to
> use it in the attached scratch ( there were few name changings of the
> params, so check with ompi_info )
>
> shared memory is used between processes that share same machine, and openib
> is used between different machines ( hostnames ), no special mca params are
> needed.
>
> Best Regards
> Lenny,
>
>
>
>   On Sun, Oct 19, 2008 at 10:32 AM, Gilbert Grosdidier <*
> gro...@mail.cern.ch* > wrote:
>
>Working with a CellBlade cluster (QS22), the requirement is to have one
>instance of the executable running on each socket of the blade (there
>are 2
>sockets). The application is of the 'domain decomposition' type, and
>each
>instance is required to often send/receive data with both the remote
>blades and
>the neighbor socket.
>
>Question is : which specification must be used for the mca btl
>component
>to force 1) shmem type messages when communicating with this neighbor
>socket,
>while 2) using openib to communicate with the remote blades ?
>Is '-mca btl sm,openib,self' suitable for this ?
>
>Also, which debug flags could be used to crosscheck that the messages
>are
>_actually_ going thru the right channel for a given channel, please ?
>
>We are currently using OpenMPI 1.2.5 shipped with RHEL5.2 (ppc64).
>Which version do you think is currently the most optimised for these
>processors and problem type ? Should we go towards OpenMPI 1.2.8
>instead ?
>Or even try some OpenMPI 1.3 nightly build ?
>
>Thanks in advance for your help, Gilbert.
>
>___
>users mailing list*
>**us...@open-mpi.org* *
>
> **http://www.open-mpi.org/mailman/listinfo.cgi/users*<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
> *(See attached file: RANKS_FAQ.doc)*
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Working with a CellBlade cluster

2008-10-23 Thread Lenny Verkhovsky
Hi,


If I understand you correctly the most suitable way to do it is by paffinity
that we have in Open MPI 1.3 and the trank.
how ever usually OS is distributing processes evenly between sockets by it
self.

There still no formal FAQ due to a multiple reasons but you can read how to
use it in the attached scratch ( there were few name changings of the
params, so check with ompi_info )

shared memory is used between processes that share same machine, and openib
is used between different machines ( hostnames ), no special mca params are
needed.

Best Regards
Lenny,




On Sun, Oct 19, 2008 at 10:32 AM, Gilbert Grosdidier wrote:

>  Working with a CellBlade cluster (QS22), the requirement is to have one
> instance of the executable running on each socket of the blade (there are 2
> sockets). The application is of the 'domain decomposition' type, and each
> instance is required to often send/receive data with both the remote blades
> and
> the neighbor socket.
>
>  Question is : which specification must be used for the mca btl component
> to force 1) shmem type messages when communicating with this neighbor
> socket,
> while 2) using openib to communicate with the remote blades ?
> Is '-mca btl sm,openib,self' suitable for this ?
>
>  Also, which debug flags could be used to crosscheck that the messages are
> _actually_ going thru the right channel for a given channel, please ?
>
>  We are currently using OpenMPI 1.2.5 shipped with RHEL5.2 (ppc64).
> Which version do you think is currently the most optimised for these
> processors and problem type ? Should we go towards OpenMPI 1.2.8 instead ?
> Or even try some OpenMPI 1.3 nightly build ?
>
>  Thanks in advance for your help,  Gilbert.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


RANKS_FAQ.doc
Description: MS-Word document


Re: [OMPI users] OpenMPI with openib partitions

2008-10-07 Thread Lenny Verkhovsky
Sorry, misunderstood the question,

thanks for Pasha the right command line will be

-mca btl openib,self -mca btl_openib_of_pkey_val 0x8109 -mca
btl_openib_of_pkey_ix 1

ex.

#mpirun -np 2 -H witch2,witch3 -mca btl openib,self -mca
btl_openib_of_pkey_val 0x8001 -mca btl_openib_of_pkey_ix 1 ./mpi_p1_4_TRUNK
-t lt
LT (2) (size min max avg) 1 3.443480 3.443480 3.443480

Best regards

Lenny.

On 10/6/08, Jeff Squyres  wrote:
>
> On Oct 5, 2008, at 1:22 PM, Lenny Verkhovsky wrote:
>
>  you should probably use -mca tcp,self  -mca btl_openib_if_include ib0.8109
>>
>>
> Really?  I thought we only took OpenFabrics device names in the
> openib_if_include MCA param...?  It looks like ib0.8109 is an IPoIB device
> name.
>
>
>  Lenny.
>>
>>
>> On 10/3/08, Matt Burgess  wrote:
>> Hi,
>>
>>
>> I'm trying to get openmpi working over openib partitions. On this cluster,
>> the partition number is 0x109. The ib interfaces are pingable over the
>> appropriate ib0.8109 interface:
>>
>> d2:/opt/openmpi-ib # ifconfig ib0.8109
>> ib0.8109  Link encap:UNSPEC  HWaddr
>> 80-00-00-4A-FE-80-00-00-00-00-00-00-00-00-00-00
>>  inet addr:10.21.48.2  Bcast:10.21.255.255  Mask:255.255.0.0
>>  inet6 addr: fe80::202:c902:26:ca01/64 Scope:Link
>>  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>>  RX packets:16811 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:15848 errors:0 dropped:1 overruns:0 carrier:0
>>  collisions:0 txqueuelen:256
>>  RX bytes:102229428 (97.4 Mb)  TX bytes:102324172 (97.5 Mb)
>>
>>
>> I have tried the following:
>>
>> /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca btl
>> openib,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val 0x8109
>> -mca btl_openib_ib_pkey_ix 1 /cluster/pallas/x86_64-ib/IMB-MPI1
>>
>> but I just get a RETRY EXCEEDED ERROR. Is there a MCA parameter I am
>> missing?
>>
>> I was successful using tcp only:
>>
>> /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca btl
>> tcp,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val 0x8109
>> /cluster/pallas/x86_64-ib/IMB-MPI1
>>
>>
>>
>> Thanks,
>> Matt Burgess
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] OpenMPI with openib partitions

2008-10-05 Thread Lenny Verkhovsky
Hi,

you should probably use -mca tcp,self  -mca btl_openib_if_include ib0.8109

Lenny.

On 10/3/08, Matt Burgess  wrote:
>
> Hi,
>
>
> I'm trying to get openmpi working over openib partitions. On this cluster,
> the partition number is 0x109. The ib interfaces are pingable over the
> appropriate ib0.8109 interface:
>
> d2:/opt/openmpi-ib # ifconfig ib0.8109
> ib0.8109  Link encap:UNSPEC  HWaddr
> 80-00-00-4A-FE-80-00-00-00-00-00-00-00-00-00-00
>   inet addr:10.21.48.2  Bcast:10.21.255.255  Mask:255.255.0.0
>   inet6 addr: fe80::202:c902:26:ca01/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>   RX packets:16811 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:15848 errors:0 dropped:1 overruns:0 carrier:0
>   collisions:0 txqueuelen:256
>   RX bytes:102229428 (97.4 Mb)  TX bytes:102324172 (97.5 Mb)
>
>
> I have tried the following:
>
> /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca btl
> openib,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val 0x8109
> -mca btl_openib_ib_pkey_ix 1 /cluster/pallas/x86_64-ib/IMB-MPI1
>
> but I just get a RETRY EXCEEDED ERROR. Is there a MCA parameter I am
> missing?
>
> I was successful using tcp only:
>
> /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile machinefile -mca btl
> tcp,self -mca btl_openib_max_btls 1 -mca btl_openib_ib_pkey_val 0x8109
> /cluster/pallas/x86_64-ib/IMB-MPI1
>
>
>
> Thanks,
> Matt Burgess
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] SM btl slows down bandwidth?

2008-08-13 Thread Lenny Verkhovsky
Hi,

just for the try - can run np 2

( Ping Pong test is for 2 processes only )

On 8/13/08, Daniël Mantione  wrote:
>
>
>
> On Tue, 12 Aug 2008, Gus Correa wrote:
>
> > Hello Daniel and list
> >
> > Could it be a problem with memory bandwidth / contention in multi-core?
>
>
> Yes, I believe we are somehow limited by memory performance. Here are
> some numbers from a dual Opteron 2352 system, which has much more memory
> bandwidth:
>
>
> #---
> # Benchmarking PingPong
> # #processes = 2
> # ( 6 additional processes waiting in MPI_Barrier)
> #---
>#bytes #repetitions  t[usec]   Mbytes/sec
>
> 0 1000 0.86 0.00
> 1 1000 0.97 0.98
> 2 1000 0.95 2.01
> 4 1000 0.96 3.97
> 8 1000 0.95 7.99
>16 1000 0.9615.85
>32 1000 0.9930.69
>64 1000 0.9763.09
>   128 1000 1.02   119.68
>   256 1000 1.18   207.25
>   512 1000 1.40   348.77
>  1024 1000 1.75   556.75
>  2048 1000 2.59   753.22
>  4096 1000 5.10   766.23
>  8192 1000 7.93   985.13
> 16384 100014.60  1070.57
> 32768 100027.92  1119.23
> 65536  64046.67  1339.16
>131072  32086.03  1453.06
>262144  160   163.16  1532.21
>524288   80   310.01  1612.88
>   1048576   40   730.62  1368.69
>   2097152   20  1449.72  1379.57
>   4194304   10  2884.90  1386.53
>
> However, +/- 1200 MB/s (or +/ 1500 MB/s in case of the AMD system) is not
> even close to the memory performance limits the systems, so there
> should be room for optimization.
>
> After all, the openib btl manages to tranfer the data from the memory of
> oneprocess to the memory of another process just fine with more
> performance.
>
>
> > It has been reported in many mailing lists (mpich, beowulf, etc).
> > Here it seems to happen in dual-processor dual-core with our memory
> intensive
> > programs.
>
>
> MPICH2 manages to get about 5GB/s in shared memory performance on the
> Xeon 5420 system.
>
>
> > Have you checked what happens to the shared memory runs as you
> > you increase the number of active cores/processes?
> > Would it help to set the processor affinity in the shared memory runs?
> >
> > http://www.open-mpi.org/faq/?category=building#build-paffinity
> > http://www.open-mpi.org/faq/?category=tuning#using-paffinity
>
>
> Neither has any effect on the scores.
>
>
> Daniël
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Fail to install openmpi 1.2.5 on bladecenter with OFED 1.3

2008-08-13 Thread Lenny Verkhovsky
Hi,

check in /usr/lib  it's usually folder for 32bit libraries.

I think OFED1.3 comes already with Open MPI  so it should be installed by
default.

BTW, OFED1.3.1 comes with Open MPI 1.2.6 .

Lenny.

On 8/12/08, Mohd Radzi Nurul Azri  wrote:
>
> Hi,
>
>
> Thanks for the prompt reply. This might be basic but typically where is the
> 32 bit ofed libs? I think the default install prefix is /usr and my guess is
> the 64 bit libs is in /usr/lib64 . Where do I look for the 32 bit ofed libs?
> I remembered during the ofed build that passing 32 bit build argument failed
> - will it still install an OFED 32 bit libs?
>
>
>
> On Wed, Aug 13, 2008 at 1:40 AM, Jeff Squyres  wrote:
>
>> You probably need to add
>> --with-openib-libdir=/path/to/your/32/bit/ofed/libs.  I'm guessing that the
>> system installed the 64 bit libs in the default location and the 32 bit libs
>> in a different location.  If that's the case, then --with-openib-libdir will
>> tell OMPI specifically where to look for those libs and use those instead.
>>
>>
>> On Aug 12, 2008, at 1:32 PM, Mohd Radzi Nurul Azri wrote:
>>
>>
>>> Hi,
>>>
>>>
>>> I've been trying to install openmpi 1.2.5 on my cluster system running
>>> RHEL 4 (x64) with OFED 1.3. I need openmpi 1.2.5 (32 bit) and OFED seems to
>>> only install 64 bit version. I tried to build OFED with 32 bit support but
>>> it failed so I figure it's best to just compile 32 bit openmpi. I followed
>>> the FAQ and few user experience on the web.
>>>
>>> I ran this command:
>>> ./configure --prefix=/usr/mpi/gcc/32bit --with-openib=/usr CFLAGS=-m32
>>> CXXFLAGS=-m32 FFLAGS=-m32 FCFLAGS=-m32
>>>
>>> and after few scrolling lines, it stops here:
>>> --- MCA component btl:openib (m4 configuration macro)
>>> checking for MCA component btl:openib compile mode... dso
>>> looking for header without includes
>>> checking infiniband/verbs.h usability... yes
>>> checking infiniband/verbs.h presence... yes
>>> checking for infiniband/verbs.h... yes
>>> looking for library without search path
>>> checking for ibv_open_device in -libverbs... no
>>> looking for library in lib
>>> checking for ibv_open_device in -libverbs... no
>>> looking for library in lib64
>>> checking for ibv_open_device in -libverbs... no
>>> checking for ibv_create_srq... no
>>> checking for ibv_get_device_list... no
>>> checking for ibv_resize_cq... no
>>> configure: WARNING: OpenFabrics support requested (via --with-openib) but
>>> not fo  und.
>>> configure: WARNING: If you are using libibverbs v1.0 (i.e., OFED v1.0 or
>>> v1.1),   you *MUST* have both the libsysfs headers and libraries installed.
>>>  Later versio  ns of libibverbs do not require libsysfs.
>>> configure: error: Aborting.
>>>
>>>
>>> What went wrong? From the error it says early OFED version which is not
>>> the one I'm using (running OFED 1.3 now).
>>>
>>> Any advice is greatly appreciated.
>>>
>>>
>>> --
>>> Thank you.
>>>
>>> azri
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Thank you.
>
> Nurul Azri Mohd Radzi
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Setting up Open MPI to run on multiple servers

2008-08-12 Thread Lenny Verkhovsky
you can also provide a full path to your mpi

#/usr/lib/openmpi/1.2.5-gcc/bin/mpiexec -n 2 ./a.out

On 8/12/08, jody  wrote:
>
> No.
> The PATH variable simply tells the system in which order the
> directories should be searched for executables.
>
> so in .bash_profile just add the line
>   PATH=/usr/lib/openmpi/1.2.5-gcc/bin:$PATH
> after the line
>   PATH=$PATH:$HOME/bin
>
> Then the system will search in /usr/lib/openmpi/1.2.5-gcc/bin before
> it will look
> in the directories it would have looked in anyway.
>
>
>
> Jody
>
>
> On Tue, Aug 12, 2008 at 11:59 AM, Rayne  wrote:
> > My .bash_profile and .bashrc on the server are exactly the same as that
> on my PC. However, I can run mpiexec without any problems just using my PC
> as a single node, i.e. without trying to login to other servers and using
> multiple nodes. I only get the errors on the server.
> >
> > In .bash_profile, I see
> >
> > PATH=$PATH:$HOME/bin
> >
> > If I change this, won't it affect other programs as well?
> >
> > Thank you.
> >
> > Regards,
> > Rayne
> >
> > --- On Tue, 12/8/08, jody  wrote:
> >
> >> From: jody 
> >> Subject: Re: [OMPI users] Setting up Open MPI to run on multiple servers
> >> To: lancer6...@yahoo.com, "Open MPI Users" 
> >> Date: Tuesday, 12 August, 2008, 5:23 PM
> >> What are the contents of your $PATH environment variable?
> >> Make sure that your Open-MPI folder
> >> (/usr/lib/openmpi/1.2.5-gcc/bin)
> >> precedes '/usr/bin' in $PATH,
> >> i.e.
> >> /usr/lib/openmpi/1.2.5-gcc/bin:/usr/bin
> >>
> >> then the Open-MPI version of mpirun or mpiexec will be used
> >> instead of
> >> the LAM-versions.
> >>
> >> This should also be the case on your other machines.
> >>
> >> BTW, since it seems you haven't correctly set your PATH
> >> variable, i
> >> suspect you have omitted
> >> to set LD_LIBRARY_PATH as well...
> >> see points 1,2 and 3 in
> >> http://www.open-mpi.org/faq/?category=running
> >>
> >> Jody
> >>
> >> On Tue, Aug 12, 2008 at 11:10 AM, Rayne
> >>  wrote:
> >> > Hi,
> >> >
> >> > I looked for any folders with 'lam', and found
> >> 2, under /usr/lib/lam and /etc/lam. I don't know if it
> >> means LAM was previously installed, because my PC also has
> >> /usr/lib/lam, although the contents are different. I renamed
> >> the 2 folders, and got the "*** Oops -- I cannot open
> >> the LAM help file." error below instead.
> >> >
> >> > I tried 'whichexec', and it gave me
> >> /usr/bin/mpiexec. I checked the mpiexec there and it's
> >> actually a Perl script, and I believe I installed OpenMPI in
> >> /usr/lib64/openmpi/1.2.5-gcc/
> >> >
> >> > So I tried mpirun instead and it gave me the following
> >> message:
> >> >
> >> > "*** Oops -- I cannot open the LAM help file.
> >> > *** I tried looking for it in the following places:
> >> > ***
> >> > ***   $HOME/lam-helpfile
> >> > ***   $HOME/lam-7.0.6-helpfile
> >> > ***   $HOME/etc/lam-helpfile
> >> > ***   $HOME/etc/lam- 7.0.6-helpfile
> >> > ***   $LAMHELPDIR/lam-helpfile
> >> > ***   $LAMHELPDIR/lam-7.0.6-helpfile
> >> > ***   $LAMHOME/etc/lam-helpfile
> >> > ***   $LAMHOME/etc/lam-7.0.6-helpfile
> >> > ***   $SYSCONFDIR/lam-helpfile
> >> > ***   $SYSCONFDIR/lam- 7.0.6-helpfile
> >> > ***
> >> > *** You were supposed to get help on the program
> >> "MPI"
> >> > *** about the topic "no-lamd"
> >> > ***
> >> > *** Sorry!"
> >> >
> >> > Firstly, how do I change the settings such that
> >> mpiexec points to the mpiexec in my installation folder,
> >> which I believe should be
> >> > /usr/lib/openmpi/1.2.5-gcc/bin/mpiexec, and the
> >> mpiexec there seems to be a shortcut that points to
> >> /usr/lib/openmpi/1.2.5-gcc/bin/orterun. Would this help?
> >> While I'm at it, it seems that mpirun, which is
> >> /usr/bin/mpirun currently, should also point to
> >> /usr/lib/openmpi/1.2.5-gcc/bin/mpirun, which also is a
> >> shortcut to /usr/lib/openmpi/1.2.5-gcc/bin/orterun.
> >> >
> >> > Thank you.
> >> >
> >> > Regards,
> >> > Rayne
> >> >
> >> > --- On Tue, 12/8/08, jody 
> >> wrote:
> >> >
> >> >> From: jody 
> >> >> Subject: Re: [OMPI users] Setting up Open MPI to
> >> run on multiple servers
> >> >> To: lancer6...@yahoo.com, "Open MPI
> >> Users" 
> >> >> Date: Tuesday, 12 August, 2008, 3:38 PM
> >> >> Hi Ryan
> >> >> Another thing:
> >> >> Have you checked if the mpiexec you call is really
> >> the one
> >> >> from your
> >> >> Open-MPI installation?
> >> >>
> >> >> Try 'which mpiexec' to find out.
> >> >>
> >> >> Jody
> >> >>
> >> >> On Tue, Aug 12, 2008 at 9:36 AM, jody
> >> >>  wrote:
> >> >> > Hi Ryan
> >> >> >
> >> >> > The message "Lamnodes Failed!"
> >> seems to
> >> >> indicate that you still have a
> >> >> > LAM/MPI installation somewhere.
> >> >> > You should get rid of that first.
> >> >> >
> >> >> > Jody
> >> >> >
> >> >> > On Tue, Aug 12, 2008 at 9:00 AM, Rayne
> >> >>  wrote:
> >> >> >> Hi, thanks for your reply.
> >> >> >>
> >> >> >> I did what you said, set up the
> >> password-less ssh,
> >> >> nfs etc, and put the IP address of the server in
> >> the d

Re: [OMPI users] Pathscale compiler and C++ bindings

2008-08-04 Thread Lenny Verkhovsky
Sles10sp1

On 8/1/08, Scott Beardsley  wrote:
>
> we might be running different OS's.  I'm running RHEL 4U4
>>
>
> CentOS 5.2 here
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] OpenMPI 1.4 nightly

2008-07-31 Thread Lenny Verkhovsky
try to use only openib

make sure you use nightly after r19092

On 7/31/08, Gabriele Fatigati  wrote:
>
> Mm, i've tried to disable shared memory but the problem remains. Is it
> normal?
>
> 2008/7/31 Jeff Squyres 
>
>> There is very definitely a shared memory bug on the trunk at the moment
>> that can cause hangs like this:
>>
>>https://svn.open-mpi.org/trac/ompi/ticket/1378
>>
>> That being said, the v1.4 nightly is our normal development head, so all
>> the normal rules and disclaimers apply (it's *generally* stable, but
>> sometimes things break).
>>
>>
>> On Jul 31, 2008, at 10:27 AM, Gabriele Fatigati wrote:
>>
>>
>>> Dear OpenMPI users,
>>> i have installed OpenMPI 1.4 nigthly over IBM BLADE system with
>>> Infiniband. I have some problem with MPI applications. A simple MPI Hello
>>> world, doesn't function. After dispatch, every cpu works over 100% but doing
>>> nothing. The jobs appears locked.
>>>
>>> I compiled with
>>>
>>>  --enable-mpi-threads
>>>  --enable-ft-thread
>>>  --with-ft=cr
>>> -with-blcr=/prod/tools/blcr/0.7.1/gnu--4.1.2
>>>
>>> (and other, but less important).
>>>
>>> Where is the problem? This version is very instable?
>>> --
>>> Gabriele Fatigati
>>>
>>> CINECA Systems & Tecnologies Department
>>>
>>> Supercomputing Group
>>>
>>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>>
>>> www.cineca.it Tel: +39 051 6171722
>>>
>>> g.fatig...@cineca.it
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> --
> Gabriele Fatigati
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.it Tel: +39 051 6171722
>
> g.fatig...@cineca.it
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Strange problem with 1.2.6

2008-07-14 Thread Lenny Verkhovsky
maybe it's related to #1378  PML ob1 deadlock for ping/ping  ?

On 7/14/08, Jeff Squyres  wrote:
>
> What application is it?  The majority of the message passing engine did not
> change in the 1.2 series; we did add a new option into 1.2.6 for disabling
> early completion:
>
>
> http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
>
> See if that helps you out.
>
> Note that I don't think many (any?) of us developers monitor the beowulf
> list.  Too much mail in our INBOXes already... :-(
>
>
> On Jul 10, 2008, at 11:04 PM, Joe Landman wrote:
>
>  Hi folks:
>>
>>  I am running into a strange problem with Open-MPI 1.2.6, built using
>> gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish).  The problem
>> appears to be that if I run using the tcp btl, disabling sm and openib, the
>> run completes successfully (on several different platforms), and does so
>> repeatably.
>>
>>  Similarly, if I enable either openib or sm btl, the run does not
>> complete, hanging at different places.
>>
>>  An strace of the master thread while it is hanging shows it in a tight
>> loop
>>
>> Process 15547 attached - interrupt to quit
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
>> {fd=8, events=POLLIN}, {fd=9, events=
>> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
>> {fd=8, events=POLLIN}, {fd=9, events=
>> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>>
>> The code ran fine about 18 months ago with earlier OpenMPI.  This is
>> identical source and data to what is known to work, and demonstrated to work
>> on a few different platforms.
>>
>> Posing the question on Beowulf, some suggested turning off sm and openib.
>>  So this run works repeatedly when we do as indicated.  The suggestion was
>> that there was some sort of buffer size issue on the sm device.
>>
>> Turning off sm and tcp, leaving openib also appears to loop forever.
>>
>> So, with all this, are there any sort of tunables that I should be playing
>> with?
>>
>> I tried adusting a few things  by setting some mca parameters in
>> $HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun
>> claimed it was going to ignore those anyway).
>>
>> Any clues?  Thanks.
>>
>> Joe
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: land...@scalableinformatics.com
>> web  : http://www.scalableinformatics.com
>>  http://jackrabbit.scalableinformatics.com
>> phone: +1 734 786 8423
>> fax  : +1 866 888 3112
>> cell : +1 734 612 4615
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>