Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-27 Thread Ralph Castain

On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:

> No we do not have a firewall turned on. I can run smaller 96 slave cases
> on ln10 and ln13 included on the slavelist. 
> 
> Could there be another reason for this to fail ? 

What is in "procgroup"? Is it a single application?

Offhand, there is nothing in OMPI that would explain the problem. The only 
possibility I can think of would be if your "procgroup" file contains more than 
128 applications in it.

> 
> 
> Sincerely,
> 
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> Sent: Wednesday, April 27, 2011 2:18 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
> 
> Perhaps a firewall? All it is telling you is that mpirun couldn't
> establish TCP communications with the daemon on ln10.
> 
> 
> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
> 
>> Hi,
>>I am getting a "oob-tcp: Communication retries exceeded" error
>> message when I run a 238 MPI slave code
>> 
>> 
>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>> 
> 
>> --
>> mpirun was unable to start the specified application as it encountered
>> an error:
>> 
>> Error name: Unknown error: 1
>> Node: ln10
>> 
>> when attempting to start process rank 234.
>> 
> 
>> --
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>> orted/orted_comm.c at line 130
>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>> orted/orted_comm.c at line 130
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded.  Can not communicate with peer
>> 
>> Any help would be greatly appreciated.
>> 
>> Sincerely,
>> 
>> Waris Sindhi
>> High Performance Computing, TechApps
>> Pratt & Whitney, UTC
>> (860)-565-8486
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Need help buiding OpenMPI with Intel v12.0 compilers on Linux

2011-04-27 Thread Tru Huynh
On Thu, Apr 28, 2011 at 12:46:27AM +0200, Tru Huynh wrote:
> On Thu, Apr 21, 2011 at 06:35:16PM -0400, Jeff Squyres wrote:
> > It's normal and expected for there to be lots of errors in config.log.  
> > 
> > There's a bunch of tests in configure that are designed to succeed on some 
> > systems and fail on others.  
> > 
> > So don't read anything into the failures that you see in config.log -- 
> > unless configure itself fails.  Then we generally go look at the *last* 
> > failures in config.log to start backtracking to figure out what went wrong.
> > 
> 
> for what's worth this works fine for me CentOS 5 x86_64:
> ./configure --prefix=/c5/shared/openmpi/1.4.3/sge/6.2u4/intel/12.2011.3.174 
> --with-sge CC=icc FC=ifort CXX=icpc F77=ifort && make && make check && make 
> install 
> 
The above was with:
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, 
Version 12.0 Build 20110309
Copyright (C) 1985-2011 Intel Corporation.  All rights reserved.

I have just retried with the initial XE version:
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, 
Version 12.0 Build 20101006
Copyright (C) 1985-2010 Intel Corporation.  All rights reserved.

builds and passes the check too.

my 2 cents

Tru
-- 
Dr Tru Huynh  | http://www.pasteur.fr/recherche/unites/Binfs/
mailto:t...@pasteur.fr | tel/fax +33 1 45 68 87 37/19
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France  


Re: [OMPI users] Need help buiding OpenMPI with Intel v12.0 compilers on Linux

2011-04-27 Thread Tru Huynh
On Thu, Apr 21, 2011 at 06:35:16PM -0400, Jeff Squyres wrote:
> It's normal and expected for there to be lots of errors in config.log.  
> 
> There's a bunch of tests in configure that are designed to succeed on some 
> systems and fail on others.  
> 
> So don't read anything into the failures that you see in config.log -- unless 
> configure itself fails.  Then we generally go look at the *last* failures in 
> config.log to start backtracking to figure out what went wrong.
> 

for what's worth this works fine for me CentOS 5 x86_64:
./configure --prefix=/c5/shared/openmpi/1.4.3/sge/6.2u4/intel/12.2011.3.174 
--with-sge CC=icc FC=ifort CXX=icpc F77=ifort && make && make check && make 
install 

Cheers,

Tru
-- 
Dr Tru Huynh  | http://www.pasteur.fr/recherche/unites/Binfs/
mailto:t...@pasteur.fr | tel/fax +33 1 45 68 87 37/19
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France  


Re: [OMPI users] srun and openmpi

2011-04-27 Thread Jeff Squyres
On Apr 27, 2011, at 3:39 PM, Ralph Castain wrote:

> Nope, nope nope...in this mode of operation, we are using -static- ports.

Er.. right.  Sorry -- my bad for not reading the full context here... ignore 
what I said...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain

On Apr 27, 2011, at 1:27 PM, Jeff Squyres wrote:

> On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:
> 
>> Actually, I understood you correctly. I'm just saying that I find no 
>> evidence in the code that we try three times before giving up. What I see is 
>> a single attempt to bind the port - if it fails, then we abort. There is no 
>> parameter to control that behavior.
>> 
>> So if the OS hasn't released the port by the time a new job starts on that 
>> node, then it will indeed abort if the job was unfortunately given the same 
>> port reservation.
> 
> FWIW, the OS may be trying multiple times under the covers, but from as far 
> as OMPI is concerned, we're just trying once.
> 
> OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking 
> for a specific port number, and the OS fills it in for us).  If it gives us 
> back a port that isn't actually available, that would be really surprising.

Nope, nope nope...in this mode of operation, we are using -static- ports.

The problem here is that srun is incorrectly handing out the same port 
reservation to the next job, causing the port binding to fail because the last 
job's binding hasn't yet timed out.


> 
> If you have a bajiollion short jobs running, I wonder if there's some kind of 
> race condition occurring that some MPI processes are getting messages from 
> the wrong mpirun.  And then things go downhill from there.  
> 
> I can't immediately imagine how that would happen, but maybe there's some 
> kind of weird race condition in there somewhere...?  We pass specific IP 
> addresses and ports around on the command line, though, so I don't quite see 
> how that would happen...
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-27 Thread Sindhi, Waris PW
No we do not have a firewall turned on. I can run smaller 96 slave cases
on ln10 and ln13 included on the slavelist. 

Could there be another reason for this to fail ? 


Sincerely,

Waris Sindhi
High Performance Computing, TechApps
Pratt & Whitney, UTC
(860)-565-8486

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Wednesday, April 27, 2011 2:18 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded

Perhaps a firewall? All it is telling you is that mpirun couldn't
establish TCP communications with the daemon on ln10.


On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:

> Hi,
> I am getting a "oob-tcp: Communication retries exceeded" error
> message when I run a 238 MPI slave code
> 
> 
> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>

> --
> mpirun was unable to start the specified application as it encountered
> an error:
> 
> Error name: Unknown error: 1
> Node: ln10
> 
> when attempting to start process rank 234.
>

> --
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
> orted/orted_comm.c at line 130
> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
> orted/orted_comm.c at line 130
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded.  Can not communicate with peer
> 
> Any help would be greatly appreciated.
> 
> Sincerely,
> 
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] srun and openmpi

2011-04-27 Thread Jeff Squyres
On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:

> Actually, I understood you correctly. I'm just saying that I find no evidence 
> in the code that we try three times before giving up. What I see is a single 
> attempt to bind the port - if it fails, then we abort. There is no parameter 
> to control that behavior.
> 
> So if the OS hasn't released the port by the time a new job starts on that 
> node, then it will indeed abort if the job was unfortunately given the same 
> port reservation.

FWIW, the OS may be trying multiple times under the covers, but from as far as 
OMPI is concerned, we're just trying once.

OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking for 
a specific port number, and the OS fills it in for us).  If it gives us back a 
port that isn't actually available, that would be really surprising.

If you have a bajiollion short jobs running, I wonder if there's some kind of 
race condition occurring that some MPI processes are getting messages from the 
wrong mpirun.  And then things go downhill from there.  

I can't immediately imagine how that would happen, but maybe there's some kind 
of weird race condition in there somewhere...?  We pass specific IP addresses 
and ports around on the command line, though, so I don't quite see how that 
would happen...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 2:46 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:
>
>> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>>>
>>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>>>
 Was this ever committed to the OMPI src as something not having to be
 run outside of OpenMPI, but as part of the PSM setup that OpenMPI
 does?
>>>
>>> Not that I know of - I don't think the PSM developers ever looked at it.
>>>

 I'm having some trouble getting Slurm/OpenMPI to play nice with the
 setup of this key.  Namely, with slurm you cannot export variables
 from the --prolog of an srun, only from an --task-prolog,
 unfortunately, if you use a task-prolog each rank gets a different
 key, which doesn't work.

 I'm also guessing that each unique mpirun needs it's own psm key, not
 one for the whole system, so i can't just make it a permanent
 parameter somewhere else.

 Also, i recall reading somewhere that the --resv-ports parameter that
 OMPI uses from slurm to choose a list of ports to use for TCP comm's,
 tries to lock a port from the pool three times before giving up.
>>>
>>> Had to look back at the code - I think you misread this. I can find no 
>>> evidence in the code that we try to bind that port more than once.
>>
>> Perhaps i misstated, i don't believe you're trying to bind to the same
>> port twice during the same session.  i believe the code re-uses
>> similar ports from session to session.  what i believe happens (but
>> could be totally wrong) the previous session releases the port, but
>> linux isn't quite done with it when the new session tries to bind to
>> the port, in which case it tries three times and then fails the job
>
> Actually, I understood you correctly. I'm just saying that I find no evidence 
> in the code that we try three times before giving up. What I see is a single 
> attempt to bind the port - if it fails, then we abort. There is no parameter 
> to control that behavior.
>
> So if the OS hasn't released the port by the time a new job starts on that 
> node, then it will indeed abort if the job was unfortunately given the same 
> port reservation.

Oh, okay, sorry...



Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain

On Apr 27, 2011, at 12:38 PM, Michael Di Domenico wrote:

> On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>> 
>> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>> 
>>> Was this ever committed to the OMPI src as something not having to be
>>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>>> does?
>> 
>> Not that I know of - I don't think the PSM developers ever looked at it.
>> 
>>> 
>>> I'm having some trouble getting Slurm/OpenMPI to play nice with the
>>> setup of this key.  Namely, with slurm you cannot export variables
>>> from the --prolog of an srun, only from an --task-prolog,
>>> unfortunately, if you use a task-prolog each rank gets a different
>>> key, which doesn't work.
>>> 
>>> I'm also guessing that each unique mpirun needs it's own psm key, not
>>> one for the whole system, so i can't just make it a permanent
>>> parameter somewhere else.
>>> 
>>> Also, i recall reading somewhere that the --resv-ports parameter that
>>> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
>>> tries to lock a port from the pool three times before giving up.
>> 
>> Had to look back at the code - I think you misread this. I can find no 
>> evidence in the code that we try to bind that port more than once.
> 
> Perhaps i misstated, i don't believe you're trying to bind to the same
> port twice during the same session.  i believe the code re-uses
> similar ports from session to session.  what i believe happens (but
> could be totally wrong) the previous session releases the port, but
> linux isn't quite done with it when the new session tries to bind to
> the port, in which case it tries three times and then fails the job

Actually, I understood you correctly. I'm just saying that I find no evidence 
in the code that we try three times before giving up. What I see is a single 
attempt to bind the port - if it fails, then we abort. There is no parameter to 
control that behavior.

So if the OS hasn't released the port by the time a new job starts on that 
node, then it will indeed abort if the job was unfortunately given the same 
port reservation.


> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
On Wed, Apr 27, 2011 at 2:25 PM, Ralph Castain  wrote:
>
> On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:
>
>> Was this ever committed to the OMPI src as something not having to be
>> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
>> does?
>
> Not that I know of - I don't think the PSM developers ever looked at it.
>
>>
>> I'm having some trouble getting Slurm/OpenMPI to play nice with the
>> setup of this key.  Namely, with slurm you cannot export variables
>> from the --prolog of an srun, only from an --task-prolog,
>> unfortunately, if you use a task-prolog each rank gets a different
>> key, which doesn't work.
>>
>> I'm also guessing that each unique mpirun needs it's own psm key, not
>> one for the whole system, so i can't just make it a permanent
>> parameter somewhere else.
>>
>> Also, i recall reading somewhere that the --resv-ports parameter that
>> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
>> tries to lock a port from the pool three times before giving up.
>
> Had to look back at the code - I think you misread this. I can find no 
> evidence in the code that we try to bind that port more than once.

Perhaps i misstated, i don't believe you're trying to bind to the same
port twice during the same session.  i believe the code re-uses
similar ports from session to session.  what i believe happens (but
could be totally wrong) the previous session releases the port, but
linux isn't quite done with it when the new session tries to bind to
the port, in which case it tries three times and then fails the job



Re: [OMPI users] srun and openmpi

2011-04-27 Thread Ralph Castain

On Apr 27, 2011, at 10:09 AM, Michael Di Domenico wrote:

> Was this ever committed to the OMPI src as something not having to be
> run outside of OpenMPI, but as part of the PSM setup that OpenMPI
> does?

Not that I know of - I don't think the PSM developers ever looked at it.

> 
> I'm having some trouble getting Slurm/OpenMPI to play nice with the
> setup of this key.  Namely, with slurm you cannot export variables
> from the --prolog of an srun, only from an --task-prolog,
> unfortunately, if you use a task-prolog each rank gets a different
> key, which doesn't work.
> 
> I'm also guessing that each unique mpirun needs it's own psm key, not
> one for the whole system, so i can't just make it a permanent
> parameter somewhere else.
> 
> Also, i recall reading somewhere that the --resv-ports parameter that
> OMPI uses from slurm to choose a list of ports to use for TCP comm's,
> tries to lock a port from the pool three times before giving up.

Had to look back at the code - I think you misread this. I can find no evidence 
in the code that we try to bind that port more than once.

> 
> Can someone tell me where that parameter is set, i'd like to set it to
> a higher value.  We're seeing issues where running a large number of
> short srun's sequentially is causing some of the mpirun's in the
> stream to be killed because they could not lock the ports.
> 
> I suspect because of the lag between when the port is actually closed
> in linux and when ompi re-opens a new port is very quick, we're trying
> three times and giving up.  I have more then enough ports in the
> resv-ports list, 30k.  but i suspect there is some random re-use being
> done and it's failing
> 
> thanks
> 
> 
> On Mon, Jan 3, 2011 at 10:00 AM, Jeff Squyres  wrote:
>> Yo Ralph --
>> 
>> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197. 
>>  Do you want to add a blurb in README about it, and/or have this executable 
>> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
>> ompi-psm-keygen)?
>> 
>> Right now, it's only compiled as part of "make check" and not installed, 
>> right?
>> 
>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>> 
>>> Run the program only once - it can be in the prolog of the job if you like. 
>>> The output value needs to be in the env of every rank.
>>> 
>>> You can reuse the value as many times as you like - it doesn't have to be 
>>> unique for each job. There is nothing magic about the value itself.
>>> 
>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>> 
 How early does this need to run? Can I run it as part of a task
 prolog, or does it need to be the shell env for each rank?  And does
 it need to run on one node or all the nodes in the job?
 
 On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
> Well, I couldn't do it as a patch - proved too complicated as the psm 
> system looks for the value early in the boot procedure.
> 
> What I can do is give you the attached key generator program. It outputs 
> the envar required to run your program. So if you run the attached 
> program and then export the output into your environment, you should be 
> okay. Looks like this:
> 
> $ ./psm_keygen
> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
> $
> 
> You compile the program with the usual mpicc.
> 
> Let me know if this solves the problem (or not).
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] OpenMPI out of band TCP retry exceeded

2011-04-27 Thread Sindhi, Waris PW
Hi,
 I am getting a "oob-tcp: Communication retries exceeded" error
message when I run a 238 MPI slave code


/opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
--mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
/usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup

--
mpirun was unable to start the specified application as it encountered
an error:

Error name: Unknown error: 1
Node: ln10

when attempting to start process rank 234.

--
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
orted/orted_comm.c at line 130
[ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
orted/orted_comm.c at line 130
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer
[ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
exceeded.  Can not communicate with peer

Any help would be greatly appreciated.

Sincerely,

Waris Sindhi
High Performance Computing, TechApps
Pratt & Whitney, UTC
(860)-565-8486




Re: [OMPI users] srun and openmpi

2011-04-27 Thread Michael Di Domenico
Was this ever committed to the OMPI src as something not having to be
run outside of OpenMPI, but as part of the PSM setup that OpenMPI
does?

I'm having some trouble getting Slurm/OpenMPI to play nice with the
setup of this key.  Namely, with slurm you cannot export variables
from the --prolog of an srun, only from an --task-prolog,
unfortunately, if you use a task-prolog each rank gets a different
key, which doesn't work.

I'm also guessing that each unique mpirun needs it's own psm key, not
one for the whole system, so i can't just make it a permanent
parameter somewhere else.

Also, i recall reading somewhere that the --resv-ports parameter that
OMPI uses from slurm to choose a list of ports to use for TCP comm's,
tries to lock a port from the pool three times before giving up.

Can someone tell me where that parameter is set, i'd like to set it to
a higher value.  We're seeing issues where running a large number of
short srun's sequentially is causing some of the mpirun's in the
stream to be killed because they could not lock the ports.

I suspect because of the lag between when the port is actually closed
in linux and when ompi re-opens a new port is very quick, we're trying
three times and giving up.  I have more then enough ports in the
resv-ports list, 30k.  but i suspect there is some random re-use being
done and it's failing

thanks


On Mon, Jan 3, 2011 at 10:00 AM, Jeff Squyres  wrote:
> Yo Ralph --
>
> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197.  
> Do you want to add a blurb in README about it, and/or have this executable 
> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
> ompi-psm-keygen)?
>
> Right now, it's only compiled as part of "make check" and not installed, 
> right?
>
> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>
>> Run the program only once - it can be in the prolog of the job if you like. 
>> The output value needs to be in the env of every rank.
>>
>> You can reuse the value as many times as you like - it doesn't have to be 
>> unique for each job. There is nothing magic about the value itself.
>>
>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>
>>> How early does this need to run? Can I run it as part of a task
>>> prolog, or does it need to be the shell env for each rank?  And does
>>> it need to run on one node or all the nodes in the job?
>>>
>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
 Well, I couldn't do it as a patch - proved too complicated as the psm 
 system looks for the value early in the boot procedure.

 What I can do is give you the attached key generator program. It outputs 
 the envar required to run your program. So if you run the attached program 
 and then export the output into your environment, you should be okay. 
 Looks like this:

 $ ./psm_keygen
 OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
 $

 You compile the program with the usual mpicc.

 Let me know if this solves the problem (or not).



Re: [OMPI users] RES: RES: RES: Error with ARM target

2011-04-27 Thread Jeff Squyres
FWIW, my ARM contact tells me that he uses a native ARM Linux distro explicitly 
to avoid all the complexities of cross-compiling...  :-\


On Apr 25, 2011, at 11:29 AM, Jeff Squyres wrote:

> There's some extra special mojo that needs to be supplied when 
> cross-compiling Open MPI (e.g., a file that specifies all the ./configure 
> answers for tests that it can't run in a cross-compiling environment).  
> 
> The wiki page Ralph was talking about was referring to instructions on how to 
> create this answer file.  I can't seem to find it, either.
> 
> Brian -- any idea what happened to that wiki page?
> 
> I've pinged our ARM contact to see how he compiles OMPI for the ARM platform.
> 
> 
> 
> On Apr 25, 2011, at 10:00 AM, Fernando Dutra Fagundes Macedo wrote:
> 
>> I tried  1.5.2 and 1.5.3.
>> 
>> -Mensagem original-
>> De: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] Em nome 
>> de Barrett, Brian W
>> Enviada em: segunda-feira, 25 de abril de 2011 10:53
>> Para: Open MPI Users
>> Assunto: Re: [OMPI users] RES: RES: Error with ARM target
>> 
>> --host is the correct option, but the host string "arm" is not valid; it 
>> needs to be a proper triple, something like "x86_64-unknown-linux-gnu".
>> Either way, ARM was not a supported platform in the 1.4.x release; the 
>> earliest version of Open MPI to support the ARM platform was 1.5.2.
>> 
>> Brian
>> 
>> On 4/25/11 7:46 AM, "Ralph Castain"  wrote:
>> 
>>> I think you've reversed the role of host and target then. "host" is the 
>>> machine type you are compiling on, and "target" is the machine you are 
>>> compiling for.
>>> 
>>> There used to be a wiki page on cross-compiling OMPI, but I couldn't 
>>> locate it this morning - I'm sure it's still there, but it is hard to 
>>> find. Try searching the OMPI web site for info.
>>> 
>>> 
>>> On Apr 25, 2011, at 5:09 AM, Fernando Dutra Fagundes Macedo wrote:
>>> 
 I'm trying to cross-compile.
 
 -Mensagem original-
 De: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] Em 
 nome de Ralph Castain  Enviada em: sábado, 23 de abril de 2011 17:21
 Para: Open MPI Users
 Assunto: Re: [OMPI users] RES: Error with ARM target
 
 Don't give it a host argument - unless you are trying to 
 cross-compile, it should figure it out for itself
 
 
 On Apr 23, 2011, at 1:25 PM, Fernando Dutra Fagundes Macedo wrote:
 
> Correcting:
> 
> I tried 1.5.2 and 1.5.3.
> 
> 
> -Mensagem original-
> De: users-boun...@open-mpi.org em nome de Fernando Dutra Fagundes 
> Macedo
> Enviada: sáb 23/4/2011 16:16
> Para: us...@open-mpi.org
> Assunto: [OMPI users] Error with ARM target
> 
> Hi,
> 
> I am trying to use Open MPI on a Friendly ARM board, but I can't 
> compile it to ARM target. I'm trying to configure the package this way:
> 
> ./configure -host="arm"
> 
> What can I do to make it work?
> 
> More information:
> 
> Error: "configure: error: No atomic primitives available for 
> arm-unknown-none"
> Version: 1.4.2 and 1.4.3
> 
> Thanks in advance,
> Fernando Macedo
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
>> 
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-27 Thread Brock Palen
Argh, our messed up environment with three generations on infiniband bit us,
Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR ib 
on some of our hosts.  Note that jobs will never run across our old DDR ib and 
our new QDR stuff where rdmacm does work.

I am doing some testing with:
export OMPI_MCA_btl_openib_cpc_include=rdmacm,oob,xoob

What I want to know is there a way to tell mpirun to 'dump all resolved mca 
settings'  Or something similar. 

The error we get which I think is expected is we set only rdmacm is:
--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   nyx0665.engin.umich.edu
  Local device: mthca0
  Local port:   1
  CPCs attempted:   rdmacm
--

Again I think this is expected on this older hardware. 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Apr 22, 2011, at 10:23 AM, Brock Palen wrote:

> On Apr 21, 2011, at 6:49 PM, Ralph Castain wrote:
> 
>> 
>> On Apr 21, 2011, at 4:41 PM, Brock Palen wrote:
>> 
>>> Given that part of our cluster is TCP only, openib wouldn't even startup on 
>>> those hosts
>> 
>> That is correct - it would have no impact on those hosts
>> 
>>> and this would be ignored on hosts with IB adaptors?  
>> 
>> Ummm...not sure I understand this one. The param -will- be used on hosts 
>> with IB adaptors because that is what it is controlling.
>> 
>> However, it -won't- have any impact on hosts without IB adaptors, which is 
>> what I suspect you meant to ask?
> 
> Correct typo, Thanks, I am going to add the environment variable to our 
> OpenMPI modules so rdmacm is our default for now,  Thanks!
> 
>> 
>> 
>>> 
>>> Just checking thanks!
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Apr 21, 2011, at 6:21 PM, Jeff Squyres wrote:
>>> 
 Over IB, I'm not sure there is much of a drawback.  It might be slightly 
 slower to establish QP's, but I don't think that matters much.
 
 Over iWARP, rdmacm can cause connection storms as you scale to thousands 
 of MPI processes.
 
 
 On Apr 20, 2011, at 5:03 PM, Brock Palen wrote:
 
> We managed to have another user hit the bug that causes collectives (this 
> time MPI_Bcast() ) to hang on IB that was fixed by setting:
> 
> btl_openib_cpc_include rdmacm
> 
> My question is if we set this to the default on our system with an 
> environment variable does it introduce any performance or other issues we 
> should be aware of?
> 
> Is there a reason we should not use rdmacm?
> 
> Thanks!
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
 -- 
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 
 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




[OMPI users] [SPAM:### 84%]

2011-04-27 Thread christophe petit
http://www.pimp2.com/modules/mod_osdonate/life.html