Re: [OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread Ralph Castain
It applies to both. In the rsh/ssh launcher, there is a limit on how many 
concurrent ssh/rsh sessions we have open at any one time. This is required due 
to OS limitations. As each daemon completes its launch, it "daemonizes" and 
closes the ssh/rsh session, thus enabling another daemon to be launched.

We have launched very large clusters with ssh/rsh without problem.

My best guess here is that Lydia has the -do-not-daemonize flag (or mca param) 
set somewhere, perhaps for debug purposes so that stdout/stderr will be 
forwarded by ssh/rsh. Unfortunately, that means the session doesn't get closed, 
and blocks the launch from completing. We are supposed to detect that situation 
and output an error message before aborting.

However, without more info from her, there is nothing more I can do.


On Dec 14, 2010, at 1:53 PM, Gilbert Grosdidier wrote:

> Bonjour Ralph,
> 
> I wonder : is this plm_rsh_num_concurrent parameter standing ONLY for rsh use,
> or for ssh OR rsh, depending on plm_rsh_agent, please ?
> 
> Thanks,  Best,   G.
> 
> 
> Le 14/12/2010 18:30, Ralph Castain a écrit :
>> That's a big cluster to be starting with rsh! :-)
>> 
>> When you say it won't start, do you mean that it hangs? Or does it fail with 
>> some error message? How many nodes are involved (this is the important 
>> number, not the number of cores)?
>> 
>> Also, what version are you using?
>> 
>> 
>> On Dec 14, 2010, at 9:10 AM, Lydia Heck wrote:
>> 
>>> About 9 months ago we had a new installation with a system of 1800 cores 
>>> and at the time we found that jobs with more than 1028 cores would not 
>>> start. At the time a colleague found that setting
>>> 
>>> OMPI_MCA_plm_rsh_num_concurrent=256
>>> 
>>> help with the problem.
>>> 
>>> We have now increased our processor count to more than 2700 cores and a job 
>>> with 2,500 jobs does not start.
>>> 
>>> Is there any advice?
>>> 
>>> Best wishes,
>>> 
>>> Lydia Heck
>>> --
>>> Dr E L Heck
>>> Senior Computer Manager
>>> 
>>> University of Durham Institute for Computational Cosmology
>>> Ogden Centre
>>> Department of Physics South Road
>>> 
>>> DURHAM, DH1 3LE United Kingdom
>>> 
>>> e-mail: lydia.h...@durham.ac.uk
>>> 
>>> Tel.: + 44 191 - 334 3628
>>> Fax.: + 44 191 - 334 3645
>>> ___
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> -- 
> Cordialement,   Gilbert.
> 
> --
> *-*
>  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
>  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
>  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
>  B.P. 34, F-91898 Orsay Cedex (FRANCE)
> *-*
> 




Re: [OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread Gilbert Grosdidier

Bonjour Ralph,

 I wonder : is this plm_rsh_num_concurrent parameter standing ONLY for 
rsh use,

or for ssh OR rsh, depending on plm_rsh_agent, please ?

 Thanks,  Best,   G.


Le 14/12/2010 18:30, Ralph Castain a écrit :

That's a big cluster to be starting with rsh! :-)

When you say it won't start, do you mean that it hangs? Or does it fail with 
some error message? How many nodes are involved (this is the important number, 
not the number of cores)?

Also, what version are you using?


On Dec 14, 2010, at 9:10 AM, Lydia Heck wrote:


About 9 months ago we had a new installation with a system of 1800 cores and at 
the time we found that jobs with more than 1028 cores would not start. At the 
time a colleague found that setting

OMPI_MCA_plm_rsh_num_concurrent=256

help with the problem.

We have now increased our processor count to more than 2700 cores and a job 
with 2,500 jobs does not start.

Is there any advice?

Best wishes,

Lydia Heck
--
Dr E L Heck
Senior Computer Manager

University of Durham Institute for Computational Cosmology
Ogden Centre
Department of Physics South Road

DURHAM, DH1 3LE United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
 Cordialement,   Gilbert.

--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*



Re: [OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread John Hearns
On 14 December 2010 17:32, Lydia Heck  wrote:
>
> I have experimented a bit more and found that if I set
>
> OMPI_MCA_plm_rsh_num_concurrent=1024
>
> a job with more than 2,500 processes will start and run.
>
> However when I searched the open-mpi web site for the the variable I could
> not find any indication.

Lydia,  a quick search find this page:
http://docs.sun.com/source/820-3176-10/appb-mca.html

It may be out of data, but does describe the parameters.
What is your setting for plm_rsh_agent   (ie are you using ssh or rsh)
and also have you tried setting plm_rsh_debug


Re: [OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread Lydia Heck


I have experimented a bit more and found that if I set

OMPI_MCA_plm_rsh_num_concurrent=1024

a job with more than 2,500 processes will start and run.

However when I searched the open-mpi web site for the the variable I could not 
find any indication.


Best wishes,
Lydia Heck




 15. jobs with more that 2, 500 processes will not even start
 (Lydia Heck)

--

Message: 15
Date: Tue, 14 Dec 2010 16:10:01 + (GMT)
From: Lydia Heck 
Subject: [OMPI users] jobs with more that 2,500 processes will not
even start
To: us...@open-mpi.org
Message-ID:

Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII


About 9 months ago we had a new installation with a system of 1800 cores and at
the time we found that jobs with more than 1028 cores would not start. At the
time a colleague found that setting

OMPI_MCA_plm_rsh_num_concurrent=256

help with the problem.

We have now increased our processor count to more than 2700 cores and a job with
2,500 jobs does not start.

Is there any advice?

Best wishes,

Lydia Heck