Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
On Tue, Nov 16, 2010 at 12:23 PM, Terry Dontje wrote:

>  On 11/16/2010 01:31 PM, Reuti wrote:
>
> Hi Ralph,
>
> Am 16.11.2010 um 15:40 schrieb Ralph Castain:
>
>
>  2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
> does this automatically to constrain the procs to running on only those cores.
>
>  This is another "bug/feature" in SGE: it's a matter of discussion, whether 
> the shepherd should get exactly one core (in case you use more than one 
> `qrsh`per node) for each call, or *all* cores assigned (which we need right 
> now, as the processes in Open MPI will be forks of orte daemon). About such a 
> situtation I filled an issue a long time ago and "limit_to_one_qrsh_per_host 
> yes/no" in the PE definition would do (this setting should then also change 
> the core allocation of the master process):
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
>
> I believe this is indeed the crux of the issue
>
>  fantastic to share the same view.
>
>
>  FWIW, I think I agree too.
>
>   3. tell OMPI to --bind-to-core.
>
> In other words, tell SGE to allocate a certain number of cores on each node, 
> but to bind each proc to all of them (i.e., don't bind a proc to a specific 
> core). I'm pretty sure that is a standard SGE option today (at least, I know 
> it used to be). I don't believe any patch or devel work is required (to 
> either SGE or OMPI).
>
>  When you use a fixed allocation_rule and a matching -binding request it will 
> work today. But any other case won't be distributed in the correct way.
>
> Is it possible to not include the -binding request? If SGE is told to use a 
> fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
> the orted see
> itself bound to two specific cores on each node?
>
>  When you leave out the -binding, all jobs are allowed to run on any core.
>
>
>
>  We would then be okay as the spawned children of orted would inherit its 
> binding. Just don't tell mpirun to bind the processes and the threads of 
> those MPI procs will be able to operate across the provided cores.
>
> Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
> -binding given), but doesn't bind the orted to any two specific cores? If so, 
> then that would be a problem as the orted would think itself unconstrained. 
> If I understand the thread correctly, you're saying that this is what happens 
> today - true?
>
>  Exactly. It won't apply any binding at all and orted would think of being 
> unlimited. I.e. limited only by the number of slots it should use thereon.
>
>
>  So I guess the question I have for Ralph.  I thought, and this might be
> mixing some of the ideas Jeff and I've been talking about, that when a RM
> executes the orted with a bound set of resources (ie cores) that orted would
> bind the individual processes on a subset of the bounded resources.  Is this
> not really the case for 1.4.X branch?  I believe it is the case for the
> trunk based on Jeff's refactoring.
>

You are absolutely correct, Terry, and the 1.4 release series does include
the proper code. The point here, though, is that SGE binds the orted to a
single core, even though other cores are also allocated. So the orted
detects an external binding of one core, and binds all its children to that
same core.

What I had suggested to Reuti was to not include the -binding flag to SGE in
the hopes that SGE would then bind the orted to all the allocated cores.
However, as I feared, SGE in that case doesn't bind the orted at all - and
so we assume the entire node is available for our use.

This is an SGE issue. We need them to bind the orted to -all- the allocated
cores (and only those cores) in order for us to operate correctly.



>
>
> --
> [image: Oracle]
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
>  Oracle * - Performance Technologies*
>  95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpi-io, fortran, going crazy... (ADENDA)

2010-11-16 Thread Gus Correa

Ricardo Reis wrote:


 and sorry to be such a nuisance...

 but any motive for an MPI-IO "wall" between the 2.0 and 2.1 Gb?


Salve Ricardo Reis!

Is this "wall" perhaps the 2GB Linux file size limit on 32-bit systems?

Gus


 (1 mpi process)

 best,

 Ricardo Reis

 'Non Serviam'

 PhD candidate @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 http://www.lasef.ist.utl.pt

 Cultural Instigator @ Rádio Zero
 http://www.radiozero.pt

 Keep them Flying! Ajude a/help Aero Fénix!

 http://www.aeronauta.com/aero.fenix

 http://www.flickr.com/photos/rreis/

 contacts:  gtalk: kyriu...@gmail.com  skype: kyriusan

   < sent with alpine 2.00 >




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpi-io, fortran, going crazy... (ADENDA)

2010-11-16 Thread Ricardo Reis


 and sorry to be such a nuisance...

 but any motive for an MPI-IO "wall" between the 2.0 and 2.1 Gb?

 (1 mpi process)

 best,

 Ricardo Reis

 'Non Serviam'

 PhD candidate @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 http://www.lasef.ist.utl.pt

 Cultural Instigator @ Rádio Zero
 http://www.radiozero.pt

 Keep them Flying! Ajude a/help Aero Fénix!

 http://www.aeronauta.com/aero.fenix

 http://www.flickr.com/photos/rreis/

 contacts:  gtalk: kyriu...@gmail.com  skype: kyriusan

   < sent with alpine 2.00 >

Re: [OMPI users] mpi-io, fortran, going crazy...

2010-11-16 Thread Ricardo Reis


 On my last email...

 I forgot to add

 It's a 12Gb machine and the file should be around 2.5Gb

 I'm using mpirun -np 1

 And it writes without problem if I try a file of 250Mb, for instance

 so it seems also to be a size related problem

 I'm using the 'native' type for writing...

 ideas?




 Ricardo Reis

 'Non Serviam'

 PhD candidate @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 http://www.lasef.ist.utl.pt

 Cultural Instigator @ Rádio Zero
 http://www.radiozero.pt

 Keep them Flying! Ajude a/help Aero Fénix!

 http://www.aeronauta.com/aero.fenix

 http://www.flickr.com/photos/rreis/

 contacts:  gtalk: kyriu...@gmail.com  skype: kyriusan

   < sent with alpine 2.00 >

[OMPI users] mpi-io, fortran, going crazy...

2010-11-16 Thread Ricardo Reis


 Hi all

 I have been banging my head on a wall for three days trying to make a 
simple fortran mpi-io program work.


 I'm using gfortan 4.4, openmpi 1.4.1 (it's a debian box)

 I have other codes that work with MPI-IO without any problem but for some 
reason I can't grasp this one... doesn't write!


 the code is here:

 http://aero.ist.utl.pt/~rreis/test_io.f90

 can some kind soul just look at it and give some input?

 or, simply, point me also to where fortran error n 3 meaning is 
explained?


 best and many thanks for your time,


 Ricardo Reis

 'Non Serviam'

 PhD candidate @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 http://www.lasef.ist.utl.pt

 Cultural Instigator @ Rádio Zero
 http://www.radiozero.pt

 Keep them Flying! Ajude a/help Aero Fénix!

 http://www.aeronauta.com/aero.fenix

 http://www.flickr.com/photos/rreis/

 contacts:  gtalk: kyriu...@gmail.com  skype: kyriusan

   < sent with alpine 2.00 >

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 01:31 PM, Reuti wrote:

Hi Ralph,

Am 16.11.2010 um 15:40 schrieb Ralph Castain:


2. have SGE bind procs it launches to -all- of those cores. I believe SGE does 
this automatically to constrain the procs to running on only those cores.

This is another "bug/feature" in SGE: it's a matter of discussion, whether the shepherd 
should get exactly one core (in case you use more than one `qrsh`per node) for each call, or *all* 
cores assigned (which we need right now, as the processes in Open MPI will be forks of orte 
daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

I believe this is indeed the crux of the issue

fantastic to share the same view.


FWIW, I think I agree too.

3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node, 
but to bind each proc to all of them (i.e., don't bind a proc to a specific 
core). I'm pretty sure that is a standard SGE option today (at least, I know it 
used to be). I don't believe any patch or devel work is required (to either SGE 
or OMPI).

When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.

Is it possible to not include the -binding request? If SGE is told to use a 
fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
the orted see
itself bound to two specific cores on each node?

When you leave out the -binding, all jobs are allowed to run on any core.



We would then be okay as the spawned children of orted would inherit its 
binding. Just don't tell mpirun to bind the processes and the threads of those 
MPI procs will be able to operate across the provided cores.

Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
-binding given), but doesn't bind the orted to any two specific cores? If so, 
then that would be a problem as the orted would think itself unconstrained. If 
I understand the thread correctly, you're saying that this is what happens 
today - true?

Exactly. It won't apply any binding at all and orted would think of being 
unlimited. I.e. limited only by the number of slots it should use thereon.

So I guess the question I have for Ralph.  I thought, and this might be 
mixing some of the ideas Jeff and I've been talking about, that when a 
RM executes the orted with a bound set of resources (ie cores) that 
orted would bind the individual processes on a subset of the bounded 
resources.  Is this not really the case for 1.4.X branch?  I believe it 
is the case for the trunk based on Jeff's refactoring.


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Hi Ralph,

Am 16.11.2010 um 15:40 schrieb Ralph Castain:

> > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
> > does this automatically to constrain the procs to running on only those 
> > cores.
> 
> This is another "bug/feature" in SGE: it's a matter of discussion, whether 
> the shepherd should get exactly one core (in case you use more than one 
> `qrsh`per node) for each call, or *all* cores assigned (which we need right 
> now, as the processes in Open MPI will be forks of orte daemon). About such a 
> situtation I filled an issue a long time ago and "limit_to_one_qrsh_per_host 
> yes/no" in the PE definition would do (this setting should then also change 
> the core allocation of the master process):
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
> 
> I believe this is indeed the crux of the issue

fantastic to share the same view.


> > 3. tell OMPI to --bind-to-core.
> >
> > In other words, tell SGE to allocate a certain number of cores on each 
> > node, but to bind each proc to all of them (i.e., don't bind a proc to a 
> > specific core). I'm pretty sure that is a standard SGE option today (at 
> > least, I know it used to be). I don't believe any patch or devel work is 
> > required (to either SGE or OMPI).
> 
> When you use a fixed allocation_rule and a matching -binding request it will 
> work today. But any other case won't be distributed in the correct way.
> 
> Is it possible to not include the -binding request? If SGE is told to use a 
> fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
> the orted see 
> itself bound to two specific cores on each node?

When you leave out the -binding, all jobs are allowed to run on any core.


> We would then be okay as the spawned children of orted would inherit its 
> binding. Just don't tell mpirun to bind the processes and the threads of 
> those MPI procs will be able to operate across the provided cores.
> 
> Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
> -binding given), but doesn't bind the orted to any two specific cores? If so, 
> then that would be a problem as the orted would think itself unconstrained. 
> If I understand the thread correctly, you're saying that this is what happens 
> today - true?

Exactly. It won't apply any binding at all and orted would think of being 
unlimited. I.e. limited only by the number of slots it should use thereon.

-- Reuti


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell

On 16 Nov 2010, at 17:25, Terry Dontje wrote:
>>> 
>> Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe 
>> mpi 8 -binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
>> ras_gridengine_verbose 100 --report-bindings ./unterm':
>> 
>> [exec4:17384] System has detected external process binding to cores 0022
>> [exec4:17384] ras:gridengine: JOB_ID: 59352
>> [exec4:17384] ras:gridengine: PE_HOSTFILE: 
>> /usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
>> [exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
>> slots=2
>> [exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> 
>> 
>> 
> Is that all that came out?  I would have expected a some output from each 
> process after the orted forked the processes but before the exec of unterm.

Yes.  It appears that if orted detects binding done by external processes, then 
this is all you get.  Scratch the GE enforced binding, and you get:

[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],0] to 
cpus 0001
[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],1] to 
cpus 0002
[exec7:06781] [[23443,0],2] odls:default:fork binding child [[23443,1],3] to 
cpus 0001
[exec2:24160] [[23443,0],1] odls:default:fork binding child [[23443,1],2] to 
cpus 0001
[exec6:30097] [[23443,0],4] odls:default:fork binding child [[23443,1],5] to 
cpus 0001
[exec5:02736] [[23443,0],6] odls:default:fork binding child [[23443,1],7] to 
cpus 0001
[exec1:30779] [[23443,0],5] odls:default:fork binding child [[23443,1],6] to 
cpus 0001
[exec3:12818] [[23443,0],3] odls:default:fork binding child [[23443,1],4] to 
cpus 0001
.


C
--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Am 16.11.2010 um 17:35 schrieb Terry Dontje:

> On 11/16/2010 10:59 AM, Reuti wrote:
>> Am 16.11.2010 um 15:26 schrieb Terry Dontje:
>> 
>> 
> 
> 1. allocate a specified number of cores on each node to your job
> 
> 
 this is currently the bug in the "slot <=> core" relation in SGE, which 
 has to be removed, updated or clarified. For now slot and core count are 
 out of sync AFAICS.
 
 
>>> Technically this isn't a bug but a gap in the allocation rule.  I think the 
>>> solution is a new allocation rule.
>>> 
>> Yes, you can phrase it this way. But what do you mean by "new allocation 
>> rule"? 
> The proposal of have a slot allocation rule that forces the number of cores 
> allocated on each node to equal the number of slots.

Yep. But then you would end up with $round_robin_cores, $fill_up_cores, 
$pe_slots_cores and with a fixed value "4 cores". Maybe an additional flag 
would be more suitable.


>> The slot allocation should follow the specified cores? 
>> 
> The other way around I think.

Yep, agreed.


> 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
> does this automatically to constrain the procs to running on only those 
> cores.
> 
> 
 This is another "bug/feature" in SGE: it's a matter of discussion, whether 
 the shepherd should get exactly one core (in case you use more than one 
 `qrsh`per node) for each call, or *all* cores assigned (which we need 
 right now, as the processes in Open MPI will be forks of orte daemon). 
 About such a situtation I filled an issue a long time ago and 
 "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this 
 setting should then also change the core allocation of the master process):
 
 
 
 http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
>>> Isn't it almost required to have the shepherd bind to all the cores so that 
>>> the orted inherits that binding?
>>> 
>> Yes, for orted. But if you want to have any other (legacy) application which 
>> using N times `qrsh` to an exechost when you got N slots thereon, then only 
>> one core should be bound to each of the started shepherds.
>> 
>> 
> Blech.  Not sure of the solution for that but I see what you are saying now 
> :-).

:-)


> 3. tell OMPI to --bind-to-core.
> 
> In other words, tell SGE to allocate a certain number of cores on each 
> node, but to bind each proc to all of them (i.e., don't bind a proc to a 
> specific core). I'm pretty sure that is a standard SGE option today (at 
> least, I know it used to be). I don't believe any patch or devel work is 
> required (to either SGE or OMPI).
> 
> 
 When you use a fixed allocation_rule and a matching -binding request it 
 will work today. But any other case won't be distributed in the correct 
 way.
 
 
>>> Ok, so what is the "correct" way and we sure it isn't distributed correctly?
>>> 
>> You posted the two cases yesterday. Do we agree that both cases aren't 
>> correct, or do you think it's a correct allocation for both cases? Even if 
>> it could be "repaired" in Open MPI, it would be better to fix the generated 
>> 'pe' PE hostfile and 'set' allocation, i.e. the "slot <=> cores" relation.
>> 
>> 
>> 
> So I am not a GE type of guy but from what I've been led to believe what 
> happened is correct (in some form of correct).  That is in case one we asked 
> for a core allocation of 1 core per node and a core allocation of 2 cores in 
> the other case.  That is what we were given.  The fact that we distributed 
> the slots in a non-uniform manner I am not sure is GE's fault.  Note I can 
> understand where it may seem non-intuitive and not nice for people wanting to 
> do things like this.
>>> In the original case of 7 nodes and processes if we do -binding pe 
>>> linear:2, and add the -bind-to-core to mpirun  I'd actually expect 6 of the 
>>> nodes processes bind to one core and the 7th node with 2 processes to have 
>>> each of those processes bound to different cores on the same machine.
>>> 
>> Yes, possibly it could be repaired this way (for now I have no free machines 
>> to play with). But then the "reserved" cores by the "-binding pe linear:2" 
>> are lost for other processes on these 6 nodes, and the slot count gets out 
>> of sync with slots.
>> 
> Right, if you want to rightsize the amount of cores allocated to slots 
> allocated on each node then we are stuck unless a new allocation rule is 
> made.  

Great.


>>> Can we get a full output of such a run with -report-bindings turned on.  I 
>>> think we should find out that things actually are happening correctly 
>>> except for the fact that the 6 of the nodes have 2 cores allocated but only 
>>> one is being bound to by a process.
>>> 
>> You mean, to accept the current behavior as being the intended one, as 
>> finally for having only one job running on these machines we get what we 
>

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 12:13 PM, Chris Jewell wrote:

On 16 Nov 2010, at 14:26, Terry Dontje wrote:

In the original case of 7 nodes and processes if we do -binding pe linear:2, 
and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
processes bind to one core and the 7th node with 2 processes to have each of 
those processes bound to different cores on the same machine.

Can we get a full output of such a run with -report-bindings turned on.  I 
think we should find out that things actually are happening correctly except 
for the fact that the 6 of the nodes have 2 cores allocated but only one is 
being bound to by a process.

Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':

[exec4:17384] System has detected external process binding to cores 0022
[exec4:17384] ras:gridengine: JOB_ID: 59352
[exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
[exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1


Is that all that came out?  I would have expected a some output from 
each process after the orted forked the processes but before the exec of 
unterm.


--td

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 12:13 PM, Chris Jewell wrote:

On 16 Nov 2010, at 14:26, Terry Dontje wrote:

In the original case of 7 nodes and processes if we do -binding pe linear:2, 
and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
processes bind to one core and the 7th node with 2 processes to have each of 
those processes bound to different cores on the same machine.

Can we get a full output of such a run with -report-bindings turned on.  I 
think we should find out that things actually are happening correctly except 
for the fact that the 6 of the nodes have 2 cores allocated but only one is 
being bound to by a process.

Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':

[exec4:17384] System has detected external process binding to cores 0022
[exec4:17384] ras:gridengine: JOB_ID: 59352
[exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
[exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1


Is that all that came out?  I would have expected a some output from 
each process after the orted forked the processes but before the exec of 
unterm.


--td

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell

On 16 Nov 2010, at 14:26, Terry Dontje wrote:
> 
> In the original case of 7 nodes and processes if we do -binding pe linear:2, 
> and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
> processes bind to one core and the 7th node with 2 processes to have each of 
> those processes bound to different cores on the same machine.
> 
> Can we get a full output of such a run with -report-bindings turned on.  I 
> think we should find out that things actually are happening correctly except 
> for the fact that the 6 of the nodes have 2 cores allocated but only one is 
> being bound to by a process.

Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':

[exec4:17384] System has detected external process binding to cores 0022
[exec4:17384] ras:gridengine: JOB_ID: 59352
[exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
[exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1


Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








[OMPI users] architecture questions

2010-11-16 Thread Hicham Mouline
hello,I currently have a serial application with a GUI that runs some 
calculations.My next step is to use OpenMPI with the help of the Boost.MPI 
wrapper library in C++ to parallelize those calculations.There is a set of 
static data objects created once at startup or loaded from files.1. In terms of 
running mpi processes, I've chosen this route: starting up the GUI launches all 
the MPI processes. They wait listening for calculations to perform (via 
broadcast?) The GUI is the sort of master process.I've used mpirun to launch x 
processes on the same box. I assume there's a different setup to launch mpi 
processes on different boxes.Is there a way to hide the explicit launching of 
the mpi runtime? ie, can the user just start the GUI and the program actually 
launches the mpi runtime and the program actually becomes 1 of the mpi 
processes (a master process)2. what are the pros/cons of loading the static 
data objects individually from each separate mpi process vs broadcasting the 
static data via MPI itself after only the master reads/sets up the static 
data?regards,

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 10:59 AM, Reuti wrote:

Am 16.11.2010 um 15:26 schrieb Terry Dontje:



1. allocate a specified number of cores on each node to your job


this is currently the bug in the "slot<=>  core" relation in SGE, which has to 
be removed, updated or clarified. For now slot and core count are out of sync AFAICS.


Technically this isn't a bug but a gap in the allocation rule.  I think the 
solution is a new allocation rule.

Yes, you can phrase it this way. But what do you mean by "new allocation rule"?
The proposal of have a slot allocation rule that forces the number of 
cores allocated on each node to equal the number of slots.

The slot allocation should follow the specified cores?

The other way around I think.



2. have SGE bind procs it launches to -all- of those cores. I believe SGE does 
this automatically to constrain the procs to running on only those cores.


This is another "bug/feature" in SGE: it's a matter of discussion, whether the shepherd 
should get exactly one core (in case you use more than one `qrsh`per node) for each call, or *all* 
cores assigned (which we need right now, as the processes in Open MPI will be forks of orte 
daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):


http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

Isn't it almost required to have the shepherd bind to all the cores so that the 
orted inherits that binding?

Yes, for orted. But if you want to have any other (legacy) application which 
using N times `qrsh` to an exechost when you got N slots thereon, then only one 
core should be bound to each of the started shepherds.

Blech.  Not sure of the solution for that but I see what you are saying 
now :-).

3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node, 
but to bind each proc to all of them (i.e., don't bind a proc to a specific 
core). I'm pretty sure that is a standard SGE option today (at least, I know it 
used to be). I don't believe any patch or devel work is required (to either SGE 
or OMPI).


When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.


Ok, so what is the "correct" way and we sure it isn't distributed correctly?

You posted the two cases yesterday. Do we agree that both cases aren't correct, or do you think it's a 
correct allocation for both cases? Even if it could be "repaired" in Open MPI, it would be 
better to fix the generated 'pe' PE hostfile and 'set' allocation, i.e. the "slot<=>  
cores" relation.


So I am not a GE type of guy but from what I've been led to believe what 
happened is correct (in some form of correct).  That is in case one we 
asked for a core allocation of 1 core per node and a core allocation of 
2 cores in the other case.  That is what we were given.  The fact that 
we distributed the slots in a non-uniform manner I am not sure is GE's 
fault.  Note I can understand where it may seem non-intuitive and not 
nice for people wanting to do things like this.

In the original case of 7 nodes and processes if we do -binding pe linear:2, 
and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
processes bind to one core and the 7th node with 2 processes to have each of 
those processes bound to different cores on the same machine.

Yes, possibly it could be repaired this way (for now I have no free machines to play with). But 
then the "reserved" cores by the "-binding pe linear:2" are lost for other 
processes on these 6 nodes, and the slot count gets out of sync with slots.
Right, if you want to rightsize the amount of cores allocated to slots 
allocated on each node then we are stuck unless a new allocation rule is 
made.

Can we get a full output of such a run with -report-bindings turned on.  I 
think we should find out that things actually are happening correctly except 
for the fact that the 6 of the nodes have 2 cores allocated but only one is 
being bound to by a process.

You mean, to accept the current behavior as being the intended one, as finally 
for having only one job running on these machines we get what we asked for - 
despite the fact that cores are lost for other processes?

Yes, that is what I mean.  I first would like to prove at least to 
myself things are working the way we think they are.  I believe the 
discussion of recovering the lost cores is the next step.  Either we 
redefine what -binding linear:X means in light of slots, we make a new 
allocation rule -binding slots:X or live with the lost cores.  Note, the 
"we" here is loosely used.  I am by no means the keeper of GE and just 
injected myself in this discussion because, like Ralph, I have dealt 
with binding and I work for Oracle which develops GE.  Just to be clear 
I do not work in

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti

Am 16.11.2010 um 15:26 schrieb Terry Dontje:

>>> 
>>> 1. allocate a specified number of cores on each node to your job
>>> 
>> this is currently the bug in the "slot <=> core" relation in SGE, which has 
>> to be removed, updated or clarified. For now slot and core count are out of 
>> sync AFAICS.
>> 
> Technically this isn't a bug but a gap in the allocation rule.  I think the 
> solution is a new allocation rule.

Yes, you can phrase it this way. But what do you mean by "new allocation rule"? 
The slot allocation should follow the specified cores? 


>>> 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
>>> does this automatically to constrain the procs to running on only those 
>>> cores.
>>> 
>> This is another "bug/feature" in SGE: it's a matter of discussion, whether 
>> the shepherd should get exactly one core (in case you use more than one 
>> `qrsh`per node) for each call, or *all* cores assigned (which we need right 
>> now, as the processes in Open MPI will be forks of orte daemon). About such 
>> a situtation I filled an issue a long time ago and 
>> "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this 
>> setting should then also change the core allocation of the master process):
>> 
>> 
>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
> Isn't it almost required to have the shepherd bind to all the cores so that 
> the orted inherits that binding?

Yes, for orted. But if you want to have any other (legacy) application which 
using N times `qrsh` to an exechost when you got N slots thereon, then only one 
core should be bound to each of the started shepherds.


>>> 3. tell OMPI to --bind-to-core.
>>> 
>>> In other words, tell SGE to allocate a certain number of cores on each 
>>> node, but to bind each proc to all of them (i.e., don't bind a proc to a 
>>> specific core). I'm pretty sure that is a standard SGE option today (at 
>>> least, I know it used to be). I don't believe any patch or devel work is 
>>> required (to either SGE or OMPI).
>>> 
>> When you use a fixed allocation_rule and a matching -binding request it will 
>> work today. But any other case won't be distributed in the correct way.
>> 
> Ok, so what is the "correct" way and we sure it isn't distributed correctly?

You posted the two cases yesterday. Do we agree that both cases aren't correct, 
or do you think it's a correct allocation for both cases? Even if it could be 
"repaired" in Open MPI, it would be better to fix the generated 'pe' PE 
hostfile and 'set' allocation, i.e. the "slot <=> cores" relation.


> In the original case of 7 nodes and processes if we do -binding pe linear:2, 
> and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
> processes bind to one core and the 7th node with 2 processes to have each of 
> those processes bound to different cores on the same machine.

Yes, possibly it could be repaired this way (for now I have no free machines to 
play with). But then the "reserved" cores by the "-binding pe linear:2" are 
lost for other processes on these 6 nodes, and the slot count gets out of sync 
with slots.


> Can we get a full output of such a run with -report-bindings turned on.  I 
> think we should find out that things actually are happening correctly except 
> for the fact that the 6 of the nodes have 2 cores allocated but only one is 
> being bound to by a process.

You mean, to accept the current behavior as being the intended one, as finally 
for having only one job running on these machines we get what we asked for - 
despite the fact that cores are lost for other processes?

-- Reuti


Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
Hi Reuti


> > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE
> does this automatically to constrain the procs to running on only those
> cores.
>
> This is another "bug/feature" in SGE: it's a matter of discussion, whether
> the shepherd should get exactly one core (in case you use more than one
> `qrsh`per node) for each call, or *all* cores assigned (which we need right
> now, as the processes in Open MPI will be forks of orte daemon). About such
> a situtation I filled an issue a long time ago and
> "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this
> setting should then also change the core allocation of the master process):
>
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254


I believe this is indeed the crux of the issue


>
>
>
> > 3. tell OMPI to --bind-to-core.
> >
> > In other words, tell SGE to allocate a certain number of cores on each
> node, but to bind each proc to all of them (i.e., don't bind a proc to a
> specific core). I'm pretty sure that is a standard SGE option today (at
> least, I know it used to be). I don't believe any patch or devel work is
> required (to either SGE or OMPI).
>
> When you use a fixed allocation_rule and a matching -binding request it
> will work today. But any other case won't be distributed in the correct way.
>

Is it possible to not include the -binding request? If SGE is told to use a
fixed allocation_rule, and to allocate (for example) 2 cores/node, then
won't the orted see itself bound to two specific cores on each node? We
would then be okay as the spawned children of orted would inherit its
binding. Just don't tell mpirun to bind the processes and the threads of
those MPI procs will be able to operate across the provided cores.

Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no
-binding given), but doesn't bind the orted to any two specific cores? If
so, then that would be a problem as the orted would think itself
unconstrained. If I understand the thread correctly, you're saying that this
is what happens today - true?



>
> -- Reuti
>
>
> >
> >
> > On Tue, Nov 16, 2010 at 4:07 AM, Reuti 
> wrote:
> > Am 16.11.2010 um 10:26 schrieb Chris Jewell:
> >
> > > Hi all,
> > >
> > >> On 11/15/2010 02:11 PM, Reuti wrote:
> > >>> Just to give my understanding of the problem:
> > 
> > >> Sorry, I am still trying to grok all your email as what the
> problem you
> > >> are trying to solve. So is the issue is trying to have two jobs
> having
> > >> processes on the same node be able to bind there processes on
> different
> > >> resources. Like core 1 for the first job and core 2 and 3 for the
> 2nd job?
> > >>
> > >> --td
> > >> You can't get 2 slots on a machine, as it's limited by the core count
> to one here, so such a slot allocation shouldn't occur at all.
> > >
> > > So to clarify, the current -binding :
> allocates binding_amount cores to each sge_shepherd process associated with
> a job_id.  There appears to be only one sge_shepherd process per job_id per
> execution node, so all child processes run on these allocated cores.  This
> is irrespective of the number of slots allocated to the node.
> > >
> > > I agree with Reuti that the binding_amount parameter should be a
> maximum number of bound cores per node, with the actual number determined by
> the number of slots allocated per node.  FWIW, an alternative approach might
> be to have another binding_type ('slot', say) that automatically allocated
> one core per slot.
> > >
> > > Of course, a complex situation might arise if a user submits a combined
> MPI/multithreaded job, but then I guess we're into the realm of setting
> allocation_rule.
> >
> > IIRC there was a discussion on the [GE users] list about it, to get an
> uniform distribution on all slave nodes for such jobs, as also e.g.
> $OMP_NUM_THREADS will be set to the same value for all slave nodes for
> hybrid jobs. Otherwise it would be necessary to adjust SGE to set this value
> in the "-builtin-" startup method automatically on all nodes to the local
> granted slots value. For now a fixed allocation rule of 1,2,4 or whatever
> must be used and you have to submit by reqeusting a wildcard PE to get any
> of these defined PEs for an even distribution and you don't care whether
> it's two times two slots, one time four slots, or four times one slot.
> >
> > In my understanding, any type of parallel job should always request and
> get the total number of slots equal to the cores it needs to execute.
> Independent whether these are threads, forks or any hybrid type of jobs.
> Otherwise any resource planing and reservation will most likely fail.
> Nevertheless, there might exist rare cases where you submit an exclusive
> serial job but create threads/forks in the end. But such a setup should be
> an exception, not the default.
> >
> >
> > > Is it going to be worth looking at creating a patch for this?
> >
> > Absolute.
> >
> >
> > >  I don'

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 09:08 AM, Reuti wrote:

Hi,

Am 16.11.2010 um 14:07 schrieb Ralph Castain:


Perhaps I'm missing it, but it seems to me that the real problem lies in the interaction 
between SGE and OMPI during OMPI's two-phase launch. The verbose output shows that SGE 
dutifully allocated the requested number of cores on each node. However, OMPI launches 
only one process on each node (the ORTE daemon), which SGE "binds" to a single 
core since that is what it was told to do.

Since SGE never sees the local MPI procs spawned by ORTE, it can't assign 
bindings to them. The ORTE daemon senses its local binding (i.e., to a single 
core in the allocation), and subsequently binds all its local procs to that 
core.

I believe all you need to do is tell SGE to:

1. allocate a specified number of cores on each node to your job

this is currently the bug in the "slot<=>  core" relation in SGE, which has to 
be removed, updated or clarified. For now slot and core count are out of sync AFAICS.


Technically this isn't a bug but a gap in the allocation rule.  I think 
the solution is a new allocation rule.

2. have SGE bind procs it launches to -all- of those cores. I believe SGE does 
this automatically to constrain the procs to running on only those cores.

This is another "bug/feature" in SGE: it's a matter of discussion, whether the shepherd 
should get exactly one core (in case you use more than one `qrsh`per node) for each call, or *all* 
cores assigned (which we need right now, as the processes in Open MPI will be forks of orte 
daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
Isn't it almost required to have the shepherd bind to all the cores so 
that the orted inherits that binding?



3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node, 
but to bind each proc to all of them (i.e., don't bind a proc to a specific 
core). I'm pretty sure that is a standard SGE option today (at least, I know it 
used to be). I don't believe any patch or devel work is required (to either SGE 
or OMPI).

When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.

Ok, so what is the "correct" way and we sure it isn't distributed correctly?

In the original case of 7 nodes and processes if we do -binding pe 
linear:2, and add the -bind-to-core to mpirun  I'd actually expect 6 of 
the nodes processes bind to one core and the 7th node with 2 processes 
to have each of those processes bound to different cores on the same 
machine.


Can we get a full output of such a run with -report-bindings turned on.  
I think we should find out that things actually are happening correctly 
except for the fact that the 6 of the nodes have 2 cores allocated but 
only one is being bound to by a process.


--td


-- Reuti




On Tue, Nov 16, 2010 at 4:07 AM, Reuti  wrote:
Am 16.11.2010 um 10:26 schrieb Chris Jewell:


Hi all,


On 11/15/2010 02:11 PM, Reuti wrote:

Just to give my understanding of the problem:

Sorry, I am still trying to grok all your email as what the problem you
are trying to solve. So is the issue is trying to have two jobs having
processes on the same node be able to bind there processes on different
resources. Like core 1 for the first job and core 2 and 3 for the 2nd job?

--td

You can't get 2 slots on a machine, as it's limited by the core count to one 
here, so such a slot allocation shouldn't occur at all.

So to clarify, the current -binding:  
allocates binding_amount cores to each sge_shepherd process associated with a job_id.  
There appears to be only one sge_shepherd process per job_id per execution node, so all 
child processes run on these allocated cores.  This is irrespective of the number of slots 
allocated to the node.

I agree with Reuti that the binding_amount parameter should be a maximum number 
of bound cores per node, with the actual number determined by the number of 
slots allocated per node.  FWIW, an alternative approach might be to have 
another binding_type ('slot', say) that automatically allocated one core per 
slot.

Of course, a complex situation might arise if a user submits a combined 
MPI/multithreaded job, but then I guess we're into the realm of setting 
allocation_rule.

IIRC there was a discussion on the [GE users] list about it, to get an uniform 
distribution on all slave nodes for such jobs, as also e.g. $OMP_NUM_THREADS will be set 
to the same value for all slave nodes for hybrid jobs. Otherwise it would be necessary to 
adjust SGE to set this value in the "-builtin-" startup method automatically on 
all nodes to the local granted slots value. For now a fixed allocation rule of 1,2,4 or 
whatever must be used 

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Hi,

Am 16.11.2010 um 14:07 schrieb Ralph Castain:

> Perhaps I'm missing it, but it seems to me that the real problem lies in the 
> interaction between SGE and OMPI during OMPI's two-phase launch. The verbose 
> output shows that SGE dutifully allocated the requested number of cores on 
> each node. However, OMPI launches only one process on each node (the ORTE 
> daemon), which SGE "binds" to a single core since that is what it was told to 
> do.
> 
> Since SGE never sees the local MPI procs spawned by ORTE, it can't assign 
> bindings to them. The ORTE daemon senses its local binding (i.e., to a single 
> core in the allocation), and subsequently binds all its local procs to that 
> core.
> 
> I believe all you need to do is tell SGE to:
> 
> 1. allocate a specified number of cores on each node to your job

this is currently the bug in the "slot <=> core" relation in SGE, which has to 
be removed, updated or clarified. For now slot and core count are out of sync 
AFAICS.


> 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
> does this automatically to constrain the procs to running on only those cores.

This is another "bug/feature" in SGE: it's a matter of discussion, whether the 
shepherd should get exactly one core (in case you use more than one `qrsh`per 
node) for each call, or *all* cores assigned (which we need right now, as the 
processes in Open MPI will be forks of orte daemon). About such a situtation I 
filled an issue a long time ago and "limit_to_one_qrsh_per_host yes/no" in the 
PE definition would do (this setting should then also change the core 
allocation of the master process):

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254


> 3. tell OMPI to --bind-to-core.
> 
> In other words, tell SGE to allocate a certain number of cores on each node, 
> but to bind each proc to all of them (i.e., don't bind a proc to a specific 
> core). I'm pretty sure that is a standard SGE option today (at least, I know 
> it used to be). I don't believe any patch or devel work is required (to 
> either SGE or OMPI).

When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.

-- Reuti


> 
> 
> On Tue, Nov 16, 2010 at 4:07 AM, Reuti  wrote:
> Am 16.11.2010 um 10:26 schrieb Chris Jewell:
> 
> > Hi all,
> >
> >> On 11/15/2010 02:11 PM, Reuti wrote:
> >>> Just to give my understanding of the problem:
> 
> >> Sorry, I am still trying to grok all your email as what the problem you
> >> are trying to solve. So is the issue is trying to have two jobs having
> >> processes on the same node be able to bind there processes on different
> >> resources. Like core 1 for the first job and core 2 and 3 for the 2nd 
> >> job?
> >>
> >> --td
> >> You can't get 2 slots on a machine, as it's limited by the core count to 
> >> one here, so such a slot allocation shouldn't occur at all.
> >
> > So to clarify, the current -binding : 
> > allocates binding_amount cores to each sge_shepherd process associated with 
> > a job_id.  There appears to be only one sge_shepherd process per job_id per 
> > execution node, so all child processes run on these allocated cores.  This 
> > is irrespective of the number of slots allocated to the node.
> >
> > I agree with Reuti that the binding_amount parameter should be a maximum 
> > number of bound cores per node, with the actual number determined by the 
> > number of slots allocated per node.  FWIW, an alternative approach might be 
> > to have another binding_type ('slot', say) that automatically allocated one 
> > core per slot.
> >
> > Of course, a complex situation might arise if a user submits a combined 
> > MPI/multithreaded job, but then I guess we're into the realm of setting 
> > allocation_rule.
> 
> IIRC there was a discussion on the [GE users] list about it, to get an 
> uniform distribution on all slave nodes for such jobs, as also e.g. 
> $OMP_NUM_THREADS will be set to the same value for all slave nodes for hybrid 
> jobs. Otherwise it would be necessary to adjust SGE to set this value in the 
> "-builtin-" startup method automatically on all nodes to the local granted 
> slots value. For now a fixed allocation rule of 1,2,4 or whatever must be 
> used and you have to submit by reqeusting a wildcard PE to get any of these 
> defined PEs for an even distribution and you don't care whether it's two 
> times two slots, one time four slots, or four times one slot.
> 
> In my understanding, any type of parallel job should always request and get 
> the total number of slots equal to the cores it needs to execute. Independent 
> whether these are threads, forks or any hybrid type of jobs. Otherwise any 
> resource planing and reservation will most likely fail. Nevertheless, there 
> might exist rare cases where you submit an exclusive serial job but create 
> threads/forks in the end. But such a setup should be an excep

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Ralph Castain
Perhaps I'm missing it, but it seems to me that the real problem lies in the
interaction between SGE and OMPI during OMPI's two-phase launch. The verbose
output shows that SGE dutifully allocated the requested number of cores on
each node. However, OMPI launches only one process on each node (the ORTE
daemon), which SGE "binds" to a single core since that is what it was told
to do.

Since SGE never sees the local MPI procs spawned by ORTE, it can't assign
bindings to them. The ORTE daemon senses its local binding (i.e., to a
single core in the allocation), and subsequently binds all its local procs
to that core.

I believe all you need to do is tell SGE to:

1. allocate a specified number of cores on each node to your job

2. have SGE bind procs it launches to -all- of those cores. I believe SGE
does this automatically to constrain the procs to running on only those
cores.

3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node,
but to bind each proc to all of them (i.e., don't bind a proc to a specific
core). I'm pretty sure that is a standard SGE option today (at least, I know
it used to be). I don't believe any patch or devel work is required (to
either SGE or OMPI).



On Tue, Nov 16, 2010 at 4:07 AM, Reuti  wrote:

> Am 16.11.2010 um 10:26 schrieb Chris Jewell:
>
> > Hi all,
> >
> >> On 11/15/2010 02:11 PM, Reuti wrote:
> >>> Just to give my understanding of the problem:
> 
> >> Sorry, I am still trying to grok all your email as what the problem
> you
> >> are trying to solve. So is the issue is trying to have two jobs
> having
> >> processes on the same node be able to bind there processes on
> different
> >> resources. Like core 1 for the first job and core 2 and 3 for the
> 2nd job?
> >>
> >> --td
> >> You can't get 2 slots on a machine, as it's limited by the core count to
> one here, so such a slot allocation shouldn't occur at all.
> >
> > So to clarify, the current -binding :
> allocates binding_amount cores to each sge_shepherd process associated with
> a job_id.  There appears to be only one sge_shepherd process per job_id per
> execution node, so all child processes run on these allocated cores.  This
> is irrespective of the number of slots allocated to the node.
> >
> > I agree with Reuti that the binding_amount parameter should be a maximum
> number of bound cores per node, with the actual number determined by the
> number of slots allocated per node.  FWIW, an alternative approach might be
> to have another binding_type ('slot', say) that automatically allocated one
> core per slot.
> >
> > Of course, a complex situation might arise if a user submits a combined
> MPI/multithreaded job, but then I guess we're into the realm of setting
> allocation_rule.
>
> IIRC there was a discussion on the [GE users] list about it, to get an
> uniform distribution on all slave nodes for such jobs, as also e.g.
> $OMP_NUM_THREADS will be set to the same value for all slave nodes for
> hybrid jobs. Otherwise it would be necessary to adjust SGE to set this value
> in the "-builtin-" startup method automatically on all nodes to the local
> granted slots value. For now a fixed allocation rule of 1,2,4 or whatever
> must be used and you have to submit by reqeusting a wildcard PE to get any
> of these defined PEs for an even distribution and you don't care whether
> it's two times two slots, one time four slots, or four times one slot.
>
> In my understanding, any type of parallel job should always request and get
> the total number of slots equal to the cores it needs to execute.
> Independent whether these are threads, forks or any hybrid type of jobs.
> Otherwise any resource planing and reservation will most likely fail.
> Nevertheless, there might exist rare cases where you submit an exclusive
> serial job but create threads/forks in the end. But such a setup should be
> an exception, not the default.
>
>
> > Is it going to be worth looking at creating a patch for this?
>
> Absolute.
>
>
> >  I don't know much of the internals of SGE -- would it be hard work to
> do?  I've not that much time to dedicate towards it, but I could put some
> effort in if necessary...
>
> I don't know about the exact coding for it, but when it's for now a plain
> "copy" of the binding list, then it should become a loop to create a list of
> cores from the original specification until all granted slots got a core
> allocated.
>
> -- Reuti
>
>
> >
> > Chris
> >
> >
> > --
> > Dr Chris Jewell
> > Department of Statistics
> > University of Warwick
> > Coventry
> > CV4 7AL
> > UK
> > Tel: +44 (0)24 7615 0778
> >
> >
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] source code for presentation/papers

2010-11-16 Thread Jeff Squyres
We hosted the paper as a courtesy to the author; they aren't part of the Open 
MPI core community.  You should probably contact the author directly to obtain 
the work; it was not submitted upstream to us.


On Nov 13, 2010, at 4:39 AM, Vasiliy G Tolstov wrote:

> Hello. I read very good paper about xenmpi and interdomain communication
> (http://www.open-mpi.org/papers/trinity-btl-2009/xenmpi_report.pdf)
> 
> Documents contains some instruction how to build xensocket and xen btl
> with openmpi. Where i can find the source?
> 
> -- 
> Vasiliy G Tolstov 
> Selfip.Ru
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Reuti
Am 16.11.2010 um 10:26 schrieb Chris Jewell:

> Hi all,
> 
>> On 11/15/2010 02:11 PM, Reuti wrote: 
>>> Just to give my understanding of the problem: 
 
>> Sorry, I am still trying to grok all your email as what the problem you 
>> are trying to solve. So is the issue is trying to have two jobs having 
>> processes on the same node be able to bind there processes on different 
>> resources. Like core 1 for the first job and core 2 and 3 for the 2nd 
>> job? 
>> 
>> --td 
>> You can't get 2 slots on a machine, as it's limited by the core count to one 
>> here, so such a slot allocation shouldn't occur at all. 
> 
> So to clarify, the current -binding : 
> allocates binding_amount cores to each sge_shepherd process associated with a 
> job_id.  There appears to be only one sge_shepherd process per job_id per 
> execution node, so all child processes run on these allocated cores.  This is 
> irrespective of the number of slots allocated to the node.  
> 
> I agree with Reuti that the binding_amount parameter should be a maximum 
> number of bound cores per node, with the actual number determined by the 
> number of slots allocated per node.  FWIW, an alternative approach might be 
> to have another binding_type ('slot', say) that automatically allocated one 
> core per slot.
> 
> Of course, a complex situation might arise if a user submits a combined 
> MPI/multithreaded job, but then I guess we're into the realm of setting 
> allocation_rule.

IIRC there was a discussion on the [GE users] list about it, to get an uniform 
distribution on all slave nodes for such jobs, as also e.g. $OMP_NUM_THREADS 
will be set to the same value for all slave nodes for hybrid jobs. Otherwise it 
would be necessary to adjust SGE to set this value in the "-builtin-" startup 
method automatically on all nodes to the local granted slots value. For now a 
fixed allocation rule of 1,2,4 or whatever must be used and you have to submit 
by reqeusting a wildcard PE to get any of these defined PEs for an even 
distribution and you don't care whether it's two times two slots, one time four 
slots, or four times one slot.

In my understanding, any type of parallel job should always request and get the 
total number of slots equal to the cores it needs to execute. Independent 
whether these are threads, forks or any hybrid type of jobs. Otherwise any 
resource planing and reservation will most likely fail. Nevertheless, there 
might exist rare cases where you submit an exclusive serial job but create 
threads/forks in the end. But such a setup should be an exception, not the 
default.


> Is it going to be worth looking at creating a patch for this?

Absolute.


>  I don't know much of the internals of SGE -- would it be hard work to do?  
> I've not that much time to dedicate towards it, but I could put some effort 
> in if necessary...

I don't know about the exact coding for it, but when it's for now a plain 
"copy" of the binding list, then it should become a loop to create a list of 
cores from the original specification until all granted slots got a core 
allocated.

-- Reuti


> 
> Chris
> 
> 
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Terry Dontje

On 11/16/2010 04:26 AM, Chris Jewell wrote:

Hi all,


On 11/15/2010 02:11 PM, Reuti wrote:

Just to give my understanding of the problem:

Sorry, I am still trying to grok all your email as what the problem you
are trying to solve. So is the issue is trying to have two jobs having
processes on the same node be able to bind there processes on different
resources. Like core 1 for the first job and core 2 and 3 for the 2nd job?

--td

You can't get 2 slots on a machine, as it's limited by the core count to one 
here, so such a slot allocation shouldn't occur at all.

So to clarify, the current -binding:  
allocates binding_amount cores to each sge_shepherd process associated with a job_id.  
There appears to be only one sge_shepherd process per job_id per execution node, so all 
child processes run on these allocated cores.  This is irrespective of the number of slots 
allocated to the node.

I believe the above is correct.

I agree with Reuti that the binding_amount parameter should be a maximum number 
of bound cores per node, with the actual number determined by the number of 
slots allocated per node.  FWIW, an alternative approach might be to have 
another binding_type ('slot', say) that automatically allocated one core per 
slot.

That might be correct, I've put in a question to someone who should know.

Of course, a complex situation might arise if a user submits a combined 
MPI/multithreaded job, but then I guess we're into the realm of setting 
allocation_rule.

Yes, that would get ugly.

Is it going to be worth looking at creating a patch for this?  I don't know 
much of the internals of SGE -- would it be hard work to do?  I've not that 
much time to dedicate towards it, but I could put some effort in if necessary...


Is the patch you're wanting is for a "slot" binding_type?

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell
Hi all,

> On 11/15/2010 02:11 PM, Reuti wrote: 
>> Just to give my understanding of the problem: 
>>> 
> Sorry, I am still trying to grok all your email as what the problem you 
> are trying to solve. So is the issue is trying to have two jobs having 
> processes on the same node be able to bind there processes on different 
> resources. Like core 1 for the first job and core 2 and 3 for the 2nd 
> job? 
> 
> --td 
> You can't get 2 slots on a machine, as it's limited by the core count to one 
> here, so such a slot allocation shouldn't occur at all. 

So to clarify, the current -binding : 
allocates binding_amount cores to each sge_shepherd process associated with a 
job_id.  There appears to be only one sge_shepherd process per job_id per 
execution node, so all child processes run on these allocated cores.  This is 
irrespective of the number of slots allocated to the node.  

I agree with Reuti that the binding_amount parameter should be a maximum number 
of bound cores per node, with the actual number determined by the number of 
slots allocated per node.  FWIW, an alternative approach might be to have 
another binding_type ('slot', say) that automatically allocated one core per 
slot.

Of course, a complex situation might arise if a user submits a combined 
MPI/multithreaded job, but then I guess we're into the realm of setting 
allocation_rule.

Is it going to be worth looking at creating a patch for this?  I don't know 
much of the internals of SGE -- would it be hard work to do?  I've not that 
much time to dedicate towards it, but I could put some effort in if necessary...

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778