[OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-04 Thread Chris Jewell
Hi all,

Firstly, hello to the mailing list for the first time!  Secondly, sorry for the 
non-descript subject line, but I couldn't really think how to be more specific! 
 

Anyway, I am currently having a problem getting OpenMPI to work within my 
installation of SGE 6.2u5.  I compiled OpenMPI 1.4.2 from source, and installed 
under /usr/local/packages/openmpi-1.4.2.  Software on my system is controlled 
by the Modules framework which adds the bin and lib directories to PATH and 
LD_LIBRARY_PATH respectively when a user is connected to an execution node.  I 
configured a parallel environment in which OpenMPI is to be used: 

pe_namempi
slots  16
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$round_robin
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary FALSE

I then tried a simple job submission script:

#!/bin/bash
#
#$ -S /bin/bash
. /etc/profile
module add ompi gcc
mpirun hostname

If the parallel environment runs within one execution host (8 slots per host), 
then all is fine.  However, if scheduled across  several nodes, I get an error:

execv: No such file or directory
execv: No such file or directory
execv: No such file or directory
--
A daemon (pid 1629) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished


I'm at a loss on how to start debugging this, and I don't seem to be getting 
anything useful using the mpirun '-d' and '-v' switches.  SGE logs don't note 
anything.  Can anyone suggest either what is wrong, or how I might progress 
with getting more information?

Many thanks,


Chris



--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-10-05 Thread Chris Jewell

> 
> It looks to me like your remote nodes aren't finding the orted executable. I 
> suspect the problem is that you need to forward the path and ld_library_path 
> tot he remove nodes. Use the mpirun -x option to do so.


Hi, problem sorted.  It was actually caused by the system I currently use to 
create Linux cpusets on the execution nodes.  Grid Engine was trying to execv 
on the slave nodes, and not supplying an executable to run, since this is 
deferred to OpenMPI.  I've scrapped this system now in favour of the new SGE 
core binding feature.

Thanks, sorry to waste people's time!

Chris








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-13 Thread Chris Jewell
Hi Dave, Reuti,

Sorry for kicking off this thread, and then disappearing.  I've been away for a 
bit.  Anyway, Dave, I'm glad you experienced the same issue as I had with my 
installation of SGE 6.2u5 and OpenMPI with core binding -- namely that with 
'qsub -pe openmpi 8 -binding set linear:1 ', if two or more of 
the parallel processes get scheduled to the same execution node, then the 
processes end up being bound to the same core.  Not good!

I've been playing around quite a bit trying to understand this issue, and ended 
up on the GE dev list:

http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=39&dsMessageId=285878

It seems that most people expect that calls to 'qrsh -inherit' (that I assume 
OpenMPI uses to bind parallel processes to reserved GE slots) activates a 
separate binding.  This does not appear to be the case.  I *was* hoping that 
using -binding pe linear:1 might enable me to write a script that read the 
pe_hostfile and created a machine file for OpenMPI, but this fails as GE does 
not appear to give information as to which cores are unbound, only the number 
required.

So, for now, my solution has been to use a JSV to remove core binding for the 
MPI jobs (but retain it for serial and SMP jobs).  Any more ideas??

Cheers,

Chris

(PS. Dave: how is my alma mater these days??)
--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi Reuti,

Okay so I tried what you suggested.  You essentially get the requested number 
of bound cores on each execution node, so if I use

$ qsub -pe openmpi 8 -binding linear:2 

then I get 2 bound cores per node, irrespective of the number of slots (and 
hence parallel) processes allocated by GE.  This is irrespective of which 
setting I use for the allocation_rule.

My aim with this was to deal with badly behaved multithreaded algorithms which 
end up spreading across more cores on an execution node than the number of 
GE-allocated slots (thereby interfering with other GE scheduled tasks running 
on the same exec node).  By binding a process to one or more cores, one can 
"box in" processes and prevent them from spawning erroneous sub-processes and 
threads.  Unfortunately, the above solution sets the same core binding for each 
execution node to be the same.

>From exploring the software (both OpenMPI and GE) further, I have two comments:

1) The core binding feature in GE appears to apply the requested core-binding 
topology to every execution node involved in a parallel job, rather than 
assuming that the topology requested is *per parallel process*.  So, if I 
request 'qsub -pe mpi 8 -binding linear:1 ' with the intention of 
getting each of the 8 parallel processes to be bound to 1 core, I actually get 
all processes associated with the job_id on one exec node bound to 1 core.  
Oops!

2) OpenMPI has its own core-binding feature (-mca mpi_paffinity_alone 1) which 
works well to bind each parallel process to one processor.  Unfortunately, the 
binding framework (hwloc) is different to that which GE uses (PLPA), resulting 
in binding overlaps between GE-bound tasks (eg serial and smp jobs) and 
OpenMPI-bound processes (ie my mpi jobs).  Again, oops ;-)


If, indeed, it is not possible currently to implement this type of core-binding 
in tightly integrated OpenMPI/GE, then a solution might lie in a custom script 
run in the parallel environment's 'start proc args'.  This script would have to 
find out which slots are allocated where on the cluster, and write an OpenMPI 
rankfile.

Any thoughts on that?

Cheers,

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi,

> > If, indeed, it is not possible currently to implement this type of 
> > core-binding in tightly integrated OpenMPI/GE, then a solution might lie in 
> > a custom script run in the parallel environment's 'start proc args'. This 
> > script would have to find out which slots are allocated where on the 
> > cluster, and write an OpenMPI rankfile. 
> 
> Exactly this should work. 
> 
> If you use "binding_instance" "pe" and reformat the information in the 
> $PE_HOSTFILE to a "rankfile", it should work to get the desired allocation. 
> Maybe you can share the script with this list once you got it working. 


As far as I can see, that's not going to work.  This is because, exactly like 
"binding_instance" "set", for -binding pe linear:n you get n cores bound per 
node.  This is easily verifiable by using a long job and examining the 
pe_hostfile.  For example, I submit a job with:

$ qsub -pe mpi 8 -binding pe linear:1 myScript.com

and my pe_hostfile looks like:

exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1
exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1
exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1
exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1
exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1
exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1
exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1

Notice that, because I have specified the -binding pe linear:1, each execution 
node binds processes for the job_id to one core.  If I have -binding pe 
linear:2, I get:

exec6.cluster.stats.local 2 batch.q@exec6.cluster.stats.local 0,1:0,2
exec1.cluster.stats.local 1 batch.q@exec1.cluster.stats.local 0,1:0,2
exec7.cluster.stats.local 1 batch.q@exec7.cluster.stats.local 0,1:0,2
exec4.cluster.stats.local 1 batch.q@exec4.cluster.stats.local 0,1:0,2
exec3.cluster.stats.local 1 batch.q@exec3.cluster.stats.local 0,1:0,2
exec2.cluster.stats.local 1 batch.q@exec2.cluster.stats.local 0,1:0,2
exec5.cluster.stats.local 1 batch.q@exec5.cluster.stats.local 0,1:0,2

So the pe_hostfile still doesn't give an accurate representation of the binding 
allocation for use by OpenMPI.  Question: is there a system file or command 
that I could use to check which processors are "occupied"?

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
> I confess I am now confused. What version of OMPI are you using? 
> 
> FWIW: OMPI was updated at some point to detect the actual cores of an 
> external binding, and abide by them. If we aren't doing that, then we have a 
> bug that needs to be resolved. Or it could be you are using a version that 
> predates the change. 
> 
> Thanks 
> Ralph

Hi Ralph,

I'm using OMPI version 1.4.2.  I can upgrade and try it out if necessary.  Is 
there anything I can give you as potential debug material?

Cheers,

Chris



--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
Hi Ralph,

Thanks for the tip.  With the command

$ qsub -pe mpi 8 -binding linear:1 myScript.com

I get the output

[exec6:29172] System has detected external process binding to cores 0008
[exec6:29172] ras:gridengine: JOB_ID: 59282
[exec6:29172] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec6/active_jobs/59282.1/pe_hostfile
[exec6:29172] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec6:29172] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec6:29172] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1

Presumably that means that OMPI is detecting the external binding okay.  If so, 
then that confirms my problem as an issue with how GE sets the processor 
affinity -- essentially the controlling sge_shepherd process  on each physical 
exec node gets bound to the requested number of cores (in this case 1) 
resulting in any child process (ie the ompi parallel processes) being bound to 
the same core.  What we really need is for GE to set the binding on each 
execution node according to the number of parallel processes that will run 
there.  Not sure this is doable currently...

Cheers,

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-15 Thread Chris Jewell
> Sorry, I am still trying to grok all your email as what the problem you 
> are trying to solve. So is the issue is trying to have two jobs having 
> processes on the same node be able to bind there processes on different 
> resources. Like core 1 for the first job and core 2 and 3 for the 2nd job? 
> 
> --td 

That's exactly it.  Each MPI process needs to be bound to 1 processor in a way 
that reflects GE's slot allocation scheme.

C

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell
Hi all,

> On 11/15/2010 02:11 PM, Reuti wrote: 
>> Just to give my understanding of the problem: 
>>> 
>>>>> Sorry, I am still trying to grok all your email as what the problem you 
>>>>> are trying to solve. So is the issue is trying to have two jobs having 
>>>>> processes on the same node be able to bind there processes on different 
>>>>> resources. Like core 1 for the first job and core 2 and 3 for the 2nd 
>>>>> job? 
>>>>> 
>>>>> --td 
> You can't get 2 slots on a machine, as it's limited by the core count to one 
> here, so such a slot allocation shouldn't occur at all. 

So to clarify, the current -binding : 
allocates binding_amount cores to each sge_shepherd process associated with a 
job_id.  There appears to be only one sge_shepherd process per job_id per 
execution node, so all child processes run on these allocated cores.  This is 
irrespective of the number of slots allocated to the node.  

I agree with Reuti that the binding_amount parameter should be a maximum number 
of bound cores per node, with the actual number determined by the number of 
slots allocated per node.  FWIW, an alternative approach might be to have 
another binding_type ('slot', say) that automatically allocated one core per 
slot.

Of course, a complex situation might arise if a user submits a combined 
MPI/multithreaded job, but then I guess we're into the realm of setting 
allocation_rule.

Is it going to be worth looking at creating a patch for this?  I don't know 
much of the internals of SGE -- would it be hard work to do?  I've not that 
much time to dedicate towards it, but I could put some effort in if necessary...

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell

On 16 Nov 2010, at 14:26, Terry Dontje wrote:
> 
> In the original case of 7 nodes and processes if we do -binding pe linear:2, 
> and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
> processes bind to one core and the 7th node with 2 processes to have each of 
> those processes bound to different cores on the same machine.
> 
> Can we get a full output of such a run with -report-bindings turned on.  I 
> think we should find out that things actually are happening correctly except 
> for the fact that the 6 of the nodes have 2 cores allocated but only one is 
> being bound to by a process.

Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 
-binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
ras_gridengine_verbose 100 --report-bindings ./unterm':

[exec4:17384] System has detected external process binding to cores 0022
[exec4:17384] ras:gridengine: JOB_ID: 59352
[exec4:17384] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
[exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1


Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-16 Thread Chris Jewell

On 16 Nov 2010, at 17:25, Terry Dontje wrote:
>>> 
>> Sure.   Here's the stderr of a job submitted to my cluster with 'qsub -pe 
>> mpi 8 -binding linear:2 myScript.com'  where myScript.com runs 'mpirun -mca 
>> ras_gridengine_verbose 100 --report-bindings ./unterm':
>> 
>> [exec4:17384] System has detected external process binding to cores 0022
>> [exec4:17384] ras:gridengine: JOB_ID: 59352
>> [exec4:17384] ras:gridengine: PE_HOSTFILE: 
>> /usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
>> [exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
>> slots=2
>> [exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> [exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
>> slots=1
>> 
>> 
>> 
> Is that all that came out?  I would have expected a some output from each 
> process after the orted forked the processes but before the exec of unterm.

Yes.  It appears that if orted detects binding done by external processes, then 
this is all you get.  Scratch the GE enforced binding, and you get:

[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],0] to 
cpus 0001
[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],1] to 
cpus 0002
[exec7:06781] [[23443,0],2] odls:default:fork binding child [[23443,1],3] to 
cpus 0001
[exec2:24160] [[23443,0],1] odls:default:fork binding child [[23443,1],2] to 
cpus 0001
[exec6:30097] [[23443,0],4] odls:default:fork binding child [[23443,1],5] to 
cpus 0001
[exec5:02736] [[23443,0],6] odls:default:fork binding child [[23443,1],7] to 
cpus 0001
[exec1:30779] [[23443,0],5] odls:default:fork binding child [[23443,1],6] to 
cpus 0001
[exec3:12818] [[23443,0],3] odls:default:fork binding child [[23443,1],4] to 
cpus 0001
.


C
--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-17 Thread Chris Jewell
On 17 Nov 2010, at 11:56, Terry Dontje wrote:
>> 
>> You are absolutely correct, Terry, and the 1.4 release series does include 
>> the proper code. The point here, though, is that SGE binds the orted to a 
>> single core, even though other cores are also allocated. So the orted 
>> detects an external binding of one core, and binds all its children to that 
>> same core.
> I do not think you are right here.  Chris sent the following which looks like 
> OGE (fka SGE) actually did bind the hnp to multiple cores.  However that 
> message I believe is not coming from the processes themselves and actually is 
> only shown by the hnp.  I wonder if Chris adds a "-bind-to-core" option  
> we'll see more output from the a.out's before they exec unterm?

As requested using 

$ qsub -pe mpi 8 -binding linear:2 myScript.com'  

and 

'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
-bind-to-core ./unterm'

[exec5:06671] System has detected external process binding to cores 0028
[exec5:06671] ras:gridengine: JOB_ID: 59434
[exec5:06671] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
[exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1

No more info.  I note that the external binding is slightly different to what I 
had before, but our cluster is busier today :-)

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778








Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

2010-11-18 Thread Chris Jewell
> 
>> Perhaps if someone could run this test again with --report-bindings 
>> --leave-session-attached and provide -all- output we could verify that 
>> analysis and clear up the confusion?
>> 
> Yeah, however I bet you we still won't see output.

Actually, it seems we do get more output!  Results of 'qsub -pe mpi 8 -binding 
linear:2 myScript.com'

with

'mpirun -mca ras_gridengine_verbose 100 -report-bindings 
--leave-session-attached -bycore -bind-to-core ./unterm'

[exec1:06504] System has detected external process binding to cores 0028
[exec1:06504] ras:gridengine: JOB_ID: 59467
[exec1:06504] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec1/active_jobs/59467.1/pe_hostfile
[exec1:06504] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec1:06504] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],0] to 
cpus 0008
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],1] to 
cpus 0020
[exec3:20248] [[59608,0],1] odls:default:fork binding child [[59608,1],2] to 
cpus 0008
[exec4:26792] [[59608,0],4] odls:default:fork binding child [[59608,1],5] to 
cpus 0001
[exec2:32462] [[59608,0],2] odls:default:fork binding child [[59608,1],3] to 
cpus 0001
[exec7:09833] [[59608,0],3] odls:default:fork binding child [[59608,1],4] to 
cpus 0002
[exec5:10834] [[59608,0],5] odls:default:fork binding child [[59608,1],6] to 
cpus 0001
[exec6:04230] [[59608,0],6] odls:default:fork binding child [[59608,1],7] to 
cpus 0001

AHHA!  Now I get the following if I use 'qsub -pe mpi 8 -binding linear:1 
myScript.com' with the above mpirun command:

[exec1:06552] System has detected external process binding to cores 0020
[exec1:06552] ras:gridengine: JOB_ID: 59468
[exec1:06552] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec1/active_jobs/59468.1/pe_hostfile
[exec1:06552] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows 
slots=2
[exec1:06552] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows 
slots=1
[exec1:06552] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows 
slots=1
--
mpirun was unable to start the specified application as it encountered an error:

Error name: Unknown error: 1
Node: exec1

when attempting to start process rank 0.
--
[exec1:06552] [[59432,0],0] odls:default:fork binding child [[59432,1],0] to 
cpus 0020
--
Not enough processors were found on the local host to meet the requested
binding action:

  Local host:exec1
  Action requested:  bind-to-core
  Application name:  ./unterm

Please revise the request and try again.
--
[exec4:26816] [[59432,0],4] odls:default:fork binding child [[59432,1],5] to 
cpus 0001
[exec3:20345] [[59432,0],1] odls:default:fork binding child [[59432,1],2] to 
cpus 0020
[exec2:32486] [[59432,0],2] odls:default:fork binding child [[59432,1],3] to 
cpus 0001
[exec7:09921] [[59432,0],3] odls:default:fork binding child [[59432,1],4] to 
cpus 0002
[exec6:04257] [[59432,0],6] odls:default:fork binding child [[59432,1],7] to 
cpus 0001
[exec5:10861] [[59432,0],5] odls:default:fork binding child [[59432,1],6] to 
cpus 0001



Hope that helps clear up the confusion!  Please say it does, my head hurts...

Chris


--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778