Hi,

Am 13.11.2010 um 15:39 schrieb Chris Jewell:

> Sorry for kicking off this thread, and then disappearing.  I've been away for 
> a bit.  Anyway, Dave, I'm glad you experienced the same issue as I had with 
> my installation of SGE 6.2u5 and OpenMPI with core binding -- namely that 
> with 'qsub -pe openmpi 8 -binding set linear:1 <myscript.com>', if two or 
> more of the parallel processes get scheduled to the same execution node, then 
> the processes end up being bound to the same core.  Not good!
> 
> I've been playing around quite a bit trying to understand this issue, and 
> ended up on the GE dev list:
> 
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=39&dsMessageId=285878

as the [GE dev] list was nearly dead as it went closed source, I'm no longer 
subscripted to it. [GE users] will reach a broader audience I think.  Anyway,

I don't have a free suitable cluster, but can you please try the following:

$ qsub -pe openmpi 8 -binding linear:2 <myscript.com>

with a fixed "allocation_rule 2" in your PE. And also:

$ qsub -pe openmpi 8 -binding linear:8 <myscript.com>


> It seems that most people expect that calls to 'qrsh -inherit' (that I assume 
> OpenMPI uses to bind parallel processes to reserved GE slots) activates a 
> separate binding.  This does not appear to be the case.  I *was* hoping that 
> using -binding pe linear:1 might enable me to write a script that read the 
> pe_hostfile and created a machine file for OpenMPI, but this fails as GE does 
> not appear to give information as to which cores are unbound, only the number 
> required.

You can get the information about the to be used cores when you use "env" or 
even better "pe" as "binding_instance" instead of "set". Then it should be 
possible (and you even need to implement it) to let Open MPI do the core 
binding instead of SGE. From `man qsub`:

              pe means that the information about the selected cores appears in 
the fourth column of the pe_hostfile.  Here  the
              logical  core  and  socket  numbers are printed (they start at 0 
and have no holes) in colon separated pairs (i.e.
              0,0:1,0 which means core 0 on socket 0 and core 0 on socket 1).  
For more information about the $pe_hostfile check
              ge_pe(5)

-- Reuti


> So, for now, my solution has been to use a JSV to remove core binding for the 
> MPI jobs (but retain it for serial and SMP jobs).  Any more ideas??
> 
> Cheers,
> 
> Chris
> 
> (PS. Dave: how is my alma mater these days??)
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
> 
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to