Hi, Am 13.11.2010 um 15:39 schrieb Chris Jewell:
> Sorry for kicking off this thread, and then disappearing. I've been away for > a bit. Anyway, Dave, I'm glad you experienced the same issue as I had with > my installation of SGE 6.2u5 and OpenMPI with core binding -- namely that > with 'qsub -pe openmpi 8 -binding set linear:1 <myscript.com>', if two or > more of the parallel processes get scheduled to the same execution node, then > the processes end up being bound to the same core. Not good! > > I've been playing around quite a bit trying to understand this issue, and > ended up on the GE dev list: > > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=39&dsMessageId=285878 as the [GE dev] list was nearly dead as it went closed source, I'm no longer subscripted to it. [GE users] will reach a broader audience I think. Anyway, I don't have a free suitable cluster, but can you please try the following: $ qsub -pe openmpi 8 -binding linear:2 <myscript.com> with a fixed "allocation_rule 2" in your PE. And also: $ qsub -pe openmpi 8 -binding linear:8 <myscript.com> > It seems that most people expect that calls to 'qrsh -inherit' (that I assume > OpenMPI uses to bind parallel processes to reserved GE slots) activates a > separate binding. This does not appear to be the case. I *was* hoping that > using -binding pe linear:1 might enable me to write a script that read the > pe_hostfile and created a machine file for OpenMPI, but this fails as GE does > not appear to give information as to which cores are unbound, only the number > required. You can get the information about the to be used cores when you use "env" or even better "pe" as "binding_instance" instead of "set". Then it should be possible (and you even need to implement it) to let Open MPI do the core binding instead of SGE. From `man qsub`: pe means that the information about the selected cores appears in the fourth column of the pe_hostfile. Here the logical core and socket numbers are printed (they start at 0 and have no holes) in colon separated pairs (i.e. 0,0:1,0 which means core 0 on socket 0 and core 0 on socket 1). For more information about the $pe_hostfile check ge_pe(5) -- Reuti > So, for now, my solution has been to use a JSV to remove core binding for the > MPI jobs (but retain it for serial and SMP jobs). Any more ideas?? > > Cheers, > > Chris > > (PS. Dave: how is my alma mater these days??) > -- > Dr Chris Jewell > Department of Statistics > University of Warwick > Coventry > CV4 7AL > UK > Tel: +44 (0)24 7615 0778 > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users