Hi, Am 12.01.2012 um 08:00 schrieb Brendan Moloney:
> I seem to have found a combination of resource quotas that is preventing > the scheduler from scheduling parallel jobs across multiple queues. > > I have multiple queues for jobs with different run times: veryshort.q, > short.q , > long.q, and verylong.q. Each of these queues has an increasing 'h_rt' limit > and > an increasing sequence number (I have the scheduler sort by sequence > numbers). Each of these queues also has a decreasing number slots available. > Jobs are then submitted with an h_rt value and the shortest queue with an > open slot is used. I also have a parallel environment "mpi" that is enabled > in > all of these queues. > > The problem only occurs if I use resource quota sets to both limit the total > number of slots for the queues and limit the number of slots on each node. > > For example: > > { > name nodelimit > description NONE > enabled TRUE > limit queues !debug.q hosts {*} to slots=$num_proc > } > { > name shortlimit > description NONE > enabled TRUE > limit queues short.q hosts * to slots=32 I think you can leave the "hosts *" out here and the other RQS below. It means "used slots across all machines" limited to 32 in this queue. The same can be achieved by specifying only the queue. > } > { > name longlimit > description NONE > enabled TRUE > limit queues long.q hosts * to slots=16 > } > { > name verylonglimit > description NONE > enabled TRUE > limit queues verylong.q hosts * to slots=4 > } > { > name urgentlimit > description NONE > enabled TRUE > limit users {*} queues urgent.q hosts * to slots=1 > } > { > name debuglimit > description NONE > enabled TRUE > limit users {*} queues debug.q hosts {*} to slots=1 > } As the above 5 limits are disjunct, they can also be put in one and the same RQS. You can give each a name to get it listed instead of the number of the rule, which is always 1 right now. > This will cause a parallel job across multiple queues to never schedule. If > I get rid of the "nodelimit" and instead set the number of slots using > the complex value in the host configuration, then everything works (except > my debug queue). Do you have many machinetypes? What happens, if you don't use $num_proc there but specify a hard coded limit per hostgroup for a machinetype or so? limit queues !debug.q hosts {@quadcore} to slots=4 limit queues !debug.q hosts {@hexacore} to slots=6 > Below I give an example of a hanging job (with the scheduler output enabled). > I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and > verylong.q. I request 40 slots as that will have to span multiple queues. If I get you right, SGE could find different combinations for the slot allocation, depending on the algorithm which is used as all the queues are on the same machines? -- Reuti > $ qsub -w e -l h_rt=3:50:00 -pe mpi 40 test.sh > Your job 13280 ("test.sh") has been submitted > > $ qstat -u '*' > job-ID prior name user state submit/start at queue > slots ja-task-ID > ----------------------------------------------------------------------------------------------------------------- > 13280 0.00000 test.sh moloney qw 01/11/2012 21:21:32 > 40 > > $ qstat -j 13280 > ============================================================== > job_number: 13280 > exec_file: job_scripts/13280 > submission_time: Wed Jan 11 21:21:32 2012 > owner: moloney > ... > scheduling info: cannot run in queue "debug.q" because PE "mpi" is > not in pe list > cannot run in queue "urgent.q" because PE "mpi" is > not in pe list > cannot run because it exceeds limit "////piggy/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////piggy/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////piggy/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////piggy/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////kermit/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////kermit/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////kermit/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////kermit/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////animal/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////animal/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////animal/" > in rule "nodelimit/1" > cannot run because it exceeds limit "////animal/" > in rule "nodelimit/1" > cannot run in PE "mpi" because it only offers 0 > slots > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users