Hi,
Am 12.01.2012 um 08:00 schrieb Brendan Moloney:
> I seem to have found a combination of resource quotas that is preventing
> the scheduler from scheduling parallel jobs across multiple queues.
>
> I have multiple queues for jobs with different run times: veryshort.q,
> short.q ,
> long.q, and verylong.q. Each of these queues has an increasing 'h_rt' limit
> and
> an increasing sequence number (I have the scheduler sort by sequence
> numbers). Each of these queues also has a decreasing number slots available.
> Jobs are then submitted with an h_rt value and the shortest queue with an
> open slot is used. I also have a parallel environment "mpi" that is enabled
> in
> all of these queues.
>
> The problem only occurs if I use resource quota sets to both limit the total
> number of slots for the queues and limit the number of slots on each node.
>
> For example:
>
> {
> name nodelimit
> description NONE
> enabled TRUE
> limit queues !debug.q hosts {*} to slots=$num_proc
> }
> {
> name shortlimit
> description NONE
> enabled TRUE
> limit queues short.q hosts * to slots=32
I think you can leave the "hosts *" out here and the other RQS below. It means
"used slots across all machines" limited to 32 in this queue. The same can be
achieved by specifying only the queue.
> }
> {
> name longlimit
> description NONE
> enabled TRUE
> limit queues long.q hosts * to slots=16
> }
> {
> name verylonglimit
> description NONE
> enabled TRUE
> limit queues verylong.q hosts * to slots=4
> }
> {
> name urgentlimit
> description NONE
> enabled TRUE
> limit users {*} queues urgent.q hosts * to slots=1
> }
> {
> name debuglimit
> description NONE
> enabled TRUE
> limit users {*} queues debug.q hosts {*} to slots=1
> }
As the above 5 limits are disjunct, they can also be put in one and the same
RQS. You can give each a name to get it listed instead of the number of the
rule, which is always 1 right now.
> This will cause a parallel job across multiple queues to never schedule. If
> I get rid of the "nodelimit" and instead set the number of slots using
> the complex value in the host configuration, then everything works (except
> my debug queue).
Do you have many machinetypes? What happens, if you don't use $num_proc there
but specify a hard coded limit per hostgroup for a machinetype or so?
limit queues !debug.q hosts {@quadcore} to slots=4
limit queues !debug.q hosts {@hexacore} to slots=6
> Below I give an example of a hanging job (with the scheduler output enabled).
> I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and
> verylong.q. I request 40 slots as that will have to span multiple queues.
If I get you right, SGE could find different combinations for the slot
allocation, depending on the algorithm which is used as all the queues are on
the same machines?
-- Reuti
> $ qsub -w e -l h_rt=3:50:00 -pe mpi 40 test.sh
> Your job 13280 ("test.sh") has been submitted
>
> $ qstat -u '*'
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 13280 0.00000 test.sh moloney qw 01/11/2012 21:21:32
> 40
>
> $ qstat -j 13280
> ==============================================================
> job_number: 13280
> exec_file: job_scripts/13280
> submission_time: Wed Jan 11 21:21:32 2012
> owner: moloney
> ...
> scheduling info: cannot run in queue "debug.q" because PE "mpi" is
> not in pe list
> cannot run in queue "urgent.q" because PE "mpi" is
> not in pe list
> cannot run because it exceeds limit "////piggy/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////piggy/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////piggy/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////piggy/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////kermit/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////kermit/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////kermit/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////kermit/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////animal/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////animal/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////animal/"
> in rule "nodelimit/1"
> cannot run because it exceeds limit "////animal/"
> in rule "nodelimit/1"
> cannot run in PE "mpi" because it only offers 0
> slots
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users