Hi,

> Am 11.02.2015 um 20:03 schrieb Michael Stauffer <[email protected]>:
> 
> OGS/GE 2011.11p1
> 
> Hi again,
> 
> I've got a user who's got 240+ running jobs (single slot) in the default 
> queue (and 1400 queued and waiting), when the usual slot quota is about 50. I 
> say 'usual' because I'm running a simple script that modifies everyone's slot 
> quota depending on the overall cluster usage. When lots of slots are 
> available, the quota goes up to a max of 100. I checked the logs from the 
> script (it runs every minute) and over the time period that these 240+ jobs 
> were submitted, the max slot quota never went above 97.
> 
> My script examines the current cluster state, then dumps out a new rqs file, 
> which then gets loaded via 'qconf -Mrqs'. The script gets called every 
> minute. The queue scedule interval is one second:
> 
>   schedule_interval                 0:0:1

Are the jobs so short that such a short interval is necessary? It will put some 
load on the scheduler.


> Anyone have an idea how this might have happened? If the user submits a lot 
> of jobs in the split-second when 'qconf -Mrqs' is updating, could the 
> scheduler get confused and start more jobs than it should? Any suggestions on 
> how to dig around to see what happened? Thanks.

I can't say for sure, but instead of creating an altered file of the output, 
it's also possible to change individual lines like:

$ qconf -mattr resource_quota limit slots=4 general/3
$ qconf -mattr resource_quota limit slots=4 general/short # here the limit got 
a name
$ qconf -mattr resource_quota enabled TRUE general

for an RQS called "general".

A safety net could be setup in addition in the scheduler configuration with 
"maxujobs".

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to