Am 11.02.2015 um 20:52 schrieb Michael Stauffer:

> On Wed, Feb 11, 2015 at 2:30 PM, Reuti <[email protected]> wrote:
> Hi,
> 
> > Am 11.02.2015 um 20:03 schrieb Michael Stauffer <[email protected]>:
> >
> > OGS/GE 2011.11p1
> >
> > Hi again,
> >
> > I've got a user who's got 240+ running jobs (single slot) in the default 
> > queue (and 1400 queued and waiting), when the usual slot quota is about 50. 
> > I say 'usual' because I'm running a simple script that modifies everyone's 
> > slot quota depending on the overall cluster usage. When lots of slots are 
> > available, the quota goes up to a max of 100. I checked the logs from the 
> > script (it runs every minute) and over the time period that these 240+ jobs 
> > were submitted, the max slot quota never went above 97.
> >
> > My script examines the current cluster state, then dumps out a new rqs 
> > file, which then gets loaded via 'qconf -Mrqs'. The script gets called 
> > every minute. The queue scedule interval is one second:
> >
> >   schedule_interval                 0:0:1
> 
> Are the jobs so short that such a short interval is necessary? It will put 
> some load on the scheduler.
> 
> No they're not so short. I had this just to give the user the fastest 
> response possible. I don't notice any overhead on my system, usually there's 
> at most a few hundred jobs in the queue and we have an overpowered head node. 
> But I'll change it to 2 sec for good measure.
>  
> 
> 
> > Anyone have an idea how this might have happened? If the user submits a lot 
> > of jobs in the split-second when 'qconf -Mrqs' is updating, could the 
> > scheduler get confused and start more jobs than it should? Any suggestions 
> > on how to dig around to see what happened? Thanks.
> 
> I can't say for sure, but instead of creating an altered file of the output, 
> it's also possible to change individual lines like:
> 
> $ qconf -mattr resource_quota limit slots=4 general/3
> $ qconf -mattr resource_quota limit slots=4 general/short # here the limit 
> got a name
> $ qconf -mattr resource_quota enabled TRUE general
> 
> for an RQS called "general".
> 
> OK seems like a great idea. By 'can't say for sure' do you mean you don't 
> know for sure if this will avoid the problem?

Exactly. Sometime RQS are not working, although they should. To me it was never 
clear, when exactly they are failing.

-- Reuti


> Seems very likely.
>  
> 
> A safety net could be setup in addition in the scheduler configuration with 
> "maxujobs".
> 
> Yes, good idea. I had that set once but removed it for some reason, can't 
> remember.
> 
> Also I figure I could disable all queues before I make the changes, then 
> reenable.
> 
> -M
>  
> 
> -- Reuti
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to