Am 11.02.2015 um 20:52 schrieb Michael Stauffer: > On Wed, Feb 11, 2015 at 2:30 PM, Reuti <[email protected]> wrote: > Hi, > > > Am 11.02.2015 um 20:03 schrieb Michael Stauffer <[email protected]>: > > > > OGS/GE 2011.11p1 > > > > Hi again, > > > > I've got a user who's got 240+ running jobs (single slot) in the default > > queue (and 1400 queued and waiting), when the usual slot quota is about 50. > > I say 'usual' because I'm running a simple script that modifies everyone's > > slot quota depending on the overall cluster usage. When lots of slots are > > available, the quota goes up to a max of 100. I checked the logs from the > > script (it runs every minute) and over the time period that these 240+ jobs > > were submitted, the max slot quota never went above 97. > > > > My script examines the current cluster state, then dumps out a new rqs > > file, which then gets loaded via 'qconf -Mrqs'. The script gets called > > every minute. The queue scedule interval is one second: > > > > schedule_interval 0:0:1 > > Are the jobs so short that such a short interval is necessary? It will put > some load on the scheduler. > > No they're not so short. I had this just to give the user the fastest > response possible. I don't notice any overhead on my system, usually there's > at most a few hundred jobs in the queue and we have an overpowered head node. > But I'll change it to 2 sec for good measure. > > > > > Anyone have an idea how this might have happened? If the user submits a lot > > of jobs in the split-second when 'qconf -Mrqs' is updating, could the > > scheduler get confused and start more jobs than it should? Any suggestions > > on how to dig around to see what happened? Thanks. > > I can't say for sure, but instead of creating an altered file of the output, > it's also possible to change individual lines like: > > $ qconf -mattr resource_quota limit slots=4 general/3 > $ qconf -mattr resource_quota limit slots=4 general/short # here the limit > got a name > $ qconf -mattr resource_quota enabled TRUE general > > for an RQS called "general". > > OK seems like a great idea. By 'can't say for sure' do you mean you don't > know for sure if this will avoid the problem?
Exactly. Sometime RQS are not working, although they should. To me it was never clear, when exactly they are failing. -- Reuti > Seems very likely. > > > A safety net could be setup in addition in the scheduler configuration with > "maxujobs". > > Yes, good idea. I had that set once but removed it for some reason, can't > remember. > > Also I figure I could disable all queues before I make the changes, then > reenable. > > -M > > > -- Reuti > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
