On Wed, Feb 11, 2015 at 2:30 PM, Reuti <[email protected]> wrote:
> Hi, > > > Am 11.02.2015 um 20:03 schrieb Michael Stauffer <[email protected]>: > > > > OGS/GE 2011.11p1 > > > > Hi again, > > > > I've got a user who's got 240+ running jobs (single slot) in the default > queue (and 1400 queued and waiting), when the usual slot quota is about 50. > I say 'usual' because I'm running a simple script that modifies everyone's > slot quota depending on the overall cluster usage. When lots of slots are > available, the quota goes up to a max of 100. I checked the logs from the > script (it runs every minute) and over the time period that these 240+ jobs > were submitted, the max slot quota never went above 97. > > > > My script examines the current cluster state, then dumps out a new rqs > file, which then gets loaded via 'qconf -Mrqs'. The script gets called > every minute. The queue scedule interval is one second: > > > > schedule_interval 0:0:1 > > Are the jobs so short that such a short interval is necessary? It will put > some load on the scheduler. > No they're not so short. I had this just to give the user the fastest response possible. I don't notice any overhead on my system, usually there's at most a few hundred jobs in the queue and we have an overpowered head node. But I'll change it to 2 sec for good measure. > > > > Anyone have an idea how this might have happened? If the user submits a > lot of jobs in the split-second when 'qconf -Mrqs' is updating, could the > scheduler get confused and start more jobs than it should? Any suggestions > on how to dig around to see what happened? Thanks. > > I can't say for sure, but instead of creating an altered file of the > output, it's also possible to change individual lines like: > > $ qconf -mattr resource_quota limit slots=4 general/3 > $ qconf -mattr resource_quota limit slots=4 general/short # here the limit > got a name > $ qconf -mattr resource_quota enabled TRUE general > > for an RQS called "general". > OK seems like a great idea. By 'can't say for sure' do you mean you don't know for sure if this will avoid the problem? Seems very likely. > > A safety net could be setup in addition in the scheduler configuration > with "maxujobs". > Yes, good idea. I had that set once but removed it for some reason, can't remember. Also I figure I could disable all queues before I make the changes, then reenable. -M > > -- Reuti
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
