Hi, > Am 11.02.2015 um 20:03 schrieb Michael Stauffer <[email protected]>: > > OGS/GE 2011.11p1 > > Hi again, > > I've got a user who's got 240+ running jobs (single slot) in the default > queue (and 1400 queued and waiting), when the usual slot quota is about 50. I > say 'usual' because I'm running a simple script that modifies everyone's slot > quota depending on the overall cluster usage. When lots of slots are > available, the quota goes up to a max of 100. I checked the logs from the > script (it runs every minute) and over the time period that these 240+ jobs > were submitted, the max slot quota never went above 97. > > My script examines the current cluster state, then dumps out a new rqs file, > which then gets loaded via 'qconf -Mrqs'. The script gets called every > minute. The queue scedule interval is one second: > > schedule_interval 0:0:1
Are the jobs so short that such a short interval is necessary? It will put some load on the scheduler. > Anyone have an idea how this might have happened? If the user submits a lot > of jobs in the split-second when 'qconf -Mrqs' is updating, could the > scheduler get confused and start more jobs than it should? Any suggestions on > how to dig around to see what happened? Thanks. I can't say for sure, but instead of creating an altered file of the output, it's also possible to change individual lines like: $ qconf -mattr resource_quota limit slots=4 general/3 $ qconf -mattr resource_quota limit slots=4 general/short # here the limit got a name $ qconf -mattr resource_quota enabled TRUE general for an RQS called "general". A safety net could be setup in addition in the scheduler configuration with "maxujobs". -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
