OGS/GE 2011.11p1

Hi again,

I've got a user who's got 240+ running jobs (single slot) in the default
queue (and 1400 queued and waiting), when the usual slot quota is about 50.
I say 'usual' because I'm running a simple script that modifies everyone's
slot quota depending on the overall cluster usage. When lots of slots are
available, the quota goes up to a max of 100. I checked the logs from the
script (it runs every minute) and over the time period that these 240+ jobs
were submitted, the max slot quota never went above 97.

My script examines the current cluster state, then dumps out a new rqs
file, which then gets loaded via 'qconf -Mrqs'. The script gets called
every minute. The queue scedule interval is one second:

  schedule_interval                 0:0:1

Anyone have an idea how this might have happened? If the user submits a lot
of jobs in the split-second when 'qconf -Mrqs' is updating, could the
scheduler get confused and start more jobs than it should? Any suggestions
on how to dig around to see what happened? Thanks.

-M
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to