OGS/GE 2011.11p1 Hi again,
I've got a user who's got 240+ running jobs (single slot) in the default queue (and 1400 queued and waiting), when the usual slot quota is about 50. I say 'usual' because I'm running a simple script that modifies everyone's slot quota depending on the overall cluster usage. When lots of slots are available, the quota goes up to a max of 100. I checked the logs from the script (it runs every minute) and over the time period that these 240+ jobs were submitted, the max slot quota never went above 97. My script examines the current cluster state, then dumps out a new rqs file, which then gets loaded via 'qconf -Mrqs'. The script gets called every minute. The queue scedule interval is one second: schedule_interval 0:0:1 Anyone have an idea how this might have happened? If the user submits a lot of jobs in the split-second when 'qconf -Mrqs' is updating, could the scheduler get confused and start more jobs than it should? Any suggestions on how to dig around to see what happened? Thanks. -M
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
