On Wed, Feb 11, 2015 at 2:30 PM, Reuti <[email protected]> wrote:

> Hi,
>
> > Am 11.02.2015 um 20:03 schrieb Michael Stauffer <[email protected]>:
> >
> > OGS/GE 2011.11p1
> >
> > Hi again,
> >
> > I've got a user who's got 240+ running jobs (single slot) in the default
> queue (and 1400 queued and waiting), when the usual slot quota is about 50.
> I say 'usual' because I'm running a simple script that modifies everyone's
> slot quota depending on the overall cluster usage. When lots of slots are
> available, the quota goes up to a max of 100. I checked the logs from the
> script (it runs every minute) and over the time period that these 240+ jobs
> were submitted, the max slot quota never went above 97.
> >
> > My script examines the current cluster state, then dumps out a new rqs
> file, which then gets loaded via 'qconf -Mrqs'. The script gets called
> every minute. The queue scedule interval is one second:
> >
> >   schedule_interval                 0:0:1
>
> Are the jobs so short that such a short interval is necessary? It will put
> some load on the scheduler.
>

No they're not so short. I had this just to give the user the fastest
response possible. I don't notice any overhead on my system, usually
there's at most a few hundred jobs in the queue and we have an overpowered
head node. But I'll change it to 2 sec for good measure.


>
>
> > Anyone have an idea how this might have happened? If the user submits a
> lot of jobs in the split-second when 'qconf -Mrqs' is updating, could the
> scheduler get confused and start more jobs than it should? Any suggestions
> on how to dig around to see what happened? Thanks.
>
> I can't say for sure, but instead of creating an altered file of the
> output, it's also possible to change individual lines like:
>
> $ qconf -mattr resource_quota limit slots=4 general/3
> $ qconf -mattr resource_quota limit slots=4 general/short # here the limit
> got a name
> $ qconf -mattr resource_quota enabled TRUE general
>
> for an RQS called "general".
>

OK seems like a great idea. By 'can't say for sure' do you mean you don't
know for sure if this will avoid the problem? Seems very likely.


>
> A safety net could be setup in addition in the scheduler configuration
> with "maxujobs".
>

Yes, good idea. I had that set once but removed it for some reason, can't
remember.

Also I figure I could disable all queues before I make the changes, then
reenable.

-M


>
> -- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to