Well, the change was made with some additional input from the fellows here,
particularly Stuart.  (big thanks!)

1. I set seq_no on my exec hosts based on the type of hardware.  Newer,
better stuff gets low seq_no, the opposite for old stuff

2. Set slots=$num_proc on exec nodes (sub num_proc for # of cores)

3. Ditched host_slotcap RQS AND the queue_slotcap RQS

4. Enabled default queue with max run time = to my previous long queue.
 Drained and remove devel and short queues.  Waiting for medium and long to
drain (will happen by next Friday).

5. Created devel, short, medium, long complex attributes with appropriate
urgencies, modifying my JSV to request these based on runtime.  Set limits
in the global exec host so that only x long-length, y medium, and unlimited
short & devel jobs could be run.

6. Used Stuart's load_formula to pack jobs better than with seqno alone
(that was my previous approach).  Since seq_no has high relevance in this
formula, setting it on the exec nodes based on class of hardware ensures
that jobs land on the best available nodes.

Schedule intervals are short and sweet and job scheduling never skipped a
beat during the changes.  (BTW, commercial schedulers boast about their new
features to modify configuration on the fly... GE has been doing this for
as long as I can remember).  ARs seem to be working correctly and I'll be
testing out resource reservations later today.  Utilization is still high
and job turnover time has seen a nice improvement.

Thanks to everyone who provided input on this.  I'll post the results of my
resource reservation tests also.

-Brian

On Fri, Aug 17, 2012 at 3:33 PM, Stuart Barkley <stua...@4gh.net> wrote:

> On Thu, 16 Aug 2012 at 12:07 -0000, Brian Smith wrote:
>
> > {
> >    name         host_slotcap
> >    description  make sure only the right number of slots get used
> >    enabled      TRUE
> >    limit        queues * hosts {*} to slots=$num_proc
> > }
>
> I used to have a rule similar to this (I didn't have 'queues *'
> clause).  I found that disabling the rule improved my scheduling
> performance by a huge amount (several minutes became a few seconds).
> You might try disabling this rule briefly and see if your scheduling
> performance changes.
>
> I'm still using 6.2u5, it is possible bugs have been fixed in other
> versions.
>
> I have queues defined to provide small, medium and large jobs (based
> upon run time).  Limits on the queues determine which jobs will run on
> which queue.  The significant definitions are:
>
> % qconf -sq small
> qname                 small
> hostlist              @small
> seq_no                20
> s_rt                  4:00:00
> h_rt                  4:00:00
> %
>
> % qconf -sq medium
> qname                 medium
> hostlist              @medium
> seq_no                30
> s_rt                  48:00:00
> h_rt                  48:00:00
> %
>
> % qconf -ssconf
> queue_sort_method                 load
> load_formula                      seq_no*100+m_core-slots
> default_duration                  48:00:00
> %
>
> seq_no controls the order the queues are searched.  qmaster will
> search until it finds a queue which can run the job so the most
> limiting queus should be first.
>
> The size of the different queues is controlled by putting hosts in
> different host groups.  hosts become dedicated to jobs of a particular
> size (or smaller).  This does allow "small" jobs to run on host for
> "large" jobs.  In our case this is acceptable size the smaller jobs
> will finish in a reasonable amount of time if large jobs are queued.
>
> No jsv is required and users should not specify the queue, just the
> run time limit.
>
> We can adjust the run time limits and number of hosts in each host
> group over time to best match the workload.
>
> Stuart
> --
> I've never been lost; I was once bewildered for three days, but never lost!
>                                         --  Daniel Boone
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>



-- 
Brian Smith
Sr. System Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. SVC4010
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to