Reuti,
The PEs will stay because we've found the configuration to be incredibly
convenient and easy to support (handles just about all known parallel
implementations). My PE support questions have virtually disappeared
since we deployed this configuration. I know each queue instance has to
be evaluated for each matching PE, but it will drop from 5 queue
instances to two with 3 less RQS lines. One of the queue instances will
(generally) be dropped by ACL which is evaluated before matching
resource requests, including PEs, IIRC. It should be a good
improvement. Has anyone (outside of Sun/Oracle/Univa-internal spent
time optimizing the queue configuration with some emphasis on schedule
iteration performance? Seems like a neat area.
Thanks,
-Brian
Brian Smith
Sr. System Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. SVC4010
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu
On 08/16/2012 01:46 PM, Reuti wrote:
Am 16.08.2012 um 18:07 schrieb Brian Smith:
I know that in a lot of scheduling environments, queues are used such as short,
long, etc. to differentiate different classes of jobs. In our environment,
we're doing very much the same thing and using fancy pe_list syntax to
differentiate our various clusters. It occurred to me, however, that it might
be better to ditch that strategy and instead use JSV and complex attributes
with a single default queue instance.
Let's say I want to have the job classes
devel <= 1hr
short <= 6hr
medium <= 48hr
long <= 192hr
xlong > 192hr (no limit, restricted access)
Our current methodology for ensuring QoS for those queues involves RQS & JSV.
Schedule intervals are pretty long and hairy even for a <500 node cluster due to
the complex PE configuration:
{
name host_slotcap
description make sure only the right number of slots get used
enabled TRUE
limit queues * hosts {*} to slots=$num_proc
}
{
name queue_slotcap
description slot limits for each queue
enabled TRUE
limit queues xlong to slots=512
limit queues long to slots=1436
limit queues medium to slots=1724
}
{
name user_slotcap
description make sure users can only use so much
enabled TRUE
limit users {*} to slots=512
}
We use a jsv to classify the jobs into queues:
...
# Set queue based on specified runtime
if [ -z "$hrt" ]; then
jsv_sub_add_param q_hard "devel"
jsv_sub_add_param l_hard h_rt "01:00:00"
do_correct="true"
else
do_correct="true"
if [ $hrt -le $((3600*1)) ]; then
jsv_sub_add_param q_hard "devel"
elif [ $hrt -gt $((3600*1)) -a $hrt -le $((3600*6)) ]; then
jsv_sub_add_param q_hard "short"
elif [ $hrt -gt $((3600*6)) -a $hrt -le $((3600*48)) ]; then
jsv_sub_add_param q_hard "medium"
elif [ $hrt -gt $((3600*48)) -a $hrt -le $((3600*168)) ]; then
jsv_sub_add_param q_hard "long"
elif [ $hrt -gt $((3600*168)) ]; then
jsv_sub_add_param q_hard "xlong"
fi
fi
...
We also use my github project for pbs-esque parallel environment support:
https://github.com/brichsmith/gepetools
This means each queue has a complicated PE configuration:
pe_list make smp,[@cms_X7DBR-3=pe_cms_X7DBR-3_hg \
pe_cms_X7DBR-3_hg.1 pe_cms_X7DBR-3_hg.2 \
pe_cms_X7DBR-3_hg.4 pe_cms_X7DBR-3_hg.6 \
pe_cms_X7DBR-3_hg.8], \
...
[@MRI_Sun_X4150=pe_MRI_Sun_X4150_hg \
pe_MRI_Sun_X4150_hg.1 pe_MRI_Sun_X4150_hg.2 \
pe_MRI_Sun_X4150_hg.4 pe_MRI_Sun_X4150_hg.6 \
pe_MRI_Sun_X4150_hg.8], \
...
[@RC_Dell_R410=pe_RC_Dell_R410_hg \
pe_RC_Dell_R410_hg.1 \
pe_RC_Dell_R410_hg.12 pe_RC_Dell_R410_hg.2 \
pe_RC_Dell_R410_hg.4 pe_RC_Dell_R410_hg.6 \
pe_RC_Dell_R410_hg.8], \
...
[@RC_HP_DL165G7=pe_RC_HP_DL165G7_hg \
pe_RC_HP_DL165G7_hg.1 pe_RC_HP_DL165G7_hg.12 \
pe_RC_HP_DL165G7_hg.16 pe_RC_HP_DL165G7_hg.2 \
pe_RC_HP_DL165G7_hg.4 pe_RC_HP_DL165G7_hg.6 \
pe_RC_HP_DL165G7_hg.8], \
...
We set a negative urgency value to h_rt so that longer jobs get lower priority.
This approach seems to confuse the scheduler in terms of resource reservations, so
we pretty much can't do them and end up with the occasional starving >128 slot
parallel job. Its also pretty difficult to determine scheduling bottlenecks, etc.
Its elegant from a user perspective, but somewhat difficult to administer and
troubleshoot (we've whipped up some tools to help, but there are still
limitations).
I want to ditch the "queues-as-classifiers" model and use complex attributes instead.
Think a single "default" queue, but my jsv will now:
...
# Set queue based on specified runtime
if [ -z "$hrt" ]; then
jsv_sub_add_param l_hard h_rt "01:00:00"
jsv_sub_add_param l_hard devel 1
do_correct="true"
else
do_correct="true"
if [ $hrt -le $((3600*1)) ]; then
jsv_sub_add_param l_hard devel 1
elif [ $hrt -gt $((3600*1)) -a $hrt -le $((3600*6)) ]; then
jsv_sub_add_param l_hard short 1
elif [ $hrt -gt $((3600*6)) -a $hrt -le $((3600*48)) ]; then
jsv_sub_add_param l_hard medium 1
elif [ $hrt -gt $((3600*48)) -a $hrt -le $((3600*168)) ]; then
jsv_sub_add_param l_hard long 1
elif [ $hrt -gt $((3600*168)) ]; then
jsv_sub_add_param q_hard "xlong"
fi
fi
...
RQS gets simplified to:
{
name host_slotcap
description make sure only the right number of slots get used
enabled TRUE
limit hosts {*} to slots=$num_proc
}
{
name user_slotcap
description make sure users can only use so much
enabled TRUE
limit users {*} to slots=512
}
And global host gets configured as such:
...
complex_values ...,short=4096,devel=4096,medium=1768,long=1534
...
We drop the urgency from h_rt and instead associate it with the complex
attributes:
$ qconf -sc | egrep '^(devel|short|medium|long)[ ]+'
devel devel INT <= YES YES 0 1000
long long INT <= YES YES 0 0
medium medium INT <= YES YES 0 10
short short INT <= YES YES 0 100
What say other GridEngine gurus about this approach? I believe this will help
with my resource reservation woes and at the very least, should make my
scheduler iterations much shorter. Is there a better way? Are there any
potential pitfalls I may have missed?
Yes, it's good to use less RQS as it's known that this sometimes leads to jobs
which will never get scheduled if there are several of them. And if you have no
(automatic) subordination, it can often be put in one queue in SGE.
But I wonder: you will also get rid of all the PEs, which you used up to know
to pack jobs to certain exechosts due the setup of the network?
-- Reuti
Any input or suggestions would be appreciated.
Best Regards,
Brian Smith
Sr. System Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. SVC4010
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users