I am looking to use this differently.
The problem I am having is that I have users with 200-1000 jobs. I have 80 servers with almost 1000 cores. For my normal queue, I want SGE PE to create up to 4 jobs per server until it runs out of servers, then add up to 4 more until all the jobs are allocated. (1 per is fine as long as it will round robin and start adding a second job per server, then a third until it runs out of jobs)

Does the allocation rule limit the number of jobs per server PER qsub, or total jobs allowed per server?

The problem I am having is that I get 20 jobs per server and overload a couple of servers while 80 servers running idle. Each has 10 cores and 128 GB of RAM so they can handle up to 20 light jobs each.

Also, for the heavy CPU jobs, I want a max of 4 jobs per server, so for pe_slots would I just put the integer 4 in there?

Should I create a third PE, lets say "dan" with the desired settings? When I tried this before it would throw errors.


Am I correct that I want to change these settings, but I suspect I really want to make a custom PE, these are default.

I was looking at http://linux.die.net/man/5/sge_pe and http://www.softpanorama.org/HPC/Grid_engine/parallel_environment.shtml but seems to assume I comprehend the details of each.. Such as...can I only put one setting for allocation rule per PE and one PE per queue?


[root@blade5-1-1 ~]# qconf -sp make
pe_name            make
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE

[root@blade5-1-1 ~]# qconf -sp smp
pe_name            smp
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $pe_slots
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE
[root@blade5-1-1 ~]# echo $pe_slots



[root@blade5-1-1 ~]# qconf -sp DAN
pe_name           DAN
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE

[root@blade5-1-1 ~]# qconf -sp smp
pe_name            smp
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    4
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE
[root@blade5-1-1 ~]# echo $pe_slots

Yep, we use functional tickets to accomplish this exact goal. Every user
gets 1000 functional tickets via auto_user_fshare in sge_conf(5), though
your exact number will depend on the number tickets and weights you have
elsewhere in your policy configuration.
Also the waiting time should be set to 0, and less importance of the urgency 
(as the default is to give 1000 per slot in the complex configuration - this 
means more slots results in being more important):

weight_user                       0.900000
weight_project                    0.000000
weight_department                 0.000000
weight_job                        0.100000
weight_tickets_functional         1000000
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  F
weight_ticket                     1.000000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   32
default_duration                  8760:00:00
We actually do weight waiting time, but at half the value of both
functional and urgency tickets. We then give big urgency boosts to
difficult-to-schedule jobs (i.e. lots of memory or CPUs in one spot). It
took us a while to arrive at a decent mix of short-run / small jobs vs
long-run / big jobs, and it definitely will be a site-dependent decision.


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to