Hello,
I set up a test queue to test a new prolog/epilog scripts and I am seeing
some strange behavior when I submit a PE job to this queue, which causes
the job to not get scheduled forever or for a very long period of time. I
tried several PE with allocation rules of '1', '2', '4'. All to no avail.
Submitting a job without a PE makes it run immediately. I am using SGE
2.6u5.
Checking why it is not running:
$ qalter -w *v* 7301747
...
Job 7301747 cannot run because it exceeds limit "ilya/////" in rule
"limit_slots_for_users/1"
Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots
verification: no suitable queues
$ qconf -sp pe_1
pe_name pe_1
slots 9999999
user_lists NONE
xuser_lists NONE
start_proc_args startmpi.sh $pe_hostfile
stop_proc_args stopmpi.sh $pe_hostfile
allocation_rule *1*
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
$ qconf -srqs limit_slots_for_users
{
name limit_slots_for_users
description "limit the number of simultaneous slots any user can use"
enabled TRUE
limit users {*} to slots=800
}
And finally,
$ qstat
job-ID prior name user state submit/start at
queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7301584 0.60051 sleep ilya qw 04/20/2018
18:29:26 4
7301747 0.50051 sleep ilya qw 04/20/2018
18:36:23 1
So I am not running anything at the moment. If I submit a job with the same
PE to a production queue, it will get scheduled.
A job that I left hanging last night, finally got scheduled after 7-8 hours.
The test queue is a follows:
qconf -sq test_gpu.q
qname test_gpu.q
hostlist @gpu
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make pe_1 pe_2 pe_3 pe_4 pe_slots
rerun TRUE
slots 4
tmpdir /data
shell /bin/sh
prolog [email protected]
epilog [email protected]
shell_start_mode unix_behavior
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method custom_kill -p $job_pid -j $job_id
notify 00:00:60
owner_list NONE
user_lists system.g
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core 1G
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Any suggestions?
Thank you,
Ilya.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users