Hi,
I'm getting back to this post finally. I've looked at the links and
suggestions in the two replies to my original post a few months ago, but
they haven't helped. Here's my original:
I'm getting some queued jobs with scheduling info that includes this line
at the end:
cannot run in PE "unihost" because it only offers 0 slots
'unihost' is the only PE I use. When users request multiple slots, they use
'unihost':
qsub ... -binding linear:2 -pe unihost 2 ...
What happens is that these jobs aren't running when it otherwise seems like
they should be, or they sit waiting in the queue for a long time even when
the user has plenty of quota available within the queue they've requested,
and there are enough resources available on the queue's nodes per
qhost(slots and vmem are consumables), and qquota isn't showing any rqs
limits have been reached.
Below I've dumped relevant configurations.
Today I created a new PE called "int_test" to test the "integer" allocation
rule. I set it to 16 (16 cores per node), and have also tried 8. It's been
added as a PE to the queues we use. When I try to run to this new PE
however, it *always* fails with the same "PE ...offers 0 slots" error, even
if I can run the same multi-slot job using "unihost" PE at the same time.
I'm not sure if this helps debug or not.
Another thought - this behavior started happening some time ago more or
less when I tried implementing fairshare behavior. I never seemed to get
fairshare working right. We haven't been able to confirm, but for some
users it seems this "PE 0 slots" issue pops up only after they've been
running other jobs for a little while. So I'm wondering if I've screwed up
fairshare in some way that's causing this odd behavior.
The default queue from global config file is all.q.
Here are various config dumps. Is there anything else that might be helpful?
Thanks for any help! This has been plaguing me.
[root@chead ~]# qconf -sp unihost
pe_name unihost
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
qsort_args NONE
[root@chead ~]# qconf -sp int_test
pe_name int_test
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule 8
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
qsort_args NONE
[root@chead ~]# qconf -ssconf
algorithm default
schedule_interval 0:0:5
maxujobs 200
queue_sort_method load
job_load_adjustments np_load_avg=0.50
load_adjustment_decay_time 0:7:30
load_formula np_load_avg
schedd_job_info true
flush_submit_sec 0
flush_finish_sec 0
params none
reprioritize_interval 0:0:0
halftime 1
usage_weight_list cpu=0.700000,mem=0.200000,io=0.100000
compensation_factor 5.000000
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
weight_tickets_functional 1000
weight_tickets_share 100000
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 2000
report_pjob_tickets TRUE
max_pending_tasks_per_job 100
halflife_decay_list none
policy_hierarchy OS
weight_ticket 0.000000
weight_waiting_time 1.000000
weight_deadline 3600000.000000
weight_urgency 0.100000
weight_priority 1.000000
max_reservation 0
default_duration INFINITY
[root@chead ~]# qconf -sconf
#global:
execd_spool_dir /opt/sge/default/spool
mailer /bin/mail
xterm /usr/bin/X11/xterm
load_sensor none
prolog none
epilog none
shell_start_mode posix_compliant
login_shells sh,bash,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 02:00:00
loglevel log_warning
administrator_mail none
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
execd_params ENABLE_BINDING=true
reporting_params accounting=true reporting=true \
flush_time=00:00:15 joblog=true
sharelog=00:00:00
finished_jobs 100
gid_range 20000-20100
qlogin_command /opt/sge/bin/cfn-qlogin.sh
qlogin_daemon /usr/sbin/sshd -i
rlogin_command builtin
rlogin_daemon builtin
rsh_command builtin
rsh_daemon builtin
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs 4000
max_jobs 0
max_advance_reservations 0
auto_user_oticket 0
auto_user_fshare 100
auto_user_default_project none
auto_user_delete_time 0
delegated_file_staging false
reprioritize 0
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
[root@chead ~]# qconf -sq all.q
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH
ckpt_list NONE
pe_list make mpich mpi orte unihost serial int_test unihost2
rerun FALSE
slots 1,[compute-0-0.local=4],[compute-0-1.local=15], \
[compute-0-2.local=15],[compute-0-3.local=15], \
[compute-0-4.local=15],[compute-0-5.local=15], \
[compute-0-6.local=16],[compute-0-7.local=16], \
[compute-0-9.local=16],[compute-0-10.local=16], \
[compute-0-11.local=16],[compute-0-12.local=16], \
[compute-0-13.local=16],[compute-0-14.local=16], \
[compute-0-15.local=16],[compute-0-16.local=16], \
[compute-0-17.local=16],[compute-0-18.local=16], \
[compute-0-8.local=16],[compute-0-19.local=14], \
[compute-0-20.local=4],[compute-gpu-0.local=4]
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem 40G,[compute-0-20.local=3.2G], \
[compute-gpu-0.local=3.2G],[compute-0-19.local=5G]
h_vmem 40G,[compute-0-20.local=3.2G], \
[compute-gpu-0.local=3.2G],[compute-0-19.local=5G]
qstat -j on a stuck job as an example:
[mgstauff@chead ~]$ qstat -j 3714924
==============================================================
job_number: 3714924
exec_file: job_scripts/3714924
submission_time: Fri Aug 11 12:48:47 2017
owner: mgstauff
uid: 2198
group: mgstauff
gid: 2198
sge_o_home: /home/mgstauff
sge_o_log_name: mgstauff
sge_o_path:
/share/apps/mricron/ver_2015_06_01:/share/apps/afni/linux_xorg7_64_2014_06_16:/share/apps/c3d/c3d-1.0.0-Linux-x86_64/bin:/share/apps/freesurfer/5.3.0/bin:/share/apps/freesurfer/5.3.0/fsfast/bin:/share/apps/freesurfer/5.3.0/tktools:/share/apps/fsl/5.0.8/bin:/share/apps/freesurfer/5.3.0/mni/bin:/share/apps/fsl/5.0.8/bin:/share/apps/pandoc/1.12.4.2-in-rstudio/:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/opt/sge/bin:/opt/sge/bin/lx-amd64:/share/admin:/opt/perfsonar_ps/toolkit/scripts:/usr/dbxml-2.3.11/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/bio/ncbi/bin:/opt/bio/mpiblast/bin:/opt/bio/EMBOSS/bin:/opt/bio/clustalw/bin:/opt/bio/tcoffee/bin:/opt/bio/hmmer/bin:/opt/bio/phylip/exe:/opt/bio/mrbayes:/opt/bio/fasta:/opt/bio/glimmer/bin:/opt/bio/glimmer/scripts:/opt/bio/gromacs/bin:/opt/bio/gmap/bin:/opt/bio/tigr/bin:/opt/bio/autodocksuite/bin:/opt/bio/wgs/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/home/mgstauff/bin:/share/apps/R/R-3.1.1/bin:/share/apps/rstudio/rstudio-0.98.1091/bin/:/share/apps/ANTs/2014-06-23/build/bin/:/share/apps/matlab/R2014b/bin/:/share/apps/BrainVISA/brainvisa-Mandriva-2008.0-x86_64-4.4.0-2013_11_18:/share/apps/MIPAV/7.1.0_release:/share/apps/itksnap/itksnap-most-recent/bin/:/share/apps/MRtrix3/2016-04-25/mrtrix3/release/bin/:/share/apps/VoxBo/bin
sge_o_shell: /bin/bash
sge_o_workdir: /home/mgstauff
sge_o_host: chead
account: sge
hard resource_list: h_stack=128m
mail_list: [email protected]
notify: FALSE
job_name: myjobparam
jobshare: 0
hard_queue_list: all.q
env_list: TERM=NONE
job_args: 5
script_file: workshop-files/myjobparam
parallel environment: int_test range: 2
binding: set linear:2
job_type: NONE
scheduling info: queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is temporarily not available
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance
"[email protected]" dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
queue instance "[email protected]"
dropped because it is full
cannot run in PE "int_test" because it only
offers 0 slots
[mgstauff@chead ~]$ qquota -u mgstauff
resource quota rule limit filter
--------------------------------------------------------------------------------
[mgstauff@chead ~]$ qconf -srqs limit_user_slots
{
name limit_user_slots
description Limit the users' batch slots
enabled TRUE
limit users {pcook,mgstauff} queues {allalt.q} to slots=32
limit users {*} queues {allalt.q} to slots=0
limit users {*} queues {himem.q} to slots=6
limit users {*} queues {all.q,himem.q} to slots=32
limit users {*} queues {basic.q} to slots=40
}
There are plenty of consumables available:
[root@chead ~]# qstat -F h_vmem,s_vmem,slots -q all.q a
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
[email protected] BP 0/4/4 5.24 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
qc:slots=0
---------------------------------------------------------------------------------
[email protected] BP 0/10/15 9.58 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
qc:slots=5
---------------------------------------------------------------------------------
[email protected] BP 0/9/16 9.80 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=7
---------------------------------------------------------------------------------
[email protected] BP 0/11/16 9.18 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=5
---------------------------------------------------------------------------------
[email protected] BP 0/11/16 9.72 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=5
---------------------------------------------------------------------------------
[email protected] BP 0/10/16 9.14 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=6
---------------------------------------------------------------------------------
[email protected] BP 0/10/16 9.66 lx-amd64
hc:h_vmem=28.890G
hc:s_vmem=30.990G
hc:slots=6
---------------------------------------------------------------------------------
[email protected] BP 0/10/16 9.54 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=6
---------------------------------------------------------------------------------
[email protected] BP 0/10/16 10.01 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=6
---------------------------------------------------------------------------------
[email protected] BP 0/11/16 9.75 lx-amd64
hc:h_vmem=29.963G
hc:s_vmem=32.960G
hc:slots=5
---------------------------------------------------------------------------------
[email protected] BP 0/11/16 10.29 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=5
---------------------------------------------------------------------------------
[email protected] BP 0/9/14 9.01 lx-amd64
qf:h_vmem=5.000G
qf:s_vmem=5.000G
qc:slots=5
---------------------------------------------------------------------------------
[email protected] BP 0/10/15 9.24 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
qc:slots=5
---------------------------------------------------------------------------------
[email protected] BP 0/0/4 0.00 lx-amd64
qf:h_vmem=3.200G
qf:s_vmem=3.200G
qc:slots=4
---------------------------------------------------------------------------------
[email protected] BP 0/11/15 9.62 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
qc:slots=4
---------------------------------------------------------------------------------
[email protected] BP 0/12/15 9.85 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
qc:slots=3
---------------------------------------------------------------------------------
[email protected] BP 0/12/15 10.18 lx-amd64
hc:h_vmem=36.490G
hc:s_vmem=39.390G
qc:slots=3
---------------------------------------------------------------------------------
[email protected] BP 0/12/16 9.95 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=4
---------------------------------------------------------------------------------
[email protected] BP 0/10/16 9.59 lx-amd64
hc:h_vmem=36.935G
qf:s_vmem=40.000G
hc:slots=5
---------------------------------------------------------------------------------
[email protected] BP 0/10/16 9.37 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=6
---------------------------------------------------------------------------------
[email protected] BP 0/10/16 9.38 lx-amd64
qf:h_vmem=40.000G
qf:s_vmem=40.000G
hc:slots=6
---------------------------------------------------------------------------------
[email protected] BP 0/0/4 0.05 lx-amd64
qf:h_vmem=3.200G
qf:s_vmem=3.200G
qc:slots=4
On Mon, Feb 13, 2017 at 2:42 PM, Jesse Becker <[email protected]> wrote:
> On Mon, Feb 13, 2017 at 02:26:18PM -0500, Michael Stauffer wrote:
>
>> SoGE 8.1.8
>>
>> Hi,
>>
>> I'm getting some queued jobs with scheduling info that includes this line
>> at the end:
>>
>> cannot run in PE "unihost" because it only offers 0 slots
>>
>> 'unihost' is the only PE I use. When users request multiple slots, they
>> use
>> 'unihost':
>>
>> ... -binding linear:2 -pe unihost 2 ...
>>
>> What happens is that these jobs aren't running when it otherwise seems
>> like
>> they should be, or they sit waiting in the queue for a long time even when
>> the user has plenty of quota available within the queue they've requested,
>> and there are enough resources available on the queue's nodes (slots and
>> vram are consumables).
>>
>> Any suggestions about how I might further understand this?
>>
>
> This *exact* problem has bitten me in the past. It seems to crop up
> about every 3 years--long enough to remember it was a problem, and long
> enough to forget just what the [censored] I did to fix it.
>
> As I recall, it has little to do with actual PEs, but everything to do
> with complexes and resource requests.
>
> You might glean a bit more information by running "qsub -w p" (or "-w e").
>
> Take a look at these previous discussions:
>
> http://gridengine.org/pipermail/users/2011-November/001932.html
> http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/1700
>
>
> --
> Jesse Becker (Contractor)
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users