I have pinpointed the problem, but I don’t know how to solve it.
It looks like hosts with complex virtual_free cannot run jobs that require PE,
even though the virtual_free complex was not requested by the job. I set the
virtual_free to allow jobs to request RAM, so the goal is the each job to
request both RAM and number of cpu cores. Hopefully this helps figuring out a
solution. Thanks.
Here’s an example of one host that doesn’t work:
# qconf -se ibm038
hostname ibm038
load_scaling NONE
complex_values virtual_free=16G
# qsub -V -b y -cwd -now n -pe cores 7 -q all.q@ibm038 xclock
Your job 143 ("xclock") has been submitted
# qstat -j 143
==============================================================
job_number: 143
exec_file: job_scripts/143
submission_time: Fri Dec 16 10:46:02 2016
owner: johnt
uid: 162
group: sa
gid: 4563
sge_o_home: /home/johnt
sge_o_log_name: johnt
sge_o_path:
/home/sge/sge8.1.9-1.el5/bin:/home/sge/sge8.1.9-1.el5/bin/lx-amd64:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/home/johnt/bin:.
sge_o_shell: /bin/tcsh
sge_o_workdir: /home/johnt/sge8
sge_o_host: ibm005
account: sge
cwd: /home/johnt/sge8
mail_list: johnt@ibm005
notify: FALSE
job_name: xclock
jobshare: 0
hard_queue_list: all.q@ibm038<mailto:all.q@ibm038>
env_list: TERM=xterm,DISPLAY=dsls11:3.0,HOME= [..]
script_file: xclock
parallel environment: cores range: 7
binding: NONE
job_type: binary
scheduling info: cannot run in PE "cores" because it only offers 0 slots
Here’s an example of a host that does work:
# qconf -se ibm037
hostname ibm037
load_scaling NONE
complex_values NONE
# qsub -V -b y -cwd -now n -pe cores 7 -q all.q@ibm037 xclock
Your job 144 ("xclock") has been submitted
# qstat -j 144
==============================================================
job_number: 144
exec_file: job_scripts/144
submission_time: Fri Dec 16 10:49:35 2016
owner: johnt
uid: 162
group: sa
gid: 4563
sge_o_home: /home/johnt
sge_o_log_name: johnt
sge_o_path:
/home/sge/sge8.1.9-1.el5/bin:/home/sge/sge8.1.9-1.el5/bin/lx-amd64:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/home/johnt/bin:.
sge_o_shell: /bin/tcsh
sge_o_workdir: /home/johnt/sge8
sge_o_host: ibm005
account: sge
cwd: /home/johnt/sge8
mail_list: johnt@ibm005
notify: FALSE
job_name: xclock
jobshare: 0
hard_queue_list: all.q@ibm037
env_list: TERM=xterm,DISPLAY=dsls11:3.0,HOME=/home/johnt [..]
script_file: xclock
parallel environment: cores range: 7
binding: NONE
job_type: binary
usage 1: cpu=00:00:00, mem=0.00000 GB s, io=0.00000 GB,
vmem=N/A, maxvmem=N/A
binding 1: NONE
From: [email protected] [mailto:[email protected]] On
Behalf Of John_Tai
Sent: Wednesday, December 14, 2016 3:52
To: Christopher Heiny
Cc: [email protected]; Coleman, Marcus [JRDUS Non-J&J]
Subject: Re: [gridengine users] John's cores pe (Was: users Digest...)
I’m actually using sge8.1.9-1 for all. Is there a problem with that? Downloaded
here:
http://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/
From: Christopher Heiny [mailto:[email protected]]
Sent: Wednesday, December 14, 2016 3:26
To: John_Tai
Cc: [email protected]<mailto:[email protected]>; Coleman, Marcus [JRDUS
Non-J&J]; Reuti; Christopher Heiny
Subject: Re: [gridengine users] John's cores pe (Was: users Digest...)
On Dec 13, 2016 7:04 PM, "John_Tai"
<[email protected]<mailto:[email protected]>> wrote:
I have 3 hosts in all.q, it seems the 2 servers running RHEL5.3 (ibm037,
ibm038) do not work with PE, while the server with RHEL6.8 (ibm021) is working
ok. Their conf are identical:
Hmmmm. Might be a Grid Engine version mismatch issue. If you installed from RH
rpms, then I think EL5.3 is on 6.1u4 and EL6.8 is on 6.2u3 or 6.2u5.
# qconf -sq all.q@ibm038<mailto:all.q@ibm038>
qname all.q
hostname ibm038
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list cores
rerun FALSE
slots 8
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
-----Original Message-----
From: Christopher Heiny
[mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, December 14, 2016 10:21
To: John_Tai; Reuti
Cc: Coleman, Marcus [JRDUS Non-J&J];
[email protected]<mailto:[email protected]>
Subject: Re: John's cores pe (Was: users Digest...)
On Wed, 2016-12-14 at 02:03 +0000, John_Tai wrote:
> I switched schedd_job_info to true, these are the outputs you
> requested:
>
>
>
> # qstat -j 95
> ==============================================================
> job_number: 95
> exec_file: job_scripts/95
> submission_time: Tue Dec 13 08:50:34 2016
> owner: johnt
> uid: 162
> group: sa
> gid: 4563
> sge_o_home: /home/johnt
> sge_o_log_name: johnt
> sge_o_path: /home/sge/sge8.1.9-
> 1.el5/bin:/home/sge/sge8.1.9-1.el5/bin/lx-
> amd64:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/home/johnt/bin:.
> sge_o_shell: /bin/tcsh
> sge_o_workdir: /home/johnt/sge8
> sge_o_host: ibm005
> account: sge
> cwd: /home/johnt/sge8
> mail_list: johnt@ibm005
> notify: FALSE
> job_name: xclock
> jobshare: 0
> hard_queue_list: all.q@ibm038<mailto:all.q@ibm038>
> env_list: TERM=xterm,DISPLAY=dsls11:3.0,HOME=/home/
> johnt,SHELL=/bin/tcsh,USER=johnt,LOGNAME=johnt,PATH=/home/sge/sge8.1.
> 9-1.el5/bin:/home/sge/sge8.1.9-1.el5/bin/lx-
> amd64:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/home/johnt/bin:.,H
> OSTTYPE=x86_64-
> linux,VENDOR=unknown,OSTYPE=linux,MACHTYPE=x86_64,SHLVL=1,PWD=/home/j
> ohnt/sge8,GROUP=sa,HOST=ibm005,REMOTEHOST=dsls11,MAIL=/var/spool/mail
> /johnt,LS_COLORS=no=00:fi=00:di=00;36:ln=00;34:pi=40;33:so=01;31:bd=4
> 0;33:cd=40;33:or=40;31:ex=00;31:*.tar=00;33:*.tgz=00;33:*.zip=00;33:*
> .bz2=00;33:*.z=00;33:*.Z=00;33:*.gz=00;33:*.ev=00;41,G_BROKEN_FILENAM
> ES=1,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-
> askpass,KDE_IS_PRELINKED=1,KDEDIR=/usr,LANG=en_US.UTF-
> 8,LESSOPEN=|/usr/bin/lesspipe.sh
> %s,HOSTNAME=ibm005,INPUTRC=/etc/inputrc,ASSURA_AUTO_64BIT=NONE,EDITOR
> =vi,TOP=-ores
> 60,CVSROOT=/home/edamgr/CVSTF,OPERA_PLUGIN_PATH=/usr/java/jre1.5.0_01
> /plugin/i386/ns7,NPX_PLUGIN_PATH=/usr/java/jre1.5.0_01/plugin/i386/ns
> 7,MANPATH=/home/sge/sg!
> e8.1.9-
> 1.el5/man:/usr/share/man:/usr/X11R6/man:/usr/kerberos/man,LD_LIBRARY_
> PATH=/usr/lib:/usr/local/lib:/usr/lib64:/usr/local/lib64,MGC_HOME=/ho
> me/eda/mentor/aoi_cal_2015.3_25.16,CALIBRE_LM_LOG_LEVEL=WARN,MGLS_LIC
> ENSE_FILE=1717@ibm004:1717@ibm005:1717@ibm041:1717@ibm042:1717@ibm043
> :1717@ibm033:1717@ibm044:1717@td156:1717@td158:1717@ATD222,MGC_CALGUI
> _RELEASE_LICENSE_TIME=0.5,MGC_RVE_RELEASE_LICENSE_TIME=0.5,SOSCAD=/ca
> d,EDA_TOOL_SETUP_ROOT=/cad/toolSetup,EDA_TOOL_SETUP_VERSION=1.0,SGE_R
> OOT=/home/sge/sge8.1.9-1.el5,SGE_ARCH=lx-
> amd64,SGE_CELL=cell2,SGE_CLUSTER_NAME=p6444,SGE_QMASTER_PORT=6444,SGE
> _EXECD_PORT=6445,DRMAA_LIBRARY_PATH=/home/sge/sge8.1.9-
> 1.el5/lib//libdrmaa.so
> script_file: xclock
> parallel environment: cores range: 1
> binding: NONE
> job_type: binary
> scheduling info: cannot run in queue "pc.q" because it is
> not contained in its hard queue list (-q)
> cannot run in queue "sim.q" because it is
> not contained in its hard queue list (-q)
> cannot run in queue
> "all.q@ibm021<mailto:all.q@ibm021>"
> because it is not contained in its hard queue list (-q)
> cannot run in PE "cores" because it only
> offers 0 slots
Hmmmm. Just a wild idea, but I'm thinking maybe there's something wacky about
ibm038's particular configuration. What does
qconf -sq all.q@ibm038<mailto:all.q@ibm038>
say?
And what happens if you use this qsub command?
qsub -V -b y -cwd -now n -pe cores 2 -q all.q xclock
Cheers,
Chris
________________________________
This email (including its attachments, if any) may be confidential and
proprietary information of SMIC, and intended only for the use of the named
recipient(s) above. Any unauthorized use or disclosure of this email is
strictly prohibited. If you are not the intended recipient(s), please notify
the sender immediately and delete this email from your computer.
_______________________________________________
users mailing list
[email protected]<mailto:[email protected]>
https://gridengine.org/mailman/listinfo/users
________________________________
This email (including its attachments, if any) may be confidential and
proprietary information of SMIC, and intended only for the use of the named
recipient(s) above. Any unauthorized use or disclosure of this email is
strictly prohibited. If you are not the intended recipient(s), please notify
the sender immediately and delete this email from your computer.
________________________________
This email (including its attachments, if any) may be confidential and
proprietary information of SMIC, and intended only for the use of the named
recipient(s) above. Any unauthorized use or disclosure of this email is
strictly prohibited. If you are not the intended recipient(s), please notify
the sender immediately and delete this email from your computer.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users