Dang! You are right!
The "incoherence" among jobs is due to the first core of the first
socket being available. On my previous socket report, all "linear X:0,0"
that were correctly reported were only the ones that could start in the
first core.
I have just modified my jsv to set the policy to linear_automatic, and
now it works fine!
Being the nodes:
compute-1-8 lx26-amd64 12 6.95 94.6G 32.3G 9.8G 39.4M
hl:m_topology_inuse=SccccccSCCCCCC
binding: set linear:6:0,0
binding 1: SccccccSCCCCCC
binding: set linear:1:0,0
binding 1: NONE
compute-1-9 lx26-amd64 12 0.01 94.6G 10.1G 9.8G 39.0M
hl:m_topology_inuse=SCCCCCCSCCCCCC
compute-1-8 has the 1st core already bound, and compute-1-9 has it free.
I submit several single-core qlogin to both nodes:
(compute-1-8)
[root@floquet ~]# qstat -j 4564595 -cb | grep binding
binding: set linear:1:0,0
binding 1: NONE
(compute-1-9)
[root@floquet ~]# qstat -j 4564594 -cb | grep binding
binding: set linear:1:0,0
binding 1: ScCCCCCSCCCCCC
Now I change the policy to linear_automatic and:
(compute-1-8)
[root@floquet ~]# qstat -j 4564597 -cb | grep binding
binding: set linear:1
binding 1: SCCCCCCScCCCCC
(compute-1-9)
[root@floquet ~]# qstat -j 4564596 -cb | grep binding
binding: set linear:1
binding 1: ScCCCCCSCCCCCC
Thanks!!
Txema
El 27/06/14 13:19, Daniel Gruber escribió:
Hi,
Please notice the difference between "set linear:1:0,0“ and
"set linear:1“. The first one means - give me one core starting
at socket 0 core 0 (which means here obviously you are
requesting core 0 on socket 0). The second means that
you want one core on the host and the execution daemon
takes care which one.
So per design the core selection is done on the execd in SGE -
while in Univa Grid Engine we moved that to the qmaster
itself (which has many advantages due to the global
view of the cluster / job and core usage).
If now the execd in your case tries to bind the job it figures
out that a different job already uses this core and therefore
SGE just don’t do any binding for the job (in order to avoid
overallocation).
I guess your linear:1:0,0 request is not by intention - it does
only make sense in scenarios where you are using your
host exclusively for one job.
This is probably caused by your JSV script - which sets binding_strategy
to „linear“ (linear:X:S,C) instead of „linear_automatic“ (linear:X).
Obviously
the naming of the JSV parameter argument is unfortunate.
Might this be the reason?
Cheers
Daniel
Am 27.06.2014 um 12:58 schrieb Txema Heredia <[email protected]
<mailto:[email protected]>>:
El 27/06/14 12:32, Reuti escribió:
Am 27.06.2014 um 12:24 schrieb Txema Heredia:
El 27/06/14 11:31, Reuti escribió:
Hi,
Am 26.06.2014 um 17:56 schrieb Txema Heredia:
<snip>
# qstat -j 4561291 -cb | grep "job_name\|binding\|queue_list"
job_name: c0-1
hard_queue_list: *@compute-0-1.local
binding: set linear:1:0,0
binding 1: NONE
What I am missing here? What can be different in my nodes?
Does `qhost -F` output the fields:
$ qhost -F
...
hl:m_topology=SC
hl:m_topology_inuse=SC
hl:m_socket=1.000000
hl:m_core=1.000000
for this machine?
-- Reuti
Yes, qhost -F reports that for all nodes:
# qhost -F | grep "compute\|hl:m_"
compute-0-0 lx26-amd64 12 0.60 94.6G 10.1G
9.8G 53.8M
hl:m_topology=SCCCCCCSCCCCCC
hl:m_topology_inuse=SCCCCCCSCCCCCC
hl:m_socket=2.000000
hl:m_core=12.000000
compute-0-1 lx26-amd64 12 7.21 94.6G 14.9G
9.8G 86.6M
hl:m_topology=SCCCCCCSCCCCCC
hl:m_topology_inuse=ScCCCCCSCCCCCC
hl:m_socket=2.000000
hl:m_core=12.000000
...
But the inuse topology is blatantly wrong.
What version of SGE are you using? Maybe the "PLPA" which was used
in former versions doesn't support this particular CPU's topology.
It was replaced by "hwloc" later on.
-- Reuti
Originally it was SGE 6.2u5, but later on I substituted the
sge_qmaster binary for OGS/GE 2011.11p1 (due to a problem with
parallel jobs and -hold_jid)
_______________________________________________
users mailing list
[email protected] <mailto:[email protected]>
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users